Research Guides: Resources for Digital Scholarship: Resources for Text Analysis

Introduction

Text analysis is the process of extracting information from a body, or corpus, of texts and organizing it in a meaningful way so that it can serve as the basis for scholarly interpretation. Below are some helpful overviews of textual analysis (they, like the other resources and information on this page, are derived from the research guide Introduction to Textual Analysis, created at Duke University Libraries):

Types of Text Analysis

Basic Text Summaries and Analyses

Word frequency (lists of words and their frequencies)
(See also: Word counts are amazing, Ted Underwood)
Collocation (words commonly appearing near each other)
Concordance (the contexts of a given word or set of words)
N-grams (common two-, three-, etc.- word phrases)
Entity recognition (identifying names, places, time periods, etc.)
Dictionary tagging (locating a specific set of words in the texts)

High-level Goals for Text Analysis

(From Underwood, T. (2012). Where to start with text mining.)

Document categorization
- Information retrieval (e.g., search engines)
- Supervised classification (e.g., guessing genres)
- Unsupervised clustering (e.g., alternative “genres”)
Corpora comparison (e.g., political speeches)
Language use over time (e.g., Google ngram viewer)
Detecting clusters of document features (i.e., topic modeling)
Entity recognition/extraction (e.g., geoparsing)
Visualization

Examples of Text Analysis Projects with Visualizations

Sources of Texts

Internet Archive
Project Gutenberg
Google Books
Hathi Trust (Hathi Download Helper)
JSTOR Data for Research* (with Early Journal Content bundle, also from archive.org)
PubMed Open Access Subset
Document Cloud*
Open American National Corpus (collection of American English from various sources)
WordHoard* (tagged literary texts)
Corpus of Contemporary American English

Tools for Textual Analysis

Web Tools

Voyant Tools – word frequencies, concordance, word clouds, visualizations
TAPorWare – various data cleaning, annotating, and summarizing tools in a web interface
Netlytic – word frequencies, concordance, dictionary tagging, network analysis
Wmatrix – frequency profiles, concordances, compare frequency lists, n-grams and c-grams, collocations
Natural Language Processor & Analyzer - word frequencies, collocations, concordance, tokenizer, etc.
ManyEyes – interactive text visualizations (network diagram, word tree, phrase net, tag cloud, word cloud)
Overview – Automatic topic tagging and visualization
Monk Workbench – Corpus selection from library holdings, frequencies and corpora comparisons, supervised classification
LIWC - Web version will output a few linguistic dimensions; full version can be licensed for ~$100

Downloadable Applications
(no programming required)

AntWord – word frequencies
AntConc – frequency lists, concordances, collocations, keywords, n-grams
TextSTAT – word frequencies, concordances
Concordance – word frequencies, concordances, indexes
Cowo - semantic network
WordHoard - word frequencies, concordances, collocations, scripting (includes tagged literary corpora)
CasualConc - kwic concordance lines, word clusters, collocation analysis, and word count

Advanced Tools for Textual Analysis

Text Annotation Tools

Natural Language Processing

GATE
nltk
Stanford NLP Group Software
National Centre for Text Mining (includes some tools for medical texts)
Reporters' Lab Reviews: Entity Extraction
Michael Collins' notes on NLP
Natural (natural language facilities for Node.js)

Sentiment Analysis

Most powerful open source sentiment analysis tools
Bing Liu's Resources on Opinion Mining (including a sentiment lexicon)
NaCTeM Sentiment Analysis Test Site (web form)
pattern web mining module (python)
SentiWordNet
Umigon (for tweets, etc.)
List of sentiment analysis tools for Twitter

Resources for Digital Scholarship: Resources for Text Analysis