Skip to Main Content

Resources for Digital Scholarship: Resources for Text Analysis

Introduction

Text analysis is the process of extracting information from a body, or corpus, of texts and organizing it in a meaningful way so that it can serve as the basis for scholarly interpretation. Below are some helpful overviews of textual analysis (they, like the other resources and information on this page, are derived from the research guide Introduction to Textual Analysis, created at Duke University Libraries):

Types of Text Analysis

Basic Text Summaries and Analyses

  • Word frequency (lists of words and their frequencies)
    (See also: Word counts are amazing, Ted Underwood)
  • Collocation (words commonly appearing near each other)
  • Concordance (the contexts of a given word or set of words)
  • N-grams (common two-, three-, etc.- word phrases)
  • Entity recognition (identifying names, places, time periods, etc.)
  • Dictionary tagging (locating a specific set of words in the texts)

High-level Goals for Text Analysis

(From Underwood, T. (2012). Where to start with text mining.)

  • Document categorization
    • Information retrieval (e.g., search engines)
    • Supervised classification (e.g., guessing genres)
    • Unsupervised clustering (e.g., alternative “genres”)
  • Corpora comparison (e.g., political speeches)
  • Language use over time (e.g., Google ngram viewer)
  • Detecting clusters of document features (i.e., topic modeling)
  • Entity recognition/extraction (e.g., geoparsing)
  • Visualization

Tools for Textual Analysis

Web Tools

  • Voyant Tools – word frequencies, concordance, word clouds, visualizations
  • TAPorWare – various data cleaning, annotating, and summarizing tools in a web interface
  • Netlytic – word frequencies, concordance, dictionary tagging, network analysis
  • Wmatrix – frequency profiles, concordances, compare frequency lists, n-grams and c-grams, collocations
  • Natural Language Processor & Analyzer - word frequencies, collocations, concordance, tokenizer, etc.
  • ManyEyes – interactive text visualizations (network diagram, word tree, phrase net, tag cloud, word cloud)
  • Overview – Automatic topic tagging and visualization
  • Monk Workbench – Corpus selection from library holdings, frequencies and corpora comparisons, supervised classification
  • LIWC - Web version will output a few linguistic dimensions; full version can be licensed for ~$100

Downloadable Applications
(no programming required)

  • AntWord – word frequencies
  • AntConc – frequency lists, concordances, collocations, keywords, n-grams
  • TextSTAT – word frequencies, concordances
  • Concordance – word frequencies, concordances, indexes
  • Cowo - semantic network
  • WordHoard - word frequencies, concordances, collocations, scripting (includes tagged literary corpora)
  • CasualConc - kwic concordance lines, word clusters, collocation analysis, and word count