Phrase Net Guide
When to use a Phrase Net
A phrase net diagrams the relationships between different words used in a text. It uses a simple form of pattern matching to provide multiple views of the concepts contained a book, speech, or poem. The image below is a word graph made from Jane Austen's novel "Pride and Prejudice." The program has drawn a network of words, where two words are connected if they appear together in a phrase of the form "X and Y":
For instance, "Jane" and "Elizabeth" are connected by a thicker arrow since the phrase "Jane and Elizabeth" occurs 10 times in the novel. The result of this simple pattern matching scheme is a surprisingly coherent view of some of the concepts in the books. A large cluster of the main characters and their relationships is on the left; separate clusters touch on emotion and attitude. Smaller connected pairs ("fortune" and "consequence") touch on other themes.
How Phrase Net works
Phrase net analyzes a text by looking for pairs of words that fit particular patterns. You can specify this pattern by using asterisks as wildcard characters. For instance, the pattern "* and *" will match phrases like "play and sing" or "vexation and regret." Punctuation matters, so it will not match "left, and then". You can choose from some useful defaults or you can type your own patterns in the field below the list.
Once you've specified a pattern, the program will create a network diagram of the words it found as matches. Two words will be connected if they occurred in the same phrase. The size of a word is proportional to the number of times it occurred in a match; the thickness of an arrow between words tells you how many times those two words occurred in the same phrase. The color of a word indicates whether it was more likely to be found in the first of second slot of a pattern. The darker the word, the more often it appeared in the first position.
Defining patterns
Matching different patterns gives different views of the text. Each text is unique, so it is worth experimenting. For instance, looking for the pattern "* and *" will often highlight key related concepts. In contrast, the pattern "* 's *" will often result in a diagram of the main people and the things they possess. The simplest pattern is "* *" which links words if they come in immediate succession; this is often provides a surprisingly clear view, especially for short documents. Sometimes there is a special pattern that will provide information on a particular document. For example, applying "* begat *" to the King James Bible yields a rough family tree.
There are three ways to specify a pattern. The easiest is to choose one of the defaults from the list on the left. A second way is to type a pattern with two asterisks for the "slots" of the pattern. Note that you need exactly two asterisks for the pattern to work. Finally, there's an advanced programmers-only option, which is to use a "regular expression" with two capturing groups. For an introduction to regular expressions, read this tutorial
Filtering results
Not all matching words are shown in the visualization. Very common English words, such as "the" or "of," typically are not informative in this kind of display, and are removed by default. If you do want to see them, uncheck the "Hide common words" box.
In addition, if the network contains more than 50 words, it often becomes hard to read. By default, the diagram will only show the top 50 most frequent matches. In some cases you may want to change this settings, either to winnow the network further, or to allow more words. To do so, type a new number in the "Show top:" box and hit return.
Interaction and highlighting
As with other Many Eyes network visualizations, you can pan by right-clicking and dragging, and you can zoom either by using the mousewheel or by dragging to define an area to zoom to. Click the "reset view" button to fit the entire network on the screen.
Move the mouse over a word to see how many times it occurred in a match, or over an arrow to see how many times a particular pair of words occurred. You can also click on a word to highlight it in orange; this can be helpful when making comments.
Data format
Phrase net accepts free (unstructured) text data. It can handle documents with up to about a million words.
Expert Notes
This is an experimental technique that can be viewed as a halfway point between the tag cloud and the word tree. We're very interested in any comments. The visualization itself owes a debt to Peter Cho's diagram of news stories and Franco Moretti's work on literary style.