Background
Human language is highly structured and nonrandom: not every sequence of letters forms a word, and not every sequence of words forms an intelligible sentence. But some aspects of random language begin to become familiar if sufficient structure is captured. In addition, the patterns of word choice vary between different authors. You can generate surprising messages, with much of the style maintained (but none of the meaning) by duplicating the nearest and next-nearest neighbor correlations between words in an otherwise random sequence.
Learning Goals
Science:
This is mostly for fun, and to introduce the Python programming language. Similar concepts are at work, however, in a number of important scientific problems that use Markov chains to model probabilistic processes. In genomics, for example, there is much evidence of higher-order structure in DNA sequences; that is, it is not sufficient to describe only the frequency of nucleotide bases A,C,G,T in a sequence, but also higher-order correlations. The use of base triplet codons to encode amino acids in protein sequences, for example, introduces important three-letter correlations. In terms of human text, you will gain a qualitative feel for how the three-word correlations help provide a characteristic feel to the English language, and how they appear intrinsic to the voice of different authors.
Computation:
This is a simple exercise to familiarize you with some of the syntax of Python, how to use the editor, etc. You will learn some of how to work with file input/output, lists, and dictionaries. For those new to programming this will also introduce the concept of a "function". Write a program which will takes as input an arbitrary text file, and returns as output a string which is a random text with the same statistical structure (up to three-word correlations) as the original text. Please verify that you gain an understanding of the programming concepts described at the bottom of this page.
Procedures
Consult the instructions on the Getting Started with Python page. Broadly, the procedure involves: (1) downloading the Hints file linked to the left, (2) copying it to a new solution source file, (3) opening the source file in an editor, (4) starting up an ipython interpreter, (5) running the source file within the interpreter, (6) updating the source file and rerunning to implement and test missing functionality.
In this exercise, you will read in a body of text (e.g., a famous piece of literature), in order to build up a representation of its three-word correlations. Once you have that representation, you can sample it randomly to generate random text that is consistent with the statistical correlations you have measured. The main data structure you will build to store these correlations is a prefix dictionary: a data structure that maps a pair of words to a list of words that follow that pair of words in the text. For example, if the text contains the phrases "she loves you" and "she loves me", then the prefix dictionary would contain an entry as follows: prefix_dictionary[('she', 'loves')] = ['you', 'me'].
Once you've got yourself set up to edit the source file and run it within an interpreter, you can try some of the following:
- If you run read_file_into_word_list('shelovesyou.txt') within the interpreter, it should spill out a list of words into the interpreter, since that list should be returned by the function.
- If that works try asigning this list of words to a variable, e.g.,: words1=read_file_into_word_list("shelovesyou.txt")
- You can manipulate this object with commands like print(len(words1))
- Note: this is a general purpose function that you could use in other programs.
- Once everything is debugged (maybe you've done some tweaking to the code within the interpreter), make sure that you have stored the correct code in the source file, and then test it again by running the source file within the interpreter and trying out the function.
- You should call your previous function from this function.
- Again test it using "shelovesyou.txt"
- You should call your previous function from this function.
- Again test it using "shelovesyou.txt"
- Do you see any differences in the output?
- Do any of the random strings make sense?
- Could you do the same thing but with 4 word correlations? What do you expect would happen? You might imagine it could be good to use very high-order correlations to get text that is very realistic: what would happen in such a case? How does the utility of using higher-order correlations depend on the overall size of the text being mimicked?
Background and Notes
Programming Concepts
Functions and procedures
For us, a function is a small computer program that does a well defined task. Typically a function takes some sort of input, and produces output. One strings together functions to build up a program. Some programming languages distinguish between functions (which return a value and typically have no side effects) and procedures (which do not return a value and therefore only operate through side effects). Python does not make this distinction: all functions return a value (even if that value is the object None, which is the default return value if no return statement is included), and functions can have side effects (e.g., to modify objects that are passed to the function). Functions are used for several reasons:- They allow you to reuse code
- Less programming
- Less debugging
- They lead to more easily readable code
- They separate the "what" from the "how"
- In an interactive environment they let you play: you can operate on objects just as you would with paper and pencil
Dictionaries
A python dictionary is an object that associates a value with a key in such a way that looking up the value is quick. This is a very useful and powerful construct. A good introduction into python dictionaries can be found in the Python tutorial.
Useful Constructions/Syntax
- Functions
- keywords: def, return
- File Objects
- function: open
- method for file objects: read
- String Objects
- Loops: for, in
- range