Computational Methods for Nonlinear Systems

Physics 7682 - Fall 2015


Instructor: Chris Myers

Mondays & Fridays 1:30-3:30, Rockefeller B3 (directions)

http://www.physics.cornell.edu/~myers/teaching/ComputationalMethods/

Random Text Exercises
"United Colonies are, and of consanguinity. We must, therefore, acquiesce in the Course of human events it becomes necessary for the sole purpose of fatiguing them into compliance with his measures. He has combined with others to be totally dissolved; and that as Free and Independent States, that they are endowed by their legislature to extend an unwarrantable jurisdiction over us. We have Petitioned for Redress in the Name, and by Authority of the United States of America, in General Congress, Assembled, appealing to the opinions of mankind requires that they are accustomed..." - Thomas Jefferson (before he honed his message)

Background

Human language is highly structured and nonrandom: not every sequence of letters forms a word, and not every sequence of words forms an intelligible sentence. But some aspects of random language begin to become familiar if sufficient structure is captured. In addition, the patterns of word choice vary between different authors. You can generate surprising messages, with much of the style maintained (but none of the meaning) by duplicating the nearest and next-nearest neighbor correlations between words in an otherwise random sequence.

Learning Goals

Science:

This is mostly for fun, and to introduce the Python programming language. Similar concepts are at work, however, in a number of important scientific problems that use Markov chains to model probabilistic processes. In genomics, for example, there is much evidence of higher-order structure in DNA sequences; that is, it is not sufficient to describe only the frequency of nucleotide bases A,C,G,T in a sequence, but also higher-order correlations. The use of base triplet codons to encode amino acids in protein sequences, for example, introduces important three-letter correlations. In terms of human text, you will gain a qualitative feel for how the three-word correlations help provide a characteristic feel to the English language, and how they appear intrinsic to the voice of different authors.

Computation:

This is a simple exercise to familiarize you with some of the syntax of Python, how to use the editor, etc. You will learn some of how to work with file input/output, lists, and dictionaries. For those new to programming this will also introduce the concept of a "function". Write a program which will takes as input an arbitrary text file, and returns as output a string which is a random text with the same statistical structure (up to three-word correlations) as the original text. Please verify that you gain an understanding of the programming concepts described at the bottom of this page.

Procedures

Consult the instructions on the Getting Started with Python page. Broadly, the procedure involves: (1) downloading the Hints file linked to the left, (2) copying it to a new solution source file, (3) opening the source file in an editor, (4) starting up an ipython interpreter, (5) running the source file within the interpreter, (6) updating the source file and rerunning to implement and test missing functionality.

In this exercise, you will read in a body of text (e.g., a famous piece of literature), in order to build up a representation of its three-word correlations. Once you have that representation, you can sample it randomly to generate random text that is consistent with the statistical correlations you have measured. The main data structure you will build to store these correlations is a prefix dictionary: a data structure that maps a pair of words to a list of words that follow that pair of words in the text. For example, if the text contains the phrases "she loves you" and "she loves me", then the prefix dictionary would contain an entry as follows: prefix_dictionary[('she', 'loves')] = ['you', 'me'].

Once you've got yourself set up to edit the source file and run it within an interpreter, you can try some of the following:

  • Write a function (read_file_into_word_list) which will take the name of a file (say "shelovesyou.txt") and returns a list of words.
  • Continue filling out the empty functions in the source file. Write a function (make_prefix_dictionary) which will take the name of a file and return a dictionary where every two consecutive words are the "key" and the list of possible next words is the "value". The docstring in the source file gives a detailed sample algorithm.
  • In the source file, implement a function (make_random_text) which will take the name of a file and the number of words desired. This function will return the desired string. The docstring comment in the Python code gives a detailed sample algorithm.
  • Generate some random text from all of the different text files.
  • Once you've generated some random text, head on over to Wordle to make some "word clouds" from your text. (If you're especially ambitious, try to figure out if you can use some of the Python web interface libraries such as urllib to automate the generation of a word clouds without having to cut and paste text to the wordle site.) While visually interesting, word clouds seem to deal only with word frequencies, rather than higher-order correlations, so it doesn't matter much that you've scrambled the original text. The FAQ indicates that words can be tied together in the clouds, but that's different than grouping together words probabilistically based on correlations, which seems similar in spirit to questions regarding optimal graph layout algorithms. If you're interested in generating higher-order word clouds, that could be an interesting project for the course. The Python Imaging Library could be a useful tool for generating those sorts of images. (Unfortunately, we can't peek at the Wordle source code to get some insight into their algorithm.)

    Background and Notes

    Programming Concepts

    Functions and procedures

    For us, a function is a small computer program that does a well defined task. Typically a function takes some sort of input, and produces output. One strings together functions to build up a program. Some programming languages distinguish between functions (which return a value and typically have no side effects) and procedures (which do not return a value and therefore only operate through side effects). Python does not make this distinction: all functions return a value (even if that value is the object None, which is the default return value if no return statement is included), and functions can have side effects (e.g., to modify objects that are passed to the function). Functions are used for several reasons:
    1. They allow you to reuse code
      • Less programming
      • Less debugging
    2. They lead to more easily readable code
    3. They separate the "what" from the "how"
    4. In an interactive environment they let you play: you can operate on objects just as you would with paper and pencil
    Python tutorial: introduction to functions in Python.

    Dictionaries

    A python dictionary is an object that associates a value with a key in such a way that looking up the value is quick. This is a very useful and powerful construct. A good introduction into python dictionaries can be found in the Python tutorial.

    Useful Constructions/Syntax