Small World Exercise

Background

Small world networks were inspired by the sociological phenomenon of "six degrees of separation": the network of human friendships is thought to link random pairs of humans through a remarkably small number of first-name basis relationships (you know someone who knows someone who ... who knows someone who was struck by a meteor, just like the dinosaurs). The same phenomena happens in movies -- hence the famous "Bacon number". This simple model has been studied thoroughly, but remains a nice introduction both to random network construction, to breadth-first search algorithms, and to scaling collapses. Your resulting software can then also be applied to real-world networks.

You will apply your algorithms to some randomly generated networks. Many interesting problems arise from studying properties of randomly generated networks. A network is a collection of nodes and edges, with each edge connected to two nodes, but with each node potentially connected to any number of edges. A random network is constructed probabilistically according to some definite rules; studying such a random network usually is done by studying the entire ensemble of networks, each weighted by the probability that it was constructed. Thus these problems naturally fall within the broad purview of statistical mechanics.

In particular, in this exercise we will generate some random networks and calculate the distribution of distances (path lengths) between pairs of nodes. We will study small world networks, a theoretical model that suggests how a small number of shortcuts (e.g., unusual international and intercultural friendships in a social network) can dramatically shorten the tyical path lengths. Finally we will study how a simple, universal scaling behavior emerges for large networks with few shortcuts.

Learning Goals

Science: You will determine how the connectivity of a network changes when a small number of random connections are added, and explore scaling collapses that summarize path length data for networks of different sizes and numbers of near-neighbor connections.

Computation: You will learn some basic algorithms for efficiently traversing and computing path length distributions of arbitrary networks. You will write a program to produce small world networks, i.e., 1D ring shaped networks with random "short circuits". Write routines for generating statistics of shortest connection paths between nodes in these networks.

Procedure

By now, you should have completed the Introduction to Networks exercise, and should have working code to create undirected graphs. In what follows, you will continue with the study of small world networks: Creating Small World Networks, Finding the Statistical Properties of Networks, Finding the Statistical Properties of Small World Networks, and two optional excercises: applying your program to the study of real networks, and a study of "betweenness".

Creating Small World Networks

Read the paper by Duncan Watts and Steven Strogatz, Collective Dynamics of "small world" Networks. (You might also be interested in the writeup by Jon Kleinberg and David Easley, "Handout on the small-world phenomenon", Networks: Economics 204 / Sociology 209 / Information Science 204.)
Download the file SmallWorldNetworksHints.py (Hints file for small world nets) and rename it as "SmallWorldNetworks.py". It is easiest if you store this file in the same directory as the networks files you worked with earlier (Networks.py and NetGraphics.py), since those files will be needed for this exercise.
- See notes below for information on how to organize files in multiple directories and have them be accessible to Python.
Also download the MultiPlot.py software for use in making scaling collapses.
Start up a new ipython session. (This is not strictly necessary, but you will probably save yourself grief, so that anything you did in previous sessions won't interact or conflict with what you are doing now.)
Construct a function MakeRingGraph that takes two arguments, the number of nodes (num_nodes), and the number of edges (Z) that each node has, and returns an UndirectedGraph object. This really only makes sense for even Z, in which case there will be edges connecting each node to the Z/2 nearest neighbors to the left and the Z/2 nearest neighbors to the right. This is a ring, so there should be periodic boundary conditions. (Hint -- the Python operator that implements modular arithmetic is %: see the python documentation)
Test by typing run SmallWorldNetworks.py, generating some networks, and plotting them using the commands in NetGraphics.
Construct a function AddRandomEdges that adds a given number of random edges to a graph. Test. (There are some subleties here. You might imagine that, at random, the same pair of nodes might be selected for an edge, in which case the number of unique edges in the graph might be less than you initially specify. This will become less likely as the size of the graph increases. Alternatively, you could check for the number of unique edges, and stop once you've reached the specified number.)
Construct the function MakeSmallWorldNetwork that calls your two previous functions to generate a ring network of L sites with Z edges per site, then add p*Z*L/2 random edges. Test. (p is typically a number less than 1.)
As per the hints file construct MakeSmallWorldSimple. Test.

Computing Path Lengths of Networks

You will now write routines for computing path lengths between pairs of nodes in a network. Although we are interested in this question in the context of small world networks, the computation is applicable to all networks. Therefore, you will add code to your network infrastructure file "Networks.py" rather that the "SmallWorldNetworks.py" file. In "Networks.py", write and debug the following routines: FindPathLengthsFromNode, FindAllPathLengths, and FindAveragePathLength.
- The algorithms are described in Small World Exercise.
- The basic algorithmic approach is to use breadth-first search to traverse the graph starting from a root node. Central to graph traversal algorithms is the need to mark nodes as they are visited, so they need not be revisited. The use of a Python dictionary serves a dual purpose in that regard: both to store the pathlength to a given node, and to mark that node as visited.
- Test.

Finding Statistical Properties of Small World Networks

Go back to "SmallWorldNetworks.py". Make and debug the routine MakePathLengthHistograms.
Plot histograms for L=1000, Z=2, and p=0.02. Repeat for p=0.2. Play with other p's. What value of p's gives "six degrees of separation"?
Use FindAverageAveragePathLength on several random small world networks with L=100,Z=2, and p=0.1. How wildly do these fluctuate?
Quantify the fluctuations by writing the FindAverageAveragePathLength routine, which will calculate the mean and standard deviation of a set of random small world networks. Calculate the mean with different sample sizes. Does this agree with your previous observations? How many samples do you need for the mean calculated this way to be representitive of the mean of an infinite set of samples? [This is the concept in statistics of " standard error of the mean", which is distinct from " standard deviation".]
Now that you know how to get the mean and standard deviation for some fixed p, you want to make a table of the average path length (and standard deviation) as a function of p. Make the functions GetPathLength_vs_p and PlotPathLength_vs_p.
Make the graph of mean vs p for L=50,Z=2, and p between 0.001 and 1.
Compare with figure 2 of Watts and Strogatz Collective Dynamics of "small world" Networks. Since they are working with a different L, you should find that you need to shift p for the results to coincide. See Small World Exercise for an explanation.
Create and display a circle graph with (Z=2, L=50) at p=0.1. Create the histogram of path lengths.
Create and display a circle graph with (Z=10, L=1000) at p=0.1 and p=0.01. Create the histogram of path lengths. How do these compare to your previous histogram?
Write the routine PlotScaledPathLength_vs_pZL, and plot Pi Z l/L (where l is the average path length) versus the total number of shortcuts pLZ/2, for p ranging between 0.001 and 1. Use L=100 and 200, and Z= 2 and 4. These scaling functions are described and derived in the paper by Newman and Watts, Renormalization--group analysis of the small world network model, and geometric intuition is provided in the Small World Exercise.
- You can either scale the data explicitly, or you can use the MultiPlot module. See the notes below for information on scaling collapses and use of the MultiPlot module.
Does this make sense?

Real World Networks (optional)

Use some of your routines on real-world networks -- some good sources of data are the Barabasi group and the Community structure in social and biological networks, and Mark Newman, Scientific collaboration networks II: shortest paths, weighted networks, and centrality.

Useful Constructions/Syntax

class: python tutorial, nanotutorial
"dot": used in calling methods, also used to navigate namespaces.
Dictionaries, lists, tuples
Iterators (for x in list : dosomething(x))
Comparison (if, elif, else)
range
Module random
- random.choice
* -- unpacking lists
modulo (%): documentation
sum -- strangely it is really hard to find this in the python documentation. It is pretty self-explanatory though -- you just type sum(lst) to sum the elements of lst.
len -- just type len(lst).
pylab (aka matplotlib):tutorial
- pylab.hist
- pylab.show: The command pylab.show() displays a pylab graph.
Functions:
- Tutorial
- keywords: def, return
- Technical Documentation
Assignment (=):
- Tutorial
import
reload

Science Concepts

Files

References

Duncan Watts and Steven Strogatz, Collective Dynamics of "small world" Networks
Jon Kleinberg and David Easley, "Handout on graph theory", Networks: Economics 204 / Sociology 209 / Information Science 204
Jon Kleinberg and David Easley, "Handout on the small-world phenomenon", Networks: Economics 204 / Sociology 209 / Information Science 204
Mark Newman, The structure and function of complex networks
Mark Newman, Models of the Small World
Mark Newman, Scientific collaboration networks II: shortest paths, weighted networks, and centrality.
Mark Newman and Duncan Watts, Renormalization--group analysis of the small world network model
Michelle Girvan and Mark Newman, Community structure in social and biological networks
Yook, Oltvai, and Barabasi, Comparing Different Protein Interaction Networks

Notes

Organizing source files in multiple directories

When you try to import a module in Python (e.g., via import somemodule), the interpreter looks through a sequence of directories to find the specified module. This sequence (a Python list) is stored in the variable sys.path: you can see what is in the default search path by first importing the sys module (import sys) and then printing the path (print sys.path). You will see that an empty directory ('') is in the path; this corresponds to your current directory. For this reason, it is often easiest to put a group of related files together in the same directory, such as is recommended here. You will see that there are also directories in sys.path. These indicate where various built-in and third-party packages are installed.

You can extend the default search path in a couple of different ways. One is to define the shell variable PYTHONPATH. Any directories placed in that list will be appended (i.e., stuck at the end) of sys.path. In the bash shell, for example, one could type at the command line or place in the ~/.bashrc file, a command like: export PYTHONPATH=~/mypythonlibrary and anything Python files in that directory would be accessible for import. If you find yourself developing source files that you reuse for a number of different projects, creating a central repository like this might be useful. (Such a repository can be hierarchically structured as well.) The other way to augment the path is to add to the sys.path variable directly within your Python code: e.g., sys.path.append('~/mypythonlibrary') will add that directory to your path, but only for the currently running session.

Scaling collapses and use of the MultiPlot software

The multiplot software is available at: Scaling collapse software (MultiPlot).

Conceptually, scaling collapses are extremely straightforward. In a family of x-y data sets, the x axis in each set is scaled by one formula, and the y-axis by another, where the formulas depend on parameters that distinguish each data set from one another.

While conceptually simple, computing scaling collapses requires managing multiple data sets, their associated parameter values, and the functional form of the desired scaling collapse. To address this complexity, we have written the MultiPlot module to facilitate scaling collapses, although the collapses can certainly be done without using MultiPlot. The key to the MultiPlot module is to store families of data sets in a Python dictionary, where the keys to the dictionary are the parameter values for each data set, and the values are scipy arrays containing the raw, unscaled data. The x-axis data and y-axis data are stored separately in different dictionaries. By using Python's capabilities for dynamic code evaluation, we can encode (nearly) arbitrary functional forms for scaling collapses in Python strings, rather than having to hardcode specific functional forms.

For example, let xdata and ydata describe data that were generated for different values of parameters A and B, say, for A = 50 and 100, and B = 2 and 4. Then xdata and ydata would be dictionaries keyed on (A,B) tuples:

xdata[(50, 2)] = # x coordinates for A=50,  B=2
xdata[(50, 4)] = # x coordinates for A=50,  B=4
xdata[(100,2)] = # x coordinates for A=100, B=2
xdata[(100,4)] = # x coordinates for A=100, B=4
ydata[(50, 2)] = # y coordinates for A=50,  B=2
ydata[(50, 4)] = # y coordinates for A=50,  B=4
ydata[(100,2)] = # y coordinates for A=100, B=2
ydata[(100,4)] = # y coordinates for A=100, B=4

In MultiPlot, we can specify the functional form of the collapse as a function of the parameters A and B. We do this by passing a symbolic representation of the transformation function, encoded in a Python string. The form of the transformation function is: "unscaled_variable->scaled_variable". For example, if we called the unscaled coordinates "x" and "y", and the scaling functions for those variable were "x*A*B" and "y*A/B", respectively, then the transformation functions would be written as xform="x->x*A*B" and yform="y->y*A/B". (xform and yform are two arguments that are passed to the MultiPlot.MultiPlot function.) In order for Python to be able to numerically evaluate these symbolic expressions (which it does using the eval() function), it needs to know what the values of A and B are for each dataset. Of course, those values make up the keys of the xdata and ydata dictionaries, but Python doesn't know that. In order to associate the numerical values in the dictionary keys with the symbolic string variables "A" and "B", we use the keyNames argument to MultiPlot.MultiPlot. keyNames is a tuple containing the string names of the keys in the xdata and ydata dictionaries, in the order that they appear in those keys. For our example here, therefore, keyNames=("A", "B"). To execute this scaling collapse, therefore, we would issue the following command:

MultiPlot.MultiPlot(xdata, ydata, xform="x->x*A*B", yform="y->y*A/B", keyNames=("A","B"))

Additional features in MultiPlot:

As its name suggests, MultiPlot is as much about plotting multiple datasets as it is about doing scaling collapses. (The scaling collapses themselves are carried out in MultiPlot.ScaleIt.) If you just want to use it to plot multiple datasets without scaling them, simply don't specify xform and yform, since the default values for those are "x->x" and "y->y", respectively.
Scaling collapses often involve not just the unscaled variables and dataset parameters (e.g., x, y, A, and B in the example above), but also other parameters, such as scaling exponents whose values one is trying to determine through the process of collapsing the data. Any symbol used in a transformation function must be defined for Python to be able to evaluate the symbolic string, and the example above describes how the unscaled variables and dataset parameters are specified. Any additional scaling parameters are specified through a dictionary scalingParams that maps the names of those parameters (encoded as strings) to their values. (One could imagine embedding MultiPlot.ScaleIt within a larger optimization and/or learning algorithm, that would find an optimal set of parameters and/or functional forms that best collapses the data. That might be an interesting side project for someone who is interested.)
James P. Sethna, Christopher R. Myers, Erich J. Mueller.
Last modified: August 25, 2008