1 July 2005

Four Degrees of Separation: Research on the Science of Networks

Reginald Smith

As an amateur scientist, I believe many times we in the amateur science community sell ourselves short. We believe that the only research we can hope to accomplish is the fun and curious. Though we call ourselves scientists often there is a nagging voice that tells us we are small-time, and the real work belongs to the guys in the gazillion dollar labs. True, some things cannot be done by amateur scientists (yet). I cannot claim to be able create energies in excess of the Large Hadron Collider in my basement. I cannot fabricate precision nano-machines. But I believe that I, and any other amateur scientist, can do research that the larger world will take notice of. This article is about one of my experiences in this regard.

In 2002, I was finishing my studies at the University of Virginia for my BA in physics. Like most graduating seniors, I had a senior thesis to write. Earlier in the year I decided to write about the emerging field of complex systems, which had recently grabbed my attention. However, I could not nail down a topic in this vast field. By chance, I had read an article on PhysicsWeb (quoted below) describing recent research on the structure of the Internet by a group headed by the Notre Dame physicist Albert-Laszlo Barabasi. They had measured the connections between websites using hyperlinks and described the overall topology (structure) of the Internet in detail. At first, I felt content to just write about his research and get this paper over with, since I had a bad case of senioritis. But one day I had a new idea.

While chatting with a friend on AOL Instant Messenger™ I realized that the networks between "buddies" on instant messaging software like AOL and Microsoft's MSN Messenger ™ is very similar to the networks between web sites. The article had mentioned that such networks were also used to measure social networks as well as computer networks. I realized that measuring the size and connections of a social network could now be done with unprecedented precision due to the rise of instant messaging. Thus I set about measuring the AOL Instant Messaging Network

A Brief Introduction to Graph Theory

The science of networks is based on a branch of mathematics known as graph theory. Here I will give a short introduction to graph theory that will give the reader a simple knowledge sufficient to understand the research. A more comprehensive knowledge can be gained in the further reading at the end of the article.

Graph Theory was developed by many of the great names in science and mathematics, including Euler, Kirchoff, and Erdos. The work of Erdos is very relevant to networks. Graphs are very easy to understand. They are an assemblage of nodes (vertices), which are distinct and individual points, connected to each other by edges (arcs).

Several mathematical measures of graphs are important:

Undirected versus directed edges:

Edges can be undirected (inherently bi-directional) or directed (from one node to another but not in reverse). Different networks are modeled by undirected edges or directed edges or both. The Internet is modeled using directed edges. A hyperlink on page X points to page Y, but the link is not necessarily reciprocal. Some social networks are undirected, for example mutual acquaintances. I know Forrest Mims and he knows me. This kind of mutual acquaintance network first gave rise to the phrase "six degrees of separation" coined by sociologist Stanley Milgram.

Degree of a node:

The degree of a node is how many edges a node is connected with. For undirected networks, every node has one degree. For directed networks, nodes have two degrees: an in-degree (how many edges point to a node) and an out-degree (how many edges depart from this node).

Connected graphs:

A graph is called connected if you can reach any given node from any other node by hopping across edges. In other words, there are no isolated nodes that cannot be reached by starting at certain nodes. A graph is simply connected if it is connected with undirected edges, and a graph is strongly connected if it is connected by directed edges (which is harder to do).

Diameter and average distance:

How far apart are the nodes? There are two main measures for this. The diameter is the measure of the longest shortest path between any two nodes in a graph. The average distance is the average distance between any two nodes, which is calculated by averaging the shortest distance between every pair of nodes (n2 pairs in directed networks, n2/2 pairs in undirected networks) in the graph.

Application to Networks

Once you have a handle on graph theory, it is simple to see how networks are modeled as graphs. For example, in the Internet a node is a web site (or router) and the edge is a hyperlink (or IP entry in a router table). In social networks, nodes are people and the edges are acquaintances. There are some directed social networks, such as the instant messaging network, email networks, and networks of citations in scientific papers. There are also networks that measure power grids, where power stations are nodes and transmission lines are edges. More surprising is that the metabolism in cells can be measured by a network, where chemical enzymes are nodes and edges point from a reactant to a product of a chemical reaction.

Random Networks and Scale-Free Networks

Much of the work on networks done by Erdos involved the concept of random networks. Random networks are networks where the number of nodes is fixed and each node connects to other nodes with a fixed probability, P. It was found that after a certain value of P the random graphs would become connected (which involves a study called percolation theory which I will not go into here). At first, researchers assumed most, if not all, real-world networks are random. Random networks have the interesting feature that if you plot the statistical distribution of the number of nodes with a given degree, you get an exponential curve. This mainly says a lot of nodes have a given degree with not many varying from this. Exponential curves are of the form P(X) = e-X.

In the late 1990s, however, a group of researchers actually asked if this is true. The work was centered at Notre Dame and involved Albert-Laszlo Barabasi, Reka Albert and Hawoong Jeong. Their research, which focused in a large part on the Internet, found the network is not random. First, the distribution of the node degrees does not follow an exponential function, but rather a power law curve. Power law curves have the form P(X) = a-X, where “a” is a base and “d” is called the critical exponent. In particular, the distribution of power law curves is one where instead of almost all nodes being of a certain degree, many nodes cluster at one degree. There is also a tail in the distribution so other nodes of a higher degree are included as well. So there are some very high degree nodes, though they are in the minority.

This does not appear in random graphs. The researchers wondered why this was and noticed that real world networks have several features that random networks don't. First, the number of nodes grows with time. Second, over time the number of edges changes. Here we assume the number of edges grows. Finally, the connection of nodes is not random. This is what is called "preferential attachment". This designates that nodes with more edges have a higher probability of receiving more edges as time goes on. The rich get richer in a way. They termed these networks "scale-free," since the distribution of node degrees looks the same no matter at which scale you look at it (as is the case with power laws in general).

It's a Small World After All

Scale-free networks (as well as random ones) exhibit what is called "small world behavior," which is that the average distance between nodes is much smaller than the number of nodes in the graph. This is where the six degrees of separation enters in. Scale-free networks have a small average distance between nodes. In fact, the researchers estimated the Internet's average distance to be 19 clicks.

The Instant Messaging Search

So now that I knew all this, I had to collect data to test my hypothesis that instant messaging is a scale-free network. I tried first talking to AOL. A physicist who had tried the same warned me this would be extremely difficult. And it was. In fact, it was impossible. AOL did not want the liability issues of me messing with their instant messaging buddy list database, even though I promised everyone would be assigned an anonymous number.

Later, I had a breakthrough. On the advice of my undergrad advisor, one of the high priests of computer know-how, I investigated the Jabber instant messaging system. Jabber is an open-source instant messaging protocol. As with many open-source adventures, its votaries are more open and willing to help. I was thus able to obtain a 50,000 user buddy list database from nioki.com, a young person's network in France . The list was anonymous, every name was a number, and I knew nothing about them. Note to amateur scientists: sometimes it pays to cooperate with the smaller guys.

The next process was the research. I had to write my own code in C++ to analyze the data. However, I luckily had some help from other scientists in the field. I am indebted to Mark Newman for sending me a sample of his code to help me improve mine. I learned a valuable lesson how to communicate with professional scientists. First, before you write them, learn something about them and their research. After reading some of Mark Newman's work, I sent him a letter explaining my research and asking for help. He graciously responded. Of course, not all scientists are so open, but never let fear (or ego) prevent you from asking for help from those in the professional community or even in the Society for Amateur Scientists. After some long nights, each run to calculate the average distance took 6 hours on the school's distributed computer cluster. I finally came up with results that confirmed my hypothesis: instant messaging networks are scale-free. In fact, the average distance between the people on the network was 4.35. I guess maybe the world is getting smaller. Or maybe that network was just exclusive.

One interesting note is that scale-free networks have an interesting stability characteristic. Scale-free networks are in general stable when certain nodes and their corresponding edges "disappear." This means that if a given random node disappears, the average distance between nodes will likely not change. However, if the nodes with the highest degree are targeted, the network rapidly loses stability and becomes disconnected or rapidly increases its average distance. Many researchers are concerned that if someone wanted to take down the Internet, they could accomplish this by taking out the most connected routers. This applies to instant messaging in the case of viruses and worms. If a virus or a worm is spreading across an instant messaging network, it could possibly be slowed by shutting down the most connected users instead of closing the network as a whole. This could help both minimize the damage done by the worm and the inconvenience to users. After finishing my paper I uploaded it to arxiv.org, the giant e-print archive for physicists. It was received very well. Many researchers told me this was one of the few times a direct measurement on a network like instant messaging had been done. In fact, several papers including a review paper on the topic ended up directly quoting my paper. The biggest moment, however, was when I found out that a full subsection of my paper was published in the monograph "Evolution and Structure of the Internet: A Statistical Physics Approach" by Romualdo Pastor-Satorras and Alessandro Vespignani. It feltgood to have contributed to such a nascent field. My only regret was that I never submitted the paper for peer review. Partly because I felt no pressure for a peer reviewed paper and partly since it was rejected by a well-known journal, I never pursued full publication. It is a mistake I now regret now but corrected with my later work.

In summary, amateur scientists CAN do widely recognized research. Many people both inside and outside of SAS do this every day. Even though I was doing this work in college, the only real resources I had besides my advisor's advice was the academic library. This was a new field with very few people when I first wrote the paper. It's something that anyone can do if they keep at it. It's not a matter of finding a big problem. Just find an interesting problem and follow it in ways others have not yet thought about.

Further Reading

My paper is "Instant Messaging as a Scale-Free Network"

A deeper but still layman-level view of the topic is "The Physics of the Web"

If you want a deep and technical view, read the following. They are not mathematically hard. I started reading them with no knowledge of graph theory. No calculus or advanced math is required. All you need is a basic knowledge of statistics and matrices. They cite many articles that research different types of networks.

"Statistical Mechanics of Complex Networks" by Albert-Laszlo Barabasi and Reka Albert.

"Evolution of Networks" by S. N. Dorogovtsev and J. F. F. Mendes.

The Structure and Function of Complex Networks” by Mark Newman.

Here are some relevant popular science books:

"Linked: The New Science of Networks" by Albert-Laszlo Barabasi.

"Small Worlds and Six Degrees: The Science of a Connected Age" by Duncan Watts.

"Evolution of Networks: From Biological Nets to the Internet and WWW" by S. N. Dorogovtsev and J. F. F. Mendes.

www.amazon.com/exec/obidos/tg/detail//0198515901?v=glance

And here is a special interest resource:

"Evolution and Structure of the Internet: A Statistical Physics Approach" by Romualdo Pastor-Satorras and Alessandro Vespignani.


 
Figure 1: A random graph (courtesy of www.mathworld.wolfram.com).
 
Figure 2: Undirected and directed edges.
 
Figure 3: A sample scale-free network degree distribution graph.
   
Copyright 2005 by Society for Amateur Scientists