Four Degrees of Separation: Research
on the Science of Networks
Reginald Smith
As an amateur scientist, I believe
many times we in the amateur science community sell
ourselves short. We believe that the only research we
can hope to accomplish is the fun and curious. Though
we call ourselves scientists often there is a nagging
voice that tells us we are small-time, and the real
work belongs to the guys in the gazillion dollar labs.
True, some things cannot be done by amateur scientists
(yet). I cannot claim to be able create energies in
excess of the Large Hadron Collider in my basement.
I cannot fabricate precision nano-machines. But I believe
that I, and any other amateur scientist, can do research
that the larger world will take notice of. This article
is about one of my experiences in this regard.
In 2002, I was finishing my studies
at the University of Virginia for my BA in physics.
Like most graduating seniors, I had a senior thesis
to write. Earlier in the year I decided to write about
the emerging field of complex systems, which had recently
grabbed my attention. However, I could not nail down
a topic in this vast field. By chance, I had read an
article on PhysicsWeb (quoted below) describing recent
research on the structure of the Internet by a group
headed by the Notre Dame physicist Albert-Laszlo Barabasi.
They had measured the connections between websites using
hyperlinks and described the overall topology (structure)
of the Internet in detail. At first, I felt content
to just write about his research and get this paper
over with, since I had a bad case of senioritis. But
one day I had a new idea.
While chatting with a friend on AOL
Instant Messenger™ I realized that the networks between
"buddies" on instant messaging software like
AOL and Microsoft's MSN Messenger ™ is very similar
to the networks between web sites. The article had mentioned
that such networks were also used to measure social
networks as well as computer networks. I realized that
measuring the size and connections of a social network
could now be done with unprecedented precision due to
the rise of instant messaging. Thus I set about measuring
the AOL Instant Messaging Network
A Brief Introduction
to Graph Theory
The science of networks is based on
a branch of mathematics known as graph theory. Here
I will give a short introduction to graph theory that
will give the reader a simple knowledge sufficient to
understand the research. A more comprehensive knowledge
can be gained in the further reading at the end of the
article.
Graph Theory was developed by many
of the great names in science and mathematics, including
Euler, Kirchoff, and Erdos. The work of Erdos is very
relevant to networks. Graphs are very easy to understand.
They are an assemblage of nodes (vertices), which are
distinct and individual points, connected to each other
by edges (arcs).
Several mathematical measures of graphs
are important:
Undirected versus directed
edges:
Edges can be undirected (inherently bi-directional) or
directed (from one node to another but not in reverse).
Different networks are modeled by undirected edges or
directed edges or both. The Internet is modeled using
directed edges. A hyperlink on page X points to page Y,
but the link is not necessarily reciprocal. Some social
networks are undirected, for example mutual acquaintances.
I know Forrest Mims and he knows me. This kind of mutual
acquaintance network first gave rise to the phrase "six
degrees of separation" coined by sociologist Stanley
Milgram.
Degree of a node:
The degree of a node is how many edges a node is connected
with. For undirected networks, every node has one degree.
For directed networks, nodes have two degrees: an in-degree
(how many edges point to a node) and an out-degree (how
many edges depart from this node).
Connected graphs:
A graph is called connected if you can reach any given
node from any other node by hopping across edges. In other
words, there are no isolated nodes that cannot be reached
by starting at certain nodes. A graph is simply connected
if it is connected with undirected edges, and a graph
is strongly connected if it is connected by directed edges
(which is harder to do).
Diameter and average distance:
How far apart are the nodes? There are two main measures
for this. The diameter is the measure of the longest shortest
path between any two nodes in a graph. The average distance
is the average distance between any two nodes, which is
calculated by averaging the shortest distance between
every pair of nodes (n2 pairs in directed networks,
n2/2 pairs in undirected networks) in the graph.
Application to Networks
Once you have a handle on graph theory,
it is simple to see how networks are modeled as graphs.
For example, in the Internet a node is a web site (or
router) and the edge is a hyperlink (or IP entry in
a router table). In social networks, nodes are people
and the edges are acquaintances. There are some directed
social networks, such as the instant messaging network,
email networks, and networks of citations in scientific
papers. There are also networks that measure power grids,
where power stations are nodes and transmission lines
are edges. More surprising is that the metabolism in
cells can be measured by a network, where chemical enzymes
are nodes and edges point from a reactant to a product
of a chemical reaction.
Random Networks and Scale-Free Networks
Much of the work on networks done by
Erdos involved the concept of random networks. Random
networks are networks where the number of nodes is fixed
and each node connects to other nodes with a fixed probability,
P. It was found that after a certain value of P the
random graphs would become connected (which involves
a study called percolation theory which I will not go
into here). At first, researchers assumed most, if not
all, real-world networks are random. Random networks
have the interesting feature that if you plot the statistical
distribution of the number of nodes with a given degree,
you get an exponential curve. This mainly says a lot
of nodes have a given degree with not many varying from
this. Exponential curves are of the form P(X) = e-X.
In the late 1990s, however, a group
of researchers actually asked if this is true. The work
was centered at Notre Dame and involved Albert-Laszlo
Barabasi, Reka Albert and Hawoong Jeong. Their research,
which focused in a large part on the Internet, found
the network is not random. First, the distribution of
the node degrees does not follow an exponential function,
but rather a power law curve. Power law curves have
the form P(X) = a-X, where “a” is a base
and “d” is called the critical exponent. In particular,
the distribution of power law curves is one where instead
of almost all nodes being of a certain degree, many
nodes cluster at one degree. There is also a tail in
the distribution so other nodes of a higher degree are
included as well. So there are some very high degree
nodes, though they are in the minority.
This does not appear in random graphs.
The researchers wondered why this was and noticed that
real world networks have several features that random
networks don't. First, the number of nodes grows with
time. Second, over time the number of edges changes.
Here we assume the number of edges grows. Finally, the
connection of nodes is not random. This is what is called
"preferential attachment". This designates
that nodes with more edges have a higher probability
of receiving more edges as time goes on. The rich get
richer in a way. They termed these networks "scale-free,"
since the distribution of node degrees looks the same
no matter at which scale you look at it (as is the case
with power laws in general).
It's a Small World After All
Scale-free networks (as well as random
ones) exhibit what is called "small world behavior,"
which is that the average distance between nodes is
much smaller than the number of nodes in the graph.
This is where the six degrees of separation enters in.
Scale-free networks have a small average distance between
nodes. In fact, the researchers estimated the Internet's
average distance to be 19 clicks.
The Instant Messaging Search
So now that I knew all this, I had
to collect data to test my hypothesis that instant messaging
is a scale-free network. I tried first talking to AOL.
A physicist who had tried the same warned me this would
be extremely difficult. And it was. In fact, it was
impossible. AOL did not want the liability issues of
me messing with their instant messaging buddy list database,
even though I promised everyone would be assigned an
anonymous number.
Later, I had a breakthrough. On the
advice of my undergrad advisor, one of the high priests
of computer know-how, I investigated the Jabber instant
messaging system. Jabber is an open-source instant messaging
protocol. As with many open-source adventures, its votaries
are more open and willing to help. I was thus able to
obtain a 50,000 user buddy list database from nioki.com,
a young person's network in France . The list was anonymous,
every name was a number, and I knew nothing about them.
Note to amateur scientists: sometimes it pays to cooperate
with the smaller guys.
The next process was the research.
I had to write my own code in C++ to analyze the data.
However, I luckily had some help from other scientists
in the field. I am indebted to Mark Newman for sending
me a sample of his code to help me improve mine. I learned
a valuable lesson how to communicate with professional
scientists. First, before you write them, learn something
about them and their research. After reading some of
Mark Newman's work, I sent him a letter explaining my
research and asking for help. He graciously responded.
Of course, not all scientists are so open, but never
let fear (or ego) prevent you from asking for help from
those in the professional community or even in the Society
for Amateur Scientists. After some long nights, each
run to calculate the average distance took 6 hours on
the school's distributed computer cluster. I finally
came up with results that confirmed my hypothesis: instant
messaging networks are scale-free. In fact, the average
distance between the people on the network was 4.35.
I guess maybe the world is getting smaller. Or maybe
that network was just exclusive.
One interesting note is that scale-free
networks have an interesting stability characteristic.
Scale-free networks are in general stable when certain
nodes and their corresponding edges "disappear."
This means that if a given random node disappears, the
average distance between nodes will likely not change.
However, if the nodes with the highest degree are targeted,
the network rapidly loses stability and becomes disconnected
or rapidly increases its average distance. Many researchers
are concerned that if someone wanted to take down the
Internet, they could accomplish this by taking out the
most connected routers. This applies to instant messaging
in the case of viruses and worms. If a virus or a worm
is spreading across an instant messaging network, it
could possibly be slowed by shutting down the most connected
users instead of closing the network as a whole. This
could help both minimize the damage done by the worm
and the inconvenience to users. After finishing my paper
I uploaded it to arxiv.org,
the giant e-print archive for physicists. It was received
very well. Many researchers told me this was one of
the few times a direct measurement on a network like
instant messaging had been done. In fact, several papers
including a review paper on the topic ended up directly
quoting my paper. The biggest moment, however, was when
I found out that a full subsection of my paper was published
in the monograph "Evolution and Structure of the
Internet: A Statistical Physics Approach" by Romualdo
Pastor-Satorras and Alessandro Vespignani. It feltgood
to have contributed to such a nascent field. My only
regret was that I never submitted the paper for peer
review. Partly because I felt no pressure for a peer
reviewed paper and partly since it was rejected by a
well-known journal, I never pursued full publication.
It is a mistake I now regret now but corrected with
my later work.
In summary, amateur scientists CAN
do widely recognized research. Many people both inside
and outside of SAS do this every day. Even though I
was doing this work in college, the only real resources
I had besides my advisor's advice was the academic library.
This was a new field with very few people when I first
wrote the paper. It's something that anyone can do if
they keep at it. It's not a matter of finding a big
problem. Just find an interesting problem and follow
it in ways others have not yet thought about.
Further Reading
My paper is "Instant
Messaging as a Scale-Free Network"
A deeper but still layman-level view
of the topic is "The
Physics of the Web"
If you want a deep and technical view,
read the following. They are not mathematically hard.
I started reading them with no knowledge of graph theory.
No calculus or advanced math is required. All you need
is a basic knowledge of statistics and matrices. They
cite many articles that research different types of
networks.
"Statistical
Mechanics of Complex Networks" by Albert-Laszlo
Barabasi and Reka Albert.
"Evolution
of Networks" by S. N. Dorogovtsev and J. F.
F. Mendes.
“The
Structure and Function of Complex Networks” by Mark
Newman.
Here are some relevant popular science
books:
"Linked:
The New Science of Networks" by Albert-Laszlo
Barabasi.
"Small
Worlds and Six Degrees: The Science of a Connected Age"
by Duncan Watts.
"Evolution
of Networks: From Biological Nets to the Internet and
WWW" by S. N. Dorogovtsev and J. F. F. Mendes.
www.amazon.com/exec/obidos/tg/detail//0198515901?v=glance
And here is a special interest resource:
"Evolution
and Structure of the Internet: A Statistical Physics
Approach" by Romualdo Pastor-Satorras and Alessandro
Vespignani. 
|