A Tool That Counts: Basic
Statistics for the Amateur Scientist
Part 2. Analyzing and Evaluating
Your Data
Mark Hartwig, Ph.D.
Editor's Note: The original version
of this 3-part series was first published as a major
feature in Science Probe! (April 1992). The
article received some nice reader feedback and college
professors asked to reprint it. When Nature,
one of the world's leading journals of science, reviewed
Science Probe! in a feature about new science
publications, this article was specifically cited by
the reviewer, James Lovelock. We are grateful to Dr.
Hartwig for allowing The Citizen Scientist
to present this series based on his article.
When presented properly, statistics
is a fascinating subject that can open up new ways of
looking at the natural world. Moreover, the basic principles
are actually quite simple and can enhance anyone's critical
thinking skills. The purpose of this series is to provide
a gentle introduction to the wide world of statistics
and to show you how to use statistical tools and principles
to improve your understanding of the world around you.
In Part
1 (Compiling and Sorting Your Data) (TCS, 11
March 2005) we explored basic descriptive statistics.
This time we will look at analyzing and evaluating data.
Terms of Endearment
There is much more to descriptive statistics
than counting. As useful as frequency tables may be,
sometimes you want to describe your data more compactly–say,
with one or two number instead of 10 or 20. In a moment.
I'll describe some particularly useful statistical indices
you can use with your own data sets. But first, it's
necessary to define a few common terms that will make
it easier to describe some of the statistical indices
I'll be covering.
The first term is the word variable.
Briefly, a variable is any measurable entity that
can take on different values. Age, for example, is a
variable that can take on an indefinite number of values.
So is weight, height or the number of kilowatts consumed
by a household in a year. Likewise for the amount of
stratospheric ozone over Fresno, California, which is
a variable we looked at in Table
1.
Some variables may take on a very limited
number of values, and sometimes these may not even be
numeric. Hair color, for example, can be specified by
one of several words. Other such variables include sex
(male or female), performance on a task (success or
failure), the status of a switch (on or off), the answer
to a true/false question, or the result in a coin toss.
In short, it doesn't matter whether a variable takes
on two values or two billion, or whether the values
are numeric. It's still a variable.
Another term worth remembering is the
word observation, a tricky term because statisticians
don't use it quite the same way the rest of us do. In
statobabble, an observation is the value you
obtain when you take a measurement on some variable.
Thus, if you were to measure the resting heart rate
of 50 brown-throated, three-toed sloths, your data set
would contain 50 observations on that variable. Or,
building on our ozone example, Table
1 contains 365 observations on stratospheric ozone
as measured by a ground based Dobson instrument, some
of which are missing observations, and another 365 observations
as measured by a satellite-based TOMS.
The third term we need to know is the
word case. Statistically speaking, a case is
the particular object or person that we measure when
we take an observation on some variable. For example,
if we were to measure the weight of 75 tree shrews,
each tree shrew would constitute an individual case.
Similarly, if we were to measure the body fat, height,
and running speed of 35 male sprinters, each sprinter
would constitute a case. Only now each case would contain
three variables instead of one.
To remember the differences between
a variable, an observation, and a case, it might help
to look at Table 3, which contains some of the ozone
data from Table
1. Each column in Table 3 represents a different
variable: the date of measurement, the ozone level measured
by the Dobson instrument, and the ozone level measured
by the TOMS instrument in the satellite. Each row represents
a different case, one for every day of the year. Finally,
each value in the table constitutes an individual observation.
Day |
TOMS |
Ground |
1
2
3
4
5
6 |
308
294
316
360
431
403 |
308
305
316
376
450
406 |
Table 3.First six days of Fresno ozone observations.
A Matter
of Balance
With our new terminology, we can now
describe a few handy statistical indices. One such statistic
is the mean, or the simple arithmetic average.
The mean is a measure of central tendency; it tells
us where a series of values is "centered." (There are
other "averages" as well. These include the median,
mode, the geometric mean, and the harmonic mean. But
we'll leave them for another time.)
Except for the fact that some data
sets can be very large, the mean is an easy statistic
to compute. Simply add up all the observations on one
variable and divide by the total number of observations.
Let's take an example: In Colorado Springs, the high
temperatures for 28 April through 1 May 1991, were,
respectively, 11.1, 10.6, 5.6 and 17.2 degrees Celsius
(52, 51, 42 and 63 degrees Fahrenheit). To compute the
mean, we add up all the observations and divide by four,
the number of observations:
(11.1 + 10.6 + 5.6+ 17.2)/4 = 11.125
The mean high temperature for those
four days in Colorado Springs was 11.1 degrees Celsius
(52 degrees Fahrenheit).
Returning to our ozone example, we
can compute the mean for the TOMS readings. Adding up
all the values and dividing by 365, we obtain a mean
of 309.22. Notice where this value lies on our bar graph
in Fig. 1 from Part
1, which is also shown here in Part 2. If we think
of the x-axis of our bar graph as a balance board, and
the vertical bars as stacks of blocks, a fulcrum placed
at 309.22, as in Fig. 2, would come very close to balancing
the board and would balance it perfectly if we had graphed
the occurrences of each separate value rather than 10-point
intervals. Thus, we can consider the mean as being the
"center of gravity" for any set of data points.
Deviations Great
and Small
The mean, of course,
doesn't tell us everything we need to know about our
data set. It tells us where our data are "centered,"
but doesn't tell us much about how the data are spread
out. For example, the annual mean temperature for Bismarck,
North Dakota, from 1951 to 1980 was about 5 degrees
Celsius (41 degrees Fahrenheit). But not every day is
that cool-or that warm. Each day's temperature deviates
a certain amount from the annual mean. In fact, the
temperature extremes for that same period were -42.2
and 42.8 degrees Celsius (-44 and 109 degrees Fahrenheit)!
This is true for other valuables as
well. Individual observations can be expected to differ
somewhat from the mean. It is therefore necessary to
have some kind of statistic that tells us how much variation
is in our data set.
A simple indicator of variation is
the range, which is estimated by subtracting
the lowest, or minimum, value in your data set from
the highest, or maximum value. In our ozone example,
the range for our TOMS data is 455 – 245 = 210 Dobson
units.
Unfortunately, because the range only
tells you about the most extreme values in your data
set, it's usefulness as a measure of overall variation
is limited. A much better gauge of overall variation
would be to estimate the average value of all deviations
from the mean. That is, we could subtract the mean from
each value in our data set, sum up all the differences,
and then divide by the number of observations. The problem,
however, is that the result would always be zero because
the values on each side of the mean cancel each other.
Let's use our Colorado Springs temperature data as an
example.
If we subtract the mean temperature
from the high temp- erature for each day (in degrees
Fahrenheit), we end up with these deviations:
Value
|
Minus
|
Mean
|
Equals
|
Deviation
|
52
|
- |
52
|
= |
0 |
51
|
- |
52
|
= |
-1
|
42
|
- |
52
|
= |
-10
|
63
|
- |
52
|
= |
11
|
If we sum the deviations, we end up with a value of
zero. For example:
0 + (-1) + (-10) + 11 = 0
Therefore, it's pointless to divide by the number of
observations.
Fortunately, statisticians have come
up with a modification of this scheme that avoids the
"canceling out" problem. Instead of estimating the mean
of all the deviations, they estimate the mean of all
the squared deviations. Let's see how this
works with our temperature example (In degrees Fahrenheit):
(Value
- Mean) 2 |
Equals
|
Squared
Deviation |
(52
– 52) 2 |
= |
0 |
(51
– 52) 2 |
= |
1 |
(42
– 52) 2 |
= |
100
|
(63
– 52) 2 |
= |
121
|
When the deviations are squared, the
minus signs are eliminated, and we get a positive value
when we sum everything up:
0 + 1 + 100 + 121 = 222
Now we can take the final step and
divide by the number (n) of observations (4), to obtain
the mean of our squared deviations: 55.5.
The statistic we have just estimated is called the variance
of our data set. Statisticians make extensive use of
this statistic in many of their procedures. Its one
disadvantage, however, is that the number is given in
squared units. To obtain a measure of variability in
the original units, statisticians simply take the square
root of the variance. In the present example, that would
give us a result of 7.45.This statistic is known as
the standard deviation. (1)
A Common Yardstick
Once you know how to estimate the mean
and standard deviation, this opens up new doors for
understanding your data. In addition to providing information
about your data set as a whole, the mean and standard
deviations can also be used to make sense of individual
scores. Let's return to our ozone data for an example.
If we look at the TOMS ozone reading
for Day 97 in Part
1, we find a value of 363 Dobson units. But what
does that tell us? Is it high? Is it low? If all we
have is that number, we can't really say. But once we
know the mean and standard deviation for our TOMS data,
the situation changes. We have already calculated the
mean, which is 309.97. The standard deviation is 31.56.
This reveals that our value (363) is not only much higher
than average, it is also farther than average from the
mean. A quick glance at Fig.1
in Part 1 (also included here in Part 2) should reinforce
this impression. This is obviously very different from
what we would conclude if the mean was, say, 365, and
the standard deviation 75, or, again, if the mean was
364 and the standard deviation 0.01.
If we wanted to go a step further,
we could get a more precise indication of where our
value (363) lies by computing what statisticians call
a standardized score, or a z-score.
A standardized score is estimated simply by subtracting
the mean from the observed value and then dividing the
result by the standard deviation. In our case the result
would be (363 – 309.97) ÷ 31.56 = 1.68, which
tells us that our observed value of 363 lies 1.68 standard
deviations above the mean.
This logic can be extended to entire
data sets. For example, instead of looking at individual
observations with a z-score, you can compute a related
statistic, called the t statistic, to compare
group means. Another benefit
of standardized scores is that they allow you to convert
different kinds of measurements to a common scale, greatly
facilitating the search for relationships between variables.
In the above example, we converted our observed value
from Dobson units (363) to standard deviations (1.68).
We can do the same thing with almost any numeric variable.
Thus, if we wanted to examine the relationship between,
say, height and weight in a group of men, we could convert
their measurements to standardized scores. Then we could
approach the question of a relationship by looking at
the pattern of "ups" and "downs" in the data. If the
z-scores for weight tend to follow the z-scores for
height as we go from person to person, that indicates
a strong positive relationship between the two variables.
This is clearly shown in Fig. 3.
If, on the other hand, the z-scores
for height tend to mirror those for weight (with positive
scores for weight matching up with similar negative
scores for height, and vice versa ), this indicates
a strong negative relationship, as in Fig. 4.
Finally, as shown in Fig. 5, if there
is no discernable pattern of co-variation, this may
indicate no relationship. Actually, it indicates no
linear relationship. Higher order relationships
may still exist, but these are beyond the scope of the
present discussion.
Rather than looking at charts, a more
precise way to estimate the relationship between two
variables is with the correlation coefficient
or Pearson's coefficient of correlation . This
statistic varies between one and minus one, with the
former indicating a strong positive relationship, the
latter indicating a strong negative relationship, and
zero indicating no linear relationship.
Assuming that you have already estimated
z-scores for a given data set, you can estimate Pearson's
coefficient of correlation for two variables by doing
the following:
For each case, multiply the z-score for
the first variable by the z-score of the second variable.
Sum up all the resulting values.
Divide by the total number of cases.
Following through with our ozone example,
we can compute the correlation between the satellite-based
TOMS readings and those from the ground-based Dobson
instrument. Starting with the first day for which we
have readings, we multiply the z-score for the TOMS
reading by the z-score for the Dobson instrument. We
repeat this for each day on which we have observations
for both variables (skipping those days where there
are any missing values). This gives us a total of 351
values, which we sum up and divide by 351. The correlation
we come up with is .96, indicating a very strong positive
relationship between our two sets of observations, which
is exactly what we would expect, given that the TOMS
and Dobson instruments are measuring the same thing.
Note that a positive correlation does
not mean that the two instruments are giving us the
same absolute values. All it means is that their z-scores
follow a similar pattern of ups and downs. If we plot
some of the absolute values, as in Fig. 6, we see that
although the numbers of the TOMS and Dobson instrument
follow similar patterns, they still differ.
Part 3 of this series will appear in the next issue
of The Citizen Scientist.
Note.
1. Note that this estimate assumes
that you are only interested in describing the data
you have in hand, that the objects you're studying constitute
the entire population. If, on the other hand, you consider
your objects to be a sample drawn from a larger population,
you would use a "corrected" standard deviation, dividing
the variance by n-1 rather than n. The same is true
for estimating the mean of a sample. The uncorrected
statistics are used here, because (1) the subject of
this section is descriptive statistics rather
than inferential statistics and (2) the uncorrected
formulas communicate more clearly what a given statistic
means.
Acknowledgments
The Citizen Scientist and
Dr. Hartwig are grateful to Richard D. McPeters and
Arlin J. Krueger of NASA's Goddard Space Flight Center,
members of the TOMS Ozone Processing Team, and the National
Space Science Data Center for providing the measurements
of total ozone made by the TOMS instrument aboard the
Nimbus-7 satellite that appear in this article.
We are also happy to thank Robert Green
of the National Oceanic and Atmospheric Administration
and the Fresno-based Dobson instrument team for supplying
additional measure- ments of ozone. 
|