25 March 2005

A Tool That Counts: Basic Statistics for the Amateur Scientist

Part 2. Analyzing and Evaluating Your Data

Mark Hartwig, Ph.D.

Editor's Note: The original version of this 3-part series was first published as a major feature in Science Probe! (April 1992). The article received some nice reader feedback and college professors asked to reprint it. When Nature, one of the world's leading journals of science, reviewed Science Probe! in a feature about new science publications, this article was specifically cited by the reviewer, James Lovelock. We are grateful to Dr. Hartwig for allowing The Citizen Scientist to present this series based on his article.

When presented properly, statistics is a fascinating subject that can open up new ways of looking at the natural world. Moreover, the basic principles are actually quite simple and can enhance anyone's critical thinking skills. The purpose of this series is to provide a gentle introduction to the wide world of statistics and to show you how to use statistical tools and principles to improve your understanding of the world around you. In Part 1 (Compiling and Sorting Your Data) (TCS, 11 March 2005) we explored basic descriptive statistics. This time we will look at analyzing and evaluating data.

Terms of Endearment

There is much more to descriptive statistics than counting. As useful as frequency tables may be, sometimes you want to describe your data more compactly–say, with one or two number instead of 10 or 20. In a moment. I'll describe some particularly useful statistical indices you can use with your own data sets. But first, it's necessary to define a few common terms that will make it easier to describe some of the statistical indices I'll be covering.

The first term is the word variable. Briefly, a variable is any measurable entity that can take on different values. Age, for example, is a variable that can take on an indefinite number of values. So is weight, height or the number of kilowatts consumed by a household in a year. Likewise for the amount of stratospheric ozone over Fresno, California, which is a variable we looked at in Table 1.

Some variables may take on a very limited number of values, and sometimes these may not even be numeric. Hair color, for example, can be specified by one of several words. Other such variables include sex (male or female), performance on a task (success or failure), the status of a switch (on or off), the answer to a true/false question, or the result in a coin toss. In short, it doesn't matter whether a variable takes on two values or two billion, or whether the values are numeric. It's still a variable.

Another term worth remembering is the word observation, a tricky term because statisticians don't use it quite the same way the rest of us do. In statobabble, an observation is the value you obtain when you take a measurement on some variable. Thus, if you were to measure the resting heart rate of 50 brown-throated, three-toed sloths, your data set would contain 50 observations on that variable. Or, building on our ozone example, Table 1 contains 365 observations on stratospheric ozone as measured by a ground based Dobson instrument, some of which are missing observations, and another 365 observations as measured by a satellite-based TOMS.

The third term we need to know is the word case. Statistically speaking, a case is the particular object or person that we measure when we take an observation on some variable. For example, if we were to measure the weight of 75 tree shrews, each tree shrew would constitute an individual case. Similarly, if we were to measure the body fat, height, and running speed of 35 male sprinters, each sprinter would constitute a case. Only now each case would contain three variables instead of one.

To remember the differences between a variable, an observation, and a case, it might help to look at Table 3, which contains some of the ozone data from Table 1. Each column in Table 3 represents a different variable: the date of measurement, the ozone level measured by the Dobson instrument, and the ozone level measured by the TOMS instrument in the satellite. Each row represents a different case, one for every day of the year. Finally, each value in the table constitutes an individual observation.

Day

TOMS

Ground

1

2

3

4

5

6

308

294

316

360

431

403

308

305

316

376

450

406

Table 3.First six days of Fresno ozone observations.

A Matter of Balance

With our new terminology, we can now describe a few handy statistical indices. One such statistic is the mean, or the simple arithmetic average. The mean is a measure of central tendency; it tells us where a series of values is "centered." (There are other "averages" as well. These include the median, mode, the geometric mean, and the harmonic mean. But we'll leave them for another time.)

Except for the fact that some data sets can be very large, the mean is an easy statistic to compute. Simply add up all the observations on one variable and divide by the total number of observations. Let's take an example: In Colorado Springs, the high temperatures for 28 April through 1 May 1991, were, respectively, 11.1, 10.6, 5.6 and 17.2 degrees Celsius (52, 51, 42 and 63 degrees Fahrenheit). To compute the mean, we add up all the observations and divide by four, the number of observations:

(11.1 + 10.6 + 5.6+ 17.2)/4 = 11.125

The mean high temperature for those four days in Colorado Springs was 11.1 degrees Celsius (52 degrees Fahrenheit).

Returning to our ozone example, we can compute the mean for the TOMS readings. Adding up all the values and dividing by 365, we obtain a mean of 309.22. Notice where this value lies on our bar graph in Fig. 1 from Part 1, which is also shown here in Part 2. If we think of the x-axis of our bar graph as a balance board, and the vertical bars as stacks of blocks, a fulcrum placed at 309.22, as in Fig. 2, would come very close to balancing the board and would balance it perfectly if we had graphed the occurrences of each separate value rather than 10-point intervals. Thus, we can consider the mean as being the "center of gravity" for any set of data points.

Deviations Great and Small

The mean, of course, doesn't tell us everything we need to know about our data set. It tells us where our data are "centered," but doesn't tell us much about how the data are spread out. For example, the annual mean temperature for Bismarck, North Dakota, from 1951 to 1980 was about 5 degrees Celsius (41 degrees Fahrenheit). But not every day is that cool-or that warm. Each day's temperature deviates a certain amount from the annual mean. In fact, the temperature extremes for that same period were -42.2 and 42.8 degrees Celsius (-44 and 109 degrees Fahrenheit)!

This is true for other valuables as well. Individual observations can be expected to differ somewhat from the mean. It is therefore necessary to have some kind of statistic that tells us how much variation is in our data set.

A simple indicator of variation is the range, which is estimated by subtracting the lowest, or minimum, value in your data set from the highest, or maximum value. In our ozone example, the range for our TOMS data is 455 – 245 = 210 Dobson units.

Unfortunately, because the range only tells you about the most extreme values in your data set, it's usefulness as a measure of overall variation is limited. A much better gauge of overall variation would be to estimate the average value of all deviations from the mean. That is, we could subtract the mean from each value in our data set, sum up all the differences, and then divide by the number of observations. The problem, however, is that the result would always be zero because the values on each side of the mean cancel each other. Let's use our Colorado Springs temperature data as an example.

If we subtract the mean temperature from the high temp- erature for each day (in degrees Fahrenheit), we end up with these deviations:

Value

Minus

Mean

Equals

Deviation

52

-

52

=

0

51

-

52

=

-1

42

-

52

=

-10

63

-

52

=

11

If we sum the deviations, we end up with a value of zero. For example:

0 + (-1) + (-10) + 11 = 0

Therefore, it's pointless to divide by the number of observations.

Fortunately, statisticians have come up with a modification of this scheme that avoids the "canceling out" problem. Instead of estimating the mean of all the deviations, they estimate the mean of all the squared deviations. Let's see how this works with our temperature example (In degrees Fahrenheit):

(Value - Mean) 2

Equals

Squared Deviation

(52 – 52) 2

=

0

(51 – 52) 2

=

1

(42 – 52) 2

=

100

(63 – 52) 2

=

121

When the deviations are squared, the minus signs are eliminated, and we get a positive value when we sum everything up:

0 + 1 + 100 + 121 = 222

Now we can take the final step and divide by the number (n) of observations (4), to obtain the mean of our squared deviations: 55.5.

The statistic we have just estimated is called the variance of our data set. Statisticians make extensive use of this statistic in many of their procedures. Its one disadvantage, however, is that the number is given in squared units. To obtain a measure of variability in the original units, statisticians simply take the square root of the variance. In the present example, that would give us a result of 7.45.This statistic is known as the standard deviation. (1)

A Common Yardstick

Once you know how to estimate the mean and standard deviation, this opens up new doors for understanding your data. In addition to providing information about your data set as a whole, the mean and standard deviations can also be used to make sense of individual scores. Let's return to our ozone data for an example.

If we look at the TOMS ozone reading for Day 97 in Part 1, we find a value of 363 Dobson units. But what does that tell us? Is it high? Is it low? If all we have is that number, we can't really say. But once we know the mean and standard deviation for our TOMS data, the situation changes. We have already calculated the mean, which is 309.97. The standard deviation is 31.56. This reveals that our value (363) is not only much higher than average, it is also farther than average from the mean. A quick glance at Fig.1 in Part 1 (also included here in Part 2) should reinforce this impression. This is obviously very different from what we would conclude if the mean was, say, 365, and the standard deviation 75, or, again, if the mean was 364 and the standard deviation 0.01.

If we wanted to go a step further, we could get a more precise indication of where our value (363) lies by computing what statisticians call a standardized score, or a z-score. A standardized score is estimated simply by subtracting the mean from the observed value and then dividing the result by the standard deviation. In our case the result would be (363 – 309.97) ÷ 31.56 = 1.68, which tells us that our observed value of 363 lies 1.68 standard deviations above the mean.

This logic can be extended to entire data sets. For example, instead of looking at individual observations with a z-score, you can compute a related statistic, called the t statistic, to compare group means.

Another benefit of standardized scores is that they allow you to convert different kinds of measurements to a common scale, greatly facilitating the search for relationships between variables. In the above example, we converted our observed value from Dobson units (363) to standard deviations (1.68). We can do the same thing with almost any numeric variable. Thus, if we wanted to examine the relationship between, say, height and weight in a group of men, we could convert their measurements to standardized scores. Then we could approach the question of a relationship by looking at the pattern of "ups" and "downs" in the data. If the z-scores for weight tend to follow the z-scores for height as we go from person to person, that indicates a strong positive relationship between the two variables. This is clearly shown in Fig. 3.

If, on the other hand, the z-scores for height tend to mirror those for weight (with positive scores for weight matching up with similar negative scores for height, and vice versa ), this indicates a strong negative relationship, as in Fig. 4.

Finally, as shown in Fig. 5, if there is no discernable pattern of co-variation, this may indicate no relationship. Actually, it indicates no linear relationship. Higher order relationships may still exist, but these are beyond the scope of the present discussion.

Rather than looking at charts, a more precise way to estimate the relationship between two variables is with the correlation coefficient or Pearson's coefficient of correlation . This statistic varies between one and minus one, with the former indicating a strong positive relationship, the latter indicating a strong negative relationship, and zero indicating no linear relationship.

Assuming that you have already estimated z-scores for a given data set, you can estimate Pearson's coefficient of correlation for two variables by doing the following:

•  For each case, multiply the z-score for the first variable by the z-score of the second variable.

•  Sum up all the resulting values.

•  Divide by the total number of cases.

Following through with our ozone example, we can compute the correlation between the satellite-based TOMS readings and those from the ground-based Dobson instrument. Starting with the first day for which we have readings, we multiply the z-score for the TOMS reading by the z-score for the Dobson instrument. We repeat this for each day on which we have observations for both variables (skipping those days where there are any missing values). This gives us a total of 351 values, which we sum up and divide by 351. The correlation we come up with is .96, indicating a very strong positive relationship between our two sets of observations, which is exactly what we would expect, given that the TOMS and Dobson instruments are measuring the same thing.

Note that a positive correlation does not mean that the two instruments are giving us the same absolute values. All it means is that their z-scores follow a similar pattern of ups and downs. If we plot some of the absolute values, as in Fig. 6, we see that although the numbers of the TOMS and Dobson instrument follow similar patterns, they still differ.

Part 3 of this series will appear in the next issue of The Citizen Scientist.

Note.

1. Note that this estimate assumes that you are only interested in describing the data you have in hand, that the objects you're studying constitute the entire population. If, on the other hand, you consider your objects to be a sample drawn from a larger population, you would use a "corrected" standard deviation, dividing the variance by n-1 rather than n. The same is true for estimating the mean of a sample. The uncorrected statistics are used here, because (1) the subject of this section is descriptive statistics rather than inferential statistics and (2) the uncorrected formulas communicate more clearly what a given statistic means.

Acknowledgments

The Citizen Scientist and Dr. Hartwig are grateful to Richard D. McPeters and Arlin J. Krueger of NASA's Goddard Space Flight Center, members of the TOMS Ozone Processing Team, and the National Space Science Data Center for providing the measurements of total ozone made by the TOMS instrument aboard the Nimbus-7 satellite that appear in this article.

We are also happy to thank Robert Green of the National Oceanic and Atmospheric Administration and the Fresno-based Dobson instrument team for supplying additional measure- ments of ozone.


 
Figure 1. This histogram shows the total ozone above Fresno, California, measured by the TOMS instrument aboard the Nimbus 7 satellite. Click image to enlarge.
 
Figure 2. A representation of Fig. 1 that shows the mean of the observed total ozone values over a simulated fulcrum indicated by the red triangle. Click image to enlarge.
 
Figure 3. This graph reveals a positive correlation between the z-scores for height and weight of several subjects. Click image to enlarge.
 
Figure 4. This graph shows a negative correlation between the z-scores for height and weight of several subjects. Click image to enlarge.
 
Figure 5. The graph shown here reveals no obvious correlation between the two sets of z-scores. Click image to enlarge.
 
Figure 6. This graph reveals a strong correlation between measurements of ozone made by a satellite instrument (TOMS) over Fresno, California, and from a Dobson instrument on the ground. Click image to enlarge.
 
 
   
Copyright 2005 by Society for Amateur Scientists