16 November 2001
Analysis of Variance
by Kevin Kilty
Recently
I introduced the concept of Analysis of Variance to students in an Engineering
Metrology course, but this powerful statistical method is also a valuable
part of any scientist's data analysis toolkit. It is especially useful
in the biological sciences.
The idea behind analysis
of variance is quite simple. If a large body of data represents samples
obtained from a single population; then various ways of classifying
the data consititute merely different ways of measuring the same population
characteristic--namely the population variance. If, on the other hand,
the classification of the data actually divides the data into samples
from different populations; then that should be apparent through a comparison
of the variance of each. It is easier to refer to an example rather
than try to explain the process.
The table below summarizes
data from an experiment involving corms, which are the
propagating part of gladiola radicals. The gladiolas which grew from
these corms produced varying numbers of florets, and the purpose of
the experiment was to determine if the corms found on the high part
of the gladiola bulb, which are first year corms, produce fewer florets
than low corms which are two years old and presumably more mature.
The experimenter planted
10 high corms and and 10 low corms in each of 7 different plots of soil.
The idea behind planting a mixture of high and low corms in each plot
is to attempt to remove variation caused by different soil properties.
Statisticians call this "paired data." The data displayed in the table
are the mean number of florets from these pairs of 10 plants in each
plot. The rows list data from the different plots.
Experimental Data: Floret means per Gladiola
Plots \ Corms-- High Low Totals
1 11.2 14.6 25.8
2 13.3 12.6 25.9
3 12.8 15.0 27.8
4 13.7 15.6 29.3
5 12.2 12.7 24.9
6 11.9 12.0 23.9
7 12.1 13.1 25.2
Totals 87.2 95.6 182.8
Sum of Squares 2407.9
Mean Square of total 2386.85
Mean Sum of Corms Squares 2391.89
Mean Sum of Plots Squares 2397.02
In statistical inference
we think of a pair of hypotheses to test with this data. The first hypothesis
is that all 14 of these measurements are just samples from a single
population. This implies that the classification as to high and low
corms is inconsequential. In this hypothesis the variation from one
plot to another is all experimental variation from this single population
and it doesn't matter that the parent corm is high or low. This is my
null hypothesis.
The alternative hypothesis
is that the proposed classification is significant; and generally we
decide upon a level of significance for the test of these hypotheses
at this point. For example we may decide that a 10% level of significance
is sufficient in this case because we are already convinced that low
corms are more mature and likely to be more efflorescent. In case we
were less sure of our alternative, we might choose a lower level of
significance--5% or 1%--to provide a more convincing demonstration.
I'll leave this issue of significance level open for the present time.
I have calculated the sum
of the squares of all the data. This is 2407.9. I have also computed
the mean of the squared sum 2386.85. The difference of these is the
sum of squares computed about the mean value (this is a computational
short-cut). If I divide this by the number of data values minus 1 (13),
then I have computed the total variance of my 14 experiments. This is
one estimate of population variance. Symbolically, the population variance,s2,
is approximately...
s2=S(xi-mean x)2/(n-1)
where; n is the number of values (14 in this case), and the sum is over
all the data.
But there are other estimates
of population variance available in this data. For example, the columns
are divided into classes of high and low corms. If all of the data come
from a single population (that is, our null hypothesis is true) then
this division is inconsequential and the mean squared difference between
the two column totals about their mean is also an estimate of the same
population variance,s2. In this case...
ks2=S(column totali-mean of column totals)2/(m-1)
where; m is the number of columns, k is the number of data in each column
(the number of plots), and the sum is over the columns.
Following this paragraph
I show an analysis of variance in its usual form. The first row summarizes
the total variation including all data. There is first the sum of squares
about the mean of the data, then the number of degrees of freedom (n-1=13),
and the quotient of the two, the mean square or sample variance. Degrees
of freedom simply refers to the number of independent pieces of
information in the data. There is one less independent peice of information
because we have made use of the mean value which prevents the 14th observation
from being independent of the other 13. The second row is the sum of
squares calculated from the column totals, then its degrees of freedom
(m-1=1), followed by the mean square. Finally, the third row begins
by taking the difference of the sum of squares of the two previous rows.
This is the residual sum of squares. Any fraction of the total sum of
squares which is left unexplained by our classification into high and
low corms, must reside in the residual sum of squares. Likewise the
degrees of freedom in the residual sum of squares is that in the total
data less that in our classification.
Analysis of Variance
Variation Sum of Squares Degrees of Freedom Mean Square F
Total 21.05 13 1.62
Corms 5.04 1 5.04 3.78 (0.10)
Residual 16.01 12 1.33=09
By our null hypothesis the
mean square of the classification should approximately equal the mean
square of the residual. They measure the same population variance in
this case, after all. The ratio between them should be 1. In reality,
however, the ratio is 3.78, which seems quite different from 1. Is this
difference significant?
An answer to this question
is available through the F statistic. The variance ratio of separate
samples drawn from a single population is distributed according to the
F statistic. Most spreadsheets provide an F distribution function, but
otherwise you can find it in many tables and tutorials on the internet.
The last column shows that this value of F would occur only about 10%
(0.10) of the time in ratios of variance of samples drawn from a single
population. This seems sufficiently unlikely that I conclude that data
derived from the high corms is different from that from the low corms.
I have reason to reject my null hypothesis and accept its alternative.
I have produced an expanded
analysis of variance, below, to illustrate a two-way classification.
The various plots of soil used in the experiment provide a second classification
of the data. Often soils have varying fertility and introduce a source
of experimental variation or error. This was the motivation for making
pairs of data in the experiment. After adding this second classification
to my analysis of variance, I follow the same course of calculation
that I did before.
Analysis of Variance (2-way Classification)
Variation Sum of Squares Degrees of Freedom Mean Square F
Total 21.05 13 1.62
Plots 10.17 6 1.70 1.74 (0.26)
Corms 5.04 1 5.01 5.18 (0.06)
Residual 5.84 6 0.97=09
Notice in the resulting
analysis that the mean square from the plots is nearly twice that in
the residual. Even though the F-ratio (1.74) is not especially significant
(0.26 or 26%) it yet manages to explain nearly half the variation in
the raw data (sum of squares of 10.17 vs. 21.05). By including this
second classification of the data I managed to reduce the mean sum of
squared residual substantially, and this in turn has made the F-ratio
(5.18) for the corms even more significant (6%). By accounting for all
identifiable sources of variation, and making this analysis of variance,
I have shown there is significant reason to reject the null hypothesis,
and I have also found a consistent and defensible estimate of my experimental
error--0.97 florets squared per plant, or a standard deviation of about
1 floret per gladiola. 