28 December 2001
Design Your Experiments
Part III: Noise characteristics and measurement
by Kevin Kilty
As
I advised at the end of the previous installment of this series, even
though noise is a constant impediment to our search for truth, there
is no reason to despair over it. We can handle noisy experiments if
we learn to think statistically. In this installment I wish to begin
doing this by examining the following topics:
- Characteristics of noise
in general.
- The statistical description
of noise.
- The effect of noise on
experimental design.
What to expect of noise
As a model to explore the
nature of noise I'll propose a simple one, Yo = Y + n; where,
Yo is the value I obtain from my measurement or experiment,
Y is a true value (which I don't know of course) and n is some additional
value which has corrupted my measurement. You can call it error or noise.
What should I expect of the noise values? This is straight forward to
answer. I expect that as long as all conditions surrounding my measurment
stay constant then there is some probability density, call it N, that
describes the noise. A probability density is a function of sorts which
assigns to each possible noise value (or range of noise value) an associated
probability. I do not need to know the exact density at this time, but
I do need to suppose a couple of its characteristics.
- The mean value of N should
be zero. If it is not, then there is a constant difference between
a best estimate based on experiment and a true value. This means that
there is a systematic bias in the experiment. Experimental
design ought to identify possible systematic errors and eliminate
them.
- Each n ought to be independent
and random. If this is not true, then I need some measure of how one
is correlated to another.
- I should strive to make
each individual value of n as small as possible. This makes my experimental
result very precise.
What replications of an experiment
provide is a series of observed values that cluster around a central
value. Statisticians call this central tendency. We assume,
from our experiment being well designed and executed, that the central
value is our best estimate of the true value of the thing we are looking
for. This is the value that we will report in our experiment. I am going
to assume throughout this installment that our experiment is calibrated
to provide a true value. I'll discuss calibration separately at a later
time.
Because each measurement
contains an unknown n, unless I report a measure of n as well, then
my result has limited value. One question to answer is how to report
it. The National Institute of Standards and Technology (NIST) suggests
two ways of doing this. The first is to report the square root of the
mean variance of replicated measurements as the standard uncertainty
of the measurement result. Most of you familiar with statistics will
recognize this as being the standard error of a sample
mean. Take {Yoi};i=1...n as the set of measured values from
replicated experiments, and <Y> as the mean measured value. Then...
u = sqrt( Sum of (Yoi-<Y>)2/n(n-1) )
is the standard uncertainty
of these measurements.
As an example, suppose I
measure the speed of light 5 times with some apparatus and get the following
values in km/Sec: 299,034, 300,006, 298,510, 299,435, and 299,987. Then
the best value to report for the speed of light and the uncertainty
in this estimate is 299394.4±286.3 km/Sec. Including the uncertainty
of measurement makes quibbling about whether or not to include the single
digit beyond the decimal point insignificant.
Calculating uncertainty directly
from measurements this way is very useful, but it does obscure the idea
that noise, or random errors if you will, follow some probability density.
Standard error provides a complete description of noise density only
if the noise is Gaussian--i.e. follows a normal probability distribution.
This might not always be so, and at some point in a future installment
I'll speak at length about probability density and distributions.
Calculating uncertainty from
replicated measurements is fine when we have access to them, but, what
do we do otherwise? NIST suggests a second reporting method in which
a standard uncertainty is obtained from the square root of a variance
determined by other means. This is very vague, but also
very interesting, for it suggests using prior knowledge about measurements,
models of noise sources, equipment, and even experimental conditions
to calculate and report uncertainty.
Once again let me provide
an example. A photomultiplier counts photons. If I use one to scan across
a spectral peak, each count it makes has some associated uncertainty.
I could figure this uncertainty by making numerous photon counts at
one setting of my spectrometer and calculating the standard uncertainty.
However, counting photons is a Poisson process, the variance of which
equals the count itself, so at low count rates I could spend forever
obtaining data to calculate uncertainty directly. Yet, by knowing the
theoretical density of a particular noise process, such as Poisson in
this instance, I can estimate uncertainty from it. What matters about
an estimate of uncertainty is not so much how it was done, but that
how a person does it is defensible. Peter Baum suggested
an even better example to me recently.
Suppose that I am measuring
some quantity that varies little from one experiment to another, and
I am doing it with an instrument that has not very much resolution.
All of my measurements are practically the same, and when I get around
to calculating a standard uncertainty from the replications I get a
very tiny value. Have I made an impressively precise measurement? Of
course not. This is exactly what I mean by a standard uncertainty that
is not defensible. What I must do in such a situation is use an estimate
of measurement uncertainty, such as using a unform noise (error) density
equal to my instrument's least significant digit (call it e), and report
the standard uncertainty as e/(sqrt(12*n)). The square root of 12 comes
from the standard deviation of a uniform density and the factor of n,
which is the number of independent measurements I made, makes this a
standard error of the mean.
Now that I know how to figure
uncertainty, how should I interpret it, and what effect does it have
on experimental design?
The first part of the question
is easy to answer. The distribution of noise is unknown, but if it is
concentrated near a central value, then I can invoke the central
limit theorem and treat the density of mean value as normal
with a standard deviation of u as calculated above. Therefore, the true
value, Y, is within ±u of the mean value of this experiment to
68% certainty. What this percentage certainty really means is difficult
to assess. Statisticians from the school of "Bayesians" don't believe
it means a thing. The idea is, though, that if a person were to make
many repetitions of the experiment I described in my example of 5 measurements
of the speed of light, then in 68% of these the true value (Y) would
occur within an interval of plus or minus u around each mean value,
or, <Y>±u. If more certainty is needed then the interval
<Y>±2u would be 95% certain and <Y>±3u would be
99.7% certain and so forth.
The second part of the question
gets to the heart of design, but is complex to explain. Remember that
we do not actually know the true value Y. The whole point of the experiment
is to estimate it as accurately as possible. The initial design of the
experiment has two aims. First, we rid the experiment of systematic
bias and make sure our method is calibrated. This insures that our measurements
tend to center on the true value. Second, we need to make the uncertainty
of central value (mean typically) small enough to accomplish the experimental
objective.
Often we organize an experiment
to distinguish between two alternatives. For example, we may want to
see if a treatment is 50% effective at preventing a disease. One alternative
commonly is the nominal or null hypothesis, so called
because it implies that a treatment is not effective. The other is the
alternative hypothesis that the treatment is, in this
example, 50% effective. Because of unavoidable experimental noise, it
may not be possible to distinguish between these hypotheses unless we
design carefully in advance.
This brings me to the idea
of the power of an experiment and what it means for design.
Nominal incidence of a disease implies a certain probability of incidence.
A 50% effective treatment means that incidence of the disease is cut
to one-half of the nominal probability in a group that gets treated.
Power of an experiment is a probability measure and has to be between
0 and 100% as a result. Power of 90% means that if the treatment is
truly 50% effective, our experimental design has 90% chance of success
in detecting it. Simply put, we can detect an effective treatment only
if two things happen. 1) The incidence in the treated group is less
than nominal incidence; and 2) the uncertainty of experimental incidence
is so small that we cannot mistake it for nominal incidence easily.
In other words we have to reject the nominal hypothesis. The one thing
that affects uncertainty and the probability of rejecting the nominal
hypothesis when it is in fact false, and which is also under our control,
is the number of replications of the experiment or the size of the control
and test groups.
Another numerical example
is in order. Let's take the 1954 trials of the Salk Vaccine as a great
historical example. The nominal incidence of Polio in the 1950s was
about 0.0003, or 30 per hundred thousand. Jonas Salk wanted to estimate
the size required of test and control groups to detect 50% effectiveness
of his vaccine at 95% confidence with 95% power. Polio is a binomial
process. A person either gets the disease or not, and this dichotomy
of outcomes is truly binomial. Unfortunately the binomial distribution
is inconvenient. However, when the expected number of successes
is at least 15 or so, the binomial density is nearly a normal density
with mean value of np, where n is the number of trials (at-risk population
size in this case) and p is the probability of incidence per trial (0.0003).
Binomial variance is np(1-p), which I can use the square root of as
my uncertainty calculated by other means in the venacular
of NIST.
Salk reasoned as follows.
The nominal incidence of Polio is 0.0003 and there is some uncertainty
in a control group that might produce a slightly different result. A
treated group of 50% effectiveness would have an incidence of 0.00015
with some uncertainty of a slightly different result. As long as the
observed incidence of the treated group is farther than 1.96u from the
observed incidence in the control, he would reject the nominal hypothesis,
because the 95% confidence region of the normal density is 1.96u. The
required size of u is that which will separate the expected values at
nominal and one-half nominal incidence by the combined 95% confidence
regions around both the expected nominal and experimental result. This
way Salk is 95% certain that his treated group result would be outside
the acceptance region of the nominal group 95% of the time, presuming
his vaccine really is 50% effective. Therefore, an experiment of 95%
power requires that
expected incidence difference>95% treated uncertainty+95% nominal uncertainty
or
0.5np>1.96sqrt(0.5np(1-0.5p))+1.96sqrt(np(1-p))
The only unknown is n, and solving
for it shows that 150,000 people in each group would just do the task.
I hope your amateur experiments never need so many.