Have you ever stood around a cocktail party discussing statistics? I didn't think so. But we hear about them all the time, especially when we're absorbing clinical trial results -- or election year polls. Understanding the basics of statistics is not that hard to do, despite how it seemed in school!
There are a handful of concepts that serve as the building blocks for learning statistics. I'm jumping ahead of mean, median, mode; those will not be covered here (see sidebar). We will look first at standard deviation, standard error, and confidence interval, and how they all tie in to p-value.
The standard deviation is simply a measure of the amount of variability in your particular sample. Because we can't practically measure everyone in a population that we are interested in studying, we have to take a sample of the population. The standard deviation describes the variation in measurements from individual to individual data point within the sample.
Below is a picture of what the standard deviation (SD) tells us. The mean, μ, which is the average, is the highest point in the center of the bell-shaped curve. The area between plus or minus (±) 1 standard deviation (1σ) of the mean captures 68% of all measurements in your sample; the area between ± 2 standard deviations (2σ) captures 95% of all measurements in your sample. In other words, not very many data points -- approximately 5% -- will lie more than 2 standard deviations from the mean.
The standard error (SE) is important in describing how well the sample mean represents the true population mean. Remember, because we can't practically measure everyone in the population, we take a random sample. Every random sample will give a slightly different estimation of the whole population. The standard error gives you a measure of how precise your sample mean is compared to the true population mean. It is calculated by the standard deviation divided by the square root of the mean. So it depends on the size of your sample. As the sample size gets larger, then variability gets smaller and we get a more precise measurement of the truth. If we measure every member of the population, then it is no longer a sample. There is only one value that can be computed by measuring every member of the population, thus there is no variability and the truth is known.
Since every random sample will give a slightly different estimation of the whole population, it makes sense to try to describe what the true population looks like with more than a single number. The confidence interval (CI) estimates a range of values within which we are pretty sure the truth lies. The confidence interval depends on the standard error. It is calculated by the sample mean you have measured in your sample plus or minus approximately 2 times the standard error. For example, the 95% CI gives us the range of values within which we are confident that the true population mean falls 95% of the time.
Every point outside of the confidence interval is very unlikely to occur by chance alone. If we have a 95% CI, then it means that there is a 5% chance that the true mean of the population is outside of that interval. In other words, we're pretty confident of the value of our confidence interval! This area outside the confidence area corresponds to a parameter called alpha, α. And usually we split α so that half is in the upper tail (2.5%) and half is in the lower tail (2.5%). This is what we mean by a two-tailed confidence interval.
The CI tells us a lot about our sample. We can draw conclusions about statistical significance based on the location of the CI. For example, suppose we have a CI that estimates the difference between two groups: a value of zero corresponds to no difference (that is, the null value). Therefore a CI that excludes zero denotes a statistically significant finding. Beware, though, that the null value is not always zero! It depends upon your null hypothesis, which will be described below.
Before moving on to new concepts, let's put the three summary statistics of standard deviation, standard error, and confidence interval together by considering an example from blood pressure measurements on 50 individuals. The scatter plot in the picture below shows each of the 50 individual measurements from this hypothetical sample. Our sample mean is represented by the large dot in the center of the 4 vertical lines beside the scatter plot. In the first green line, we can see that about 2/3 of our sample results are contained within +/- 1 standard deviation. In the second green line, 95% of our data points are covered by +/- 2 standard deviations. Once we know the standard error, we can construct the confidence interval. The 95% CI, depicted by the second blue line, gives us the range within which we are 95% confident that the true population mean lies.
The first step of the experiment is to state your hypothesis. The null hypothesis, H0, is pre-defined and represents a statement which is the reverse of what we hope the experiment will show. It is named the null hypothesis because it is the statement that we want the data to reject. The alternative hypothesis, Ha, is also predefined and represents a statement of what we hope the experiment will show. Ha is the hypothesis that there is a real effect.
Suppose we design a study to test if Optimized Background Treatment (OBT) + New Drug are better than just OBT alone. Our null hypothesis is that there is no difference between the groups. Our alternative hypothesis is that the two groups are not the same. (We hope that OBT + New Drug is better!) Depending on the data gathered from the study, we will either reject the null hypothesis or not.
The next step is to design your experiment and select the test statistic. The test statistic defines the method that will be used to compare the two groups and help interpret the outcome at the end of the study.
Sometimes the comparison may be based on the differences in means, and use continuous data analysis methods. Sometimes the comparison may be based on proportions, and use categorical data analysis methods. There are many possibilities.
For our example from Step 1, the test statistic will be based on the comparison of the proportion of patients with HIV viral load (VL) less than (<) 50 copies/mL in each treatment group of our study sample.
Many other important decisions go into designing the experiment besides selecting the test statistic. We also calculate the sample size, agree on the power of the study, and establish parameters like α and β (described below).
After we generate a random study sample, we conduct the study and collect the data. The third step is to investigate the hypotheses stated in Step 1 and compare the groups. In our hypothetical example, we find that 75% of subjects in the OBT + New Drug group achieve VL <50 copies/mL compared to 35% of patients in the control group. This produces a p-value <0.0001.
The p-value (the "p" stands for "probability") helps us decide whether the data from the random study sample supports the null hypothesis or the alternative hypothesis. P-value is the probability that these results would occur if there was truly no difference between the groups -- that is, how likely the results would have been observed purely by chance. The closer the p-value is to 0, the greater the likelihood that the observed difference in viral load is real and not due to chance, thus the more reason we have to reject the null hypothesis in favor of the alternative hypothesis. We look for a p-value of 0.05 or smaller. This represents a 5-in-100 probability -- a very small chance indeed!
The last step is to compare the p-value with α and interpret the finding. Alpha is called the significance level. As described above in the section on CI, it is the area outside of the confidence area. It is most commonly defined as α=0.05. If the p-value is less than or equal to α, then the null hypothesis is rejected and we declare a statistically significant finding has been observed. If the p-value is greater than α, then the null hypothesis is not rejected.
Remember, our hypothetical example produced a p-value <0.0001. This is well below α=0.05, so we reject the null hypothesis and conclude that OBT + New Drug and OBT alone are different. We can even take it one step further and conclude that OBT + New Drug are better than OBT.
The results of a study are often described by both the p-value and the 95% CI. The p-value is a single number that guides whether or not to reject the null hypothesis. The 95% CI provides a range of plausible values for describing the underlying population.
|Ho: no fire||Ha: fire|
|Accept Ho: no alarm||No error||Type II|
|Reject Ho: alarm||Type I||No error|
Three more terms that we often hear or read about are called Type I error, Type II error, and power. They are inter-related and are important in the design stage and in the interpretation stage as well.
One way to think about them is to consider the relationship between a smoke detector and a house fire. (Reference: Larry Gonick & Woollcott Smith; The Cartoon Guide to Statistics; 1993; pp151-152). The purpose of the smoke detector, of course, is to warn us in case of a fire. However, it is possible to have a fire without an alarm, as well as an alarm without a fire. Those are situations or errors that we do not ideally want, but they are possible events nevertheless. So the "true state" can be either no fire (Ho) or house fire (Ha).
Ideally, we want the alarm to alert us if there is a fire and we want the alarm to remain silent when there is no fire.
If we have an alarm without a fire, then a Type I error has been committed. This corresponds to α, which is the probability of claiming a difference/rejecting Ho. Alpha is normally pre-set to 0.05. In other words, we accept a 5% chance of a "false alarm."
If we have a fire but it does not cause an alarm, then a Type II error has been committed. This corresponds to beta, β, which is the probability of missing a difference when one truly exists/not rejecting Ho. Beta is normally pre-set to 10% or 20%. In other words, we accept a 10% or 20% chance of a "failed alarm."
Power is defined by 1-β, which is the probability of a real fire when there is an alarm. It is normally pre-set to 80% or 90%. It controls the probability of observing a true difference, or a "true alarm." In other words, with power=80%, we accept that eight trials out of 10 will correctly declare a true difference and that two trials out of 10 will incorrectly miss a true difference. β is a risk that we would want to minimize; and it is a risk to minimize as much as possible but it comes with a price: a larger study, plus more time to recruit subjects, measure, and report.
A p-value greater than α=0.05 could be non-significant because there is truly no difference between the groups. Or it could be non-significant because the study is not large enough to detect a true underlying difference. Determining the optimal sample size for a study requires a great deal of thought in the beginning at the planning stage. A sample size that is unnecessarily large is a waste of resources. But a sample size that is too small has a higher likelihood of not representing the underlying population and consequently missing a "true alarm." The small study has a wider confidence interval because the standard error is large, or less precise. As we said above in Building Block #2, when the sample size gets larger, the variability gets smaller and we get a more precise measurement of the truth. The optimal sample size depends on all of the various assumptions that go into its calculation.
For instance, to plan a superiority study as in Step 2 above, we need to make decisions/assumptions on the following parameters: α (generally 0.05), whether the hypothesis is one-sided or two-sided (generally two-sided), power (generally 80-90%), the response rate in the test arm, and the response rate in the control arm. These assumptions are directly tied to the study being designed -- so different types of studies require different sets of information for the sample size calculation. Changing any one of the decisions/assumptions will change the sample size calculation.
Understanding the basics of statistics is helpful in evaluating the messages that arise out of research. Good research follows clearly articulated steps and serious planning. The goal of research is to answer a question. In order to do so, it comes down to establishing:
In conclusion, study designs are chosen depending on the questions that are being studied. Study endpoints are selected according to the hypothesis under investigation and the study population being enrolled. And study interpretations depend on the hypotheses being tested. Statistics can help weigh the evidence and draw conclusions from the data.
The Median, the Mean, and the Mode
Before you can begin to understand statistics, there are four terms you will need to fully understand. The first term, "average," is something we have been familiar with from a very early age when we start analyzing our marks on report cards. We add together all of our test results and then divide it by the sum of the total number of marks there are. We often call it the average. However, statistically it's the mean!
The median is the "middle value" in your list. When the totals of the list are odd, the median is the middle entry in the list aft er sorting the list into increasing order. When the totals of the list are even, the median is equal to the sum of the two middle (after sorting the list into increasing order) numbers divided by two. Thus, remember to line up your values, the middle number is the median! Be sure to remember the odd and even rule.
The mode in a list of numbers refers to the list of numbers that occur most frequently. A trick to remembering this one is to remember that mode starts with the same first two letters that most does. Most frequently -- mode. You'll never forget that one!
It is important to note that there can be more than one mode. If no number occurs more than once in the set, then there is no mode for that set of numbers.
Occasionally in statistics you'll be asked for the "range" in a set of numbers. The range is simply the smallest number subtracted from the largest number in your set. Thus, if your set is 9, 3, 44, 15, and 6, the range would be 44-3=41. Your range is 41.
A natural progression once the three terms in statistics are understood is the concept of probability. Probability is the chance of an event happening and is usually expressed as a fraction. But that's another topic!
Amy Cutrell resides in Chapel Hill, NC, and has worked at GlaxoSmithKline for twenty years. She received a MS in biostatistics from the UNC School of Public Health.