A Basic Guide to the Numbers Game Behind Research
Have you ever stood around a cocktail party discussing statistics? I didn't think so. But we hear about them all the time, especially when we're absorbing clinical trial results -- or election year polls. Understanding the basics of statistics is not that hard to do, despite how it seemed in school!
There are a handful of concepts that serve as the building blocks for learning statistics. I'm jumping ahead of mean, median, mode; those will not be covered here (see sidebar). We will look first at standard deviation, standard error, and confidence interval, and how they all tie in to p-value.
Building Block #1: Standard Deviation
Below is a picture of what the standard deviation (SD) tells us. The mean, μ, which is the average, is the highest point in the center of the bell-shaped curve. The area between plus or minus (±) 1 standard deviation (1σ) of the mean captures 68% of all measurements in your sample; the area between ± 2 standard deviations (2σ) captures 95% of all measurements in your sample. In other words, not very many data points -- approximately 5% -- will lie more than 2 standard deviations from the mean.
Building Block #2: Standard Error
The standard error (SE) is important in describing how well the sample mean represents the true population mean. Remember, because we can't practically measure everyone in the population, we take a random sample. Every random sample will give a slightly different estimation of the whole population. The standard error gives you a measure of how precise your sample mean is compared to the true population mean. It is calculated by the standard deviation divided by the square root of the mean. So it depends on the size of your sample. As the sample size gets larger, then variability gets smaller and we get a more precise measurement of the truth. If we measure every member of the population, then it is no longer a sample. There is only one value that can be computed by measuring every member of the population, thus there is no variability and the truth is known.
Building Block #3: Confidence Interval
Since every random sample will give a slightly different estimation of the whole population, it makes sense to try to describe what the true population looks like with more than a single number. The confidence interval (CI) estimates a range of values within which we are pretty sure the truth lies. The confidence interval depends on the standard error. It is calculated by the sample mean you have measured in your sample plus or minus approximately 2 times the standard error. For example, the 95% CI gives us the range of values within which we are confident that the true population mean falls 95% of the time.
Every point outside of the confidence interval is very unlikely to occur by chance alone. If we have a 95% CI, then it means that there is a 5% chance that the true mean of the population is outside of that interval. In other words, we're pretty confident of the value of our confidence interval! This area outside the confidence area corresponds to a parameter called alpha, α. And usually we split α so that half is in the upper tail (2.5%) and half is in the lower tail (2.5%). This is what we mean by a two-tailed confidence interval.
The CI tells us a lot about our sample. We can draw conclusions about statistical significance based on the location of the CI. For example, suppose we have a CI that estimates the difference between two groups: a value of zero corresponds to no difference (that is, the null value). Therefore a CI that excludes zero denotes a statistically significant finding. Beware, though, that the null value is not always zero! It depends upon your null hypothesis, which will be described below.
Combining Building Blocks 1, 2, & 3: Describing a Sample
Four Steps in Conducting an Experiment or Study
The first step of the experiment is to state your hypothesis. The null hypothesis, H0, is pre-defined and represents a statement which is the reverse of what we hope the experiment will show. It is named the null hypothesis because it is the statement that we want the data to reject. The alternative hypothesis, Ha, is also predefined and represents a statement of what we hope the experiment will show. Ha is the hypothesis that there is a real effect.
Suppose we design a study to test if Optimized Background Treatment (OBT) + New Drug are better than just OBT alone. Our null hypothesis is that there is no difference between the groups. Our alternative hypothesis is that the two groups are not the same. (We hope that OBT + New Drug is better!) Depending on the data gathered from the study, we will either reject the null hypothesis or not.
The next step is to design your experiment and select the test statistic. The test statistic defines the method that will be used to compare the two groups and help interpret the outcome at the end of the study.
Sometimes the comparison may be based on the differences in means, and use continuous data analysis methods. Sometimes the comparison may be based on proportions, and use categorical data analysis methods. There are many possibilities.
For our example from Step 1, the test statistic will be based on the comparison of the proportion of patients with HIV viral load (VL) less than (<) 50 copies/mL in each treatment group of our study sample.
Many other important decisions go into designing the experiment besides selecting the test statistic. We also calculate the sample size, agree on the power of the study, and establish parameters like α and β (described below).
After we generate a random study sample, we conduct the study and collect the data. The third step is to investigate the hypotheses stated in Step 1 and compare the groups. In our hypothetical example, we find that 75% of subjects in the OBT + New Drug group achieve VL <50 copies/mL compared to 35% of patients in the control group. This produces a p-value <0.0001.
The p-value (the "p" stands for "probability") helps us decide whether the data from the random study sample supports the null hypothesis or the alternative hypothesis. P-value is the probability that these results would occur if there was truly no difference between the groups -- that is, how likely the results would have been observed purely by chance. The closer the p-value is to 0, the greater the likelihood that the observed difference in viral load is real and not due to chance, thus the more reason we have to reject the null hypothesis in favor of the alternative hypothesis. We look for a p-value of 0.05 or smaller. This represents a 5-in-100 probability -- a very small chance indeed!
The last step is to compare the p-value with α and interpret the finding. Alpha is called the significance level. As described above in the section on CI, it is the area outside of the confidence area. It is most commonly defined as α=0.05. If the p-value is less than or equal to α, then the null hypothesis is rejected and we declare a statistically significant finding has been observed. If the p-value is greater than α, then the null hypothesis is not rejected.
Remember, our hypothetical example produced a p-value <0.0001. This is well below α=0.05, so we reject the null hypothesis and conclude that OBT + New Drug and OBT alone are different. We can even take it one step further and conclude that OBT + New Drug are better than OBT.
The results of a study are often described by both the p-value and the 95% CI. The p-value is a single number that guides whether or not to reject the null hypothesis. The 95% CI provides a range of plausible values for describing the underlying population.
Concepts in Designing a Study: Type I Error, Type II Error, and Power
One way to think about them is to consider the relationship between a smoke detector and a house fire. (Reference: Larry Gonick & Woollcott Smith; The Cartoon Guide to Statistics; 1993; pp151-152). The purpose of the smoke detector, of course, is to warn us in case of a fire. However, it is possible to have a fire without an alarm, as well as an alarm without a fire. Those are situations or errors that we do not ideally want, but they are possible events nevertheless. So the "true state" can be either no fire (Ho) or house fire (Ha).
Ideally, we want the alarm to alert us if there is a fire and we want the alarm to remain silent when there is no fire.
If we have an alarm without a fire, then a Type I error has been committed. This corresponds to α, which is the probability of claiming a difference/rejecting Ho. Alpha is normally pre-set to 0.05. In other words, we accept a 5% chance of a "false alarm."
If we have a fire but it does not cause an alarm, then a Type II error has been committed. This corresponds to beta, β, which is the probability of missing a difference when one truly exists/not rejecting Ho. Beta is normally pre-set to 10% or 20%. In other words, we accept a 10% or 20% chance of a "failed alarm."
Power is defined by 1-β, which is the probability of a real fire when there is an alarm. It is normally pre-set to 80% or 90%. It controls the probability of observing a true difference, or a "true alarm." In other words, with power=80%, we accept that eight trials out of 10 will correctly declare a true difference and that two trials out of 10 will incorrectly miss a true difference. β is a risk that we would want to minimize; and it is a risk to minimize as much as possible but it comes with a price: a larger study, plus more time to recruit subjects, measure, and report.
A p-value greater than α=0.05 could be non-significant because there is truly no difference between the groups. Or it could be non-significant because the study is not large enough to detect a true underlying difference. Determining the optimal sample size for a study requires a great deal of thought in the beginning at the planning stage. A sample size that is unnecessarily large is a waste of resources. But a sample size that is too small has a higher likelihood of not representing the underlying population and consequently missing a "true alarm." The small study has a wider confidence interval because the standard error is large, or less precise. As we said above in Building Block #2, when the sample size gets larger, the variability gets smaller and we get a more precise measurement of the truth. The optimal sample size depends on all of the various assumptions that go into its calculation.
For instance, to plan a superiority study as in Step 2 above, we need to make decisions/assumptions on the following parameters: α (generally 0.05), whether the hypothesis is one-sided or two-sided (generally two-sided), power (generally 80-90%), the response rate in the test arm, and the response rate in the control arm. These assumptions are directly tied to the study being designed -- so different types of studies require different sets of information for the sample size calculation. Changing any one of the decisions/assumptions will change the sample size calculation.
Basic Requirements of Clinical Research Study
Understanding the basics of statistics is helpful in evaluating the messages that arise out of research. Good research follows clearly articulated steps and serious planning. The goal of research is to answer a question. In order to do so, it comes down to establishing:
In conclusion, study designs are chosen depending on the questions that are being studied. Study endpoints are selected according to the hypothesis under investigation and the study population being enrolled. And study interpretations depend on the hypotheses being tested. Statistics can help weigh the evidence and draw conclusions from the data.
Amy Cutrell resides in Chapel Hill, NC, and has worked at GlaxoSmithKline for twenty years. She received a MS in biostatistics from the UNC School of Public Health.
Got a comment on this article? Write to us at email@example.com.
This article was provided by Test Positive Aware Network. It is a part of the publication Positively Aware. Visit TPAN's website to find out more about their activities, publications and services.