How to Read a Scientific Paper
(Third of a three-part series)
In Part 2 of this series, we learned from flipping coins that:
That the more experimental observations you make, the more precise your answer will be.
That even though more precision is nice, after a while there is a diminishing return as results from the study continue to fall within a very small range (a range around the "true" value).
While enrolling everyone on earth in a trial would produce a highly precise result, we can be satisfied with the results of a far less ambitious study. For example, if we wanted to know if a drug could reduce viral load, we may only need to enroll a few dozen people. This is because laboratory endpoints are highly reproducible, easy to quantify and respond quickly to changes in virus levels in the blood.
But if we wanted to know if a treatment prevented AIDS, we'd be more interested in knowing how many symptoms occurred while participants were in the trial. Clinical endpoint trials such as this may require enrolling several hundred people or more. Thankfully, rates of HIV disease have declined and AIDS symptoms are far less common than they once were, but this means that the study's size must be very large if it is to yield enough events for us say if the treatment had an effect or not.
Let's revisit our imaginary HIV drug, X-100 for a minute. When we do a trial of a treatment in people, it's similar to flipping a series of coins, except that the coins have been modified (treated). If there is a difference in outcomes between heads and tails from what we'd expect, we can attribute it to the modification. Now imagine each person in a trial becomes a "coin flip." Rather than heads or tails, let's measure something else that has only two possible outcomes. In HIV trials, viral load results are often expressed as being above or below the limit of detection. This is an example of a binary value (that is, one that can have one of two possible outcomes). Let's say "heads" means undetectable virus and "tails" means any detectable value. We measure how many people have undetectable virus in our trial of X-100 treatment in order to say something about how many people out in the real world can expect a similar result.
It's important to note we're not yet talking about a controlled trial, where modified coins are compared to normal coins or X-100 is compared to other drugs. We are simply looking at what happened to these people who took X-100. This is sometimes called a case-series or an uncontrolled trial. Obviously, whether X-100 is "good" or "bad" in the real world depends on what it is compared to. It may be better than no treatment, yet not as good as standard treatment. It may even be worse than no treatment. That's why it's so hard to interpret studies that aren't controlled: they may alert us to a severe toxicity or reveal a miracle cure, but mostly they only tell us what happened in that group of people who took X-100. They are not generalizable. Sometimes a case series is matched with a series of similar people on another treatment, but we can never be sure if the comparison is meaningful.
Our gullible alien decides to become an AIDS researcher and does his first X-100 trial in one person -- who soon develops an undetectable viral load! Visions of the Nobel Prize -- or at least making the cover of Newsweek -- dance in his head. But our skeptical alien wants to give X-100 to several more people before he makes any conclusions. The second volunteer becomes undetectable, but the third does not. Does this sound familiar? Heads, heads, tails? As the trial proceeds, it starts to mirror our coin flip series, finally yielding a high probability that X-100 works in about 50% of trial participants, and by extension, persons like them. This isn't so bad if these results are applicable to patients who are failing their current treatment. But I'd feel more confident if we could see how X-100 stacks up against treatments we are already familiar with.
A new alien joins the scene. Her name is Carol and she appears on earth with an incredible new ray gun she invented that makes coins be heads, no matter what.
"Cool!" we say, "let's try it!"
The first coin is flipped, Carol blasts it, and it comes up heads. "Wow," says the gullible alien, "It works!"
"Hmmm," the skeptic says. The second coin is flipped: heads. Third coin, heads. Fourth and fifth coins: heads. Our friend is looking less skeptical now, even though he knows all this just shows is that coins are very likely (5/5, based on the available data) to come up heads when blasted with Carol's ray gun.
"So what do you think?" Carol asks.
"Well," says the skeptical alien, "on the first coin, you had a fifty/fifty chance. With two flips, the odds that it was a fluke and that the ray gun did nothing went down to 25%. And after five flips, there was only a 3% chance that those five heads occurred based solely on chance. Hmm, five heads in a row might happen occasionally, but it's pretty unlikely.
"I believe it is highly probable that this ray gun actually works. If 5% probability, or .05 is good enough for the New England Journal of Medicine, then it's good enough for me!"
Is There Some Significance to This? (P value)
Now I need to explain what's so special about 5%. When you are reading a research report or hearing a lecture you will often come across the term "P value." Simply put, the P value is the probability that the findings in the study could have occurred as a matter of chance, instead of reflecting a real change from the expected average. P value is also called significance, and generally the two terms are interchangeable -- unless you are talking to a very picky statistician. In our case, the odds of getting five heads in a row was 3% or 0.03. Carol claims that her ray gun was, in fact, the reason that they got five heads and no tails.
Notice that no one has proven that the ray gun works; they have simply decided that it is very, very likely that it does. In the same way, no clinical trial can "prove" how well a particular treatment does, but it may, however, give an answer that you feel fairly confident about. Remember that, before Carol and her ray gun appeared, the probability of getting heads was 50%, or 0.5, not a very convincing P value -- in fact, no better than flipping a coin!
I have never seen a good explanation for how and when it was decided that a P value of 0.05 (5%) was the magic cut-off for statistical significance. Who decided that if there is less than a one in twenty chance of a result being due to chance, then the effect is real? Somehow this number has been elevated to oracle status, where 0.05 is the dividing line between success and failure. If your study shows results with a P value of 0.046, you'll be published in NEJM and address the plenary at a big conference in a warm climate. But show a P value of 0.055, and you're dismissed in disgrace. It's kind of silly. There's absolutely nothing magic about one-in-twenty except that it's become a standard. Like speed limits: why 55 mph, and not 57 or 53? It's just a nice even number.
The basic rule of thumb is, the smaller the P value, the better. You can even make up your own personal P value for significance, depending on your innate skepticism. Maybe you think that one-in-twenty odds are too forgiving, so you decide to remain unconvinced unless P values are less than one in 50 (P <0.02). You go!
Let's bring this back to earth, and start talking about controlled trials. A trial is controlled when the experimental treatment is compared to one of known benefit -- a control. And the safest way to do a controlled trial is to allocate the study treatments to trial participants by a random process. Randomization helps ensure that investigators will not consciously or unconsciously influence the outcome of a trial due to personal preferences or beliefs. For example, if treatments are not randomly assigned, there is a risk that a doctor might choose to put her sicker patients on X-100 so they can get the latest drug. Or the investigator may think it's too risky to give experimental drugs to sicker patients and assigns them to the known treatment. Either way, a doctor's personal opinions can affect or bias the comparison. If sicker patients were preferentially given X-100, the drug could appear to be less effective because sicker patients tend not to respond as well to treatment. In the second example, the reverse would occur. The point is that you can never know all the factors that might influence patient selection. Therefore the only way to ensure a fair comparison is to randomly assign treatments. When treatments are compared in this way -- ensuring that both treatments groups are selected without bias -- it is called a randomized controlled clinical trial.
Let's take a harder look at our imaginary X-100. We suspect it has real benefits. After all, half the people taking it had undetectable virus afterwards. But we're not sure that this might not have happened anyway. And even more importantly, if there are already other treatments for HIV, we want to know if X-100 can add any benefit to what is otherwise available. To answer this question we need to conduct a randomized controlled clinical trial. Since the case series study of X-100 was promising, it is decided to test X-100 plus HAART against HAART by itself. But how many people do we need to enroll in our trial to get results we can count on?
Sample Size ("N")
To decide on a sample size (or "N") for a clinical trial, a trial investigator must first establish standards for the precision of their measurements, then they decide how much comparative benefit from the drug would be meaningful or possible to observe. Finally they calculate the number of people that should be enrolled in order to obtain useful results with the kind of precision they demand.
In the same way that P <0.05 is a convention for describing the likelihood of randomly getting some result, there are a couple of other common conventions for stating how precisely the results were measured. One convention that we discussed in Part 2 is power, or the likelihood of detecting a benefit that is really there. Power of 80% or 90% is typical for clinical trials. Another important convention is called alpha, which is an estimate of how likely it is to falsely detect an effect when there is none. Alpha, also known as the significance level, is often set at 5%. A good, but not entirely accurate analogy for power and alpha is sensitivity and specificity. If you think about it, the more stringently alpha is set to avoid falsely detecting an effect, the more likely it becomes to fail to detect a true effect. In other words, there is an inherent tradeoff between alpha and power.
Next, investigators must decide how much of a treatment effect they'd like to be able to capture with the precision allowed by their power and alpha. A difference of 15%-20% between compared treatments is a common expectation in AIDS trials. In order to detect smaller differences between treatments, the sample size must include a sufficient number of patients receiving each treatment. One of the dangers of too-small studies is that they have too little power. A small study may be able to confidently report: "X-100 plus HAART was no better than HAART" -- with "better" meaning at least 20% better. But so what? The trial's sample size does not allow confirming or rejecting the possibility that X-100 is 10% or 15% better than HAART alone -- something that would be good to know. If a disease affects hundreds of thousands of people, then a treatment that offers even one or two additional percentage points of benefit can have an important impact on a lot of people. However, if a relatively small number of eligible trial participants limits the achievable sample size, then perhaps detecting a 20% difference is the best that can be hoped for.
So, assuming that HAART gives you a 50% response rate and that we are looking for a 20% difference, we would like at least 60% of participants taking X-100 to have an undetectable viral load (20% of 50 is 10; 50% + 10% is 60%).
One more point. It's important to try to estimate beforehand how many people are likely to drop out or crossover (take the other arm's treatment) during the trial. This can have a major impact on the power of the trial. It's as if the coins are changing from heads to tails in mid-air and vice versa. If dropouts and switching start to make the treatment groups look more and more alike, then the trial's power to detect a difference between groups is watered down.
These factors (power, alpha, size of difference to be detected and expected dropout and switching rates) plus a few others all go into determining a trial's sample size. And all of these statistical considerations are presented in that easy-to-skip "methods" section. But now you know enough to dive in and decide for yourself if the researchers had realistic expectations or if they were just starry-eyed and gullible.
Finally, there's one bit of information found in most science papers that might help you get a better grasp on trial results than just the P value alone. The "95% confidence interval" or "95% CI" is a range of possible results within which you can be 95% sure that the "real" answer lies. So to end up with a "true answer" of 50% (as in the coin flips) the 95% CI could start out very wide. On the first flip it stretches from 0% to 100%, but as you do more and more flips, the 95% CI narrows until it covers just a tiny increment above and below 50%. You might see this written in the literature as 95% CI = (49.99, 50.1). The narrower the interval, the more certain you can be about where the "true" answer lies. Looking at the 95% CI will often give you a far better intuitive understanding of the certainty of the results. With our coins, if you saw "Odds of a head = 50% CI = [20%, 90%]," you would know that the result is almost worthless -- it could be anywhere inside that wide margin. But if you see "Odds of a head = 50% CI = [48-52%]" you can be far more confident that 50% is a good result.
That's it for the statistical stuff. No one grasps all of these concepts the first time through, but they offer the clearest and least ambiguous way of measuring research precision and accuracy. Understanding their meaning is essential for critical reading. Now lets wrap up this series by looking at a far less quantifiable measure of a report's significance.
Where Does It Appear?
Not all trial reports are equally trustworthy. In Part 1 we looked at who wrote the paper, now we're going to look at who has reviewed and published it. The "gold standard" for publication is the peer-reviewed journal. Peer-review means that other experts in the field have had an opportunity to go over the work in detail and ask the authors for clarification or additional information. Not only are articles in these important journals reviewed by peers before publication, they are subjected to intense scrutiny by the journal's readers. It can be amusing to follow one of the well-mannered "flame-wars" that sometimes break out in the Letters section of a respected medical journal. The lesson is that no single paper decides scientific truth; general agreement comes about slowly, by consensus -- and sometimes that's a rocky process.
Scientific conferences produce an abundance of reports and papers. Many major conferences have several different levels of review. Oral presentations and especially plenary talks (meetings attended by all conference participants, as opposed to special interest "breakout sessions") may undergo a peer-review process as stringent as a journal paper does. Posters (which are really just oversized abstracts) and "break-outs" typically undergo less scrutiny. So it's especially important when reviewing conference presentations to consider them in context. If they make sense and tend to "fit together" with peer reviewed papers, they may offer some novel insights or greater detail to what has gone before. But if, say, a poster substantially disagrees with the peer-reviewed literature, a careful look is merited before breaking out the party hats. This doesn't mean the poster data is wrong -- revolutionary ideas are often rejected until they become more familiar. But if what you read sounds too good to be true, it's worth looking a little deeper into the disagreement.
More and more frequently, the pharmaceutical and other industries create "front groups" that sound as if they are independent scientific organizations, when in fact they are mostly, or entirely staffed and supervised by employees or consultants of a company. It's important to not only look at who sponsors a conference, but also who is on their "scientific advisory board." Most of the major conferences actually list these advisers on their letterhead. If it's not obvious, or hard to find out who is providing scientific leadership, be very cautious. It's quite possible the conference is in effect a very detailed, complex advertisement.
Unfortunately, conflict-of-interest regulations have been weak or missing in biomedical science, so we see scandals like investigators who hold patents on a drug chairing federally funded research projects into that same drug. Fortunately, there is greater and greater interest in this issue, and many organizations are drafting guidelines requiring investigators to disclose any stock, consulting, or other fiduciary arrangements with any entities that have a stake in their research.
Many, if not most major scientists have consulted for various companies at one point or another and there is a "revolving door" between the National Institutes of Health (NIH), industry, and academia. So while it may be virtually impossible to eliminate all conflict of interest, evaluating possible conflicts has now become an inescapable part of assessing research findings.
This is a lamentable state of affairs, but this is where peer review is essential. Before publication in any major journal, experts in the relevant field pay particular attention to the "methods" section of the paper we described before. While there is always the slim possibility that investigators simply made the data up (which has occurred), knowing that standardized, well-tested methods and tools were employed helps to detect "fudging" and disingenuous conclusions.
A whole new set of complications comes with the rapidly evolving field of "Web publishing," as research appears on the Internet without having first appeared in a peer-reviewed journal. With a disease as deadly as AIDS and in a field that's evolving as rapidly as HIV research, it's understandable to want the freshest information as quickly as possible. Paper-based distribution is inherently slow, but we shouldn't sacrifice quality for speed. It will be critical over the next few years for to work out rational and workable guidelines for responsible web publishing. Many sites are already taking the initiative and have started using many of the same safeguards (scientific advisory boards, peer review, etc.) that distinguish respected paper-based publications.
Let the reader beware . . . but let the reader be ruthless. As you practice reading abstracts and full-length papers you will start to recognize the signs of quality research as well as the signs of slipshod work. Keep at it. The more mysteries we can dispel about this disease, the sooner we can stop its progress.
Back to the GMHC Treatment Issues June 2001 contents page.