Print this page • Back to Web version of article

How to Read a Scientific Paper

(Second of a three-part series)

May 2001

In Part One we looked at how a scientific paper is organized. Now we can continue to the interesting stuff -- the actual contents. Of the thousands of papers published each year, some are far more reliable and relevant than others. There are specific characteristics to look for when deciding which papers to trust and base important decisions on. Last month we learned that prospective trials are more generalizable than retrospective studies and that some authors carry more weight than others. This time we begin, naturally enough, with endpoints.

The term "endpoint" is an unfortunate choice, and confuses many people. Endpoint sounds like it means the end of the trial, and in a few cases, it might. A far better, but less memorable term would be "key data item." That's all endpoints are. They are the information that is most important to the purpose of the trial. When a hypothesis (or the main question that drives a trial) is set up, the endpoints are defined as the information necessary to answer that question. Endpoints in HIV treatment trials may include the occurrence of AIDS-related conditions, death, serious toxicity, or CD4 count and viral load thresholds. They are the essential measurements that must be recorded, those that are critical for answering important questions about a drug.

Endpoints are divided into primary (or main) endpoints and secondary endpoints. Generally, but not always, a trial will have one primary endpoint. It is the single most important piece of information to be obtained, and trial design decisions should be made to best guarantee getting accurate information on the primary endpoint. As you recall, in our X-100 trial, viral load is the primary endpoint. All other measures taken in our study are secondary to this main variable. There should be no trial procedures or other design considerations that interfere with getting the complete and accurate information on viral load.

As we learned in Part One, a study is only useful to the "real world" in proportion to its generalizability. This is the degree to which a trial's findings can be applied to a wider population outside of the trial participants. For any particular study, the answers we get only refer to the people who were in that particular trial at that time. Even if we redid the trial with all of the same people at another time, the answer could be very different.

The extent to which we believe that our study findings can be helpfully applied to other people who are similar to those in the trial is perhaps the most important quality a study can have. If the results can't be generalized, the trial is a sterile, abstract experiment of no relevance. Once generalized, though, trial results can become important factors not only in helping people make decisions about care and treatment, but also by suggesting and helping to explain additional research.

It may seem obvious that exactly who gets studied makes a big impact, not only on the results themselves, but also on the later interpretation of those results. The broader the eligibility criteria are, the more diverse the group enrolled -- the broader the population to which the results can be extended.

Some studies call for a broad representation of possible patients among the participants, whereas others study a very narrow range of people. For example, if you wanted to know how useful our imaginary anti-HIV drug X-100 is for adults failing therapy with protease inhibitors, you would have a problem if you only enrolled men. The answers you received could not be applied to women with much confidence. For that particular question, you would have seriously failed. But if you wanted to study X-100 in men alone (who knows? Maybe it had testicular toxicity in rats), then narrowing the eligibility criteria makes sense.

Some eligibility restrictions are simply prudent. You'd never want to enroll people who are highly likely to be harmed by a trial. For this reason, persons with high liver function test results (also called LFTs, transaminases, SGOT, SPGT, ALT, or AST) are often excluded. The intention is to avoid having persons who are likely to suffer serious toxicity to be among the very first to try a new drug. But sometimes there's a trade-off between protecting trial participants and producing results that will be relevant. Many people with HIV also have the hepatitis C virus (HCV), so persons with impaired liver function will use the new treatment eventually. How will the trial results apply to them? Fortunately, there is an increased awareness of these issues, and most trials involving HIV-positive people have relaxed the eligibility rules about liver function, while still barring those at imminent risk of suffering harm. The result is research that benefits a broader range of people who are actually likely to use X-100.

Human trials can be placed on a continuum from the highly restrictive "lab rat" type studies, where intense efforts are made to control every possible variable to "public health"-oriented trials that intentionally seek diversity in order to mirror the populations in which a drug will eventually be used. Both kinds of studies are necessary and neither approach is "right." Each has unique benefits and drawbacks, and they are used to answer very different kinds of questions.

For example, if we wanted to know about the activity of X-100, we would study its pure antiviral effect under conditions divorced from "real world" issues such as adherence, interactions with other drugs, the effects of gender, age, etc. We would try to control all those variables as much as possible, to see what X-100 can do under the most optimal conditions. This approach also excels at studying other kinds of questions. If we wanted to know the "why" or "how" of differences in a drug's activity between different groups, such as women and men, being able to hold most variables steady while changing one allows a precision we could never achieve with a more diverse set of people and circumstances.

On the other hand, if what we care about is X-100's efficacy (the degree to which it works under more "real world" circumstances), we would try to create a study population that mirrors the range of persons in which X-100 may eventually be used. If we overly restrict enrollment, we may end up with results that have far less applicability for public health. For example, if we arbitrarily said "no redheads" in the X-100 trials, I personally would be nervous about taking X-100 if I had red hair. That is, of course, a frivolous example. But restrictions on gender, liver function, previous treatment history, or "substance abuse" may end up screening out those who most need X-100.

Eligibility criteria are generally divided into two sections: "inclusion" and "exclusion" criteria. The first tells who can get in, the second describes who cannot. Sometimes it's not clear which category a particular criterion belongs in. If our X-100 trial enrolls persons with CD4 counts greater than 50, we can state "CD4 >50" as an inclusion criterion, or we could use "CD4 <50" as an exclusion criterion. As a general rule, though, inclusion criteria define the population the trial intends to represent, while exclusion criteria define the exceptions to that rule (for example, persons whose impaired liver function puts them at high risk, even though they meet all the other criteria).

A study's size is very important, for several reasons. As described above, evaluating a drug for "real world" efficacy requires a diverse study population -- and the practical effect of achieving diversity is that a lot people need to be enrolled. But far more important than this practical problem, is a statistical concept called power. Plainly put, the more persons you are able to observe, the higher the power, and the more certain you can be that you will detect an effect due to X-100, if one really exists. Increasing the power also protects against mistakenly deciding that X-100 provides a benefit when in fact none exists.

No matter how many people are enrolled in a trial or how long a trial continues, you can never be absolutely certain you have found the one and only "right" answer. But you can increase your confidence that you have most likely obtained the right answer -- or one that is very, very close to the imaginary "true" answer. Large trial sizes, and power, are a big part of achieving that.

Let's leave X-100 aside for a minute, and talk about coin flips. Pretend you are an alien, and know nothing about flipping coins. Being a nerdy alien, you want to calculate the odds of getting a "head" after any particular coin flip. Your friend, another alien, decides you should do an experiment, a "trial", to learn how coin flips actually work. He flips the coin one time, and gets heads. If he was a statistically impaired alien, or one of our Congress members that opposed sampling in the last census, he might just let the issue die there, and say, "The odds are 1/1 (one out of one). That's 100%! It is certain that I will get a head on the next flip!"

But you are not so easily fooled. You, on the other hand, are a rather more sophisticated alien, and you want to investigate a bit deeper. So you flip again. Another head. Hmm. Well, maybe your easily satisfied colleague was right. Or was he?

Most earthlings know that the odds of getting heads on the first flip are 1/2 or 50%. Getting heads the next time is also 1/2, and together there is a 1/4 chance of getting two heads in a row.

Well, you (the curious alien) say, "Hmm. Two heads is pretty convincing. But I want to be sure," and flip again. Tails this time! Your colleague says, "Oh, wait, I was wrong. The odds are 2/3 of getting a head." Well, his reasoning is still wrong, but notice how his answer comes closer to the "true" answer?

You keep flipping. You get HHTHTTTHTHHTHTTHTHHTHTTH. The fraction (per flip) is getting closer to -- and sticking closer to -- 1/2 or 50% with every new flip. The more information you have, the more certain you are of getting closer to the "right" answer.

Meanwhile, our alien colleague may not be so dumb after all, just hasty. He sits back, watches the flips, and frantically scribbles numbers. He gets the following sequence of fractions (see chart below):

First flip (heads): | 1/1 = | 100% |

Second Flip (heads): | 2/2 = | 100% |

Third (tails): | 2/3 = | 66.7% |

Fourth (heads): | 3/4 = | 75% |

Fifth (tails): | 3/5 = | 60% |

Sixth (tails) | 3/6 = | 50% |

Seventh (tails) | 3/7 = | 42% |

Eight (heads): | 4/8 = | 50% |

Ninth (tails): | 4/9 = | 44% |

Tenth (heads): | 5/10 = | 50% |

Eleventh (heads) | 6/11 = | 55% |

Twelve (tails): | 6/12 = | 50% |

Thirteenth (heads): | 7/13 = | 54% |

Fourteenth (tails): | 7/14 = | 50% |

Fifteenth (tails): | 7/15 = | 46% |

Sixteenth (heads): | 8/16 = | 50% |

Seventeenth (tails): | 8/17 = | 47% |

Eighteenth (heads): | 9/18 = | 50% |

Nineteenth (heads): | 10/19 = | 52% |

Twentieth (tails): | 10/20 = | 50% |

Twenty-first (heads): | 11/21 = | 52% |

Twenty-second (tails): | 11/22 = | 50% |

Twenty-third (tails): | 11/23 = | 48% |

Twenty-fourth (heads): | 12/24 = | 50% |

Eventually, he gets the right answer. His initial problem was "confusing the map with the territory" -- that is, thinking that the answers he got from his test coins said something with certainty about all coins. But you, the curious alien, realized that although you cannot predict how any particular coin toss will come out, you can still make an excellent estimate as to what will result from an extended sequence of coin flips.

If you look at this series of numbers, you may notice that they oscillate above and below 50% with each toss, which emerges as the center of the oscillations (see chart below).

As you accumulate an increasing number of coin flips, the amount by which your result differs from 50% gets smaller and smaller. Minor variations from this pattern are to be expected. There's a little deviation from that pattern after the seventh flip with three consecutive tails in a row. Our gullible colleague, if he had only seen those three flips, would be sure that all coins always come up tails -- exactly the opposite answer he got from seeing the first two flips! Little wobbles like this happen in real life, so it's important that clinical trials include enough people, and run for a long enough time, so that temporary imbalances don't fool you. But even with such minor variations, on average, as you flip more and more coins, the results you come up with get closer and closer to 50%.

To put it even more simply, the more information you have, the closer your estimate will come to the "truth" (our colleague never realized he was measuring an estimate, he thought he was measuring "the truth"). The fact is, you would need to keep flipping that coin forever -- past the end of the earth and the death of the sun -- in order to get near the "real" truth that the answer was 50%. Most sensible beings, though, would hang it up after a couple of hundred, or even a few dozen throws, when it becomes clear that if the "true" answer isn't 50%, it's pretty darn close!

Exactly what does all this flipping coin flipping have to do with AIDS research, you ask? By watching our aliens toss quarters, we are now ready to understand most of the scary statistical concepts that affect our confidence in the results of clinical trials, like sample size, significance, p values, 95% confidence intervals, and all kinds of other forbidding terms. As nasty and mathematical as this stuff seems, understanding a few simple notions can really help you judge the credibility and significance of the studies you read. In every study, information is provided that, in essence, describes how many times the coins were flipped and how far the results wobbled around the true estimate. This information gives crucial information about how trustworthy an answer is, or conversely, how likely it is that the results are wrong.

Let's pretend that rather than wanting to know about coin flips, our aliens want to know what the average CD4 count is for earthlings. So they go on an abduction spree and start counting CD4 cells. They choose a city and cruise through it, sampling the locals, collecting more and more CD4 counts. Occasionally they get a PWA, and the average drops; sometimes a person with lymph cell cancer has an abnormally high reading which nudges the average up. But overall, the average CD4 count swings back and forth in smaller and smaller increments as more samples are collected. As it turns out, during their travels they pass through an HIV clinic, and get a string of lower CD4 counts -- this is similar to getting three tails in a row. If the HIV clinic had been their first stop, then in the beginning they would have gotten a measurement that was far lower than the "true" average CD4 count of everybody on earth. But later, after they had collected measurements from a more diverse population, these early abnormalities would even out. If the HIV clinic had been their last stop, though, the low CD4 counts of the clinic patients would have had very little effect on the overall average, just as the later coin flips caused that average to deviate from 50% in smaller and smaller amounts.

To finish up and relate these concepts more clearly to clinical trials, we will need to introduce a third alien with an incredible "head-changing" ray gun. But I jump ahead.

In part three, we'll look at the effect treatment has on our results and how we can tell if our trial results have statistical significance.

Click here to see Part 1.

Click here to see Part 3.

Back to the GMHC *Treatment Issues* May 2001 contents page.

This article was provided by Gay Men's Health Crisis. It is a part of the publication

http://www.thebody.com/content/art13389.html