Thursday, March 1, 2012
How & Why Journals Waste Our Time
We often think of medical research as being objective, rigorous, and even ruthless in separating truth from fiction; whereas, it is actually easy to manipulate results or distort data in ways that make stronger claims than justified or to reach erroneous conclusions. In an unusual display of candor, a senior editor from a major journal in the pain management field provides some important lessons in how research can be misleading and the statistical subtleties behind it. At the same time, he raises questions about why such research might be published in a peer-reviewed journal in the first place.
We have observed before that when healthcare providers hear the word “statistics” many of them want to run off and hide, and even experienced consumers of pain literature suffer from a certain amount of “statistiphobia.” Yet, according to Nikolai Bogduk, MD, PhD — writing in the February edition of the journal Pain Medicine from the American Academy of Pain Medicine — “avoiding statistics constitutes a disservice to their own intellect, and renders practitioners vulnerable to illusions and misinformation” [Bogduk 2012].
Bogduk is a Professor of Pain Medicine at the University of Newcastle, Australia, and a senior editor of Pain Medicine. He reminds readers that statistics serve them not by proving truth but by protecting them from errors and false assertions. The concepts discussed by Bogduk also have been presented in our ongoing Series, “Making Sense of Pain Research” [listing here], but he provides some added insights by using a specific study published in Pain Medicine as an example.
Errors & Risks in Research
Among the difficult statistical concepts are those pertaining to Type I and Type II errors in controlled trials, along with the statistical power of such trials. Problems come about because any controlled study is vulnerable to vagaries of the samples; outcome differences between groups may occur, not because of the biological effects of a therapy or intervention but because of the samples themselves, Bogduk cautions.
A difference between an experimental treatment and a control group might be detected when it actually should not occur, which is known as Type I error, or a “false positive” result. This can arise when, by chance alone, subjects in the experimental group respond extraordinarily well or those in the control group respond unsually poorly. Differences in the samples, not in the strength of the treatment or intervention, account for the differences in outcome results.
In well-designed studies, the acceptable risk of random false positives is conventionally kept below 5%, or 0.05; hence the expression P<0.05. This means that there should be less than a 1 in 20 chance of a misleading result due to aberrations in the samples. “That is why repeat studies are mandatory, not so much to confirm the result, but to refute rogue results that arise from an unrepresentative sample,” Bogduk notes. “In other words, an intervention is not proved by repeat studies, but the credibility of that intervention rises by default if and when repeated attempts to disprove it fail.”
On the other hand, there also is a risk that a study will fail to find a difference between groups when, in reality, a difference should occur and be detected in the research. This failure is called a Type II error, or “false negative” result. Again, this may come about by chance, such as when the experimental and control groups are unrepresentative of the population and respond atypically.
Bogduk points out that the risk of a Type II error, false negative, is harder to avoid than a Type I error, false positive. The risk of Type II error is greatly influenced by sample size, with larger samples more likely to be representative and less likely to produce atypical results. Conversely, small samples of subjects can be contaminated by individuals who shift responses away from what should be the group average; thereby camouflaging differences that should have been detected. In determining necessary sample sizes, researchers typically allow for a 20% risk of Type II error, which endows the study with 80% (1.00 minus 0.20) statistical power to avoid such errors. [Statistical power also was thoroughly discussed in an UPDATE here.]
What’s All the Fuss About?
So, why are these statistical concepts important? A major concern is that small studies with insufficient statistical power may produce outcome results that are laden with Type II errors and, thereby, mislead readers. And, small studies are pervasive in the pain research field.
For example, Bogduk observes that having valid and reliable safety data regarding a therapy or intervention can be essential for practitioners; however, good studies to gather such information can be difficult to conduct. As a case in point, he discusses a study appearing in the same edition of Pain Medicine as his commentary, by Aileen Michael and colleagues [Michael, et al. 2012].
The objective of their study was to compare the safety of a new device with an older, well-established model used for the intrathecal (IT) delivery of opioid analgesics. A primary safety concern is granuloma formation around the tip of the IT catheter. Such granulomas — or clumps of immune cells and fibrous tissue — may result from a natural inflammatory reaction to foreign bodies. The incidence of this complication has been relatively low in clinical practice, but IT granulomas can lead to significant morbidity, including paraparesis (partial paralysis of lower limbs).
In the study by Michael et al., 52 female subjects were randomized to 6 groups, testing the 2 different IT pump systems and 3 different solutions in each — saline (control), morphine sulfate, or baclofen — at high and low concentrations and dosages. The study design and presentation of results is quite complex; however, Bogduk observes that the study found no statistically significant differences between groups in rates of complications, suggesting that the newer and the older IT delivery systems have equivalent safety profiles.
“Superficially, this might seem appealing or reassuring to readers,” he writes, “but such negative results warrant greater scrutiny. The risk of a Type II error arises.” That is, the negative results might reflect insufficient statistical power, rather than a true lack of differences between the devices. Indeed, a close look at sample sizes and outcomes do suggest a lack of power.
Granulomas occurred only with morphine infusion, primarily at higher concentration and dosage. With the newer device granulomas occurred in 5 of 16 cases (31%), and with the older device granulomas occurred in 10 out of 16 cases (63%). Bogduk notes that “Although the latter proportion (0.63) is larger than the former (0.31), the two are not significantly different statistically because their 95% confidence intervals (CIs) overlap (0.08–0.54, 0.39–0.87).” Furthermore, the wide ranges of the CIs suggest that the sample sizes were too small; for example, a value of 0.31 is probably not fairly representative of a CI that broadly ranges from as low as 0.08 and to a high of 0.54.
NOTE: In their article, Michael et al. do not provide confidence intervals for their data, so Bogduk has at hand information unavailable to readers. By our own calculations — using data within the study article and our PTCalcs Excel spreadsheet [here] — the difference between the proportions is 0.32 and the authors indicate this is nonsignificant (P=0.156). This would represent a small effect size of 0.37 with a large standard deviation of 0.85 and a broad 95% CI of -0.12 to 0.74 (which crosses 1.00, indicating nonsignificance). This seems to confirm Bogduk’s assertion that the small group sizes resulted in high variance in the data and unreliable outcomes.
Still, as Bogduk observes, at face value, a 63% complication rate with one device does look to be twice the rate of 31% with the other. And, in fact, he notes that, “The difference between these two proportions would have been statistically significant if the sample size had been 34.” However, he continues, “planning a study to have a sample size of 16, or stopping it at 16, provides for no scientific conclusion. We cannot tell if the observed difference is an artifact or is real.” So, what was the value of this study?
What About Subjects Who Died?
As Bogduk explains, with a larger sample size that he calculated to be 34 subjects in each group in this case, a more definite answer might have arisen, which then could have been further tested for reproducibility. “Understandably, experiments incur costs,” he writes, “and doubling the sample size might double the cost; but the penalty for financial frugality is lack of a more definitive contribution.” At the same time, all of the subjects in this study, who were previously healthy, underwent an invasive procedure — and 3 of them receiving morphine infusions developed severe complications during the study and had to be euthanized!
Did we mention that the subjects were dogs? That’s right… 52 female, cross-bred hounds, aged 8 to 26 months, weighing approximately 46 to 70 pounds (21-32 kg). In fact, all of the remaining animals were euthanized at the end of the study for inspection at autopsy of any possible pathology due to the infusions. So, essentially, 52 previously healthy animals were sacrificed for a study that was not adequately designed at the outset to provide valid results.
The authors make two relevant statements in this regard…
- First, they write that this study was conducted in compliance with the Food and Drug Administration Good Laboratory Practice Regulations and the protocol was approved by “the Institutional Animal Care and Use Committee.”
- Secondly, the purpose of their study was to gather data for submitting the new device for regulatory approval, and the authors (two of whom were employed by the new-device manufacturer) state that their sample size was based on meeting that objective. “This study was specifically not powered to detect differences between pump types,” they write.
Apparently, the study protocol involving a sacrifice of animals was reviewed in advance and approved even though there was no interest in detecting potentially important differences between the two devices; and, indeed, none were found. But, as Bogduk notes, “When coupled with the small samples used, this lack of difference precludes a legitimate conclusion. In turn, this raises the specter of the value of such studies.” Yet, doubling the number of subjects in this particular study, as Bogduk suggests would be necessary, also would have dire implications for the many additional animals involved.
Whenever subjects — whether human or animal — are engaged in studies that are unlikely to provide valid results it raises the spectre of inappropriateness. As we wrote in a prior UPDATE…
“…experts have questioned the ethical propriety of conducting small-scale, underpowered studies [Halpern et al 2002; Kraemer et al. 2006]. They assert that, even in the most innocuous of small-scale studies, patients are being exposed to the burdens and possible hazards of experimental treatments or procedures, and time and money are expended, for very limited ends. In many cases, studies worth pursuing are aborted prematurely due to disappointing early results, and studies that are not aborted are underpowered and of questionable clinical value.”
Why Was This Study Published?
Besides the propriety issues, this study by Michael and colleagues raises two more questions: (1) What purpose does it serve, and (2) of what interest would it be to readers of Pain Medicine?
Bogduk seems to politely waffle regarding those questions. He writes, “The study does serve a sociological purpose. Its publication records that investigators were interested in a question, and readers might be served by learning that someone had a look at that question; but the question remains unanswered.” At the same time, he acknowledges that the outcomes are worthless and readers “are entitled to feel irritated that [a difference between groups] might have arisen had the study been larger.”
“The point is that a small study that finds no difference serves no heuristic purpose,” he continues. “If no conclusion can be drawn, the rationale for such a small study can be questioned.” Essentially, the researchers did not do the work required of them before publishing to provide valid conclusions; and, as to why Pain Medicine would be the publisher of such research, Bogduk writes…
“For a learned journal, an issue of policy arises. Strict journals would not countenance publication of inconclusive studies, on the grounds that they do not contribute scientifically to the valid resolution of a question. Other journals might indulge sharing archival information about what investigators have done, and respond to the particular interests of their readers by doing so.”
Apparently, Pain Medicine falls into the latter category; willing to provide “archival information,” and Bogduk expresses the hope that, armed with a better understanding of statistics, readers “will be able to recognize when they are being given archival information rather than being provided with conclusive data.” Of course, this still begs the question of why editors might consider such information to be of value to busy clinicians and other healthcare providers in the pain management field, who we assume comprise the journal’s primary audience.
In fairness, Pain Medicine should be applauded for this demonstration of candor by publishing such criticism from one of its own editors. In fact, this is not the first time Bogduk has offered such disapproval. As we noted in an UPDATE last October [here], he critiqued the flaws of a research study claiming that intrathecal midazolam was a beneficial supplement to standard analgesic therapy for recalcitrant lower-back pain. In this case, the study was adequately powered, but without suitable controls to account for placebo effects and other confounding factors; essentially, making the outcomes of little or no clinical value for decision-making purposes.
Furthermore, Pain Medicine should not be singled out as being the only journal in the field to publish articles that probably have little validity and/or clinical value for readers. Perhaps, if journals in the field were more selective in the quality and value of studies that they publish, researcher-authors would respond in kind with better submissions. Meanwhile, as we have frequently cautioned in these UPDATES, caveat lector — reader beware.
> Bogduk N. Power and Meaninglessness. Pain Med. 2012(Feb);13:148–149 [available here by subscription].
> Dhruva SS, Redberg RF. Evaluating Sex Differences in Medical Device Clinical Trials: Time for Action. JAMA. 2012(Feb 29); online ahead of print [abstract].
> Halpern SD, Karlawish JH, Berlin JA. The continuing unethical conduct of underpowered clinical trials. JAMA. 2002;17;288(3):358-362 [abstract here].
> Kraemer HC, Mintz J, Noda A, et al. Caution regarding the use of pilot studies to guide power calculations for study proposals. Arch Gen Psychiatry. 2006;63:484-489.
> Michael AM, Buffen E, Rauck R, et al. An in vivo canine study to assess granulomatous responses in the MedStream programmable infusion system and the SynchroMed II infusion system. Pain Med 2012(Feb);13(2):175–184 [abstract here].