Probing an untrustworthy Cochrane review of exercise for “chronic fatigue syndrome”

My ongoing investigation so far has revealed that a 2016 Cochrane review misrepresents how the review was done  and what was found in key meta analyses. These problems are related to an undeclared conflict of interest.

The first author and spokesperson for the review, Lillebeth Larun is also the first author on the protocol for a Cochrane review that has not yet been published.

Larun L, Odgaard-Jensen J, Brurberg KG, Chalder T, Dybwad M, Moss-Morris RE, Sharpe M, Wallman K, Wearden A, White PD, Glasziou PP. Exercise therapy for chronic fatigue syndrome (individual patient data) (Protocol). Cochrane Database of Systematic Reviews 2014, Issue 4. Art. No.: CD011040.

At a meeting organized and financed by PACE investigator Peter White, Larun obtained privileged access to data that the PACE investigators have spent tens of thousands of pounds to keep most of us from viewing. Larun used this information to legitimize outcome switching or p-hacking favorable to the PACE investigators’ interests. The Cochrane review  misled readers in presenting how some analyses were conducted that were crucial to its conclusions.

One of the crucial function of Cochrane reviews is to protect policymakers, clinicians, researchers, and patients from the questionable research practices utilized by trial investigators to promote particular interpretation of their results. This Cochrane review fails miserably in this respect. The Cochrane is complicit in endorsing the PACE investigators’ misinterpretation of their findings.

A number of remedies should be implemented. The first could be for Cochrane Editor in Chief and Deputy Chief Director Dr. David Tovey to call publicly for release for independent reanalysis of the PACE trial data from The Lancet original outcomes paper and the follow-up data reported in Lancet Psychiatry.

Given the breach in trust with the readership of Cochrane that has occurred, Dr. Tovey should announce that the individual patient-level data used in the ongoing review will be released for independent re-analysis.

Larun should be removed from the Cochrane review that is in progress. She should recuse herself from further comment on the 2016 review. Her misrepresentations and comments thus far have tarnished the Cochrane’s reputation for unbiased assessment and correction when mistakes are made.

An expression of concern should be posted for the 2016 review.

The 2016 Cochrane review of exercise for chronic fatigue syndrome:

 Larun L, Brurberg KG, Odgaard-Jensen J, Price JR. Exercise therapy for chronic fatigue syndrome. Cochrane Database Syst Rev. 2016; CD003200.

Added only three studies that were not included in a 2004 Cochrane review of five studies:

Wearden AJ, Dowrick C, Chew-Graham C, Bentall RP, Morriss RK, Peters S, et al. Nurse led, home based self help treatment for patients in primary care with chronic fatigue syndrome: randomised controlled trial. BMJ 2010; 340 (1777):1–12. [DOI: 10.1136/bmj.c1777]

Hlavaty LE, Brown MM, Jason LA. The effect of homework compliance on treatment outcomes for participants with myalgic encephalomyelitis/chronic fatigue syndrome. Rehabilitation Psychology 2011;56(3):212–8.

White PD, Goldsmith KA, Johnson AL, Potts L, Walwyn R, DeCesare JC, et al. Comparison of adaptive pacing therapy, cognitive behaviour therapy, graded exercise therapy, and specialist medical care for chronic fatigue syndrome (PACE): a randomised trial. The Lancet 2011; 377:611–90.

This blog post concentrates on sub analyses that is crucial to the conclusions of the 2016 review reported on pages  68 and 69, Analyses 1.1 and 1.2.

I welcome others to extend this scrutiny to other analyses in the review, especially those for the SF-36 (parallel Analyses 1.5 and 1.6).

Analysis 1.1. Comparison 1 Exercise therapy versus treatment as usual, relaxation or flexibility, Outcome 1 Fatigue (end of treatment).

The only sub analysis that involves new studies includes Wearden et al. FINE trial, White et al. PACE trial and an earlier study, Powell et al. The meta-analysis gives 27.2% weight to Wearden et al and 62.9% weight to White et al.or a 90.1% weight to the pair.

 Inclusion of the Wearden et al FINE trial in the meta-analysis

The Cochrane review evaluates risk of bias for Wearden et al. on page 49:

Wearden selective reporting

This is untrue.

Cochrane used a ‘Likert’ scoring method (0,1,2,3), but  the original Wearden et al. paper reports using the…

11 item Chalder et al fatigue scale,19 where lower scores indicate better outcomes. Each item on the fatigue scale was scored dichotomously on a four point scale (0, 0, 1, or 1).

This would seem a trivial difference, but this outcome switching will take on increasing importance as we proceed.

Based on a tip from Robert Courtney. I found the first mention of a re-scoring of the Chalder fatigue scale in the Weardon  study in a BMJ Rapid Response:

 Wearden AJ, Dowrick C, Chew-Graham C, Bentall RP, Morriss RK, Peters S, et al. Nurse led, home based self help treatment for patients in primary care with chronic fatigue syndrome: randomised controlled trial. BMJ, Rapid Response 27 May 2010.

The explanation that was offered for the re-scoring in the Rapid Response was:

Following Bart Stouten’s suggestion that scoring the Chalder fatigue scale (1) 0123 might more reliably demonstrate the effects of pragmatic rehabilitation, we recalculated our fatigue scale scores.

“Might reliably demonstrate…”?  Where I come from, we call this outcome switching,  p-hacking, a questionable research practice, or simply cheating.

In the original reporting of the trial, effects of exercise were not significant at follow-up. With the rescoring of the Chalder fatigue scale, these results now become significant.

A  physician who suffers from myalgic encephalomyelitis (ME) – what both the PACE investigators and Cochrane review term “chronic fatigue syndrome” – sent me the following comment:

I have recently published a review of the PACE trial and follow-up articles and according to the Chalder Fatigue Questionnaire, when using the original bimodal scoring I only score 4 points, meaning I was not ill enough to enter the trial, despite being bedridden with severe ME. After changing the score in the middle of the trial to Likert scoring, the same answers mean I suddenly score the minimum number of 18 to be eligible for the trial yet that same score of 18 also meant that without receiving any treatment or any change to my medical situation I was also classed as recovered on the Chalder Fatigue Questionnaire, one of the two primary outcomes of the PACE trial.

So according to the PACE trial, despite being bedridden with severe ME, I was not ill enough to take part, ill enough to take part and recovered all 3 at the same time …

Yet according to Larun et al. there’s nothing wrong with the PACE trial.

Inclusion of the White et al PACE trial in the meta-analysis

Results of the Wearden et al FINE trial were available to the PACE investigators when they performed the controversial switching  of outcomes for their trial. This should be taken into account in interpreting Larun’s defense of the PACE investigators in response to a comment from Tom Kindlon. She stated:

 You particularly mention the risk of bias in the PACE trial regarding not providing pre-specified outcomes however the trial did pre-specify the analysis of outcomes. The primary outcomes were the same as in the original protocol, although the scoring method of one was changed and the analysis of assessing efficacy also changed from the original protocol. These changes were made as part of the detailed statistical analysis plan (itself published in full), which had been promised in the original protocol. These changes were drawn up before the analysis commenced and before examining any outcome data. In other words they were pre-specified, so it is hard to understand how the changes contributed to any potential bias.

I think that what we have seen here so far gives us good reason to side with Tom Kindlon versus Lillebeth Larun on this point.

Also relevant is an excellent PubMed Commons comment by Sam Carter, Exploring changes to PACE trial outcome measures using anonymised data from the FINE tria. His observations about the Chalder fatigue questionnaire:

White et al wrote that “we changed the original bimodal scoring of the Chalder fatigue questionnaire (range 0–11) to Likert scoring to more sensitively test our hypotheses of effectiveness” (1). However, data from the FINE trial show that Likert and bimodal scores are often contradictory and thus call into question White et al’s assumption that Likert scoring is necessarily more sensitive than bimodal scoring.

For example, of the 33 FINE trial participants who met the post-hoc PACE trial recovery threshold for fatigue at week 20 (Likert CFQ score ≤ 18), 10 had a bimodal CFQ score ≥ 6 so would still be fatigued enough to enter the PACE trial and 16 had a bimodal CFQ score ≥ 4 which is the accepted definition of abnormal fatigue.

Therefore, for this cohort, if a person met the PACE trial post-hoc recovery threshold for fatigue at week 20 they had approximately a 50% chance of still having abnormal levels of fatigue and a 30% chance of being fatigued enough to enter the PACE trial.

A further problem with the Chalder fatigue questionnaire is illustrated by the observation that the bimodal score and Likert score of 10 participants moved in opposite directions at consecutive assessments i.e. one scoring system showed improvement whilst the other showed deterioration.

Moreover, it can be seen that some FINE trial participants were confused by the wording of the questionnaire itself. For example, a healthy person should have a Likert score of 11 out of 33, yet 17 participants recorded a Likert CFQ score of 10 or less at some point (i.e. they reported less fatigue than a healthy person), and 5 participants recorded a Likert CFQ score of 0.

The discordance between Likert and bimodal scores and the marked increase in those meeting post-hoc recovery thresholds suggest that White et al’s deviation from their protocol-specified analysis is likely to have profoundly affected the reported efficacy of the PACE trial interventions.

Compare White et al.’s “more sensitively test our hypotheses” to Weardon et al.’s ““might reliably demonstrate…” explanation for switching outcomes.

A correction is needed to this assessment of risk of bias in the review for the White et al PACE trial.white study bias

A figure on page 68 shows results of a subanalysis with the switched outcomes at the end of treatment.

analysis 1.1 end of treatment

This meta analyses concludes that exercise therapy produced an almost 3 point drop in fatigue on the rescored Chalder scale at the end of treatment.

Analysis 1.2. Comparison 1 Exercise therapy versus treatment as usual, relaxation or flexibility, Outcome 2 Fatigue (follow-up).

A table on page 69 shows results of a subanalysis with the switched outcomes at follow up:

analyses 1.2 follow up

This meta analysis entirely depends on the revised scoring of the Chalder fatigue scale and the FINE and PACE trial. It suggests that the three point drop in fatigue persists at followup.

But Cochrane should have stuck with the original primary outcomes specified in the original trial registrations. That would have been consistent what with the Cochrane usually does, what is says it did here,  and what its readers expect.

Readers were not at the meeting that the PACE investigators financed and cannot get access to the data on which the Cochrane review depends. So they depend on Cochrane as a trusted source.

I am sure the results would be different if the expected and appropriate procedures had been followed. Cochrane should alert readers with an Expression of Concern until the record can be corrected or the review retracted.

 Now what?

get out of bedIs it too much to ask that Cochrane get out of bed with the PACE investigators?

What would Bill Silverman say? Rather than speculate about someone who neither Dr.Tovey or I have ever met, I ask Dr Tovey “What would Lisa Bero say?”



  1. The claim that ‘Likert scoring’ is more sensitive than the bimodal scoring seems silly to me. I read the claim as being like someone saying lets measure in millimeters rather than centimeters. But that is not what is happening.

    The different scoring techniques have different orderings. There are question answers which if say patient A and B answer then on one marking scheme patient A is more fatigued than patient B where as using the other marking scheme patient B is more fatigued than patient A. This is of course why some patients both improved and got worse in the FINE trial as measured by different scoring methods.

    The effect of this is that one or the other of the scoring methods cannot provide a linear (interval) scale that represents a proxy for fatigue. There is a question of which is most likely to be a interval scale on fatigue but I see no real validation for either in this respect.

    One paper on the CFQ points out that there are two major principle components one representing mental fatigue and one physical fatigue (along with a smaller component if I remember correctly). This means that the scale is a composite scale – mental and physical fatigue may often be correlated but from a few observations sometimes the relationships change. The two components are not even evenly weighted with fewer questions relating to mental fatigue than physical fatigue. So there is an effective utility function, to use economics terminology, where physical fatigue is considered more important than mental fatigue. Yet this choice seems arbitrary and to my mind means that the questions are unlikely to lead to an interval scale of fatigue however they are scored.

    Of course the CFQ questions are very confusing anyway in terms of the weird temporal content of the questions. The scoring as well is very dodgy as it is not symmetrical with 2 ways of scoring fatigue, on score for normal and one for better than normal. The form of the response does not take the form of a likert item despite them using the term. The whole scale is not a likert scale either (as in the set of questions) as this requires a highly correlated set of questions basically asking the same thing. The form of the questionnaire seems likely to introduce non-linearity when mapping from concept to questions answers and back to score. It seems unlikely to be an interval scale yet is used as such.

    The point is that any meta analysis based on the mean difference on the CFQ scale is really dodgy. Using a mean difference assumes an interval scale. Both scoring methods cannot both offer interval scales. One could but they have failed to demonstrate this or provide any evidence that switching scoring methods is more likely to give an interval scale.

  2. I don’t understand something about this.

    In his comment on the Cochrane review Kindlon raised concern about the change in FINE scoring from bimodal to likert, and received this response from Larun:

    “You suggest that the decision to use the 33-point fatigue scores in our analysis may bias the results because there is no statistically significant difference at the 11-point data at 70 weeks. This statement suggests that there is a statistically significant difference when using the 33-point data, but if you look into analysis 1.2 that is not the case. At 70 week we report MD -2.12 (95% CI -4.49 to 0.25) for the FINE trial, i.e. not statistically significant.”

    That could mean that they had not used the results from the BMJ rapid response cited above (which did report a statistically significant difference), but had generated different results using data from the FINE researchers (perhaps correcting an error). But the review also states: “For this updated review, we have not collected unpublished data for our outcomes but have used data from the 2004 review (Edmonds 2004) and from published versions of included articles.”

    I don’t see where the FINE results in the Cochrane analysis 1.2 came from. They cite Wearden 2010, but I couldn’t see where these results were reported there. Have I missed something, or spotted another problem?

  3. Thanks James, the Cochrane review is being used in Australia by members of Andrew Lloyds fatigue centre and lifestyle clinic, as a basis for claiming the effectiveness of CBT and GET. They’ve also invented cognitive activity therapy . The package has been modularised and they have funding to trial teaching the medical profession about how to managed CFS.

  4. PACE was unblinded with subjective primary outcome. Instead of doing everything they can to minimize bias, the authors acted in ways to maximize bias. The authors circulated a newsletter during the trial, including testimonies of patients saying that therapy had helped them so much. The details of this have been documented by David Tuller. One goal of the type of CBT and GET employed in PACE is to get the patient to view themselves as less ill. This can easily give the impression of improvement even without any real improvement taking place. The misleading claims of recovery were made by lowering the SF-36 physical functioning score required to count as recovered from 85 to 60 while falsely claiming that half the working had a score of 85 or lower, while in reality only 17.7% score lower.

    PACE does not belong in a “low risk of bias” category, and not even into a review because it’s fatally flawed and fraudulent.

  5. This devalues the whole concept of Cochrane reviews, which had already been seriously damaged by the review of cycle helmets, conducted by the world’s most vociferous helmet proponents and highly selective of the evidence reviewed

    The whole ethos of Cochrane reviews is that they are independent, unbiased and reliable; no longer perhaps.


