UKCTOCS: Ovaries, P-values & Questions
Ovarian cancer is a horrible disease. It is often asymptomatic until late in its course and then causes a lot of suffering. This is a disease with an extraordinarily high morbidity and currently has a 5-year survival of only 40%. So wouldn’t it be excellent if we could find this disease early in it’s course, intervene and cure it?
This has been the goal of several large population screening studies over the last few decades. To show that screening and subsequent early detection will improve the outcomes for women with ovarian malignancy. Last week the LANCET published the latest mega cohort – Ovarian cancer screening and mortality in the UK Collaborative Trial of Ovarian Cancer Screening (UKCTOCS): a randomised controlled trial
This is the largest ever trial looking at ovarian cancer screening – with more than 200,000 participants randomised and exceptionally well followed and documented. This trial was roughly twice the size of the Ovarian component of the American PLCO trial published in JAMA in June 2011 . The American PLCO trial failed to find any benefit from screening for Ovarian Cancer and did find a significant burden as a result of over diagnosis, false-positivity and subsequent downstream complications.
So many people in the medical community have been awaiting the outcomes of the UKCTOCS trial – and on the 17th of December it landed. So – what was the outcome? Did they find a benefit? Will we be changing our practice based on this new data? Well this is where it gets interesting. The UKCTOCS trial has been publicised widely as “positive” result – i.e. a significant benefit for screening. However, when you read the paper things are really not so clear. In fact – they are quite unclear… muddy. So I am going to do a bit of a deep dive into the paper and try to make sense of it all. But first lets look at the coverage in the mainstream and medical media. You need to know what your patients are hearing and go beyond the sound bites in the medical media to get this one right.
In Australia the two biggest medical News rags reported the trial very differently:
- In Australian Doctor the subheading was: “Evidence for ovarian cancer screening not there yet.”
- “The largest-ever trial of screening for ovarian cancer has found an “encouraging” late effect on mortality, but it is too early to back annual screening.”
- The Medical Observer ran with: “Blood test ‘could reduce ovarian cancer deaths by 20%”
- “SCREENING based on an annual blood test may help reduce the number of women dying from ovarian cancer by around 20%.”
So if you are the sort of doctor that reads the headlines and skims the abstracts then you are probably getting some mixed messages. Unfortunately both of these articles didn’t delve into the statistics underlying the claims made in the discussion by the authors.
In the USA the mainstream media also reported this important trial with similar headlines.
CBS News was very positive, the headline was: “Blood test for ovarian cancer saves lives, study finds” They interviewed a few Medics and if you read the article the benefits are clarified and it is suggested that we need to wait 3 more years to check if it really works. But the expert interviewed, Dr Agus, is quoted as saying:”If this [screening] were implemented in the United States we would save about 3,500 lives per year,” Interesting …. that is the sort of statement that tends to have patients knocking down our doors to get this new ‘super test’.
The New York Times was more circumspect. They interviewed Dr Menon – one of the lead authors. Their headline was: “Early Detection of Ovarian Cancer May Become Possible” They published a series of bites from a range of expert medical professionals which ranged from positive, through to skeptical and did include a number of the key statistical points. The NYTimes also made note of the author’s potential conflict-of-interest. This was probably the most rigorous coverage of the trial I have read.
So you need to know what the UKCTOCS trial did and did not find. Your patients will be asking you for advice and may request the “test”. So lets break it down and look at the study. Lets do a super PICO analysis.
POPULATION: This was a huge trial! 202,638 women were randomised. They were between 50 and 74 years of age at randomisation. They were largely postmenopausal. The study was completed in the NHS Trusts from all over the UK and Northern Ireland. More than 96% of the participants were “White” – so a very Anglo population. As you might expect from such a large cohort – the baseline stats were all very evenly matched across the groups and a good representation of the sort of patients that we all see and treat. Variables like age, menstrual history, HRT use, parity and co-morbid cancers were what you would expect.
More than 1.2 million women were invited to screening from the massive NHS databases – the idea here was to minimise the effect of the “healthy volunteer effect”[HVE] which can skew results in screening trials – i.e. make it less likely to find as many cancers if all of your volunteers are clean living, non-smoking, health conscious folk. The authors of the UKCTOCS also published an analysis of the “healthy volunteer effect” in 2011 in Trials. However they concluded that their invitation strategy did not reduce the HVE – this kinda makes sense when you consider that only ~ 1 in 6 women accepted the invitation to participate. I imagine they would represent the health conscious upper 1/6th of the population! The HVE meant that the women who participated in the trial died at a particularly low rate – overall only 37% of the predicted mortality. So the external validity of these results are difficult to apply to the population as a whole.
Importantly it should be noted that women at increased risk [defined as > 10% lifetime risk] of ovarian cancer due to family history (
INTERVENTION: the women were randomised 1:1:2 into 3 groups. There were 2 separate intervention groups each consisting of more than 50,000 women. Obviously the women and clinicians could not be blinded to the intervention, however the outcomes team were masked when analysing the clinical outcomes.
The original protocol was for 6 annual screens and 7 years of follow-up.
The compliance with the screening protocol was about 80% which is roughly the same as similar large population-based screening trials.
However, as noted above, the healthy-volunteer effect messed with the calculations. The mortality rate of the trial participants was much lower than that expected in the general population. The women in the trial died at 37% of the rate anticipated at the outset and by the end of the 7 years only at about half the mortality rate expected. This is an issue. Less deaths means that there was less potential to detect a real difference in the groups – the efficacy of the screening would be watered down by the relative good health of the participants. However the healthy volunteer effect also means that there are likley fewer deaths from other causes – so may bias the data to make screening look better than it is… but we don’t know for sure. See discussion of “all-cause mortality below”.
The research team decided to extend the screening period by 3 years in order to improve the ability to detect a difference. In plain speak – they increased the ‘dosage of screening’ to find try and detect a benefit. Subsequently follow-up was also extended… median follow up was about 11 years at the final curtain. The extension resulted in a recalculation of power to detect a 30% difference from 90 to 80% power.
So here is how the screening tests went:
- Ultrasound screening: women underwent annual trans-vaginal ultrasound and were followed up as follows:
- Normals – ongoing annual screening scans
- Unsatisfactory – had a repeat scan done in 3 months
- Abnormal scan – these women had a repeat scan with a senior sonographer within 6 weeks.
- Multimodal screening (MMS)
- This screening strategy was a bit more complicated. It is based upon annual CA-125 blood levels. Women were tested and their results fed into an algorithm called ROCA [Risk of Ovarian Cancer Algorithm] to detect rises in CA-125 over baseline, rather than using a standardised lab cut-off.
- The ROCA algorithm triaged the women into three groups:
- Normal: continued annual screening
- Intermediate: repeat CA-125 in 3 months
- Elevated: these got a repeat CA-125 and a TV ultrasound within 6 weeks
- Important note: Prof. Jacobs, the lead author declares in the COI statement that he holds the patent for the ROCA algorithm and has financial interest in the company Abcodia which sells this test.
Women with persistently abnormal results – either on US or the ROCA/US groups were subsequently investigated clinically by a trial investigator. They received whatever investigations or surgeries that the specialists thought were required.
There were a few papers written by the UKCTOCS team in 2011 analysing the relative sensitivity and specificity of each of the screening modalities. Both were around the 80 – 85% mark. Of note screening detected a little over half (59%) of the tumours counted in the total of ovarian cancer deaths. So at best screening will find a bit over half of the nasty tumours.
CONTROL: About 100,000 received No Screening (although there was some contamination of the control group – as would be expected in such a large study.) 4.3% of women underwent some sort of screening in the control group based upon a questionnaire at the end of the trial.
OUTCOMES: The primary outcome used in this trial is ovarian cancer death by Dec 31, 2014. Below is the key table of results form the paper:
Note – this is a disease-specific mortality outcome. There was no mention of “all cause mortality”. The trial was powered to detect a 30% mortality benefit for screening. There is no data provided in the paper about “all-cause mortality”. This is a bit odd as the best measure of the effect of a screening intervention would be “all-cause mortality”. The use of disease-specific mortality is useful to tell us if the screening actually does pick up cancers earlier and prevent death from ovarian malignancy, however…. we, and our patients, want to know if it will make them live longer. Only all-cause data can tell us this. If screening means that we diagnose ovarian cancer early and then we increase other mortality eg. more PEs or surgical deaths then we are not doing the right thing by our patients.
I did email the authors and ask if there were any numbers on “all-cause” mortality. The response did not throw any light on this stat. I do find it a little unusual that in such a large, well-conducted trial with great follow-up that this data was not published as part of the trial. Even if it were included as a secondary outcome – we could at least look and get a feel of the overall benefits. So I remain a bit confused as to why it didn’t get included.
Ovarian cancer was defined as: “malignant neoplasms of the ovary, which include primary non-epithelial ovarian cancer, borderline epithelial ovarian cancer, and invasive epithelial ovarian cancer; malignant neoplasms of the fallopian tube; and undesignated malignancies of the ovaries, fallopian tube, or peritoneum.”
Specifically primary peritoneal was not a primary outcome – although the WHO reclassification of cancers in 2014 threw a bit of a spanner in the works. The analysis therefore includes a “secondary analysis” which includes both primary ovarian cancer AND primary peritoneal cancer in the mortality numbers. Hmmm… not sure about his one! Beware the analysis of secondary outcomes
The analysis also broke the mortality reduction numbers into two time periods – 0 – 7 years AND 7 – 14 years. This is an interesting way to crunch some numbers. As with any long term mortality study – it is a basic fact that more people die the longer that you follow them. Hence there is more likely to be a benefit shown later in follow up. So if you are trying to break through he magical p < 0.05 line – then this is one way to do it.
Now if you scan the table above you will notice a few things:
- There are two different statistical techniques used to analyse the data – the Cox model and the Royston-Parmar model. These are both accepted ways of looking at data such as the survival data in this trial. However if you go back and look at the UKCTOCS trial protocol (available here at IWH website ) you will read the planned analysis was “a Cox regression model will be used to model the difference in mortality rate between the control arm and each individual screen arm.” So the Royston-Parmar model was not originally planned. Is this a problem? Well suppose we used 20 different models to analyse the mortality curves and then only published the one or two that showed a benefit. This is why we have a trials register – to ensure transparency of trial design and analysis. Note: Prof. Parmar was the head statistician on the UKCTOCS trial – so understandable that his method was used as an analysis tool.
- None of the Cox models reached statistical significance – they all included “0” in the confidence interval. However the R-P model did just squeak under the P-value of 0.05 for a few of the stats – namely those where “prevalent cases were excluded” which brings me to point #3.
- “Prevalent cases” were excluded from the analysis. Makes sense – we should not include women who already have ovarian cancer at the outset. But… hang on a minute. How did they know that these women had a cancer before they started screening? Well it is hard to answer that question. It appears that they looked at the CA125 trend in women who were diagnosed with ovarian cancer and extrapolated backwards in time to decide which women likely had a tumour at day 1. Hmmm… so how did the women in the “no screening” group get the same treatment if they never had a CA125? They did a post hoc assay on stored serum samples from enrolment samples and decided who probably already had a ‘prevalent cancer’ based on the CA125 level. I do not understand how we could generalise this to an external population of real world women. We can never know who already has a tumour in day-to-day GP practice ( if the women are asymptomatic) – so excluding them from the analysis seems to reduce the external validity of this trial.
- The analysis was extended to include both Ovarian and “primary peritoneal cancer” as a composite secondary. This also makes sense – primary peritoneal cancer is likely ovarian in origin as per the WHO reclassification. However if you look at the raw numbers you will see that the inclusion of this secondary outcome does favour the MMS strategy – there were 16 peritoneal cancer deaths in the MMS group and 15 in the control group. Recall that the control group was twice the size of the MMS group. So although it is a reasonable thing to analyse we need to beware of secondary outcomes and composite outcomes as they will be prone to bias.
- Of the screening groups 1634 women [50 per 10,000 screens] had unnecessary surgery i.e. surgery that yielded benign results. The rate of “false positive” surgery was described by the ratio of benign : malignant pathology. The ratios by group were : No-screening = 1 : 1.2, MMS 2.7 and USS 1: 6.4 . The surgical complication rate is quoted as 3.5%. Unfortunately the actual harms of the surgeries were not documented as far as I can tell from the paper. They are discussed in the 2-hour video produced at the trial launch – but bizarrely not in the actual paper. So it is hard to say what the actual true “harms of screening” are in this study. Below is the table you can find in the supplementary material which describes the types and numbers of harm events in the screening groups. For the record – in the PLCO trial the surgical complication rate in screened women was ~15% – so 5 times higher than the UK group… must be better surgeons in the NHS?
OTHER INTERESTING STUFF
There was a retrospective review of the American PLCO population data where they went back and applied a “best guess” version of the ROCA algorithm to the American cohort. Titled: Potential effect of the risk of ovarian cancer algorithm (ROCA) on the mortality outcome of the Prostate, Lung, Colorectal and Ovarian (PLCO) trial from Int. Journ Cancer, 2012. In this exercise the authors wondered what if if they had used the ROCA algorithm instead of the absolute CA125 cut-off value as a screening tool? Of course this has to be taken with a grain of salt – but they concluded that ROCA would not have shown any additional survival benefit in the PLCO cohort. One of the authors, Dr Skates, is also a co-inventor of the ROCA algorithm and co-author of the UKCTOCS trial.
There was another subgroup trial within the UKCTOCS cohort which examined the psychological impact of screening on the participants. Titled: Psychological morbidity associated with ovarian cancer screening: results from more than 23 000 women in the randomised trial of ovarian cancer screening (UKCTOCS) published in BJOG in Feb 2015. This was a prospective RCT which measured anxiety levels among women undergoing screening and subsequent investigation and surgery. The basic findings from the psychological surveys?
- Being screened did not seem to increase anxiety
- Having to have subsequent testing after an initial positive screening test did increase anxiety
- Undergoing surgery increased anxirty
- Being diagnosed with ovarian cancer had a large effect on anxiety – as one might expect.
The same research group also offered screening to “high risk” patients in a separate trial mentioned above – UK FOCSS. This was a prospective observational study (no control group) looking at the performance of annual screening using CA125 and TV ultrasound in 3500 high risk women. The screening was found to be 80% sensitive. And in this group there was not a significant “stage shift” in cancers detected. That is – screening did not move women from an advanced stage of disease at diagnosis to earlier, more treatable disease. There were a few reasons discussed for this finding and the trial is now in Phase II – in which women are screened at 4-monthly intervals and faster follow-up surgery etc are planned. So in summary – even in a high risk cohort we have not yet seen a benefit from annual screening in terms of finding earlier disease. That is the core goal of any screening program – to find earlier and treatable disease.
OVERALL THOUGHTS
I am “just a GP”. I am not a guru in biostats so I may be completely wrong here. However, I am just a GP who wants to know what to tell my patients when it comes to screening for a nasty disease. I would love to be able to do something to prevent my patients from getting diagnosed with late-stage ovarian cancer. Here are my summative thoughts:
- The UKCTOCS study was large and well conducted. The results, in my reading, do not show a significant benefit to screening.
- There are a number of statistical and methodological quirks that do raise questions about the reliability of the results and their external validity
- The conclusions of the authors are optimistic – yet at the basic scientific level – we cannot reject the null hypothesis based on these numbers.
- In time we may get more follow-up data which may change the situation. However, as of December 2015 I do not think we should be changing our practice.
- Screening for ovarian cancer remains unproven and the harms remain largely unknown.
- I would really likely see the all-cause mortality data presented in a clear manner so that we can all look at it and make our own conclusions.
Based upon my reading of this paper and the surrounding data I do not think I will be recommending screening with any tool for ovarian cancer to my patients.
I am somewhat concerned that this paper and the media hype around it may represent the “edge of the wedge” – a foot in the door for screening which remains unproven. We have all seen and struggled through the perils of prostate cancer and mammography over-diagnosis. At best guess the “number need to screen” is somewhere between 2000 and infinity. This is a very weak effect if any at all. We should be investing our time, money and patient’s goodwill in other health pursuits for now.
Love to hear your thoughts
Casey
Nice analysis. Given the difficulty in early detection through current tests, a better hope for reducing mortality from the condition than screening is probably the up and coming cancer immunotherapy.
Dear Casey,
Great thoughts. I agree with most of what you’ve written. In terms of the the appraisal, it’s worthwhile using Bond/Oxford University’s appraisal sheet (attached), and follow the PICO-RAMBO approach.
Thoughts:
Healthy volunteer effect
As mentioned, a threat to external validity, potentially. It is not unreasonable to use our judgement on whether this is an issue. Do we believe, for instance, that ovarian cancer and the effect of screening for this cancer have major different effects to people in other demographic groups? If these individuals are healthier, it is likely to exaggerate the effect from screening, as the participants are less likely to die from other causes. That is, we are less likely to get people who get ovarian cancer and then die, but for that death to be due to a cause other than ovarian cancer (but may get listed as an ovarian cancer death). Presumably, the point of screening is to diagnose people with cancer, and to make them not-die. Unhealthy populations will mask the effect of screening, when using disease-specific mortality as an outcome. What does this mean? The direction of bias in this study from this effect is towards making disease-specific mortality look better than in “natural” populations.
Of the people in the control group who got diagnosed with ovarian cancer, how did they behave? Web figure 8 (http://www.thelancet.com/cms/attachment/2041209462/2055058848/mmc1.pdf) in the supplementary would suggest that they died at pretty much a similar rate as the general UK population. Even if this population was “healthier”, that didn’t seem to protect the individuals who developed ovarian cancer.
Outcomes
Let’s stick to the primary outcome. There are no statistically significant effects at the original length of the study. However, statistical significance (as I point out in my AFP article to come!) simply means “mathematically unusual”, and nothing more. It doesn’t point to truth, or importance. Let’s look at the point estimate and confidence interval. Cox-regression model, even when looking over the 14 years – large confidence interval relatively speaking, that crosses over no-effect. For both, MMS and USS, no-effect is within the range of credible true estimate of effect. There is relatively little precision to the effect. For the MMS arm, from -3% (made things worse) to 30% reduction.
I note the use of the Royston-Parmar model. I can accept a statistical argument that this is in fact the better model to use and that the Cox model underestimates effects from non-proportional hazards. However, the risk from this reasoning is that Royston-Parmar models are more likely to over-estimate later survival effects, when such an effect is detected using the statistical model.
The is analogous to the risk of underpowered studies. When a study is underpowered, the effect size estimate is likely to be an overestimate, when an effect is detected. This is the effect of using statistical significance thresholds – the results which “survive” the statistical significance threshold when power is low, is likely to be erroneously large. The power of this study was 80%, for finding at 30% difference. I note that the true difference in HARD terms (i.e., actual numbers of people who died, not an estimate using the model), is 14.7% between the MMS and no screening group. That’s an absolute risk reduction of 0.05% over 14 years, or a NNT of 2000. Remember, big confidence intervals – NNT anywhere from 1000 to infinity.
Even when using the R-P model, the estimate doesn’t change meaningfully over 0-14 years (and neither should we expect it to). The subdivision between the 0-7 and 7-14 years is interesting.
RULE OF THUMB:
This should be considered a secondary, and exploratory analysis only. Why did we see these results? Remember, the study was not designed for this experiment. There are a number of possibilities:
the effect is real (screening > no screening)
the effect is a statistical bias (result of using the RP model)
the effect is due to a confounding factor.
In terms of confounders, the MMS group in particular seemed to have a lower incidence of ovarian cancer right towards the end of the study. See web figure 9 in the supplementary. Pretty much, it seemed like there were no diagnoses of ovarian cancer at all from about years 11 to 14 in the MMS group. Why?
My hunch? This is just statistical error (i.e., chance due to the relatively small number of diagnoses). Screening for ovarian cancer does not change the histopathological process of cancers (unlike bowel and cervical cancer). There is no reason why ovarian cancer screening should have a protective effect from future diagnoses.
Now, this isn’t to say that my hunch is “true”. The point, however, is that for this sort of secondary analysis, the question you need to really ask is whether the narrative given by the authors is the most likely, or only hypothesis that explains the data. If it isn’t, then the best response is “well that was interesting, but we need to examine this more specifically”, not claim that the effect is true.
Prevalent cases
Let’s just accept that this is nonsense. In real world terms, we need to accept (and people do) that at a good proportion of the benefit from cancer screening is in detecting existing cancer at time of the first screen. If they really wanted to exclude this effect, then simply exclude all women diagnosed with cancer within a year of enrolment into the study rather than use the assumptions in the statistical magic of “prevalent cases”.
Clinical behaviour change
Much, much too early.
The authors neither demonstrate a statistically significant effect (meh), nor a clinically significant effect (effect size, and precision of the effect).
Harms are not documented in a satisfactory manner.
Why no all-cause mortality figures? No one would expect to see a difference that reached statistical significance, but one would at least hope that the estimate is within the right direction.
A change in screening practice needs an economic analysis. Screening is a population based intervention, and thus outcomes need to be at the population level. It’s why cervical cancer screening and breast cancer screening are at certain intervals – it’s where the cost to benefit ratio is okay.
Yours sincerely,
Michael (Dr Michael Tam)
Fantastic job, Casey. You have put an impressive effort into your analysis. Thanks, too, to Michael Tam for his detailed ‘supplement’.
Cheers
Justin Coleman