I started my post-doc at McGill University with the awesome Jonathan Kimmelman in 2012. I had just finished working with the CDC on ideas about research strategy and portfolio optimization. When I arrived in Montreal, Jonathan had been pondering similar questions about how we (the scientific community) should be analyzing the success and failure of clinical trials.

One particular article loomed large in Jonathan’s and my early discussions: a piece by Benjamin Djulbegovic and colleagues in Nature. They had looked at mobidity and mortality outcomes across a cohort of 860 phase 3 trials and found a nice bell-curve distribution centered on no-difference. This showed that about half the time, a phase 3 trial finds that the experimental intervention is better; the other half of the time, the experimental intervention is worse.

They argued that this distribution was a good sign, since it seemed to indicate that the principle of clinical equipoise (which demands genuine uncertainty or disagreement in the expert community about the relative therapeutic merits of the interventions in a clinical trial) was being satisfied.

But the implication for the research enterprise as a whole is rather counter-intuitive: Products that make it into phase 3 trials should fail about half the time. This is quite at odds with how product manufacturers think. Naturally enough, when manufacturers take a product into phase 3, they want to maximize the likelihood of success. After all, a phase 3 trial is costing them millions of dollars and usually takes years to complete. This is a huge investment. So if you are going to invest that much, you’d like to be confident that you are going to succeed and come out the other side with evidence of an effective product.

But “no”, says the ethical analysis. If success can be reliably predicted in advance of the phase 3 trial, then we ought not to do the trial. Indeed, it would be unfair (and harmful) to participants in the comparator arm and it would be a waste of research resources. If the phase 3 chance of success is too high, then we should approve the drug after phase 2. Therefore, a substantial proportion of phase 3 trials should be negative.

Lovers of logic problems may spot the regress here: If we’re too successful in phase 3 and shift the burden of evidence to phase 2, then phase 2 essentially becomes phase 3 and we have the same problem again.

So the fundamental problem is really this: In an ethical, well-ordered research enterprise, what should be the tolerance for failed drug development efforts? How often should products fail in phase 3? How often should they fail in phase 2? What heuristics or principles should we use to answer these questions?

There are no easy answers to these questions and Jonathan and I spent many hours discussing. We eventually boiled down our analysis to commentary-length and got it published (my first ever peer-reviewed publication!) in Science Translational Medicine in 2013: Ethics, Error, and Initial Trials of Efficacy.

We articulated 4 heuristics that could be used to guide an answer to the question: “How predictive or rigorous should a phase 2 trial be?” (The idea being that by making the phase 2 trial more rigorous, you make it more like the phase 3 trial and thereby generate evidence that is more predictive of what you are likely to see in phase 3.)

We argued that (1) failure rates for similar interventions; (2) the volume of promising products in the pipeline; (3) the vulnerability of the patient population; and (4) the social utility of the knowledge from a decisive phase 3 were all informative guides for dialing the rigor of phase 2 trials up or down. (See figure 1 here.)

Because this article is now open access, I won’t rehearse the whole argument here. If you are intrigued, please do go and give it a read. (Warning: It is a dense piece of writing—but worth it, I promise!)

However, there are some interesting wrinkles that we never got into in that article that are more on my mind these days:

One wrinkle is whether or how the tolerance for failure changes depending upon the perspective. From a system-level perspective (like the one Djulbegovic et al. adopted in their article), it may be that 50% late-phase failure is the right target. But what about the small or medium sized biotech whose existence as a company may depend upon a compelling late-phase demonstration of efficacy? It seems absurd that they should spend years getting a product to phase 3 and then bet their future on a coin flip.

Yet, this "absurdity" is probably the case: As a scientific community, we probably should aim to keep phase 3 success close to 50%. And thankfully(?) for the individual biotech, the industry seems to have recognized this risk, and so big pharma (who can better tolerate 50% failure) tends to buy-up biotech on the basis of early-phase promise and then flip the coin.

Another wrinkle: How we should judge a product failure? Underdetermination teaches us that a negative outcome in a clinical trial is not necessarily because the product is ineffective or because the hypothesis is false. Maybe a product “failed” because the trial was testing the wrong population or measuring the wrong outcome. Maybe it failed simply because of random variation, and if we ran the trial again it would (most likely) be a roaring success.

I remember shortly after Jonathan’s and my piece came out, I was on a conference panel with Bernard Ravina (then at Biogen, now at Praxis), and I asked him about this exact issue: Given that you can have a false negative in phase 3, how did he/Biogen think about negative late-phase trials?

He told me that despite the possibility of a false negative, a negative phase 3 trial does still tend to knock the wind out of the sails for the product. So there is often little motivation left within the company to flip the coin again with another long, expensive trial.

There is, of course, much more to say on this topic of how much failure to tolerate in clinical trials, and I will come back to it later in the series when I discuss my analyses of the CETP inhibitor portfolio (which was full of expensive late-phase product failures).

But to sum up this post: As Jonathan’s and my article nears 10-years old, it is interesting to reflect on if/how the industry has changed in that time. The current amyloid/Alzheimer’s debacle signals to me that there is still plenty of opportunity for thinking more deeply and systematically about how to judge, tolerate, and learn from product failures.

I am excited to say (without giving away privileged information) that Prism will soon be exploring, and building new tools, for exactly that space. Stay tuned!