Simpson’s paradox and Judea Pearl’s Causal Calculus

Michael Nielsen wrote an interesting, informative, and lengthy blog post on Simpson’s paradox and causal calculus titled “If correlation doesn’t imply causation, then what does?”  Nielsen’s post reminded me of Judea Pearl‘s talk at KDD 2011 where Pearl described his causal calculus.  At the time I found it hard to follow, but Nielsen’s post made it more clear to me.


Causal calculus is a way of reasoning about causality if the independence relationships between random variables are known even if some of the variables are unobserved.  It uses notation like

$\alpha$ = P( Y=1 | do(X=2))

to mean the probability that Y=1 if an experimenter forces the X variable to be 2. Using the Pearl’s calculus, it may be possible to estimate $\alpha$ from a large number of observations where X is free rather than performing the experiment where X is forced to be 2.  This is not as straight forward as it might seem. We tend to conflate P(Y=1 | do(X=2)) with the conditional probability P(Y=1 | X=2). Below I will describe an example1, based on Simpson’s paradox, where they are different.

Suppose that there are two treatments for kidney stones: treatment A and treatment B.  The following situation is possible:

  • Patients that received treatment A recovered 33% of the time.  
  • Patients that received treatment B recovered 67% of the time.  
  • Treatment A is significantly better than treatment B.

This seemed very counterintuitive to me.  How is this possible?

The problem is that there is a hidden variable in the kidney stone situation. Some kidney stones are larger and therefore harder to treat and others are smaller and easier to treat.  If treatment A is usually applied to large stones and treatment B is usually used for small stones, then the recovery rate for each treatment is biased by the type of stone it treated.

Imagine that

  • treatment A is given to one million people with a large stone and 1/3 of them recover,
  • treatment A is given to one thousand people with a small stone and all of them recover,
  • treatment B is given to one thousand people with a large stone and none of them recover,
  • treatment B is given to one million people with a small stone and 2/3 of them recover.

Notice that about one-third of the treatment A patients recovered and about two-thirds of the treatment B patients recovered, and yet, treatment A is much better than treatment B.  If you have a large stone, then treatment B is pretty much guaranteed to fail (0 out of 1000) and treatment A works about 1/3 of the time. If you have a small stone, treatment A is almost guaranteed to work, while treatment B only works 2/3 of the time.

Mathematically P( Recovery | Treatment A) $\approx$ 1/3   (i.e.  about 1/3 of the patients who got treatment A recovered).

The formula for P( Recovery | do(Treatment A)) is much different.  Here we force all patients (all 2,002,000 of them) to use treatment A.  In that case,

P( Recovery | do(Treatment A) ) $\approx$ 1/2*1/3 + 1/2*1 = 2/3.

Similarly for treatment B, P( Recovery |  Treatment B) $\approx$ 2/3 and

P( Recovery | do(Treatment B) ) $\approx$ 1/3.


This example may seemed contrived, but as Nielsen said, “Keep in mind that this really happened.”


Edit Aug 8,2013:  Judea Pearl has a wonderful write-up on Simpson’s paradox titled “Simpson’s Paradox: An Anatomy” (2011?).  I think equation (9) in the article has a typo on the right-hand side.  I think it should read

$$ P (E |do(\neg C)) = P (E |do(\neg C),F) P (F ) +P (E | do(\neg C), \neg F) P (\neg F ).$$


  1. antianticamper’s avatar

    Both your writeup and Mike’s were quite good. I bought Pearl’s book after pondering Simpson’s paradox for awhile and deciding, like most people, that it is surprising and important. I abandoned the book because Pearl is a very unclear writer and the application chapter was terribly disappointing. It is interesting that thus far there are few truly significant applications. But now my curiosity is piqued and so maybe I’ll return to the book.

  2. hundalhh’s avatar

    I remember that Dr Pearl gave about three example applications of causal calculus (CC) during his 2011 KDD talk. If I recall correctly, one application was using age related medical data from one city on the US west coast to support medical interventions in another city with a different age distribution on the east coast. I have forgotten the other two applications. I found the talk difficult to understand, but I think that CC is not really that difficult—it’s just new to us. I think that CC is easier than integral calculus but harder than trigonometry. Here are a few more references.

Comments are now closed.