Michael Nielsen wrote an interesting, informative, and lengthy blog post on Simpson’s paradox and causal calculus titled “If correlation doesn’t imply causation, then what does?” Nielsen’s post reminded me of Judea Pearl‘s talk at KDD 2011 where Pearl described his causal calculus. At the time I found it hard to follow, but Nielsen’s post made it more clear to me.
Causal calculus is a way of reasoning about causality if the independence relationships between random variables are known even if some of the variables are unobserved. It uses notation like
$\alpha$ = P( Y=1  do(X=2))
to mean the probability that Y=1 if an experimenter forces the X variable to be 2. Using the Pearl’s calculus, it may be possible to estimate $\alpha$ from a large number of observations where X is free rather than performing the experiment where X is forced to be 2. This is not as straight forward as it might seem. We tend to conflate P(Y=1  do(X=2)) with the conditional probability P(Y=1  X=2). Below I will describe an example^{1}, based on Simpson’s paradox, where they are different.
Suppose that there are two treatments for kidney stones: treatment A and treatment B. The following situation is possible:
 Patients that received treatment A recovered 33% of the time.
 Patients that received treatment B recovered 67% of the time.
 Treatment A is significantly better than treatment B.
This seemed very counterintuitive to me. How is this possible?
The problem is that there is a hidden variable in the kidney stone situation. Some kidney stones are larger and therefore harder to treat and others are smaller and easier to treat. If treatment A is usually applied to large stones and treatment B is usually used for small stones, then the recovery rate for each treatment is biased by the type of stone it treated.
Imagine that
 treatment A is given to one million people with a large stone and 1/3 of them recover,
 treatment A is given to one thousand people with a small stone and all of them recover,
 treatment B is given to one thousand people with a large stone and none of them recover,
 treatment B is given to one million people with a small stone and 2/3 of them recover.
Notice that about onethird of the treatment A patients recovered and about twothirds of the treatment B patients recovered, and yet, treatment A is much better than treatment B. If you have a large stone, then treatment B is pretty much guaranteed to fail (0 out of 1000) and treatment A works about 1/3 of the time. If you have a small stone, treatment A is almost guaranteed to work, while treatment B only works 2/3 of the time.
Mathematically P( Recovery  Treatment A) $\approx$ 1/3 (i.e. about 1/3 of the patients who got treatment A recovered).
The formula for P( Recovery  do(Treatment A)) is much different. Here we force all patients (all 2,002,000 of them) to use treatment A. In that case,
P( Recovery  do(Treatment A) ) $\approx$ 1/2*1/3 + 1/2*1 = 2/3.
Similarly for treatment B, P( Recovery  Treatment B) $\approx$ 2/3 and
P( Recovery  do(Treatment B) ) $\approx$ 1/3.
This example may seemed contrived, but as Nielsen said, “Keep in mind that this really happened.”
Edit Aug 8,2013: Judea Pearl has a wonderful writeup on Simpson’s paradox titled “Simpson’s Paradox: An Anatomy” (2011?). I think equation (9) in the article has a typo on the righthand side. I think it should read
$$ P (E do(\neg C)) = P (E do(\neg C),F) P (F ) +P (E  do(\neg C), \neg F) P (\neg F ).$$
Related Posts via Categories

Both your writeup and Mike’s were quite good. I bought Pearl’s book after pondering Simpson’s paradox for awhile and deciding, like most people, that it is surprising and important. I abandoned the book because Pearl is a very unclear writer and the application chapter was terribly disappointing. It is interesting that thus far there are few truly significant applications. But now my curiosity is piqued and so maybe I’ll return to the book.
Comments are now closed.
2 comments