## Wednesday, April 22, 2009

### Why Take Causation Seriously?

Introductory statistics students are taught that "correlation does not imply causation." If you are unfamiliar with this mantra. Here's an example:
A researcher analyzes data on the relationship between test scores (SAT, ACT, etc.) and performance in college. He finds that higher scores on the SAT predicted higher GPA in college. Does this mean that the fact that a student scored well on the SAT causes better grades in college?
Obviously, no! In this example, there's likely a third factor (say, intelligence) that is positively related with SAT and GPA, which drives our observed correlation. That's the "correlation does not imply causation" mantra. If we see a relationship between two events in the world (say Y=a+bX), there are three classes of explanations.
1. X causes Y: "Lipitor lowers cholesterol"
2. Y causes X: "bad health causes more doctor visits"
3. Z causes X and Z causes Y: "smarter hard-working people score better on the SAT and have higher college GPA"

Notice that in this view of the world, correlation does not imply causation, but there is some true model of causation out there! Our statistical model just does not inform which model of causation is right.

Having lived among statisticians while earning a master's degree in statistics, I encountered a strange philosophy regarding the treatment of the causal question. I am not sure how pervasive it is among all statisticians, but this philosophy is predominant among the statisticians I know. Basically, it is a two-step approach.

1. First, divide the types of studies into experimental and observational. The statistician does this because in an experiment, the researcher can control (2) and (3), so we can conclude (1) X causes Y. In an observational study, the researcher cannot do so.
2. Second, observational studies are still interesting, so study them. But, be very careful with interpreting estimates. We cannot rule out (2) or (3), so do not under any circumstances rule these options out in an interpretation.

The conclusions offered in the second step are valid, but when it comes to practicing applied statistics (for example, estimating factors correlated with costs of wildfires or estimating factors correlated with the likelihood of Indian casino investment), we want to know more than just correlation. We want to draw lessons from our analysis. If not, what's the point?

The two-step strategy leads to a statement like "an additional house within one mile of a fire perimeter is correlated with \$13,000 in additional fire suppression cost." But, if we take the second step of the statistician's solution seriously, we cannot offer any policy implications without reasonably ruling out (2) or (3).

Can we safely conclude that "more money devoted to fire suppression leads to more houses near a fire perimeter" is an unreasonable interpretation? Can we safely conclude that "fire suppression cost and number of houses are correlated with an omitted explanation" is an unreasonable explanation?

If the answer is no to either of these questions, any discussion after the standard interpretation is nonsense. But, applied statistics in many observational studies is all about creating estimates that warrant interesting policy discussion. If it is not safe to conclude that X causes Y, my view is that there are two appropriate responses:

• Drop the research question. Your audience always wants to know what's causing what. If you report a correlation without ruling out alternative interpretations, people will either (a) not care because you did not answer their question, or (b) misinterpret you and think that you told them X causes Y when you could not offer that strong of a conclusion.
• Investigate Causation. Every observational study should investigate the extent to which important omitted factors can be driving observed correlations. This statement applies to everyone who even thinks about applied research.

The set of tools available to investigate causation is available (Google IV or 2SLS if you're technically inclined), but it requires careful thought. Basically, investigating causation boils down to different ways to reasonably rule out explanations for an observed correlation aside from "X causes Y." But, isn't that what your audience wants anyway?