“There are three kinds of lies: Lies, damned lies, and statistics.” That famous quotation is frequently attributed to Mark Twain but was actually (according to Twain himself, anyway) the work of British prime minister Benjamin Disraeli. Whoever said it, it remains familiar because it captures a widespread suspicion of the extent to which statistics can be made to support any position, with sufﬁcient manipulation. This is an especially big public-relations problem for psychological scientists, who use statistical analysis to reach most of their research conclusions.
Of the various available statistical techniques, probably none is more frequently and egregiously abused than the correlation coefﬁcient, which is also the most common way of handling the data from observational (nonexperimental) studies in psychology. A correlation coefﬁcient is a single number that indicates the nature and the strength of the relationship between two sets of numbers. Values can range between −1.0 and 1.0.
A positive number means high scores on one factor accompany high scores on the other factor being studied—as one goes up, so does the other. A negative correlation indicates an inverse relationship—as one goes up, the other goes down. The closer the correlation gets to an absolute value (positive or negative) of 1.0, the stronger the measured relationship is. For example, there is a strong positive correlation between shoe size and pants size, at least during childhood—as one number goes up, so does the other. There is a negative correlation between the air temperature and the number of layers of clothing that people wear—as the temperature rises, fewer clothes are worn.
Unfortunately, a common error, in regular life as well as in statistical analysis, is to assume that a correlation represents a causal relationship. Sometimes this is a reasonable assumption—the strong positive correlation between total number of cigarettes smoked and the likelihood of getting lung cancer, for example—and sometimes it is not. Take, for example, the strong positive correlation between the number of churches in a city and the number of bars. Internationally, as one number increases, so does the other. A causal interpretation may tempt us: religion drives people to drink, or conversely perhaps the consumption of alcohol leads to greater religiosity. Note that there is nothing in the data themselves to indicate either a causal relationship or the direction of such a relationship (which variable causes which), if one were to exist. In fact, the relationship between the two numbers is again explained by a third variable, in this case population expansion. Larger towns and cities have both more churches and more bars.
A cause-and-effect relationship would clearly be an inappropriate interpretation of the foregoing example. An important warning: cause-and-effect interpretations are always inappropriate for correlational data. Sometimes the data may actually represent a causal relationship, but there is no way of telling from the correlation coefﬁcient alone. The only type of psychological research that allows causal inferences to be drawn is the experiment.
Despite this, causal inferences are made all the time. Politicians are major offenders. The last ﬁve presidents (and probably all their predecessors as well), for example, have all seized credit for improvements in certain economic indicators by correlating the ﬁgures with their months in ofﬁce. The government keeps track of many economic indicators, and in any given period some will rise and others will fall. It’s a simple matter to examine the statistics for the ﬁrst hundred days in ofﬁce and pick one that has gone up, and to then point out that this has occurred while the president has been in ofﬁce. As with other correlational data, however, there is not enough information contained in the correlation to assume a causal connection. Be alert for this sort of thing—it’s everywhere, and it’s a clever way of lying.
The tendency to confuse correlation with causation is a perfectly natural one, and one which serves an adaptive function, despite sometimes being wrong. Recognizing that two things that occur consecutively may share a causal relationship is not a bad thing—consider the survival value of noticing that a certain type of cloud often precedes a dangerous thunderstorm or that a certain type of activity by birds often precedes the arrival of a tiger by a few seconds. It can also lead us to see such relationships where none exist, however, a phenomenon known as illusory covariation. This may explain the popularity of many unproven cold remedies. Although several clinical trials have now demonstrated that Echinacea purpurea has no effect either on the immune system or on cold symptoms, it remains hugely popular for such alleged effects. A cold usually lasts about a week, maybe a little less, maybe a bit more. If we take a remedy for several days and begin to feel better after taking it, it seems fairly obvious that the remedy caused the improvement. As scientists, however, we must be wary of the obvious—the cold would probably have gotten better anyway, with or without the remedy.
Correlation remains an extremely useful statistical technique despite these ﬂaws, as the problem is with the interpretation rather than with the numbers themselves. The real purpose of correlation is to indicate whether two variables are related in some way and how strong that relationship is—it cannot tell us anything else about the nature of that relationship. When we get promising results from correlations, we can then use these data to plan experiments to test whether a causal relationship of any sort actually exists.
- Dewdney, A. K. 200 Percent of Nothing. New York: Wiley, 1993;
- Huff, D. How to Lie with Statistics. New York: W.W. Norton, 1954, 1993.