Ours is the age of statistics; of number-crunching, of quantifying, of defining everything by what it means in terms of percentages and comparisons. Statistics crop up in every walk of life, to some extent or other, in fields as widespread as advertising and sport. Many people’s livelihoods now depend on their ability to crunch the numbers, to come up with data and patterns, and much of our society’s increasing ability to do awesome things can be traced back to someone making the numbers dance.
In fact, most of what we think of as ‘statistics’ are not really statistics at all, but merely numbers; to a pedantic mathematician, a statistic is defined as a mathematical function of a sample of data, not the whole ‘population’ we are considering. We use statistics when it would be impractical to measure the whole population, usually because it’s too large, and when we instead are trying to mathematically model the whole population based on a small sample of it. Thus, next to no sporting ‘statistics’ are in fact true statistics as they tend to cover the whole game; if I heard during a rugby match that “Leicester had 59% of the possession”, that is nothing more than a number; or, to use the mathematical term, a parameter. A statistic would be to say “From our sample [of one game] we can conclude that Leicester control an average of 59% of the possession when they play rugby”, but this is quite evidently not true since we couldn’t extrapolate Leicester’s normal behaviour from a single match. It is for this reason that complex mathematical formulae are used to determine the uncertainty of a conclusion drawn from a statistical test, and these are based on the size of the sample we are testing compared to the overall size of the population we are trying to model. These uncertainty levels are often brushed under the carpet when pseudoscientists try to make dramatic, sweeping claims about something, but they are possibly the most important feature of modern statistics.
Another weapon for the poor statistician can be the mis-application of the idea of correlation. Correlation is basically what it means when you take two variables, plot them against one another on a graph, and find you get a nice neat line joining them, suggesting that the two are in some way related. Correlation tends to get scientists very excited, since if two things are linked then it suggests that you can make one thing happen by doing another, an often advantageous concept, and this is known as a causal relationship. However, whilst correlation and causation are rarely not intertwined, the first lesson every statistician learns is this; correlation DOES NOT imply causation.
Imagine, for instance, you have a cold. You feel like crap, your head is spinning, you’re dehydrated and you can’t breath through your nose. If we were, during the period before, during and after your cold, to plot a graph of one’s relative ability to breath through the nose against the severity of your headache (yeah, not very scientific I know), these two facts would both correlate, since they happen at the same time due to the cold. However, if I were to decide that this correlation implies causation, then I would draw the conclusion that all I need to do to give you a terrible headache is to plug your nose with tissue paper so you can’t breath through it. In this case, I have ignored the possibility (and, as it transpires, the eventuality) of there being a third variable (the cold virus) that causes both of the other two variables, and this is very hard to investigate without poking our head out of the numbers and looking at the real world. There are statistical techniques that enable us to do this, but they are for another time.
Whilst this example was more childish than anything, mis-extrapolation of a correlation can have deadly consequences. One example, explored in Ben Goldacre’s Bad Science, concerns beta-carotene, an antioxidant found in carrots, and in 1981 an epidemiologist called Richard Peto published a meta-analysis (post for another time) of a series of scientific studies that suggested people with high beta-carotene levels showed a reduced risk of cancer. At the time, antioxidants were considered the wonder-substance of the nutrition, and everyone got on board with the idea that beta-carotene was awesome stuff. However, all of the studies examined were observational ones; taking a lot of different people, seeing what their beta-carotene levels were and then examining whether or not they had cancer or developed it in later life. None of the studies actually gave their subjects beta-carotene and then saw if that affected their cancer risk, and this prompted the editor of Nature magazine (the scientific journal in which Peto’s paper was published) to include a footnote reading:
Unwary readers (if such there are) should not take the accompanying article as a sign that the consumption of large quantities of carrots (or other dietary sources of beta-carotene) is necessarily protective against cancer.
The editor’s footnote quickly proved a well-judged one; a study conducted in Finland some time afterwards actually gave participants at high risk of lung cancer beta-carotene and found their risk of both getting the cancer and of death were higher than for the ‘placebo’ control group. A later study, named CARET (Carotene And Retinol Efficiency Trial), also tested groups at a high risk of lung cancer, giving half of them a mixture of beta-carotene and vitamin A and the other half placebos. The idea was to run the trial for six years and see how many illnesses/deaths each group ended up with; but after preliminary data found that those having the antioxidant tablets were 46% more likely to die from lung cancer, they decided it would be unethical to continue the trial and it was terminated early. Had the Nature article been allowed to get out of hand before this research was done, then it could have put thousands of people who hadn’t read the article properly at risk; and all because of the dangers of assuming correlation=causation.
This wasn’t really the gentle ramble through statistics I originally intended it to be, but there you go; stats. Next time, something a little less random. Maybe