“Lies, damn lies, and statistics”

Ours is the age of statistics; of number-crunching, of quantifying, of defining everything by what it means in terms of percentages and comparisons. Statistics crop up in every walk of life, to some extent or other, in fields as widespread as advertising and sport. Many people’s livelihoods now depend on their ability to crunch the numbers, to come up with data and patterns, and much of our society’s increasing ability to do awesome things can be traced back to someone making the numbers dance.

In fact, most of what we think of as ‘statistics’ are not really statistics at all, but merely numbers; to a pedantic mathematician, a statistic is defined as a mathematical function of a sample of data, not the whole ‘population’ we are considering. We use statistics when it would be impractical to measure the whole population, usually because it’s too large, and when we instead are trying to mathematically model the whole population based on a small sample of it. Thus, next to no sporting ‘statistics’ are in fact true statistics as they tend to cover the whole game; if I heard during a rugby match that “Leicester had 59% of the possession”, that is nothing more than a number; or, to use the mathematical term, a parameter. A statistic would be to say “From our sample [of one game] we can conclude that Leicester control an average of 59% of the possession when they play rugby”, but this is quite evidently not true since we couldn’t extrapolate Leicester’s normal behaviour from a single match. It is for this reason that complex mathematical formulae are used to determine the uncertainty of a conclusion drawn from a statistical test, and these are based on the size of the sample we are testing compared to the overall size of the population we are trying to model. These uncertainty levels are often brushed under the carpet when pseudoscientists try to make dramatic, sweeping claims about something, but they are possibly the most important feature of modern statistics.

Another weapon for the poor statistician can be the mis-application of the idea of correlation. Correlation is basically what it means when you take two variables, plot them against one another on a graph, and find you get a nice neat line joining them, suggesting that the two are in some way related. Correlation tends to get scientists very excited, since if two things are linked then it suggests that you can make one thing happen by doing another, an often advantageous concept, and this is known as a causal relationship. However, whilst correlation and causation are rarely not intertwined, the first lesson every statistician learns is this; correlation DOES NOT imply causation.

Imagine, for instance, you have a cold. You feel like crap, your head is spinning, you’re dehydrated and you can’t breath through your nose. If we were, during the period before, during and after your cold, to plot a graph of one’s relative ability to breath through the nose against the severity of your headache (yeah, not very scientific I know), these two facts would both correlate, since they happen at the same time due to the cold. However, if I were to decide that this correlation implies causation, then I would draw the conclusion that all I need to do to give you a terrible headache is to plug your nose with tissue paper so you can’t breath through it. In this case, I have ignored the possibility (and, as it transpires, the eventuality) of there being a third variable (the cold virus) that causes both of the other two variables, and this is very hard to investigate without poking our head out of the numbers and looking at the real world. There are statistical techniques that enable us to do this, but they are for another time.

Whilst this example was more childish than anything, mis-extrapolation of a correlation can have deadly consequences. One example, explored in Ben Goldacre’s Bad Science, concerns beta-carotene, an antioxidant found in carrots, and in 1981 an epidemiologist called Richard Peto published a meta-analysis (post for another time) of a series of scientific studies that suggested people with high beta-carotene levels showed a reduced risk of cancer. At the time, antioxidants were considered the wonder-substance of the nutrition, and everyone got on board with the idea that beta-carotene was awesome stuff. However, all of the studies examined were observational ones; taking a lot of different people, seeing what their beta-carotene levels were and then examining whether or not they had cancer or developed it in later life. None of the studies actually gave their subjects beta-carotene and then saw if that affected their cancer risk, and this prompted the editor of Nature magazine (the scientific journal in which Peto’s paper was published) to include a footnote reading:

Unwary readers (if such there are) should not take the accompanying article as a sign that the consumption of large quantities of carrots (or other dietary sources of beta-carotene) is necessarily protective against cancer.

The editor’s footnote quickly proved a well-judged one; a study conducted in Finland some time afterwards actually gave participants at high risk of lung cancer beta-carotene and found their risk of both getting the cancer and of death were higher than for the ‘placebo’ control group. A later study, named CARET (Carotene And Retinol Efficiency Trial), also tested groups at a high risk of lung cancer, giving half of them a mixture of beta-carotene and vitamin A and the other half placebos. The idea was to run the trial for six years and see how many illnesses/deaths each group ended up with; but after preliminary data found that those having the antioxidant tablets were 46% more likely to die from lung cancer, they decided it would be unethical to continue the trial and it was terminated early. Had the Nature article been allowed to get out of hand before this research was done, then it could have put thousands of people who hadn’t read the article properly at risk; and all because of the dangers of assuming correlation=causation.

This wasn’t really the gentle ramble through statistics I originally intended it to be, but there you go; stats. Next time, something a little less random. Maybe

Advertisement

The Problems of the Real World

My last post on the subject of artificial intelligence was something of a philosophical argument on its nature- today I am going to take on a more practical perspective, and have a go at just scratching the surface of the monumental challenges that the real world poses to the development of AI- and, indeed, how they are (broadly speaking) solved.

To understand the issues surrounding the AI problem, we must first consider what, in the strictest sense of the matter, a computer is. To quote… someone, I can’t quite remember who: “A computer is basically just a dumb adding machine that counts on its fingers- except that it has an awful lot of fingers and counts terribly fast”. This, rather simplistic model, is in fact rather good for explaining exactly what it is that computers are good and bad at- they are very good at numbers, data crunching, the processing of information. Information is the key thing here- if something can be inputted into a computer purely in terms of information, then the computer is perfectly capable of modelling and processing it with ease- which is why a computer is very good at playing games. Even real-world problems that can be expressed in terms of rules and numbers can be converted into computer-recognisable format and mastered with ease, which is why computers make short work of things like ballistics modelling (such as gunnery tables, the US’s first usage of them), and logical games like chess.

However, where a computer develops problems is in the barrier between the real world and the virtual. One must remember that the actual ‘mind’ of a computer itself is confined exclusively to the virtual world- the processing within a robot has no actual concept of the world surrounding it, and as such is notoriously poor at interacting with it. The problem is twofold- firstly, the real world is not a mere simulation, where rules are constant and predictable; rather, it is an incredibly complicated, constantly changing environment where there are a thousand different things that we living humans keep track of without even thinking. As such, there are a LOT of very complicated inputs and outputs for a computer to keep track of in the real world, which makes it very hard to deal with. But this is merely a matter of grumbling over the engineering specifications and trying to meet the design brief of the programmers- it is the second problem which is the real stumbling block for the development of AI.

The second issue is related to the way a computer processes information- bit by bit, without any real grasp of the big picture. Take, for example, the computer monitor in front of you. To you, it is quite clearly a screen- the most notable clue being the pretty pattern of lights in front of you. Now, turn your screen slightly so that you are looking at it from an angle. It’s still got a pattern of lights coming out of it, it’s still the same colours- it’s still a screen. To a computer however, if you were to line up two pictures of your monitor from two different angles, it would be completely unable to realise that they were the same screen, or even that they were the same kind of objects. Because the pixels are in a different order, and as such the data’s different, the two pictures are completely different- the computer has no concept of the idea that the two patterns of lights are the same basic shape, just from different angles.

There are two potential solutions to this problem. Firstly, the computer can look at the monitor and store an image of it from every conceivable angle with every conceivable background, so that it would be able to recognise it anywhere, from any viewpoint- this would however take up a library’s worth of memory space and be stupidly wasteful. The alternative requires some cleverer programming- by training the computer to spot patterns of pixels that look roughly similar (either shifted along by a few bytes, or missing a few here and there), they can be ‘trained’ to pick out basic shapes- by using an algorithm to pick out changes in colour (an old trick that’s been used for years to clean up photos), the edges of objects can be identified and separate objects themselves picked out. I am not by any stretch of the imagination an expert in this field so won’t go into details, but by this basic method a computer can begin to step back and begin to look at the pattern of a picture as a whole.

But all that information inputting, all that work…  so your computer can identify just a monitor? What about all the myriad of other things our brains can recognise with such ease- animals, buildings, cars? And we haven’t even got on to differentiating between different types of things yet… how will we ever match the human brain?

This idea presented a big setback for the development of modern AI- so far we have been able to develop AI that allows one computer to handle a few real-world tasks or applications very well (and in some cases, depending on the task’s suitability to the computational mind, better than humans), but scientists and engineers were presented with a monumental challenge when faced with the prospect of trying to come close to the human mind (let alone its body) in anything like the breadth of tasks it is able to perform. So they went back to basics, and began to think of exactly how humans are able to do so much stuff.

Some of it can be put down to instinct, but then came the idea of learning. The human mind is especially remarkable in its ability to take in new information and learn new things about the world around it- and then take this new-found information and try to apply it to our own bodies. Not only can we do this, but we can also do it remarkably quickly- it is one of the main traits which has pushed us forward as a race.

So this is what inspires the current generation of AI programmers and robotocists- the idea of building into the robot’s design a capacity for learning. The latest generation of the Japanese ‘Asimo’ robots can learn what various objects presented to it are, and is then able to recognise them when shown them again- as well as having the best-functioning humanoid chassis of any existing robot, being able to run and climb stairs. Perhaps more excitingly are a pair of robots currently under development that start pretty much from first principles, just like babies do- first they are presented with a mirror and learn to manipulate their leg motors in such a way that allows them to stand up straight and walk (although they aren’t quite so good at picking themselves up if they fail in this endeavour). They then face one another and begin to demonstrate and repeat actions to one another, giving each action a name as they do so.  In doing this they build up an entirely new, if unsophisticated, language with which to make sense of the world around them- currently, this is just actions, but who knows what lies around the corner…