# “Lies, damn lies, and statistics”

Ours is the age of statistics; of number-crunching, of quantifying, of defining everything by what it means in terms of percentages and comparisons. Statistics crop up in every walk of life, to some extent or other, in fields as widespread as advertising and sport. Many people’s livelihoods now depend on their ability to crunch the numbers, to come up with data and patterns, and much of our society’s increasing ability to do awesome things can be traced back to someone making the numbers dance.

In fact, most of what we think of as ‘statistics’ are not really statistics at all, but merely numbers; to a pedantic mathematician, a statistic is defined as a mathematical function of a sample of data, not the whole ‘population’ we are considering. We use statistics when it would be impractical to measure the whole population, usually because it’s too large, and when we instead are trying to mathematically model the whole population based on a small sample of it. Thus, next to no sporting ‘statistics’ are in fact true statistics as they tend to cover the whole game; if I heard during a rugby match that “Leicester had 59% of the possession”, that is nothing more than a number; or, to use the mathematical term, a parameter. A statistic would be to say “From our sample [of one game] we can conclude that Leicester control an average of 59% of the possession when they play rugby”, but this is quite evidently not true since we couldn’t extrapolate Leicester’s normal behaviour from a single match. It is for this reason that complex mathematical formulae are used to determine the uncertainty of a conclusion drawn from a statistical test, and these are based on the size of the sample we are testing compared to the overall size of the population we are trying to model. These uncertainty levels are often brushed under the carpet when pseudoscientists try to make dramatic, sweeping claims about something, but they are possibly the most important feature of modern statistics.

Another weapon for the poor statistician can be the mis-application of the idea of correlation. Correlation is basically what it means when you take two variables, plot them against one another on a graph, and find you get a nice neat line joining them, suggesting that the two are in some way related. Correlation tends to get scientists very excited, since if two things are linked then it suggests that you can make one thing happen by doing another, an often advantageous concept, and this is known as a causal relationship. However, whilst correlation and causation are rarely not intertwined, the first lesson every statistician learns is this; correlation DOES NOT imply causation.

Imagine, for instance, you have a cold. You feel like crap, your head is spinning, you’re dehydrated and you can’t breath through your nose. If we were, during the period before, during and after your cold, to plot a graph of one’s relative ability to breath through the nose against the severity of your headache (yeah, not very scientific I know), these two facts would both correlate, since they happen at the same time due to the cold. However, if I were to decide that this correlation implies causation, then I would draw the conclusion that all I need to do to give you a terrible headache is to plug your nose with tissue paper so you can’t breath through it. In this case, I have ignored the possibility (and, as it transpires, the eventuality) of there being a third variable (the cold virus) that causes both of the other two variables, and this is very hard to investigate without poking our head out of the numbers and looking at the real world. There are statistical techniques that enable us to do this, but they are for another time.

Whilst this example was more childish than anything, mis-extrapolation of a correlation can have deadly consequences. One example, explored in Ben Goldacre’s Bad Science, concerns beta-carotene, an antioxidant found in carrots, and in 1981 an epidemiologist called Richard Peto published a meta-analysis (post for another time) of a series of scientific studies that suggested people with high beta-carotene levels showed a reduced risk of cancer. At the time, antioxidants were considered the wonder-substance of the nutrition, and everyone got on board with the idea that beta-carotene was awesome stuff. However, all of the studies examined were observational ones; taking a lot of different people, seeing what their beta-carotene levels were and then examining whether or not they had cancer or developed it in later life. None of the studies actually gave their subjects beta-carotene and then saw if that affected their cancer risk, and this prompted the editor of Nature magazine (the scientific journal in which Peto’s paper was published) to include a footnote reading:

Unwary readers (if such there are) should not take the accompanying article as a sign that the consumption of large quantities of carrots (or other dietary sources of beta-carotene) is necessarily protective against cancer.

The editor’s footnote quickly proved a well-judged one; a study conducted in Finland some time afterwards actually gave participants at high risk of lung cancer beta-carotene and found their risk of both getting the cancer and of death were higher than for the ‘placebo’ control group. A later study, named CARET (Carotene And Retinol Efficiency Trial), also tested groups at a high risk of lung cancer, giving half of them a mixture of beta-carotene and vitamin A and the other half placebos. The idea was to run the trial for six years and see how many illnesses/deaths each group ended up with; but after preliminary data found that those having the antioxidant tablets were 46% more likely to die from lung cancer, they decided it would be unethical to continue the trial and it was terminated early. Had the Nature article been allowed to get out of hand before this research was done, then it could have put thousands of people who hadn’t read the article properly at risk; and all because of the dangers of assuming correlation=causation.

This wasn’t really the gentle ramble through statistics I originally intended it to be, but there you go; stats. Next time, something a little less random. Maybe

# A Continued History

This post looks set to at least begin by following on directly from my last one- that dealt with the story of computers up to Charles Babbage’s difference and analytical engines, whilst this one will try to follow the history along from there until as close to today as I can manage, hopefully getting in a few of the basics of the workings of these strange and wonderful machines.

After Babbage’s death as a relatively unknown and unloved mathematician in 1871, the progress of the science of computing continued to tick over. A Dublin accountant named Percy Ludgate, independently of Babbage’s work, did develop his own programmable, mechanical computer at the turn of the century, but his design fell into a similar degree of obscurity and hardly added anything new to the field. Mechanical calculators had become viable commercial enterprises, getting steadily cheaper and cheaper, and as technological exercises were becoming ever more sophisticated with the invention of the analogue computer. These were, basically a less programmable version of the difference engine- mechanical devices whose various cogs and wheels were so connected up that they would perform one specific mathematical function to a set of data. James Thomson in 1876 built the first, which could solve differential equations by integration (a fairly simple but undoubtedly tedious mathematical task), and later developments were widely used to collect military data and for solving problems concerning numbers too large to solve by human numerical methods. For a long time, analogue computers were considered the future of modern computing, but since they solved and modelled problems using physical phenomena rather than data they were restricted in capability to their original setup.

A perhaps more significant development came in the late 1880s, when an American named Herman Hollerith invented a method of machine-readable data storage in the form of cards punched with holes. These had been around for a while to act rather like programs, such as the holed-paper reels of a pianola or the punched cards used to automate the workings of a loom, but this was the first example of such devices being used to store data (although Babbage had theorised such an idea for the memory systems of his analytical engine). They were cheap, simple, could be both produced and read easily by a machine, and were even simple to dispose of. Hollerith’s team later went on to process the data of the 1890 US census, and would eventually become most of IBM. The pattern of holes on these cards could be ‘read’ by a mechanical device with a set of levers that would go through a hole if there was one present, turning the appropriate cogs to tell the machine to count up one. This system carried on being used right up until the 1980s on IBM systems, and could be argued to be the first programming language.

However, to see the story of the modern computer truly progress we must fast forward to the 1930s. Three interesting people and acheivements came to the fore here: in 1937 George Stibitz, and American working in Bell Labs, built an electromechanical calculator that was the first to process data digitally using on/off binary electrical signals, making it the first digital. In 1936, a bored German engineering student called Konrad Zuse dreamt up a method for processing his tedious design calculations automatically rather than by hand- to this end he devised the Z1, a table-sized calculator that could be programmed to a degree via perforated film and also operated in binary. His parts couldn’t be engineered well enough for it to ever work properly, but he kept at it to eventually build 3 more models and devise the first programming language. However, perhaps the most significant figure of 1930s computing was a young, homosexual, English maths genius called Alan Turing.

Turing’s first contribution to the computing world came in 1936, when he published a revolutionary paper showing that certain computing problems cannot be solved by one general algorithm. A key feature of this paper was his description of a ‘universal computer’, a machine capable of executing programs based on reading and manipulating a set of symbols on a strip of tape. The symbol on the tape currently being read would determine whether the machine would move up or down the strip, how far, and what it would change the symbol to, and Turing proved that one of these machines could replicate the behaviour of any computer algorithm- and since computers are just devices running algorithms, they can replicate any modern computer too. Thus, if a Turing machine (as they are now known) could theoretically solve a problem, then so could a general algorithm, and vice versa if it couldn’t. Not only that, but since modern computers cannot multi-task on the. These machines not only lay the foundations for computability and computation theory, on which nearly all of modern computing is built, but were also revolutionary as they were the first theorised to use the same medium for both data storage and programs, as nearly all modern computers do. This concept is known as a von Neumann architecture, after the man who first pointed out and explained this idea in response to Turing’s work.

Turing machines contributed one further, vital concept to modern computing- that of Turing-completeness. A Turing-complete system was defined as a single Turing machine (known as a Universal Turing machine) capable of replicating the behaviour of any other theoretically possible Turing machine, and thus any possible algorithm or computable sequence. Charles Babbage’s analytical engine would have fallen into that class had it ever been built, in part because it was capable of the ‘if X then do Y’ logical reasoning that characterises a computer rather than a calculator. Ensuring the Turing-completeness of a system is a key part of designing a computer system or programming language to ensure its versatility and that it is capable of performing all the tasks that could be required of it.

Turing’s work had laid the foundations for nearly all the theoretical science of modern computing- now all the world needed was machines capable of performing the practical side of things. However, in 1942 there was a war on, and Turing was being employed by the government’s code breaking unit at Bletchley Park, Buckinghamshire. They had already cracked the German’s Enigma code, but that had been a comparatively simple task since they knew the structure and internal layout of the Enigma machine. However, they were then faced by a new and more daunting prospect: the Lorenz cipher, encoded by an even more complex machine for which they had no blueprints. Turing’s genius, however, apparently knew no bounds, and his team eventually worked out its logical functioning. From this a method for deciphering it was formulated, but it required an iterative process that took hours of mind-numbing calculation to get a result out. A faster method of processing these messages was needed, and to this end an engineer named Tommy Flowers designed and built Colossus.

Colossus was a landmark of the computing world- the first electronic, digital, and partially programmable computer ever to exist. It’s mathematical operation was not highly sophisticated- it used vacuum tubes containing light emission and sensitive detection systems, all of which were state-of-the-art electronics at the time, to read the pattern of holes on a paper tape containing the encoded messages, and then compared these to another pattern of holes generated internally from a simulation of the Lorenz machine in different configurations. If there were enough similarities (the machine could obviously not get a precise matching since it didn’t know the original message content) it flagged up that setup as a potential one for the message’s encryption, which could then be tested, saving many hundreds of man-hours. But despite its inherent simplicity, its legacy is simply one of proving a point to the world- that electronic, programmable computers were both possible and viable bits of hardware, and paved the way for modern-day computing to develop.

# What we know and what we understand are two very different things…

If the whole Y2K debacle over a decade ago taught us anything, it was that the vast majority of the population did not understand the little plastic boxes known as computers that were rapidly filling up their homes. Nothing especially wrong or unusual about this- there’s a lot of things that only a few nerds understand properly, an awful lot of other stuff in our life to understand, and in any case the personal computer had only just started to become commonplace. However, over 12 and a half years later, the general understanding of a lot of us does not appear to have increased to any significant degree, and we still remain largely ignorant of these little feats of electronic witchcraft. Oh sure, we can work and operate them (most of us anyway), and we know roughly what they do, but as to exactly how they operate, precisely how they carry out their tasks? Sorry, not a clue.

This is largely understandable, particularly given the value of ‘understand’ that is applicable in computer-based situations. Computers are a rare example of a complex system that an expert is genuinely capable of understanding, in minute detail, every single aspect of the system’s working, both what it does, why it is there, and why it is (or, in some cases, shouldn’t be) constructed to that particular specification. To understand a computer in its entirety, therefore, is an equally complex job, and this is one very good reason why computer nerds tend to be a quite solitary bunch, with quite few links to the rest of us and, indeed, the outside world at large.

One person who does not understand computers very well is me, despite the fact that I have been using them, in one form or another, for as long as I can comfortably remember. Over this summer, however, I had quite a lot of free time on my hands, and part of that time was spent finally relenting to the badgering of a friend and having a go with Linux (Ubuntu if you really want to know) for the first time. Since I like to do my background research before getting stuck into any project, this necessitated quite some research into the hows and whys of its installation, along with which came quite a lot of info as to the hows and practicalities of my computer generally. I thought, then, that I might spend the next couple of posts or so detailing some of what I learned, building up a picture of a computer’s functioning from the ground up, and starting with a bit of a history lesson…

‘Computer’ was originally a job title, the job itself being akin to accountancy without the imagination. A computer was a number-cruncher, a supposedly infallible data processing machine employed to perform a range of jobs ranging from astronomical prediction to calculating interest. The job was a fairly good one, anyone clever enough to land it probably doing well by the standards of his age, but the output wasn’t. The human brain is not built for infallibility and, not infrequently, would make mistakes. Most of these undoubtedly went unnoticed or at least rarely caused significant harm, but the system was nonetheless inefficient. Abacuses, log tables and slide rules all aided arithmetic manipulation to a great degree in their respective fields, but true infallibility was unachievable whilst still reliant on the human mind.

Enter Blaise Pascal, 17th century mathematician and pioneer of probability theory (among other things), who invented the mechanical calculator aged just 19, in 1642. His original design wasn’t much more than a counting machine, a sequence of cogs and wheels so constructed as to able to count and convert between units, tens, hundreds and so on (ie a turn of 4 spaces on the ‘units’ cog whilst a seven was already counted would bring up eleven), as well as being able to work with currency denominations and distances as well. However, it could also subtract, multiply and divide (with some difficulty), and moreover proved an important point- that a mechanical machine could cut out the human error factor and reduce any inaccuracy to one of simply entering the wrong number.

Pascal’s machine was both expensive and complicated, meaning only twenty were ever made, but his was the only working mechanical calculator of the 17th century. Several, of a range of designs, were built during the 18th century as show pieces, but by the 19th the release of Thomas de Colmar’s Arithmometer, after 30 years of development, signified the birth of an industry. It wasn’t a large one, since the machines were still expensive and only of limited use, but de Colmar’s machine was the simplest and most reliable model yet. Around 3,000 mechanical calculators, of various designs and manufacturers, were sold by 1890, but by then the field had been given an unexpected shuffling.

Just two years after de Colmar had first patented his pre-development Arithmometer, an Englishmen by the name of Charles Babbage showed an interesting-looking pile of brass to a few friends and associates- a small assembly of cogs and wheels that he said was merely a precursor to the design of a far larger machine: his difference engine. The mathematical workings of his design were based on Newton polynomials, a fiddly bit of maths that I won’t even pretend to understand, but that could be used to closely approximate logarithmic and trigonometric functions. However, what made the difference engine special was that the original setup of the device, the positions of the various columns and so forth, determined what function the machine performed. This was more than just a simple device for adding up, this was beginning to look like a programmable computer.

Babbage’s machine was not the all-conquering revolutionary design the hype about it might have you believe. Babbage was commissioned to build one by the British government for military purposes, but since Babbage was often brash, once claiming that he could not fathom the idiocy of the mind that would think up a question an MP had just asked him, and prized academia above fiscal matters & practicality, the idea fell through. After investing £17,000 in his machine before realising that he had switched to working on a new and improved design known as the analytical engine, they pulled the plug and the machine never got made. Neither did the analytical engine, which is a crying shame; this was the first true computer design, with two separate inputs for both data and the required program, which could be a lot more complicated than just adding or subtracting, and an integrated memory system. It could even print results on one of three printers, in what could be considered the first human interfacing system (akin to a modern-day monitor), and had ‘control flow systems’ incorporated to ensure the performing of programs occurred in the correct order. We may never know, since it has never been built, whether Babbage’s analytical engine would have worked, but a later model of his difference engine was built for the London Science Museum in 1991, yielding accurate results to 31 decimal places.

…and I appear to have run on a bit further than intended. No matter- my next post will continue this journey down the history of the computer, and we’ll see if I can get onto any actual explanation of how the things work.