Monday, 31 May 2010

The End Of Science

This was the cover title for a series of articles in Wired in July 2008. They were welcoming us to the petabyte age - where we will routinely deal with datasets containing petabytes of data. Inside the titles are less provocative the first is - The End of Theory.

In this article Chris Anderson looks at how tools like Google and the Cloud are chaging the way we look at data. He raises the question of how we have to dea
l with such massive amounts of data.

It forces us to view data mathematically first and establish a context for it later
Peter Norvig has gone so far as to change George Box maxim "All models are wrong but some are useful" to "All models are wrong, and increasingly you can succeed without them."

... faced with massive data, this approach to science - hypoyhesize, model, test - is becoming obsolete.
There are a couple of problems with this idea:
  1. It becomes very difficult to distinguish science and pseudo-science. The Bible Code and other such books suddenly become more convincing.
  2. There is still a model, or rather an assumption and that is that homology between already seen examples will apply to new unseen examples.
What is actually happening is you no longer look for the universal laws but what they actually produce. This is actually how science has worked before. That is the process by which Newton produced his Laws of Gravitation, but the explanation in a more fundamental sense of what gravity means had to wait for Einstein.

Anderson goes too far when he talks about the limits of biology:

Now biology is heading in the same direction. The models we were taught in school about "dominant" and "recessive" genes steering a strictly Mendelian process have turned out to be an even greater simplification of reality than Newton's laws. The discovery of gene-protein interactions and other aspects of epigenetics has challenged the view of DNA as destiny and even introduced evidence that environment can influence inheritable traits, something once considered a genetic impossibility.
This is over-stating reality. Biology according to pure Darwinian evolution is likely to be wrong and there are some epigenetic factors that follow a Lamarckian process but not all and Darwin and Mendelian inheritance is still true for most genes. We only know that it is not completely true because we know the mechanism of epigenetics, which is precisely what Anderson had said we do not need to know.

There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
Sorry but he should read any statistics textbook about data-mining that show you can find any possible correlation depending on the way you partition the data. The higher the dimensionality of the data (more variables) the more likely this is to happen. So this paragraph is nonsense.


Correlation supercedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic, explanation at all.
There's no reason to cling to our old ways. It's time to ask: What can science learn from Google?
To understand the importance of mechanism you need to read studies like those of Richard Peto looking at heart attacks and the use of aspirin as a primary medical treatment when patients are hospitalised. Why did they think to use aspirin? Because its mechanism of action is to prevent clotting.

No comments:

Post a Comment