Friday, 25 June 2010

I spy with my little eye scientists acting deceitfully

In my last post I said that highly cited papers become a stronger part of the literature than less cited papers and those that are never cited. I also said that the porportion of uncited papers now is similar to that in 1973 but here are some interesting graphs created by Scopus rather than from looking at citations in Google Scholar, which had suggested this 40-50% level of uncited articles.

So this suggests that actually only a very small percentage of articles are not actually cited and that over time this approaches nearly 0% for Bioinformatics and BMC Bioinformatics. This seems a little bit odd given the previous results. So what happens if Nature and Science are added to the chart?
Nature and Science have a similar number of uncited papers as the 1973 study and the study using Google Scholar. So what is happening that the Bioinformatics journals are more highly cited than Nature and Science? There are two possibilities:

  1. Scopus is incorrectly calculating the citations for the Bioinformatics journals (but how can this be if it does it right for Science and Nature).
  2. The scientists are fiddling the results to make sure that their papers are cited at least once by citing it themselves - maybe as a conference report or in a non-peer reviewed journal. Citations make careers and an uncited and unloved paper is no use on a curriculum. 
So scientists have learned to play the system, but only those who publish in the discipline specific journals seem to be playing. The big names who get the Nature and Science articles don't care. This is a response to the way Tenure is earned in the States and many other countries and the way funding is calculated by the RAE in the UK.

Citation and confidence - Bioinformatics as an example

How do you know an article is a good article?

We know that peer review is flawed and that it can let through bad articles while blocking actually good work. So how can we be confident about a piece of research? The more an article is cited, the more this article has become important to the community. This can either be citations by those who disagree but most often with those who agree with the work. So highly cited papers even if they are shown in the future to be flawed, have become a significant part of the literature.

As a little experiment I took the articles from 2001 in Bioinformatics of which there are about 819 including editorials and comments and lookd at the number of times they have been cited using Google Scholar. Only about 300 articles have ever been cited. So 500 have never been cited. Of those articles that have been cited the most cited has over 6500 citations and there are 10 articles with more than 500 citations.

This means that in the eight and a half years since the end of 2001 less than 40% of articles have been cited. This agrees with the results reported in Ziman - Reliable Knowledge for 1973 (p 130), where less than 50% of articles were cited within the first year after they were published. This is slightly surprising with the advent of the internet and the increase in open source publication which makes access to the literature wider, but this also reflects the massive growth of the literature in the last 40 years.

Cyril Cleverdon and Document Retrieval

Cyril Cleverdon is one of those people you will not see mentioned in the news. He was a librarian at Cranfield. In the 1950s he wrote some of the pioneering work about document retrieval and most importantly defined the terms precision and recall.

Precision is the fraction of retrieved documents relevant to the search.

Precision = no. of retrieved relevant documents / no. of retrieved documents

Precision is the number of true positives relative to false positives successfully retrieved. The is a measure of the type I error.

Recall is the fraction of documents that are relevant to the query that are succesfully retrieved.

Recall = no. of relevant documents retrieved / no. of relevant documents

Recall is the number of true positives retrieved out of the total number of positives, and is a measure of the type II error. This is much harder to measure or infer than the type I error, as we cannot be sure of the total number of relevant documents except in cases where we use synthetic data.

Wednesday, 23 June 2010

Peer Review Again

The New Scientist has an article about peer review and how science is failing. It shows how psuedo-science can end up as part of the public record as it was introduced to Parliament by Davd Tredinnick. Tredinnick was quoting a University study that "showed" that homeopathic treatment can kill cancer cells. This article had been peer reviewed and now it has been clearly debunked but it still has not been withdrawn. Peer review is failing. The problem is that this undermines confidence and belief in science. Science is about giving answers and more fundamentally about giving us rational grounds for making decisions. Faith in science can easily be destroyed, when poor scientists let their internal views and convictions over-ride their actual experiments. There are two ways errors like this can occur.

  1. Intentional deception.
  2. Accidental mistake.
The first we can deal with by applying ethics policies and reviewing our processes but the second is harder to deal with. Even the greatest scientists are sometimes wrong because they have world views that turn out to be wrong. Einstein never accepted quantum mechanics, Mach never accepted atomic reality etc. Then there are scientists who make mistakes with their experiments or analysis, often this is the abuse of statistics. Mendel fiddled his statistics and got the right answer, Fleishman and Pons did not perform the correct measurements in their cold-fusion experiment, but there was no intention to deceive.

So how can we make peer review better? Certainly making it public and not anonymous would help people to behave more honestly and ss competitively. What we need is a fundamental change in the way scientists behave. Science has to have less ego and reputation which makes it more likely for scientists to maintain views that even they realise are not as rigorous as they claim.

Saturday, 5 June 2010

Formal Systems Modelling

The following diagram is adapted from "Why Critical Systems Need Help to Evolve" B. Cohen and P. Boxer, Computer Magazine, IEEE, May 2010, p56-63. This is a systems model that divides the system using three cuts. The first Endo-exo divides the systems behaviour from its surroundings, the Cartesian cut then divides the identity of the behaviour from its manifestation and the final Heisenberg cut divides supply from demand/need.

Thursday, 3 June 2010

Pre-print archives arXiv and snarXiv

One way that science is becoming more open is the setting up of online repositories of pre-print (not yet peer reviewed) articles. One of the first pre-print archives was for theoretical physics and is called arXiv. There is also a newer alternative archive snarXiv.

For an interesting comparison of the two different archives you should look at the arXiv vs snarXiv page.

Another interesting and similar example can be found in the discussions of the Social Text affair (this is also known as the Sokal affair). Sokal's paper discussing the Social Text paper.

Tuesday, 1 June 2010

Teleology - Plato's Final Cause

Most people have never heard of the word teleology but is a very important idea. It is one of the oldest ideas and also one of the most controversial. It is intimately linked to the concept of Truth and its opposites are the more relativistic theories of knowledge. It is as fundamental to the way we think as religion or atheism. The problem is that many people (myself included) who believe they oppose it strongly actually often fall into the trap of supporting it.

The question of teleology rather than its contradiction of the bible account of creation is what made Darwin's Origin of the Species so controversial. Darwin introduced a two step process. In the first there was variability and in the second there was selection by reproductive success. The problem is this variability, as it does not have either cause or direction. In the Origin there is no need for progress and if there is no progress then there is no move toward some final perfection of the universe and so there is no ultimate cause, no reason for the universe to exist.

The modern synthesis of Darwinian Evolution as presented by Richard Dawkins in books like the Blind Watchmaker forgets this idea. While I am pretty sure that Prof. Dawkins would reject teleology completely from a philosophical viewpoint it does creep into his work. The problem is that it is very hard for us not to feel that we are superior to the rest of the world about us and so we instinctively talk of ourselves as a higher species that has evolved from more primitive species and this in turn implies progress. This rising to perfection is the argument for design. Now teleology can either be "designed" internally as a result of the behaviour of the system or externally as the result of a designer. So this is the fundamental difference between the teleology of Dawkins and that of religion.

Teleology applied more generally to science makes it very hard for us not to think that science will finally give us an absolute answer to everything and that this answer will be true. Plato wrote extensively in the Republic about this idea. Can there be a truth that we can discover? At first our ideas are mere shadows on the cave wall but can science discover the actual truth? Can we come out of the cave and see the world as it really is. Once we discover truth then there is no further for us to go. This would be the final aim of science - the ultimate cause of our investigations. Among educationalists teleology also creeps in where teachers think that there should be a convergence between pupil and teacher so that in the end the student will find the truth of the master's view.

Two modern proponents of teleology are John Barrow and Frank Tipler who wrote "The Anthropic Cosmological Principle". This uses lots of probability arguments to suggest that life is incredibly unlikely (in much the same way as the intelligent design argument) and so the world must have been created in the way it has just for us. I have paraphrased a very long and difficult book that I have not managed to complete in over 20 years. In effect they go back to the view of the world of Genesis where the Earth was the centre of the heavens but in this new world view humans are at the centre of the universe. The question of Why? has an answer.

For me this sort of teleology is an incredible arrogance but then I am also caught by the paradox that my rejection of teleology also becomes teleological. It is just something you should always be careful of when you make an argument. Am I falling into the trap of teleology? On the other hand what do wishy-washy relativisitic views of knowledge mean? So I bend towards relativism but I am sympathetic to some of the teleological thinkers, such as the logical positivists. Marxism for example is a teleological view of the evolution of society that implies that the rise of the proletariat against the Bourgeoisie is a final inevitable state of human society. So anti-Marxist philosophers have used anti-teleological arguments against logical positivism (see Karl Popper's The Poverty of Historicism).