Last week, Alvis Tang (another PhD student at the Centre for Complexity Science) and found out we had won the Summer Datachallenge, a data science competition hosted by Imperial’s . Well, we were joint winners with another Physicist at Imperial, Jason Cole but we were still very pleased.
The challenge involved taking a load of raw data concerning London in 2012, including house prices, tweets, Olympic medals, theatre ticket sales – and the data we ended up using which was from London’s transport system. The tube data consisted of the numbers of entries and exits from each tube station, in 15 minute intervals throughout the whole year, coming to a ~100MB text file. We began by just exploring the data to see what patterns emerged.
This is the number of people entering, and exiting my local station, Kennington, this day 2 years ago.
As you might expect, there are 2 daily peaks around the morning and evening rush hours (this was a Tuesday). Since Kennington is a alrgely residential area, there are more people entering in the morning, and more leaving in the evening, and the evening peak is more spread out – suggesting people leave for work at roughly the same time but get home at lots of different times. Got similar plots for every other tube station in the network and then decided to have a look at the Olympic period in the summer of 2012 to see whether there was a significant difference. In particular, we were interested as to whether TfL’s `Get Ahead of the Games’ scheme, designed to get commuters to work from home was very successful.
So we picked a few stations around business areas, and compared their traffic on Olympic days against traffic on a usual summer day.
Here we are looking at the total traffic through the station – ie. entrances + exits. For almost all the stations, this plot looks the same. There is no real difference during peak hours, but outside peak hours there is a small increase in passenger numbers. There is one exception to this rule which is Canary Wharf. Here there was a drastic reduction in numbers during peak commuting hours – so it looks like the `Get Ahead of the Games’ campaign worked a lot better here than it did anywhere else.
We guessed that the difference is largely due to HSBC, who reportedly had up to 40% of their workforce , which more than accounts for the difference at commuting peak times.
We also looked at the individual entry gates to the station and found that the East entrace accounted for almost all of the drop in traffic with the West entrance seeing little difference, and it is the East entrace which is nearest to HSBC’s offices.
It’s a hard problem. For very applied work, it might be possible to see how the technology sector picks up various scientific advances and judge merit on that basis, but for most scientists the impact of their research is only seen on long timescales, much longer than the few years they spend on an individual project, or working with one particular group.
Traditionally, citations have been used as a measure of success. The assumption is that other scientists will be aiming to work on important topics, and so if they have cited some research it must have been useful to them, and therefore also important. So the more citations a paper has, the more useful it has been. The obvious problem with this analysis though is that it assumes that all citations contain the same amount of usefulness. In reality, there are many reasons to cite something beyond it being a genuinely useful inspiration to the research in question. Writers often cite well known works in their field, even if largely irrelevant to their current research, or cite a paper to explain that its interpretation of something or other is incorrect. And because citations are used as a measure of success, writers often cite their own papers, or those of their collaborators even when not really necessary.
This problem is most clearly uncovered in the work of Simkin & Roychowdhury’s 2002 paper, Read before you cite! They reason that authors who read a paper, then decide to cite it have some small chance of making a typographical mistake when writing their bibliography (note, this was before the time of software reliably auto-generating bibliographies, as is commonplace now). So perhaps paper A is cited by paper B, but a spelling mistake is made in the citation. Paper C then cites both B, and A – but makes the exact same mistake as paper B did. Simkin & Roychowdhury then reason that it is likely that the authors of paper C didn’t read paper A at all – they just copied from the bibliography of paper B. After all, the chances of them just making an identical typo seem very small, and if you did read the paper you would presumably just write down the author’s names from the original. Using this simple model they calculate that around 80% of citations (in their dataset) are made without the cite even reading the paper they are citing!
This should all be very concerning for those of us who care about how many times our papers get cited. Measures such as the h-index, and impact factors are all ultimately based on counting citations. Does it mean that citation analysis is fundamentally flawed? I don’t think so – it just means that we need to use more sophisticated tools than just counting citations and look at other aspects of the topology of citation networks, and other information we have about papers. This has been the subject of some of my recent research, and so I hope to write a few more posts soon to ‘catch up’ to where that research currently is.
I’m James Clough, a PhD student in Physics at Imperial College London – this is my personal website. I am currently working in the Centre for Complexity Science (formerly the Complexity & Networks group) at Imperial, supervised by Dr. Tim Evans, and Prof. Kim Christensen. My research is in complex networks, and my interest is particularly in networks constrained by causality, such as citation networks. My work focuses on using the special constraints and properties of these networks to find new ways of characterising and modelling their structure. Other academic interests include game theory, statistics and other aspects of complexity science, such as applications in economics and government policy.