Archive for the ‘Good Data’ Category

Just to let you know that the Twitter account @Scrape_No10 which tweets out ministers’, special advisers’ and permanent secretaries’ meetings, gifts and hospitalities is back up and tweeting. You can read the post about its creation here and download all the data the account contains. This account needs more coding maintenance than the @OJCstatements account (read about it here) because the data is contained in CSV files posted onto a webpage. I code sentences to be tweeted from the rows and columns. The scraper feeding the twitter account feeds off 5 separate scrapers of the CSV files. Because of this, the account is more likely to throw up errors than the simple scraping of the Office for Judicial Complaints site.

So I decided, as I’m learning to code and structure scrapers, to run the scrapers manually every time the twitter account stops, fix the bugs and set the account tweeting again. There will be better ways to structure the scrapers but right now I’m concentrating on the coding.

Learning to scrape CSVs is very handy as lots of government data are released as CSV. That being said, there is CSV documentation/tutorial on ScraperWiki, although it is aimed at programmers. For those interested in learning to code/scrape I would recommend “Learn Python the Hard Way” (which is the easiest for beginners, it’s just ‘hard’ for programmers because it involves typing code!). For more front end work I have recently discovered Codecademy. I can’t vouch for it but it looks interesting enough. I have also put all the datasets for the @Scrape_No10 account on BuzzData as an experiment.

Data is the new word for information. But Information Journalist implies every other journalist is just a churnalist. Which is most definitely not the case. If data is anything in a database then I’m looking beyond that. For me data is any piece of information that can be turned to journalistic use. So rather than confine my scraping to CSVs and data releases, I can take anything from the web I think will be useful for the public to know.

Here’s something that is in the public domain but not the public sphere: Statements from the Office for Judicial Complaints where judges are reprimanded or struck off.  The OJC deals with complaints about the personal conduct of judges. Examples of possible personal misconduct might be use of insulting, racist or sexist language in court, or inappropriate behaviour outside the court such as a judge using their judicial title for personal advantage or preferential treatment. So they can be reprimanded and struck off for personal misconduct by the OJC but the OJC does not have the power to investigate or call into question any of their previous judgements.

So I’ve put all the statements with a link to the PDF documents detailing their case with the OJC on twitter. Any new statements should be picked up by my scraper (which will run daily) and then be tweeted out. If anyone who has dealt with a tweeted judge has something to add please reply to the tweet or use the hashtag #OJC.

Seeing as I like to fly in the face of tradition, I’m going to turn things on it’s head and write a blog post of how I did it before I publish what “it” actually is. That is, I have scraped all the Cabinet Office spending data, cleaned it up and extracted it. But before I tell you what I’ve found (indeed, I haven’t got around to that properly yet!), I’m going to tell you how I found it.

Firstly, I scraped this page to pull out all the CSV files and put all the data in the ScraperWiki datastore. The scraper can be found here. It has over 1,200 lines of code but don’t worry, I did very little of the work myself! Spending data is very messy with trailing spaces, inconsistent capitals and various phenotypes. So I scraped the raw data which you can find in the “swdata” tab. I downloaded this and plugged it into Google Refine. I used the text facet functions to clean up the suppliers’ names as best I could (I figured these were of the most interest and would be more suitable for cleaning). This can be done by going into the “Undo/Redo” tab and clicking on “Extract…”. Select the processes you want the code for, then copy the right hand box. I pasted this prepackaged code into my scraper.

So if you want the cleaned data make sure you select the “Refined” table by hitting the tab and selecting “Download spreadsheet (CSV)”. If you want to use the amount as a numerical field (it was not put in as such in the original!) to get totals for each supplier, for example, you’ll have to use the refined table as I had to code to get the “Amount” as numbers. Or if you know a bit of SQL and want to query the data from ScraperWiki you can use my viewer to be found here. Either way, here is the data. I have already found something of interest which I am chasing but if you’re interested in data journalism here is a data set to play with. Before I can advocate using, developing and refining the tools needed for data journalism I need journalists (and anyone interested) to actually look at data. So before I say anything of what I’ve found, here are my materials plus the process I used to get them. Just let me know what you find and please publish it!


Here is a table of the top 10 receivers of Cabinet Office money. I’ve put the image in here but the original is a view that feeds off the scraper so as the data gets published, this table should update. So the information becomes living information not a static visual. The story is being told not catalogued.

Oh and V is V inspired youth volunteering. They received nearly £44 million over a nine month period. On their website they say they have received over £48 million from the private sector. I imagine £44 million of that has come straight from the Cabinet Office. The Big Society seems to be costing the government a lot of money at the moment even though they say it will be mostly funded by the private sector.

#opendata from Open Knowledge Foundation on Vimeo.

You might be wondering what this short documentary has to do with journalism or even what open data has to do with journalism. No doubt you are aware that journalism has been facing a ‘crisis’ for a while now. Not just because of the recession and shrinking advertisers but because of the dominance of the web for getting information to people and allowing them to share amongst themselves.

Open data activists are working with the web to provide information in a way people can engage with and ultimately feel empowered by. Projects like FixMyStreet and Schooloscope are emblematic of this rise in civic engagement projects. Indeed, crime mapping in San Francisco led to local citizens demanding more policing in areas of high crime and a change in the policing schedule to reflect the hours when crime is at its highest.

News used to have some responsibility in this area of engagement but never quite understood the field or didn’t know quite what to do with it. Now they have lost complete control and the masters of the web platforms are again taking informational control of a growing area of interest. But news organizations are missing a very important trick. Data driven journalist, Mirko Lorenz, has written how News organizations must become hubs of trusted data in a market seeking (and valuing) trust.

Which is why I think anyone interested in the area of data journalism should watch this documentary, as not only should traditional media be training journalists to engage with this new streaming of social and civic data, but managers and execs should think about the possible shifting in the traditional media market away from advertising and towards the trust market.

A recent blogpost by TotalPolitics says:


In order to get on top of growing mountains of correspondence and keep on digging through acres of committee and legislative papers MPs are having to take on more staff on a fixed staffing allowance, either paying lower wages or taking people on a volunteer basis.

This comes off the back of Nick Clegg’s initiative to get Westminister interns paid. The blog also addresses Westminister pay in general, quoting a staff survey. For a clearer picture a ScraperWiki user, MemeSpring, scraped the jobs data from Work4MP. This is the historic data from when the site first started in 2004.

So I threw the data into Google Refine and of the 2,661 job postings 30% were unpaid internships (791). Shockingly, there was only ever one internship posting that paid minimum wage and this was with Citigate Dewe Rogerson.

The highest demander of unpaid interns is actually the British Youth Council (with 18 postings) and the MP who advertised the most for unpaid work is Liz Lynne with a total of 10 internship positions. Now these internship listings include Parties and groups like Alcohol Concern. But looking at just MPs, Parties and Westminister, they account for over 300 unpaid positions with most ‘salaries’ consisting of travel, lunch and reasonable expenses.

It’s also odd that this is Nick Clegg’s initiative because the vast majority of internships sought for by political party groups comes from the Liberal Democrats.

So when TotalPolitics writes in defence of Government pay:


The vast majority of people involved in politics are volunteers – canvassers, committee members, deliverers, agents and organisers who want their party to succeed and gain office; of those few who are paid they are by and large paid poorly and work extraordinarily long hours, with precious little thanks.

Could this not be read the other way around? MPs geting paid whilst using an army of young naive interns to do their work for free. No doubt they put these interns on their list of expenses.

The road to No.10 is paved with advisers, they lead you in, they open doors. Often for themselves. Previous advisers include Alastair Campbell, Ed Balls and the Miliband brothers. Until they’re in the door they generally don’t command the political spotlight. That is, unless they’re on the way out like Andy Coulson. What they do command is fine wining and dining.

The Cabinet Office publishes Special Advisers’ gifts and hospitality in various Excel sheets that are filled-in depending on how much coffee the civil servant had that morning i.e. inconsistently. They weren’t even consistent with the appointed minister the adviser falls under. So I scraped it and put all the files into one download which covers May to September 2010. You can get it all by hitting the ‘Download spreadsheet (CSV)’ link here.

The Trends:

Here are the advisers listed according to the amount of hospitality they received:

Note that Nick Clegg’s chief adviser, Jonny Oates, has been taken out the most followed by the then PM’s communications chief, Andy Coulson. Most hospitality is provided by media organisations (see table below) and by using Google Refine I dug deeper into the data to look for a bias between advisers for the Prime Minister and Deputy Prime Minister (seeing as there’s a party split). It turns out the BBC only court Cameron’s advisers (15 times in 5 months). The same is true of the Daily Mail. Whereas The Financial Times dine only with those close to Clegg. The Guardian similarly enjoy Lib Dem company, inviting them to their table twice as many times as they did the Tories.

What’s very noticeable from this information is that Special Advisers are wined and dined mostly by media organisations. Here is a list of the top 10 hospitality givers:

If you add up all of Rupert Murdoch’s empire, they account for 20 occasions split 13:7 Cameron’s to Clegg’s.

The close relationship between advisers and media organisations (this is all within a five month period) makes me wonder: when a ‘No.10 insider’ or ‘someone close to the Prime Minister’ is quoted, how often is that piece of information plucked from the lips of these well-fed advisers? A lot I imagine.

The Outliers:

In fact, media and PR are so predominant in hospitality for advisers, I’ve decided to list the rest of the givers in order of how many times they appear in the data: Bell Pottinger (mostly business clients, Airbus, Sky, Unilever, etc), News Corporation, Tetra Strategy (clients include the Government of Dubai and the jailed Russian billionaire,
Mikhail Khodorkovsky), The Daily Telegraph, The Mail on Sunday, The Sunday Times, The Telegraph, Alexander Kutner, Baron Wolfson of Aspley Guise (Conservative life peer and CEO of Next), Business in the Community, Center for Court Innovation (New York think-tank), Citi, Connect Communications (lobbying), Demos (think-tank), General Sir Richard Dannatt, ITN, ITV, Ian Osborne and Partners, Institute for Public Policy Research (think-tank), Islamic Relief, James Kempton, Lansons Communications (clients include J.P. Morgan, Lloyds TSB and Barclays), London Palladium (Whoopi Goldberg), Malaria No More, Martyn Rose, Medley Global Advisors (“provider of macro policy intelligence service for the world’s top hedge funds, institutional investors, and asset managers”), News International, Lawn Tennis Association, Not to Scale, Open Road, Pakistan International Airlines, Policy Exchange (think-tank), RSA, Ramesh Dewan, Richard Thaler, Royal Bank of Scotland, SAB Miller (De Klerk Foundation Event), Save the Children, Taxpayers’ Alliance, The Daily Telegraph and The Daily Mirror, The Economist, The Evening Standard, The Spectator, The Sun, The Sunday Express, UK Music, Wall Street Journal and Wellington College.

Bell Pottinger and Martyn Rose are now with the Big Society Network.

Only six entries weren’t lunch or dinner dates. Steve Hilton was given champagne from Not to Scale, Steve Chatwin received concert tickets from Malaria No More, Naweed Khan got his flights upgraded by Pakistan International Airlines, Andy Coulson was given theatre tickets by Whoopi Goldberg and a bottle of wine by a one Alexander Kutner.

The Anomalies:

Now the only Alexander Kutner I can find happens to have been the Vice President and Principal Engineer of Software Development at Electronic Evidence Discovery. They reduce the risk of electronic discovery, a process which involves digital forensics analysis for recovering evidence.  See comment below regarding the identity of Alexander Kutner

Also, Ian Osborne and Partners, who dined Tim Chatwin, has no existence according to Google.

The MetaData:

What’s missing is what went on at these meals. Who attended. What was said, or agreed upon. Who was being represented. What goes on is not an entry in the data sheets and it never will be. But this data should make you more aware of the existence of these meals on deals.

You can find a list of Special Advisers and their salaries here.

Numerical analysis sounds boring to most and way too complicated to nearly all news organizations. But the advent of multimedia and interactives means that facts and figures can and should be made accessible and exciting. Here is a TED talk using simple Power Point. The topic and research are so affective that I am just gripped to my computer screen. A good use of data which goes to show that news agencies are just plain slow on the uptake.

As Information is Beautiful shows, there is a lot being done with data at academic institutions. At the moment there is a strong push by the data mining community to get government data out into the open but academic data is, I think, the real diamond in the rough.

If you are working with academic data or want to let me know. Follow @DataMinerUK