Posts Tagged ‘Google refine’

Although “data journalism” can encompass infographics, interactives, web apps, FOI, databases and a whole host of other numbering, coding, displaying techniques; the road less travelled-by has certain steps, turns and speed bumps. In that sense, here’s a list of things to tick off if you’re interested in going down the data journalism road:

  1. Know the legal boundaries – get to know the Data Protection Act 1998 and the sections on access to personal data and unstructured personal data held by authorities. Do not set foot on your journey without reading the section on exemptions relating to journalism. Use legislation as a reference by downloading the Mobile Legislate app.
  2. Look at data – get to know what is out there, what format it’s in and where it’s coming from. Places like, London Datastore, Office for National Statistics and Get the Data are good places to start for raw data but don’t forget, anything on the web is data. The best data are often hidden. Data can be text and pictures so even mining social media and catching the apps built from them can give you insight into what can be done with data.
  3. Read all about it – to make data and stats accessible you need to know how to frame them within a story. In that sense, you need to know how to undertand the stories they tell. That doesn’t mean going on a stats course. There are a lot of accessible reading material and I would recommend The Tiger That Isn’t.
  4. Get connected – find HacksHackers near you and join Meetup groups to point you in the right directions. Data journalists’ interests and abilities are unique to the individual (much like programmers) so don’t take text of advice as set in stone (the web changes too quickly for that!). Find your own way and your own set of people to guide you. Go to courses and conferences. Look outside the journalism bubble. Data is more than just news.
  5. Spread your bets – the easiest way to sort data is by using spreadsheets. Start with free options like Google Docs and OpenOffice. Industry standards include Microsoft Excel and Access. Learn to sort, filter and pivot. Find data you’re interested in and explore the data using your eyes balls. Know what each piece of software does and can do to the data before mashing it with another piece of software.
  6. Investigate your data – query it using the simple language SQL and the software MySQL. It’s a bit tricky to set up but by now you’ll know a hacker you can ask for help! Clean your data using Google Refine. There are tutorials and a help wiki. Know how these function not just how to navigate the user interfaces, as these will change. These products go through iterations much more quickly than the spreadsheet software.
  7. Map your data – from Google spreadsheets the easiest way to build a map is by using MapAList. There is a long list of mapping software from GeoCommons to ArcGIS. Find what’s easiest for you and most suitable for your data. See what landscapes can be revealed and hone in on areas of interest. Understand the limitations of mapping data, you’ll find devolution makes it difficult to get data for the whole of the UK and some postcodes will throw up errors.
  8. Make it pretty – visualize your data only once you fully understand it (source, format, timeframe, missing points, etc). Do not jump straight to this as visuals can be misleading. Useful and easy software solutions include Google Fusion Tables, Many Eyes and Tableau. Think of unique ways to present data by checking out what the graphics teams at news organizations have made but also what design sites such as Information is Beautiful and FlowingData are doing.
  9. Make your community – don’t just find one, build one. This area in journalism is constantly changing and for you to keep up you’ll need to source a custom made community. So blog and tweet but also source ready-made online communities from places like the European Journalism Centre, National Institute for Computer Assisted Reporting (NICAR), BuzzData and DataJournalismBlog.
  10. Scrape it – do not be constrained by data. Liberate it, mash it, make it useable. Just like a story, data is unique and bad data journalism comes with constraining the medium containing it. With code, there is no need to make the story ‘fit’ into the medium. “The Medium is the Message” (a la Marshall McLuhan). Scrape the data using ScraperWiki and make applications beyond story telling. Make data open. For examples check out OpenCorporates, Schooloscope and Planning Alerts. If you’re willing to give coding a try, this book called “Learning Python the Hard Way” is actually the easiest way to learn for the non-programmer. There is also a Google Group for Python Journalists you should join.
These are guidelines and not a map for your journey. Your beat, the data landscape, changes at the speed of web. You just need to be able to read the signs of the land as there’s no end point, no goal and no one to guide you.


If you have been keeping an eye on my blog you’ll know I scraped Cabinet Office Spending data. Few journalists will look at the mountain of CSVs on government data. Even fewer will code enough to scrape them, although a lot of them want to do this and I believe it will address the former problem. More news institutions are interested in using data to create visualizations for their users. Give them something to play with and they spend more time on your site. So I’ve created my first visual from scraped data. Click on the image to be taken to the view page (sadly WordPress can’t embed ScraperWiki views).

With help from the amazing Tom Mortimer-Jones and Ross Jones at ScraperWiki, I made a word cloud with a date slider for all the companies the Cabinet Office give money to . This is realtime visualization (well as realtime as the data release). Here is the scraper where you can download all the data. I refined the supplier names using Google Refine and you can see this result in the ‘Refined’ table. I made the word cloud in this view. I summed up all the spending for each supplier in the SQL query and used the logarithmic value to be the font size of the supplier’s name in the word cloud. The final view then calls up the word cloud with for the date range selected on the date slider (code was nicked from this JQuery plugin) by plugging the request into the SQL query of the word cloud view. That might seem very confusing but this blog is for all my workings. The code is open so you can take a look. I am opensource.

I want this to be a preview of what is possible. All government bodies are now required by law to release spending data of over £25,000. That’s a lot of data from a lot of bodies. OpenSpending will be tackling this. My thoughts have been about trying to get journalists/bloggers/students learning to scrape. I figure the most useful type of scraping for journalists will be CSV scraping. So I want volunteers to take the journey I have done with this view and learn to scrape one spending dataset.

If I get 20 such people to work together to build a resilient scraper from a template then they can learn from each other i.e. when one person’s scraper breaks because a new bug has been introduced, no doubt one of the other 19 volunteers has come across and dealt with that same bug in their learning process. So by maintaining a community of scrapers the community will be learning to scrape. And the community can do more with the data. For example, by adding category columns such as central government, health, work and pensions, etc, these can be used as filters for the visualization (and to interrogate the data).

It’s an idea, for an experiment. I’ll let you know how I get on. In theory this view can be kept as up-to-date as the date!

Seeing as I like to fly in the face of tradition, I’m going to turn things on it’s head and write a blog post of how I did it before I publish what “it” actually is. That is, I have scraped all the Cabinet Office spending data, cleaned it up and extracted it. But before I tell you what I’ve found (indeed, I haven’t got around to that properly yet!), I’m going to tell you how I found it.

Firstly, I scraped this page to pull out all the CSV files and put all the data in the ScraperWiki datastore. The scraper can be found here. It has over 1,200 lines of code but don’t worry, I did very little of the work myself! Spending data is very messy with trailing spaces, inconsistent capitals and various phenotypes. So I scraped the raw data which you can find in the “swdata” tab. I downloaded this and plugged it into Google Refine. I used the text facet functions to clean up the suppliers’ names as best I could (I figured these were of the most interest and would be more suitable for cleaning). This can be done by going into the “Undo/Redo” tab and clicking on “Extract…”. Select the processes you want the code for, then copy the right hand box. I pasted this prepackaged code into my scraper.

So if you want the cleaned data make sure you select the “Refined” table by hitting the tab and selecting “Download spreadsheet (CSV)”. If you want to use the amount as a numerical field (it was not put in as such in the original!) to get totals for each supplier, for example, you’ll have to use the refined table as I had to code to get the “Amount” as numbers. Or if you know a bit of SQL and want to query the data from ScraperWiki you can use my viewer to be found here. Either way, here is the data. I have already found something of interest which I am chasing but if you’re interested in data journalism here is a data set to play with. Before I can advocate using, developing and refining the tools needed for data journalism I need journalists (and anyone interested) to actually look at data. So before I say anything of what I’ve found, here are my materials plus the process I used to get them. Just let me know what you find and please publish it!


Here is a table of the top 10 receivers of Cabinet Office money. I’ve put the image in here but the original is a view that feeds off the scraper so as the data gets published, this table should update. So the information becomes living information not a static visual. The story is being told not catalogued.

Oh and V is V inspired youth volunteering. They received nearly £44 million over a nine month period. On their website they say they have received over £48 million from the private sector. I imagine £44 million of that has come straight from the Cabinet Office. The Big Society seems to be costing the government a lot of money at the moment even though they say it will be mostly funded by the private sector.

The Cabinet Office, in a move towards greater transparency, are attempting to publish all their data online. This isn’t really news but I don’t think news organizations are looking at this data so I’m scraping it and seeing what it has to offer. So as an exercise I’m scraping the page where ministerial gifts, hospitality, travel and meetings with external organisations are published as CSV or PDF. All this should be pretty much covered by Who’s Lobbying but I’m hoping to set up a little social media experiment (more on that to come). So here is all the data, set to scrape the site every month. You can download it all.

I whacked it into Google Refine to deal with the different spellings, nuances and the change in the format of the date. The date transformation option never seems to work for me in Refine so I exported it and opened it up in Excel to get the data out in chronological order. This may sound cumbersome to those who don’t work with data it’s actually quite quick and easy once you’ve tried it. Anyway, I looked at some of the more popular reasons for meeting ministers and grabbed a screen shot of the Excel table (Refine allows you to export a html table but I’ll have to get it to open up in Firefox so I can use my full page grab add-on).

I looked at the meetings for Big Society:

The major meeting with the Prime Minister and Deputy Prime Minister in May involved Young Foundation, Community Links, Antigone, Big Society Network, Balsall Health Forum, London Citizens, Participle, Talk About Local, CAN Breakthrough, Mayor of Middlesborough, Business in the Community, Esmee Fairbairn, Greener Leith, St Giles Trust, Big Issue Invest, Kids Company. Since then there has been a steady trickle of over 30 meetings with Nick Hurd, Oliver Letwin and Francis Maude about Big Society. Note that these are all Conservative MPs so the Big Society is already looking smaller along coalition party lines.

Sure, they have the titles to be involved but the trend in the data seems to be more about big financing. Meetings with the likes of Goldman Sachs, Barclays, British Banking Association and Co-op Financial Servies leads one to believe that Big Society is being outsourced to local communities but the big financing has to come from the top. In Building the Big Society, the Cabinet Office writes:

We will use funds from dormant bank accounts to establish a Big Society
Bank, which will provide new finance for neighbourhood groups, charities,
social enterprises and other nongovernmental bodies

What are ‘funds from dormant bank accounts’ and why didn’t they use these instead of looking to the government to bail them out? The banks and their reckless trading in toxic assets and credit default swaps led to a massive recession. This shed the light on reckless government borrowing and the massive deficit. This led to budget cuts to local services and the need for the Big Society. Which is now being funded by the banks! Am I missing something?

The next thing to look at from the data is the category ‘Introductory Meeting’:

Introductory meetings interest me as I imagine it pays to be at the back of a politicians mind. It must be worthwhile to have some ear time and get your points across. I’m sure not any old Joe can get an introductory meeting. There must be PR companies that specialise in getting these meetings (lobby firms) so it’s interesting how many large companies are going to appear on this list. In fact, the purpose for one meeting was put down as ‘Lobbying’ with UK Public Affairs Council. They have a register of firms and clients published in evil PDF (go figure). Will have to scrape that.

Lastly, I thought the ‘Renegotiation of Contract’ category might be of interest so here it is:

A lot of these are big technology companies yet the government is notorious for accumulating huge costs with little effectiveness when it comes to implementing new IT systems. I also wonder whether Vodafone’s tax dispute was known during the negotiation of their contract with the Cabinet Office.

I’m getting the data out so that anyone with inside knowledge can put two and two together to further the information. I’m churning the data in so that what can be churned out is journalism and not churnalism. That’s the idea anyway. Just looking at the data is a step in the right direction so anyone interested in data journalism, just keep on looking at what’s coming out. And I’ll try and put it into a context that has journalistic value.

A recent blogpost by TotalPolitics says:


In order to get on top of growing mountains of correspondence and keep on digging through acres of committee and legislative papers MPs are having to take on more staff on a fixed staffing allowance, either paying lower wages or taking people on a volunteer basis.

This comes off the back of Nick Clegg’s initiative to get Westminister interns paid. The blog also addresses Westminister pay in general, quoting a staff survey. For a clearer picture a ScraperWiki user, MemeSpring, scraped the jobs data from Work4MP. This is the historic data from when the site first started in 2004.

So I threw the data into Google Refine and of the 2,661 job postings 30% were unpaid internships (791). Shockingly, there was only ever one internship posting that paid minimum wage and this was with Citigate Dewe Rogerson.

The highest demander of unpaid interns is actually the British Youth Council (with 18 postings) and the MP who advertised the most for unpaid work is Liz Lynne with a total of 10 internship positions. Now these internship listings include Parties and groups like Alcohol Concern. But looking at just MPs, Parties and Westminister, they account for over 300 unpaid positions with most ‘salaries’ consisting of travel, lunch and reasonable expenses.

It’s also odd that this is Nick Clegg’s initiative because the vast majority of internships sought for by political party groups comes from the Liberal Democrats.

So when TotalPolitics writes in defence of Government pay:


The vast majority of people involved in politics are volunteers – canvassers, committee members, deliverers, agents and organisers who want their party to succeed and gain office; of those few who are paid they are by and large paid poorly and work extraordinarily long hours, with precious little thanks.

Could this not be read the other way around? MPs geting paid whilst using an army of young naive interns to do their work for free. No doubt they put these interns on their list of expenses.

The road to No.10 is paved with advisers, they lead you in, they open doors. Often for themselves. Previous advisers include Alastair Campbell, Ed Balls and the Miliband brothers. Until they’re in the door they generally don’t command the political spotlight. That is, unless they’re on the way out like Andy Coulson. What they do command is fine wining and dining.

The Cabinet Office publishes Special Advisers’ gifts and hospitality in various Excel sheets that are filled-in depending on how much coffee the civil servant had that morning i.e. inconsistently. They weren’t even consistent with the appointed minister the adviser falls under. So I scraped it and put all the files into one download which covers May to September 2010. You can get it all by hitting the ‘Download spreadsheet (CSV)’ link here.

The Trends:

Here are the advisers listed according to the amount of hospitality they received:

Note that Nick Clegg’s chief adviser, Jonny Oates, has been taken out the most followed by the then PM’s communications chief, Andy Coulson. Most hospitality is provided by media organisations (see table below) and by using Google Refine I dug deeper into the data to look for a bias between advisers for the Prime Minister and Deputy Prime Minister (seeing as there’s a party split). It turns out the BBC only court Cameron’s advisers (15 times in 5 months). The same is true of the Daily Mail. Whereas The Financial Times dine only with those close to Clegg. The Guardian similarly enjoy Lib Dem company, inviting them to their table twice as many times as they did the Tories.

What’s very noticeable from this information is that Special Advisers are wined and dined mostly by media organisations. Here is a list of the top 10 hospitality givers:

If you add up all of Rupert Murdoch’s empire, they account for 20 occasions split 13:7 Cameron’s to Clegg’s.

The close relationship between advisers and media organisations (this is all within a five month period) makes me wonder: when a ‘No.10 insider’ or ‘someone close to the Prime Minister’ is quoted, how often is that piece of information plucked from the lips of these well-fed advisers? A lot I imagine.

The Outliers:

In fact, media and PR are so predominant in hospitality for advisers, I’ve decided to list the rest of the givers in order of how many times they appear in the data: Bell Pottinger (mostly business clients, Airbus, Sky, Unilever, etc), News Corporation, Tetra Strategy (clients include the Government of Dubai and the jailed Russian billionaire,
Mikhail Khodorkovsky), The Daily Telegraph, The Mail on Sunday, The Sunday Times, The Telegraph, Alexander Kutner, Baron Wolfson of Aspley Guise (Conservative life peer and CEO of Next), Business in the Community, Center for Court Innovation (New York think-tank), Citi, Connect Communications (lobbying), Demos (think-tank), General Sir Richard Dannatt, ITN, ITV, Ian Osborne and Partners, Institute for Public Policy Research (think-tank), Islamic Relief, James Kempton, Lansons Communications (clients include J.P. Morgan, Lloyds TSB and Barclays), London Palladium (Whoopi Goldberg), Malaria No More, Martyn Rose, Medley Global Advisors (“provider of macro policy intelligence service for the world’s top hedge funds, institutional investors, and asset managers”), News International, Lawn Tennis Association, Not to Scale, Open Road, Pakistan International Airlines, Policy Exchange (think-tank), RSA, Ramesh Dewan, Richard Thaler, Royal Bank of Scotland, SAB Miller (De Klerk Foundation Event), Save the Children, Taxpayers’ Alliance, The Daily Telegraph and The Daily Mirror, The Economist, The Evening Standard, The Spectator, The Sun, The Sunday Express, UK Music, Wall Street Journal and Wellington College.

Bell Pottinger and Martyn Rose are now with the Big Society Network.

Only six entries weren’t lunch or dinner dates. Steve Hilton was given champagne from Not to Scale, Steve Chatwin received concert tickets from Malaria No More, Naweed Khan got his flights upgraded by Pakistan International Airlines, Andy Coulson was given theatre tickets by Whoopi Goldberg and a bottle of wine by a one Alexander Kutner.

The Anomalies:

Now the only Alexander Kutner I can find happens to have been the Vice President and Principal Engineer of Software Development at Electronic Evidence Discovery. They reduce the risk of electronic discovery, a process which involves digital forensics analysis for recovering evidence.  See comment below regarding the identity of Alexander Kutner

Also, Ian Osborne and Partners, who dined Tim Chatwin, has no existence according to Google.

The MetaData:

What’s missing is what went on at these meals. Who attended. What was said, or agreed upon. Who was being represented. What goes on is not an entry in the data sheets and it never will be. But this data should make you more aware of the existence of these meals on deals.

You can find a list of Special Advisers and their salaries here.

Here’s another tool for the amateur data-miners tool kit. I’m hoping it’ll be a pick axe more so than a stick of dynamite. It’s from Google so I’m hopeful. It’s called Google Refine and was previously known as Google Gridworks.

And the blurb sounds promising: “Google Refine is a power tool for cleaning up raw data, making it consistent, linking it to data registries like Freebase, augmenting it with more data from other data sources, transforming it into the required format for other tools to consume, and contributing it back to some data sources like Freebase. Google Refine is not a web service but a desktop app that runs on your own computer, so you can process sensitive data with privacy.”

I will hopefully tackle my first data sheet sometime in the early new year (or so goes the plan). I’m planning on learning Python and so will try and have my own set of scraped data to clean (oh the joy!). Meanwhile I will be posting visualization techniques and those working on new and wonderful ways to give data a voice. So stay tuned.

Here are the latest tutorials: