If you have been keeping an eye on my blog you’ll know I scraped Cabinet Office Spending data. Few journalists will look at the mountain of CSVs on government data. Even fewer will code enough to scrape them, although a lot of them want to do this and I believe it will address the former problem. More news institutions are interested in using data to create visualizations for their users. Give them something to play with and they spend more time on your site. So I’ve created my first visual from scraped data. Click on the image to be taken to the view page (sadly WordPress can’t embed ScraperWiki views).
With help from the amazing Tom Mortimer-Jones and Ross Jones at ScraperWiki, I made a word cloud with a date slider for all the companies the Cabinet Office give money to . This is realtime visualization (well as realtime as the data release). Here is the scraper where you can download all the data. I refined the supplier names using Google Refine and you can see this result in the ‘Refined’ table. I made the word cloud in this view. I summed up all the spending for each supplier in the SQL query and used the logarithmic value to be the font size of the supplier’s name in the word cloud. The final view then calls up the word cloud with for the date range selected on the date slider (code was nicked from this JQuery plugin) by plugging the request into the SQL query of the word cloud view. That might seem very confusing but this blog is for all my workings. The code is open so you can take a look. I am opensource.
I want this to be a preview of what is possible. All government bodies are now required by law to release spending data of over £25,000. That’s a lot of data from a lot of bodies. OpenSpending will be tackling this. My thoughts have been about trying to get journalists/bloggers/students learning to scrape. I figure the most useful type of scraping for journalists will be CSV scraping. So I want volunteers to take the journey I have done with this view and learn to scrape one spending dataset.
If I get 20 such people to work together to build a resilient scraper from a template then they can learn from each other i.e. when one person’s scraper breaks because a new bug has been introduced, no doubt one of the other 19 volunteers has come across and dealt with that same bug in their learning process. So by maintaining a community of scrapers the community will be learning to scrape. And the community can do more with the data. For example, by adding category columns such as central government, health, work and pensions, etc, these can be used as filters for the visualization (and to interrogate the data).
It’s an idea, for an experiment. I’ll let you know how I get on. In theory this view can be kept as up-to-date as the date!