Archive for the ‘Data Journalism’ Category

Seeing as tomorrow is Open Data Day and I claim to be a Data Journalist (I think JournoCoder is more suitable) here’s a little data food for journalistic thought.

RSS Feed of US Nuclear Reactor Events

Here is the site showing the US nuclear reactors power output status. Here is the scraper for that site written by ScraperWiki founder Julian Todd. Here is my script for catching the unplanned events and converting them to RSS format. And here is the URL you can use to subscribe to the feed yourself:

https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=rss2&name=us_nuclear_reactor_rss&query=select%20title%2C%20link%2C%20description%2C%20date%2C%20guid%20from%20%60swdata%60%20limit%2020

Oh, and a video (using the example to go through the ScraperWiki API)

ScraperWiki has had a little facelift of late so I’ve done a couple of screencasts (one more to come). Enjoy!

So I’m in the US, preparing to roll out events. To make decisions as to where to go I needed to get data. I needed numbers on the type of people we’d like to attend our events. In order to generate good data projects we would need a cohort of guests to attend. We would need media folks (including journalism students) and people who can code in Ruby, Python, and/or PHP. We’d also like to get the data analysts, or Data Scientists, as they are known as, particularly those who use R.

So with assistance from Ross Jones and Friedrich Lindenberg, I scraped the locations (websites and telephone numbers if they were available) of all the media outlets in the US (changing the structure to suit its intended purpose i.e. why there are various tables), data conferences, Ruby, PHP, Python and R meetups, B2B publishers (Informa and McGraw-Hill) and top 10 journalism schools. I added small sets in by hand such as HacksHackers chapters. All in all, nearly 12,800 data points. I had never used Fusion Tables before but I have heard good things. So I mashed up the data and imported it into Fusion Tables. So here is it (clicki on the image as sadly wordpress.com does not support iframes):

Click to explore on Fusion Tables

Sadly there is a lot of overlap so not all the points are visible. Google Earth explodes the points on the same spot however it couldn’t handle this much data when I exported it. Once we decide where best to go I can hone in on exact addresses. I wanted to use it to pinpoint concentrations, so a heat map of the points was mostly what I was looking for.

Click to explore on Fusion Tables

Using Fusion Tables I have then break down the data for the hot spots. I’ve looked at the category proportions and using the filter and aggregate, made pie charts (see New York City for example). The downside I found with Fusion Tables is that the colour schemes cannot be adjusted (I had to fix them up using Gimp) and the filters are AND statement (no OR statement option). The downside with US location data is the similarity of place names across states (also having a place and state name the same), so I had to eye up the data. So here is the breakdown for each region where the size of the pie chart corresponds to the number of data points for that location. It is relative to region not across.

Of course media outlets would outnumber coding meetups, universities and HacksHackers Chapters, but they would be a better measure of population size and city economy.

What I’ve learnt from this is:

  1. Free tools are simple to use if you play around with them
  2. They can be limiting for visual mashups
  3. The potential of your outcome is proportional to your data size, not your tool functionality (you can always use multiple tool)
  4. To work with different sources of data you need to think about your database structure and your outcome beforehand
  5. Manipulate your data in the database not your tool, always keep the integrity of the source data
  6. To have data feed into your outcome changes your efforts from event reporter to source
This all took me about a week between doing other ScraperWiki stuff and speaking at HacksHackers NYC. If I were better at coding I imagine this could be done in a day no problem.

Here’s a little experiment in using data for news:

Things have been quiet on the blog front and I apologize. What began as a tumultuous year with a big risk on my part has become even more turbulent. Happily with opportunities rather than uncertainties. Trips to Germany and the US have landed in my lap. Both hugely challenging and exciting.

I completed the Knight Mozilla Learning Lab successfully and have been invited to Berlin for the MoJoHackfest next week. I’m really looking forward to meeting all the participants and getting some in depth hands-on experience of creating applications built around a better news flow.

This is a level between the hack days ScraperWiki ran and the ScraperWiki platform development itself (I don’t play a part in this but work closely with those who do), which is more akin to the development newsroom.

My pitch for the Learning Lab, Big Picture, is asking a lot of developers coming with their own great ideas and prototypes. I would love to get some of the functionality working but that very much depends on the goodwill, skills and availability of a small group of relative strangers.

I have a tendency to bite off more than I can chew and ask a lot of people who have no vested interests in my development. I am acutely aware that I cannot build any part of the Big Picture project. That being said I have built a new project that can be added to with a basic knowledge of Python. I give you MoJoNewsBot:

If you want to know more about how the Special Advisers’ query was done read my ScraperWiki blog post. Also, I fixed the bug in the Goolge News search so the links match the headline.

Come October I will be heading to the US to help fulfill part of ScraperWiki’s obligations to the Knight News Challenge. I am honoured to be one of ScraperWiki’s first full-time employees and actually get paid to further the field of data journalism!

Being part of a startup has its risks. No one’s role is every fully defined. This really is a huge experiment and I’m not sure I can even describe what it is I am doing. I am not a noun, however. I am a verb. My definition is in my functionality and defining this through ScraperWiki, MoJo and any other opportunities that come my way will be the basis of this blog from now on. So my posts will be sporadic but I hope you look forward to them.

Click on the image to get to the widget.

Afghan Civilian Casualty Explorer

I have scraped three sources of Afghan civilian casualty data; UNAMA, ISAF and ARM. The originals can all be found here. They were obtained by Science correspondent John Bohannon after embedding with military forces in Kabul and Kandahar in October 2010. They are in Excel format. A bad format. Excel is data manipulation software, not for displaying data. This is an example where all three sources produced data of high interest but none in formats which make the data useable.

Because there are three different sources, there are three different collection methods. Date ranges are also different. The Afghan Rights Monitor (ARM) give the smallest grained data, collecting information from particular incidents. The others collect larger grained data, aggregating incidents into types and regional commands. NATOs International Security Assistance Force (ISAF) split the south of Afghanistan into two regions of command on 19 June 2009 (no doubt owing to US operations in Helmand), however the data is split at the beginning of 2009 (I had to clarify this inconsistency with LTJG Bob Page, Media Officer for the Regional Command Southwest Public Affairs Office in Afghanistan).

As I’m learning to code and calling myself a data journalist, every project I choose to undertake for the sake of ‘learning’ has to have a journalistic aspect. In building this widget (with a lot of help from Ross Jones) I haven’t made a traditional ‘story’, rather something that is functional in a news gathering sense. I got the idea from the Iraq Body Count. Their aim is to find names for the individual casualties of war, telling the story through the people rather than the numbers.

If you’ve been to the Holocaust Memorial museum, you’ll know how important individual stories are to understanding the impact of war. I thought I would try and make something simple that would help identify and tick off an individual casualty from the data points. If someone is looking to find out more about how a love one died and who might have been responsible then they need as much data on the event. The Afghan Casualty Explorer is very basic and a lot more could be done with the data by proper coders or a newsroom team with programming expertise.

I decided to make a tool in the computer-assisted-reporting fashion. My take on data journalism being use tools to aid in the news gathering process and not just the mediation process.

There’s a Excel scraping guide on ScraperWiki for anyone who have data trapped in Excel sheets..

Here are the videos from the Data Journalism stream at the Open Knowledge Conference this year held in Berlin featuring Mirko Lorenz, Simon Rogers and Caelainn Barr amongst others.

[vimeo http://vimeo.com/26861938]

[vimeo http://vimeo.com/26666260]

[vimeo http://vimeo.com/26668162]

And just so you know I will be heading back to Berlin at the end of September for the Knight-Mozilla Hackathon. Greatly looking forward to it as I’ll be getting hands on experience of platforming building for the news quick and dirty. I’m also very excited about meeting some of the lab folk face to face. Will keep you posted and blog from a journo perspective and how I think this type of creativity is changing news.

Just to let you know that the Twitter account @Scrape_No10 which tweets out ministers’, special advisers’ and permanent secretaries’ meetings, gifts and hospitalities is back up and tweeting. You can read the post about its creation here and download all the data the account contains. This account needs more coding maintenance than the @OJCstatements account (read about it here) because the data is contained in CSV files posted onto a webpage. I code sentences to be tweeted from the rows and columns. The scraper feeding the twitter account feeds off 5 separate scrapers of the CSV files. Because of this, the account is more likely to throw up errors than the simple scraping of the Office for Judicial Complaints site.

So I decided, as I’m learning to code and structure scrapers, to run the scrapers manually every time the twitter account stops, fix the bugs and set the account tweeting again. There will be better ways to structure the scrapers but right now I’m concentrating on the coding.

Learning to scrape CSVs is very handy as lots of government data are released as CSV. That being said, there is CSV documentation/tutorial on ScraperWiki, although it is aimed at programmers. For those interested in learning to code/scrape I would recommend “Learn Python the Hard Way” (which is the easiest for beginners, it’s just ‘hard’ for programmers because it involves typing code!). For more front end work I have recently discovered Codecademy. I can’t vouch for it but it looks interesting enough. I have also put all the datasets for the @Scrape_No10 account on BuzzData as an experiment.

Please read this previous blog post which runs through some quick things journalists who don’t code can do with ScraperWiki.

It is a step-by-step guide so please give it a go and don’t just try and follow the answers as you’ll learn more from rummaging around out site. Also check out the introductory video at the start of the tutorial if you’re not familiar with ScraperWiki.

Here’s the twitter scraper and datastore download. This is the first part of the tutorial where you fork (make a copy of) a basic Twitter scraper, run it for your chosen query, download the data and schedule it to run at a frequency to allow the data to be refreshed and accumulated:

The next one is a SQL Query View which looks at the data with a journalistic eye in ScraperWiki. This is the second part of the tutorial where you look into a datastore using the SQL language and find out which are the top 10 publications receiving complaints from the Press Complaints Commission and also who are the top 10 making the complaints:

And last we show you how to get a live league table view that updates with a scraper. This is the final part of the tutorial where you make a live league table of the above query that refreshes when the original scraper updates:

This is just the beginning as it’s no longer the message as the medium but the tools.

Although “data journalism” can encompass infographics, interactives, web apps, FOI, databases and a whole host of other numbering, coding, displaying techniques; the road less travelled-by has certain steps, turns and speed bumps. In that sense, here’s a list of things to tick off if you’re interested in going down the data journalism road:

  1. Know the legal boundaries – get to know the Data Protection Act 1998 and the sections on access to personal data and unstructured personal data held by authorities. Do not set foot on your journey without reading the section on exemptions relating to journalism. Use legislation as a reference by downloading the Mobile Legislate app.
  2. Look at data – get to know what is out there, what format it’s in and where it’s coming from. Places like Data.gov.uk, London Datastore, Office for National Statistics and Get the Data are good places to start for raw data but don’t forget, anything on the web is data. The best data are often hidden. Data can be text and pictures so even mining social media and catching the apps built from them can give you insight into what can be done with data.
  3. Read all about it – to make data and stats accessible you need to know how to frame them within a story. In that sense, you need to know how to undertand the stories they tell. That doesn’t mean going on a stats course. There are a lot of accessible reading material and I would recommend The Tiger That Isn’t.
  4. Get connected – find HacksHackers near you and join Meetup groups to point you in the right directions. Data journalists’ interests and abilities are unique to the individual (much like programmers) so don’t take text of advice as set in stone (the web changes too quickly for that!). Find your own way and your own set of people to guide you. Go to courses and conferences. Look outside the journalism bubble. Data is more than just news.
  5. Spread your bets – the easiest way to sort data is by using spreadsheets. Start with free options like Google Docs and OpenOffice. Industry standards include Microsoft Excel and Access. Learn to sort, filter and pivot. Find data you’re interested in and explore the data using your eyes balls. Know what each piece of software does and can do to the data before mashing it with another piece of software.
  6. Investigate your data – query it using the simple language SQL and the software MySQL. It’s a bit tricky to set up but by now you’ll know a hacker you can ask for help! Clean your data using Google Refine. There are tutorials and a help wiki. Know how these function not just how to navigate the user interfaces, as these will change. These products go through iterations much more quickly than the spreadsheet software.
  7. Map your data – from Google spreadsheets the easiest way to build a map is by using MapAList. There is a long list of mapping software from GeoCommons to ArcGIS. Find what’s easiest for you and most suitable for your data. See what landscapes can be revealed and hone in on areas of interest. Understand the limitations of mapping data, you’ll find devolution makes it difficult to get data for the whole of the UK and some postcodes will throw up errors.
  8. Make it pretty – visualize your data only once you fully understand it (source, format, timeframe, missing points, etc). Do not jump straight to this as visuals can be misleading. Useful and easy software solutions include Google Fusion Tables, Many Eyes and Tableau. Think of unique ways to present data by checking out what the graphics teams at news organizations have made but also what design sites such as Information is Beautiful and FlowingData are doing.
  9. Make your community – don’t just find one, build one. This area in journalism is constantly changing and for you to keep up you’ll need to source a custom made community. So blog and tweet but also source ready-made online communities from places like the European Journalism Centre, National Institute for Computer Assisted Reporting (NICAR), BuzzData and DataJournalismBlog.
  10. Scrape it – do not be constrained by data. Liberate it, mash it, make it useable. Just like a story, data is unique and bad data journalism comes with constraining the medium containing it. With code, there is no need to make the story ‘fit’ into the medium. “The Medium is the Message” (a la Marshall McLuhan). Scrape the data using ScraperWiki and make applications beyond story telling. Make data open. For examples check out OpenCorporates, Schooloscope and Planning Alerts. If you’re willing to give coding a try, this book called “Learning Python the Hard Way” is actually the easiest way to learn for the non-programmer. There is also a Google Group for Python Journalists you should join.
These are guidelines and not a map for your journey. Your beat, the data landscape, changes at the speed of web. You just need to be able to read the signs of the land as there’s no end point, no goal and no one to guide you.