Archive for the ‘Data Journalism’ Category

Seeing as tomorrow is Open Data Day and I claim to be a Data Journalist (I think JournoCoder is more suitable) here’s a little data food for journalistic thought.

RSS Feed of US Nuclear Reactor Events

Here is the site showing the US nuclear reactors power output status. Here is the scraper for that site written by ScraperWiki founder Julian Todd. Here is my script for catching the unplanned events and converting them to RSS format. And here is the URL you can use to subscribe to the feed yourself:

https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=rss2&name=us_nuclear_reactor_rss&query=select%20title%2C%20link%2C%20description%2C%20date%2C%20guid%20from%20%60swdata%60%20limit%2020

Oh, and a video (using the example to go through the ScraperWiki API)

ScraperWiki has had a little facelift of late so I’ve done a couple of screencasts (one more to come). Enjoy!

So I’m in the US, preparing to roll out events. To make decisions as to where to go I needed to get data. I needed numbers on the type of people we’d like to attend our events. In order to generate good data projects we would need a cohort of guests to attend. We would need media folks (including journalism students) and people who can code in Ruby, Python, and/or PHP. We’d also like to get the data analysts, or Data Scientists, as they are known as, particularly those who use R.

So with assistance from Ross Jones and Friedrich Lindenberg, I scraped the locations (websites and telephone numbers if they were available) of all the media outlets in the US (changing the structure to suit its intended purpose i.e. why there are various tables), data conferences, Ruby, PHP, Python and R meetups, B2B publishers (Informa and McGraw-Hill) and top 10 journalism schools. I added small sets in by hand such as HacksHackers chapters. All in all, nearly 12,800 data points. I had never used Fusion Tables before but I have heard good things. So I mashed up the data and imported it into Fusion Tables. So here is it (clicki on the image as sadly wordpress.com does not support iframes):

Click to explore on Fusion Tables

Sadly there is a lot of overlap so not all the points are visible. Google Earth explodes the points on the same spot however it couldn’t handle this much data when I exported it. Once we decide where best to go I can hone in on exact addresses. I wanted to use it to pinpoint concentrations, so a heat map of the points was mostly what I was looking for.

Click to explore on Fusion Tables

Using Fusion Tables I have then break down the data for the hot spots. I’ve looked at the category proportions and using the filter and aggregate, made pie charts (see New York City for example). The downside I found with Fusion Tables is that the colour schemes cannot be adjusted (I had to fix them up using Gimp) and the filters are AND statement (no OR statement option). The downside with US location data is the similarity of place names across states (also having a place and state name the same), so I had to eye up the data. So here is the breakdown for each region where the size of the pie chart corresponds to the number of data points for that location. It is relative to region not across.

Of course media outlets would outnumber coding meetups, universities and HacksHackers Chapters, but they would be a better measure of population size and city economy.

What I’ve learnt from this is:

  1. Free tools are simple to use if you play around with them
  2. They can be limiting for visual mashups
  3. The potential of your outcome is proportional to your data size, not your tool functionality (you can always use multiple tool)
  4. To work with different sources of data you need to think about your database structure and your outcome beforehand
  5. Manipulate your data in the database not your tool, always keep the integrity of the source data
  6. To have data feed into your outcome changes your efforts from event reporter to source
This all took me about a week between doing other ScraperWiki stuff and speaking at HacksHackers NYC. If I were better at coding I imagine this could be done in a day no problem.

Here’s a little experiment in using data for news:

Things have been quiet on the blog front and I apologize. What began as a tumultuous year with a big risk on my part has become even more turbulent. Happily with opportunities rather than uncertainties. Trips to Germany and the US have landed in my lap. Both hugely challenging and exciting.

I completed the Knight Mozilla Learning Lab successfully and have been invited to Berlin for the MoJoHackfest next week. I’m really looking forward to meeting all the participants and getting some in depth hands-on experience of creating applications built around a better news flow.

This is a level between the hack days ScraperWiki ran and the ScraperWiki platform development itself (I don’t play a part in this but work closely with those who do), which is more akin to the development newsroom.

My pitch for the Learning Lab, Big Picture, is asking a lot of developers coming with their own great ideas and prototypes. I would love to get some of the functionality working but that very much depends on the goodwill, skills and availability of a small group of relative strangers.

I have a tendency to bite off more than I can chew and ask a lot of people who have no vested interests in my development. I am acutely aware that I cannot build any part of the Big Picture project. That being said I have built a new project that can be added to with a basic knowledge of Python. I give you MoJoNewsBot:

If you want to know more about how the Special Advisers’ query was done read my ScraperWiki blog post. Also, I fixed the bug in the Goolge News search so the links match the headline.

Come October I will be heading to the US to help fulfill part of ScraperWiki’s obligations to the Knight News Challenge. I am honoured to be one of ScraperWiki’s first full-time employees and actually get paid to further the field of data journalism!

Being part of a startup has its risks. No one’s role is every fully defined. This really is a huge experiment and I’m not sure I can even describe what it is I am doing. I am not a noun, however. I am a verb. My definition is in my functionality and defining this through ScraperWiki, MoJo and any other opportunities that come my way will be the basis of this blog from now on. So my posts will be sporadic but I hope you look forward to them.

Click on the image to get to the widget.

Afghan Civilian Casualty Explorer

I have scraped three sources of Afghan civilian casualty data; UNAMA, ISAF and ARM. The originals can all be found here. They were obtained by Science correspondent John Bohannon after embedding with military forces in Kabul and Kandahar in October 2010. They are in Excel format. A bad format. Excel is data manipulation software, not for displaying data. This is an example where all three sources produced data of high interest but none in formats which make the data useable.

Because there are three different sources, there are three different collection methods. Date ranges are also different. The Afghan Rights Monitor (ARM) give the smallest grained data, collecting information from particular incidents. The others collect larger grained data, aggregating incidents into types and regional commands. NATOs International Security Assistance Force (ISAF) split the south of Afghanistan into two regions of command on 19 June 2009 (no doubt owing to US operations in Helmand), however the data is split at the beginning of 2009 (I had to clarify this inconsistency with LTJG Bob Page, Media Officer for the Regional Command Southwest Public Affairs Office in Afghanistan).

As I’m learning to code and calling myself a data journalist, every project I choose to undertake for the sake of ‘learning’ has to have a journalistic aspect. In building this widget (with a lot of help from Ross Jones) I haven’t made a traditional ‘story’, rather something that is functional in a news gathering sense. I got the idea from the Iraq Body Count. Their aim is to find names for the individual casualties of war, telling the story through the people rather than the numbers.

If you’ve been to the Holocaust Memorial museum, you’ll know how important individual stories are to understanding the impact of war. I thought I would try and make something simple that would help identify and tick off an individual casualty from the data points. If someone is looking to find out more about how a love one died and who might have been responsible then they need as much data on the event. The Afghan Casualty Explorer is very basic and a lot more could be done with the data by proper coders or a newsroom team with programming expertise.

I decided to make a tool in the computer-assisted-reporting fashion. My take on data journalism being use tools to aid in the news gathering process and not just the mediation process.

There’s a Excel scraping guide on ScraperWiki for anyone who have data trapped in Excel sheets..

Here are the videos from the Data Journalism stream at the Open Knowledge Conference this year held in Berlin featuring Mirko Lorenz, Simon Rogers and Caelainn Barr amongst others.

[vimeo http://vimeo.com/26861938]

[vimeo http://vimeo.com/26666260]

[vimeo http://vimeo.com/26668162]

And just so you know I will be heading back to Berlin at the end of September for the Knight-Mozilla Hackathon. Greatly looking forward to it as I’ll be getting hands on experience of platforming building for the news quick and dirty. I’m also very excited about meeting some of the lab folk face to face. Will keep you posted and blog from a journo perspective and how I think this type of creativity is changing news.