Here’s a little experiment in using data for news:


So I’m back from Berlin and in the US. I met some amazing people at the Knight Mozilla Hacktoberfest, a 4 day hackathon with people from all over the world and from all walks of life. It was the most fun I’ve had all year and I’ve made some friends for life. The project ideas were brilliant and the discussion inspiring. To have the news partners (Al Jazeera, BBC, Guardian, BostonGlobe and Zeit) be active participants was a great move on Mozillla’s part. To have big news organisations look outside for ideas and solutions shows they realise news is out there, not solely within structured organisations.

I remember first seeing a blog post about this partnership process and thinking: “Wow, I wish I could apply. Shame I’m not a developer”. I went along to the application process out of curiosity and thankfully my creative juices got the best of me.

Even then, my scepticism told me not to expect any part of my MozNewsLab pitch, the Big Picture, to be built in 4 days and so I made a little side project, MoJoNewsBot. On the third day of the hackathon I presented my data stream connected chat bot via the Big Discussion part of Big Picture. Thanks to an amazing participant, David Bello, we got a conference with website submission, approval and iframe designed and coded in two days. I only found out before presenting that he is in management at a university in Colombia and doesn’t code for a living. I was truly blown away by how an idea; developed, designed and pitched, can be made reality owing solely to the good will of someone who “plays” with code.

You can keep track of both projects, Big Picture and MoJoNewsBot on the Mozilla wiki. I’m looking to make the first and third part of Big Picutre with further help and advice from the participants. Thanks to the magic of GitHub and DotCloud, I have a local version of Big Picture running on my computer. I’m going to learn JavaScript and add to/clean up Big Picture before I present it formally on my blog. As for my chat bot, I need to add error messages and tidy up the code a bit. Then I’ll relocate him from the #botpark to #HacksHackers on IRC. During events in the US I’m going to add more modules with interesting data for journalists to reference.

To all my viewers, whoever you are, I recommend you hop on the MoJo bandwagon next year. It’ll be the ride of your life! Almost as eventful as driving the ScraperWiki digger 😉

Things have been quiet on the blog front and I apologize. What began as a tumultuous year with a big risk on my part has become even more turbulent. Happily with opportunities rather than uncertainties. Trips to Germany and the US have landed in my lap. Both hugely challenging and exciting.

I completed the Knight Mozilla Learning Lab successfully and have been invited to Berlin for the MoJoHackfest next week. I’m really looking forward to meeting all the participants and getting some in depth hands-on experience of creating applications built around a better news flow.

This is a level between the hack days ScraperWiki ran and the ScraperWiki platform development itself (I don’t play a part in this but work closely with those who do), which is more akin to the development newsroom.

My pitch for the Learning Lab, Big Picture, is asking a lot of developers coming with their own great ideas and prototypes. I would love to get some of the functionality working but that very much depends on the goodwill, skills and availability of a small group of relative strangers.

I have a tendency to bite off more than I can chew and ask a lot of people who have no vested interests in my development. I am acutely aware that I cannot build any part of the Big Picture project. That being said I have built a new project that can be added to with a basic knowledge of Python. I give you MoJoNewsBot:

If you want to know more about how the Special Advisers’ query was done read my ScraperWiki blog post. Also, I fixed the bug in the Goolge News search so the links match the headline.

Come October I will be heading to the US to help fulfill part of ScraperWiki’s obligations to the Knight News Challenge. I am honoured to be one of ScraperWiki’s first full-time employees and actually get paid to further the field of data journalism!

Being part of a startup has its risks. No one’s role is every fully defined. This really is a huge experiment and I’m not sure I can even describe what it is I am doing. I am not a noun, however. I am a verb. My definition is in my functionality and defining this through ScraperWiki, MoJo and any other opportunities that come my way will be the basis of this blog from now on. So my posts will be sporadic but I hope you look forward to them.

Click on the image to get to the widget.

Afghan Civilian Casualty Explorer

I have scraped three sources of Afghan civilian casualty data; UNAMA, ISAF and ARM. The originals can all be found here. They were obtained by Science correspondent John Bohannon after embedding with military forces in Kabul and Kandahar in October 2010. They are in Excel format. A bad format. Excel is data manipulation software, not for displaying data. This is an example where all three sources produced data of high interest but none in formats which make the data useable.

Because there are three different sources, there are three different collection methods. Date ranges are also different. The Afghan Rights Monitor (ARM) give the smallest grained data, collecting information from particular incidents. The others collect larger grained data, aggregating incidents into types and regional commands. NATOs International Security Assistance Force (ISAF) split the south of Afghanistan into two regions of command on 19 June 2009 (no doubt owing to US operations in Helmand), however the data is split at the beginning of 2009 (I had to clarify this inconsistency with LTJG Bob Page, Media Officer for the Regional Command Southwest Public Affairs Office in Afghanistan).

As I’m learning to code and calling myself a data journalist, every project I choose to undertake for the sake of ‘learning’ has to have a journalistic aspect. In building this widget (with a lot of help from Ross Jones) I haven’t made a traditional ‘story’, rather something that is functional in a news gathering sense. I got the idea from the Iraq Body Count. Their aim is to find names for the individual casualties of war, telling the story through the people rather than the numbers.

If you’ve been to the Holocaust Memorial museum, you’ll know how important individual stories are to understanding the impact of war. I thought I would try and make something simple that would help identify and tick off an individual casualty from the data points. If someone is looking to find out more about how a love one died and who might have been responsible then they need as much data on the event. The Afghan Casualty Explorer is very basic and a lot more could be done with the data by proper coders or a newsroom team with programming expertise.

I decided to make a tool in the computer-assisted-reporting fashion. My take on data journalism being use tools to aid in the news gathering process and not just the mediation process.

There’s a Excel scraping guide on ScraperWiki for anyone who have data trapped in Excel sheets..

The functionality that has set the web world a blaze, created whole industries and churned out billionaires from fiddlers of code is ‘social’. It’s even shaken Google to its core. ‘Social’ has also made news organisations think ‘digital’, however the phoenix that will emerge from the burning embers of the newspaper industry is ‘open’. The functionality of Open Data will separate the losers from the winners in the digital news (r)evolution. Curation, aggregation, live are all currently thrown in the mix but no one overarching model has yet ignited the flames of public engagement.

So I want to talk about Open Data. But what is Open Data? The best I can offer you is the open definition from the Open Data Manual which reads: “Open data is data that can be freely used, reused and redistributed by anyone – subject only, at most, to the requirement to attribute and share alike.” For the best understanding of Open Data I would highly recommend you read a report by Marco Fioretti for the Laboratory of Economics and Management of Scuola Superiore Sant’Anna, Pisa entitled Open Data: Emerging trends, issues and best practices (2011).

This blog post will really be about how this report highlights the need, duty and opportunity for news to become part of this Open Data movement and, in my opinion, the news industry can be what Open Data needs to cultivate the ethos of information access amongst the public. The first thing the report happens upon under “Social and political landscape” is news; big news which many organisations struggled to maintain across news flows. These are the Spanish “Indignados” , the Arab Spring, the Fukushima nuclear accident and Cablegate. Whilst Marco admits that Wikileaks may have caused some hostility towards Open Data he notes that:

…while certainly both Open Data and Wikileaks are about openness and transparency in politics, not only are there deep differences between the two ideas but, in our opinion, the Wikileaks experience proves the advantages of Open Data.

Fighting for transparency through organisations who exist on the outer fringes or even outside of the law, create just another veil of secrecy. Indeed, recent events regarding the leak of unredacted Wikileaks data show how corrosive forcibly breaking through the layers of data protection can be for any organisation. Many within the news industry admire (praise is too strong a word) Wikileaks’ cause and argue that if journalism was performing its intended function then there would be no need for a Wikileaks.

Which brings me back to the newsroom. Unlike the web, the newsroom is not structured to handle large streams of data. The big data stories in the UK have been the Iraq War Logs, Cablegate and MPs expenses. These have been stories because the existence of the data itself is a story. Big data dumps can make headlines, masses of data being produced from the public sector daily need to be mined to find stories. Newsrooms don’t do that. Because as a journalist you have to pitch the ‘story’ to your editor, not content.

The news medium produces content for stories not stories from content. But the web feeds off content in the form of data. And online social networks are bringing the content to the user directly. News organisations need to work with this content, this data, these facts in plain sight as “unlike the content of most Wikileaks documents, Open Data are almost always data that should surely be open” and therein lies your public service responsibility. In the case of the data story on EU structural funds by the Bureau for Investigative Journalism and the Financial Times, an Italian reporter who picked up the story, Luigi Reggi writes:

The use of open, machine-processable and linked-data formats have unexpected advantages in terms of transparency and re-use of the data .. What is needed today is the promotion among national and local authorities of the culture of transparency and the raising of awareness of the benefits that could derive from opening up existing data and information in a re-usable way.

What distinguishes Open Data from “mere” transparency is reuse

The Open Data Movement has taken off. Of course a lot more needs to be done but the awareness and realisation of the need to publish public information is born of the web and will die with the web (i.e. never). Marco states that “In practice, public data can be opened at affordable costs, in a useful and easily usable way, only if it is in digital format … When data are opened, the problem becomes to have everybody use them, in order to actually realise Open Government.”

The relationship between media and state means that the traditional media bodies (broadcast and print) should be the ones to take that place. Why? Because it requires an organisational structure, the one thing the web cannot give to citizen journalists. It can give us the tools (print, audio and video upload and curation) but it cannot provide us with the external structures (editorship, management, legal, time and expertise) needed to unearth news not just package it. News organisations need to mine the data because structures are needed to find the truth behind data as it is not transparent to the average citizen. News needs to provide the analysis, insight and understanding.

There is not automatic cause-effect relationship between Open Data and real transparency and democracy … while correct interpretation of public data from the majority of average citizens is absolutely critical, the current situation, even in countries with (theoretical) high alphbetization and Internet access rates, is one in which most people still lack the skills needed for such analysis … It is necessary that those who access Open Data are in a position to actually understand them and use them in their own interest.

So why is ‘open’ the new ‘social’? Because services who make data open make it useful and usable. Open Data is about Open Democracy and allowing communities to engage through digital services built around the idea of openness and empowerment. News needs to get on board. But just as social was an experiment which some got right, so getting Open Data right will be the deal breaker for digital news. Just take a look at some of these:

And I’m sure there are many more examples out there. I’m not saying news organisations have to do the same. Open Data, as you can see, is a global movement and just as ‘social’ triggered the advance of web industry into the news industries’ territory so news should look to ‘open’ to claim some of that back.

Here are the videos from the Data Journalism stream at the Open Knowledge Conference this year held in Berlin featuring Mirko Lorenz, Simon Rogers and Caelainn Barr amongst others.




And just so you know I will be heading back to Berlin at the end of September for the Knight-Mozilla Hackathon. Greatly looking forward to it as I’ll be getting hands on experience of platforming building for the news quick and dirty. I’m also very excited about meeting some of the lab folk face to face. Will keep you posted and blog from a journo perspective and how I think this type of creativity is changing news.

Just to let you know that the Twitter account @Scrape_No10 which tweets out ministers’, special advisers’ and permanent secretaries’ meetings, gifts and hospitalities is back up and tweeting. You can read the post about its creation here and download all the data the account contains. This account needs more coding maintenance than the @OJCstatements account (read about it here) because the data is contained in CSV files posted onto a webpage. I code sentences to be tweeted from the rows and columns. The scraper feeding the twitter account feeds off 5 separate scrapers of the CSV files. Because of this, the account is more likely to throw up errors than the simple scraping of the Office for Judicial Complaints site.

So I decided, as I’m learning to code and structure scrapers, to run the scrapers manually every time the twitter account stops, fix the bugs and set the account tweeting again. There will be better ways to structure the scrapers but right now I’m concentrating on the coding.

Learning to scrape CSVs is very handy as lots of government data are released as CSV. That being said, there is CSV documentation/tutorial on ScraperWiki, although it is aimed at programmers. For those interested in learning to code/scrape I would recommend “Learn Python the Hard Way” (which is the easiest for beginners, it’s just ‘hard’ for programmers because it involves typing code!). For more front end work I have recently discovered Codecademy. I can’t vouch for it but it looks interesting enough. I have also put all the datasets for the @Scrape_No10 account on BuzzData as an experiment.