Archive for the ‘Open Data Movement’ Category

The functionality that has set the web world a blaze, created whole industries and churned out billionaires from fiddlers of code is ‘social’. It’s even shaken Google to its core. ‘Social’ has also made news organisations think ‘digital’, however the phoenix that will emerge from the burning embers of the newspaper industry is ‘open’. The functionality of Open Data will separate the losers from the winners in the digital news (r)evolution. Curation, aggregation, live are all currently thrown in the mix but no one overarching model has yet ignited the flames of public engagement.

So I want to talk about Open Data. But what is Open Data? The best I can offer you is the open definition from the Open Data Manual which reads: “Open data is data that can be freely used, reused and redistributed by anyone – subject only, at most, to the requirement to attribute and share alike.” For the best understanding of Open Data I would highly recommend you read a report by Marco Fioretti for the Laboratory of Economics and Management of Scuola Superiore Sant’Anna, Pisa entitled Open Data: Emerging trends, issues and best practices (2011).

This blog post will really be about how this report highlights the need, duty and opportunity for news to become part of this Open Data movement and, in my opinion, the news industry can be what Open Data needs to cultivate the ethos of information access amongst the public. The first thing the report happens upon under “Social and political landscape” is news; big news which many organisations struggled to maintain across news flows. These are the Spanish “Indignados” , the Arab Spring, the Fukushima nuclear accident and Cablegate. Whilst Marco admits that Wikileaks may have caused some hostility towards Open Data he notes that:

…while certainly both Open Data and Wikileaks are about openness and transparency in politics, not only are there deep differences between the two ideas but, in our opinion, the Wikileaks experience proves the advantages of Open Data.

Fighting for transparency through organisations who exist on the outer fringes or even outside of the law, create just another veil of secrecy. Indeed, recent events regarding the leak of unredacted Wikileaks data show how corrosive forcibly breaking through the layers of data protection can be for any organisation. Many within the news industry admire (praise is too strong a word) Wikileaks’ cause and argue that if journalism was performing its intended function then there would be no need for a Wikileaks.

Which brings me back to the newsroom. Unlike the web, the newsroom is not structured to handle large streams of data. The big data stories in the UK have been the Iraq War Logs, Cablegate and MPs expenses. These have been stories because the existence of the data itself is a story. Big data dumps can make headlines, masses of data being produced from the public sector daily need to be mined to find stories. Newsrooms don’t do that. Because as a journalist you have to pitch the ‘story’ to your editor, not content.

The news medium produces content for stories not stories from content. But the web feeds off content in the form of data. And online social networks are bringing the content to the user directly. News organisations need to work with this content, this data, these facts in plain sight as “unlike the content of most Wikileaks documents, Open Data are almost always data that should surely be open” and therein lies your public service responsibility. In the case of the data story on EU structural funds by the Bureau for Investigative Journalism and the Financial Times, an Italian reporter who picked up the story, Luigi Reggi writes:

The use of open, machine-processable and linked-data formats have unexpected advantages in terms of transparency and re-use of the data .. What is needed today is the promotion among national and local authorities of the culture of transparency and the raising of awareness of the benefits that could derive from opening up existing data and information in a re-usable way.

What distinguishes Open Data from “mere” transparency is reuse

The Open Data Movement has taken off. Of course a lot more needs to be done but the awareness and realisation of the need to publish public information is born of the web and will die with the web (i.e. never). Marco states that “In practice, public data can be opened at affordable costs, in a useful and easily usable way, only if it is in digital format … When data are opened, the problem becomes to have everybody use them, in order to actually realise Open Government.”

The relationship between media and state means that the traditional media bodies (broadcast and print) should be the ones to take that place. Why? Because it requires an organisational structure, the one thing the web cannot give to citizen journalists. It can give us the tools (print, audio and video upload and curation) but it cannot provide us with the external structures (editorship, management, legal, time and expertise) needed to unearth news not just package it. News organisations need to mine the data because structures are needed to find the truth behind data as it is not transparent to the average citizen. News needs to provide the analysis, insight and understanding.

There is not automatic cause-effect relationship between Open Data and real transparency and democracy … while correct interpretation of public data from the majority of average citizens is absolutely critical, the current situation, even in countries with (theoretical) high alphbetization and Internet access rates, is one in which most people still lack the skills needed for such analysis … It is necessary that those who access Open Data are in a position to actually understand them and use them in their own interest.

So why is ‘open’ the new ‘social’? Because services who make data open make it useful and usable. Open Data is about Open Democracy and allowing communities to engage through digital services built around the idea of openness and empowerment. News needs to get on board. But just as social was an experiment which some got right, so getting Open Data right will be the deal breaker for digital news. Just take a look at some of these:

And I’m sure there are many more examples out there. I’m not saying news organisations have to do the same. Open Data, as you can see, is a global movement and just as ‘social’ triggered the advance of web industry into the news industries’ territory so news should look to ‘open’ to claim some of that back.

I consume, code and curate news. I am no longer in the ‘newsroom’ per se, but taking a step back and looking deeper in to the nuts and bolts of the news platform has given me time for reflection. I reflect off hard surfaces. As such, I would like to present three sources of information that has my brain waves bouncing and the resultant concoction of “Sink/Source Journalism” i.e. a news model for the digital age.

Here are three reasons why news organisations needed micro-startups:

  1. A journal paper titled “Network Journalism: Converging Competences of Media Professionals and Professionalism“*
  2. A blog post by Alan Mutter called “Newspapers need a jolt of Silicon Valley DNA
  3. And a TEDx video by Jeff Jarvis called “This is Bullshit”:

Now you have my materials here are my thoughts. Journalism constrains the medium around the story, creating the source. Online media builds a platform to allow stories to form, thus creating a sink. Unlike most journalism students, I didn’t set up a blog to put my picture and CV on and upload all my ‘work’. I created a sink, a semantic sink. I wasn’t searching for something. I wanted to gather sources in order to find out what might be out there that I had not heard about. I want a medium which gathers inwards rather than expels outwards. This is the opposite of what journalism is, but I think it’s worth trying.

Alan Mutter writes: “With new technologies, media formats and business models emerging at an ever-quickening pace, newspapers must learn to think and act like start-ups – or risk falling to the margins of the media world.” What I would like to see implemented at a news organisation is a micro-startup team which builds sinks. A sink needs to iterate at the speed of web and its success will depend on whether it metamorphoses into a source of its own accord. What do I mean by this?

Philip Meyer, in his book “The Vanishing Newspaper” (2004), predicts that the final copy of the final newspaper will appear on somebody’s doorstep one day in 2043. Bardoel & Deuze suggested ten years ago that:

This is not to say that the end of mediated communication is near, but it only shows that due to new technology the exclusive hold of journalists on the gatekeeping function to private households comes to an end. Ironically, it was the old (newspaper) technology that has brought journalists in this privileged position, and it is the new (on-line) technology that might remove journalists from that position again … technology does not determine what will happen here, but it will take a patient process of ‘social shaping’ that determines what will be the impact of the new communications technologies

And as Jeff put it “We should question the form”. Now The Guardian has gone digital first. More will follow. But will they follow the old form? Again, Bardoel & Deuze a decade ago write:

For journalists it is quite a challenge – or should we say less ironically a threat – to serve this multi-faced and fragmented public, for whom the news ‘product’ is no longer sacred in se. Since the scarcity of the offering has turned into abundance people can make a choice, for journalistic selection and scope or for other information intermediaries. This, again, shows that the power relation is shifting.

This was known (albeit in academic circles) before the explosion of twitter and social media. And the reliance on social media to turn ‘old’ media into ‘new’ media is misguided. If the medium is the message then the community is the code. The part of the social media model that should be taken is: build a sink for a community and let them make it into a news source. But, from a news angle, your sink should not be an application per se. It should be built from open data. Make it opensource. Why? Because you need to make a prototype/alpha fast. Make it quick and dirty. If it works, other people can fork the code but they can’t fork the community. And that’s what makes a sink a source.

What sort of projects am I talking about? Well, things like Schooloscope, Who’s Lobbying, and The Public Whip. These are brilliant public information service sites but they cannot make it as a business. They have been made by dedicated developers who cannot maintain them as they have no business model. These ‘code communities’ of scraped information need to be adopted by news organisation. They have the resources and community links to change these code sinks into sources. In the way that B2B media make their money by hoarding and reselling data to individual businesses so the news organisation in the new digital age should adopt this model by scraping public data and reselling it to the individual. This is Bardoel & Deuze’s conclusion at the turn of the millennium:

Journalism will become a profession that provides services not to collectives, but first and foremost to individuals, and not only in their capacity as citizens, but also as consumers, employees and clients

The Public Whip has had to find a new home. Schooloscope and Who’s Lobbying are shutting down. The databases they gather and feed off are a resource for journalists. Their design interface is a valuable resource to the public. By integrating them into a news model, into a community of readers and informed citizens, they can become powerful sinks around which you build a community. And this becomes a powerful source for a news organisation. A source of revenue even.

The gathering of the data, in the form of scraping, is a huge hurdle for developers but should be a standard for the new age of digital journalism. Now the barriers to building a prototype quick and dirty using a community is being significantly lowered. The company I work for, ScraperWiki (disclosure here), is being built for that. I wanted to be part of that process and the open data movement because I believe this is the route to go down when it comes to a news model for the digital age: Sink/Source Journalism.

*Bardoel, Jo, Deuze, Mark, (2001). Network Journalism: Converging Competences of Media Professionals and Professionalism. In: Australian Journalism Review 23 (2), pp.91-103.

Within a very short period of time the term ‘journalist’ has changed in meaning drastically. Or did it even have a meaning to begin with? At a panel on the phone hacking scandal at the Centre for Investigative Journalism Summer School, Gavin Millar QC, said that a journalist is like a terrorist, we have no legally defined term for what they are!

The term ‘citizen journalist’ has grown from the web, from the free and global publishing platforms that are blogs, twitter and Facebook (and much more than those). The enabling ability of the web, its ability not just to spread information but to upload pictures, video and words; has shaken the traditional media model. The press journalist is no longer the gatekeeper of information. Just look at the Ryan Giggs superinjunction scandal.

So what do we mean when we say “The Press”? One of the consequences of the News of the World phone hacking scandal (which includes a lot more publications and not just News International publications) is that we are going to get a Press Inquiry. I say we, because we don’t know whether any regulatory outcome will include bloggers and twitterers i.e. us, the public who have a social space on the web and use it to communicate within the open public sphere.

Another word that is taking on a new and worrying meaning is the term ‘hacker’. Now a lot of weight and attention has been given to the citizen journalist/web journalist/blogger phenomenon. The circle within which my journalistic persona travels is that of hacks/hackers. I am part hacker. I am a data journalism advocate for a developer platform called ScraperWiki. And I am very concerned about how this tumultuous time in journalism history will define the word ‘hack’ and all its related synonyms.

Wikipedia has one definition of ‘Hacker’ as “a subculture of innovative uses and modifications of computer hardware, software, and modern culture”. I sit on the edge of this and want to look further into the nucleus as a possible future for online news and newsgathering. ScraperWiki is one of a core set of online tools being used by the Open Data community. The people who are part of this community, I flatter myself to be included, are ‘hackers’ by the best definition of the word. The web allows anyone to publish their code online so these people are citizen hackers.

They are the creators of such open civic websites as Schooloscope, Openly Local, Open Corporates, Who’s Lobbying, They Work For You, Fix My Street, Where Does My Money Go? and What Do They Know? This is information in the public interest. This is a new subset of journalism. This is the web enabling civic engagement with public information. This is hacking. But it is made more important by the fact that not everyone can do it, unlike citizen journalism.

I have a twitter account, @Scrape_No10, tweeting out meetings, gifts and hospitality at No.10. I made a twitter account, @OJCstatements, which tweets out statements by the Office for Judicial Complaints regarding judges who have been investigated over personal conduct included racism, sexism and abuse of their position. This information is on the web so it is in the public domain. But it is not in the public sphere because the public don’t check the multitude websites that may have information in the public interest. So I have put it on the platform where it could be of most use to the public.

In that sense, I feel journalists need to be ‘hackers’; they need to hack. Information in the public interest is not often available to the public. More and more government data is being put on the web in the form of PDFs and CSVs. Now, under the Freedom of Information Act 2000, the government doesn’t have to answer your request directly if the information is published online or will be published online. That means that with more and more information being put in the form of spreadsheets or databases, the public are going to be pointed to a sea of columns and rows rather than given direct answers. So journalists need to get to grips with data to get the public their answers.

But as we know, any journalistic endeavour is open to abuse. So where do we draw the line? Even with citizen journalism, the Ryan Giggs ousting online, has blurred the boundary between the right to get private information out in the open and the right to privacy of the individual. Now deleting the voice messages of a missing girl is clearly overstepping the bounds in a horrendous way. The public will never forgive such behaviour but invading politicians’ privacy for the purpose of uncovering corruption often is.

The argument can be made that information on the web is public information and can be used freely in a journalistic endeavour. But that isn’t always the case. The British and Irish Legal Information Institute portal, BAILII, does not allow scraping. Learning to scrape is my journalistic endeavour at the moment. Scraping is the programming form that takes information from the web and pares it down into its raw programmatic ingredients. So it can be baked into something more digestible to the public.

Now I would love to the make the legal system more digestible but I can’t. It’s because BAILII have their own databases of information that they sell to private companies. And scraping can reform these. One of which is a database of court fines that they sell to a multitude of credit card companies. So we pay for the judicial system and if we’re fined by it they have the right to make money from the data to affect our credit rating. This makes the information they put online locked into the format they chose to put it in, a complicated and convoluted web portal.

But equally, what about unearthing a deleted tweet or matching social media accounts through email address which are not disclosed but which could be guessed at? Linking online personas that are set up to be separate? Not accessing their private emails, not getting past any firewall that requires a password, but using details behind the front end of the web to dig deeper into their online connections. The question is not where do we draw the boundary but can we. Or even, should we.

It’s not the technique that should be outlawed; it should be the endeavour. Please don’t let the News of the World define ‘hacking’. In the Shakespearean sense of “That which we call a rose by any other word would smell as sweet”, we should define journalism not by a word but by what it smells like. Something stank about the first phone hacking enquiry in 2009. Nick Davies smelt it and followed his nose. And that’s the definition of journalism.

The above article, by me, appeared in an edited form on the openDemocracy website. They say: “openDemocracy publishes high quality news analysis, debates and blogs about the world and the way we govern ourselves. We are not about any one set of issues, but about principles and the arguments and debates about those principles. openDemocracy believes there is an urgent need for a global culture of views and argument that is: i) Serious, thoughtful and attractively written; ii) Accessible to all; iii) Open to ideas and submissions from anywhere, part of a global human conversation that is not distorted by parochial national interests; and iv) Original and creative, able to propose and debate solutions to the real problems that we all face.” For further reading re who holds court data you should read this article.

I’m in Berlin for the Open Knowledge Conference which you’ll be hearing about. For the last few days I’ve found myself with a mixed bunch of open data hackers and some (data) journalists. It’s the first time I’ve been away from the ScraperWiki family and seen coding in the wild. One thing that surprises me is the diversity of geeks. No one person has the same experience/background. Lots of people with no experience have jumped into it out of interest. The one realization that delights and alarms me is: I’ve been throw in at the deep end. Only a tiny amount of programmers have delved deeply into the scraping soup of the web. And journalists refuse to wander far into this level of ‘difficulty’.

I was speaking to my Canadian counter part (doppelganger), Momoko Price, from BuzzData. They’re a kind of data social network. She left the journalistic platform to join a developer platform, delving into the dirty world of data. She’s learning to code having started her data journey with more experience than myself. Yet, even though we’re on the data journalism path and have a frighteningly similar road map, our coding environment has evolved two very different species. I am chasing the needle in the haystack and not visualizing/making the haystack interactive. We both see the need for this, thankfully. Scraping is helping me evolve this speciality. More so than tinkering with software.

It is the road less travelled by and that’s making all the difference. Stay tuned for more tales from the road.

#opendata from Open Knowledge Foundation on Vimeo.

You might be wondering what this short documentary has to do with journalism or even what open data has to do with journalism. No doubt you are aware that journalism has been facing a ‘crisis’ for a while now. Not just because of the recession and shrinking advertisers but because of the dominance of the web for getting information to people and allowing them to share amongst themselves.

Open data activists are working with the web to provide information in a way people can engage with and ultimately feel empowered by. Projects like FixMyStreet and Schooloscope are emblematic of this rise in civic engagement projects. Indeed, crime mapping in San Francisco led to local citizens demanding more policing in areas of high crime and a change in the policing schedule to reflect the hours when crime is at its highest.

News used to have some responsibility in this area of engagement but never quite understood the field or didn’t know quite what to do with it. Now they have lost complete control and the masters of the web platforms are again taking informational control of a growing area of interest. But news organizations are missing a very important trick. Data driven journalist, Mirko Lorenz, has written how News organizations must become hubs of trusted data in a market seeking (and valuing) trust.

Which is why I think anyone interested in the area of data journalism should watch this documentary, as not only should traditional media be training journalists to engage with this new streaming of social and civic data, but managers and execs should think about the possible shifting in the traditional media market away from advertising and towards the trust market.


This is a fringe event to the E-Campaigning Forum run by Fairsay. Rolf Kleef (Open for Change) and Tim Davies (Practical Participation) are co-ordinating the day in a voluntary capacity, with support from Javier Ruiz (Open Rights Group)


The Open Data Campaigning Camp will immediately follow the annual E-Campaigning Forum (#ECF11), so will be targeted particularly at campaigners interested in increasing their understanding of how to engage with open data. They also invite developers and data experts interested in exploring the connections between data and campaigning.


Thursday, 24th March at 09:30 AM


St. Annes College


The day will start with an introduction to open data, the history of open data campaigning, and short presentations on finding and using data, and on publishing data, for advocacy and campaigning. Then there will be action-learning – with participants choosing projects to work on throughout the day – exploring open data for campaigning around a key themes including:

  • International Development
  • Environment and Climate
  • Public Spending Cuts

Projects might include: designing a campaign using open data; building a data visualisation; creating a data campaigning toolkit for local activists; creating a data-driven mobile app for campaigning; publishing a dataset for campaigners to create mash-ups with; exploring and updating data catalogues; and whatever other ideas you bring along.The great thing is is that there will be support on hand to introduce different ways of engaging with data. Sign up for it here.


The media work with information. Fact. But what exactly is information? I’m beginning to realise that information is not static. It just doesn’t exist in a concrete form anymore. So how are the media reacting to this? I’m not sure but I think their public, the community formerly known as the audience are reacting quicker and better.

This is not through any failing by the media but more to do with the monetization of social media. Technical gadgets have become part of our physical selves, software have become part of our mental selves and now social networking has become part of our societal selves. And society can accept and integrate this because businesses have found a way of making money from it.

The business structure of the media is an enigma in itself and the more it tends toward the business model the more the information journalists work with becomes warped into the social fabric of gossip, celebrity and shock.

But lets not get carried away, here is a new way of looking at information. A quick and easy way for you to make it useful. To make it explorable. I’ve just used three free things from the web.

Thing no.1 – ScraperWiki:

I found this data set on the site

It’s by Techbelly as I am only just learning to scrape. But you can request a scraper for any data you can find on the web here.

So I downloaded the CSV file you can see on the top right and imported it into Google docs. Some scrapers allow you to import into Google docs immediately but if not it’s just a matter of download and upload.

Thing no.2 – MapAList

The next easy step is to go to MapAList. This links up directly to your Google docs and will find your spreadsheets. The great thing is that you don’t need longitude and latitude (very few data sets give these) as you can use Google Maps to plot by address.

Once you have an account you hit ‘Create Map’. Choose your source type as Google Spreadsheet, select the file you uploaded as your spreadsheet and if you have more than once worksheet you can choose the one which has the addresses listed. You should be able to make sure you’ve chosen the right file by viewing a sample of the spreadsheet.

Just hit ‘Next’ at the bottom of the page and you’ll then choose the fields (i.e. the column of your spreadsheet) that will allow the data to be mapped. This data gave the address and postcode in separate fields which is great. It also give longitude and latitude however this is not given in the original source of the data which can be found as a search here. The first part is matching you columns to the fields MapAlist will read to map your data. So if you’ve got address, postcode and/or longitude and latitude as columns this should be easy enough. The second part is choosing what fields you’ve put in your columns you’d like to see on your placemarkers. This should be a matter of which parts of your data are most interesting. For tax exempt works of art I’ve picked the category of art and a description of the object as you can see. So hit ‘Apply’ and ‘Next.

The next step trouble shoots your data. As you can see only 1149 of 1177 records were geocoded. Now below this shot will be the entries that failed to be geocoded. By looking at them I can see that there were five records where no location was given. The remaining 23 missing entries were due to the fact that the art works were on loan to a public gallery. As the whole point of this exercise is to get people knocking on Lords, Ladies, Dames and Dukes doors, I thought I’d just go with the 1149 entries given.

The next step is to chose what your palcemarkers will look like. I’ve chosen to have a different placemarker for each category of art work as you can see:

The you put in a title and select more details depending on how you want your map presented.

You get a preview to make sure everything is as you like. Hit ‘Close this, and create new map’ and voila:

You can play around and get directions here. You can even embed it on a web page and send it to a friend using the share button at the bottom. Sadly, there’s no way to get it into a WordPress blog. But wait! There’s more. Download it as a KML using the download button at the bottom right.

Thing no.3 – Google Earth:

For more fun just open it up with Google Earth. That way, where there is more than one piece of art work at a particular location, Google Earth will split them up for you. It also means you can add layers from all your data sets. Now that’s what I call information!

For those not familiar with the scheme, if you have a work of art, you can register it under the scheme to avoid paying certain taxes – including inheritance tax – but under the condition that you look after it and make it available for public viewing. A few of the people signed up to the scheme are keener on the tax advantages and less on the public availability…

Points to note:

MapAList can’t handle more than 8,000 records. You can update a map when the Google Spreadsheet updates by going into ‘Manage’ and hitting the update option. That means any extra information that you add to a spreadsheet can go on the map without having to create a new one. But what would be ultimately useful is to have it update whenever the web page (the original source of the data) gets updated. Now that would be a good journalism tool!

I think this would be possible with ScraperWiki. I’m going to find out…