Archive for the ‘My Data Journey’ Category

If you don’t already know, this year brings a fresh new challenge to a journo-coder wannabe who calls herself DataMinerUK. I am a Knight-Mozilla fellow at the Guardian and am looking to learn to code and make news in the open. As such I have moved this blog to a self hosted website: where I can embed iframes.

It is long overdue and although I found managing a blog much easier, my goal to build news using open frameworks means I need a proper platform. It should look and feel pretty much the same. I’m not looking to become a hot shot developer and build an entire content management system (which the Guardian has done) but to work on content. Let’s use digital tool to find the message as well as build the medium.

My focus is on content. I’d describe myself more as a journalist than a coder. But with a lot of help from the Interactives team at the Guardian and my fellow fellows, I can harness the power of open source, open journalism and open news to be you content in weird and wonderful ways. What they’ll be is anyones guess, yours as well as mine.

So stay tuned!

This post is for December’s Carnival of Journalism where we were asked to write what would be the best present from programmers/journalists that Santa Claus could leave under your Christmas tree? Stick people courtesy of xkcd.

My  Christmas wish from both programmers and journalists would be the realisation that both cultures are dependent on the goodwill and morals of their community in order to avoid corruption and ultimately disintegration of the integrity of their relative professions. No institution can govern fast enough to police either party. And from both I would like to see a sense of solidarity in our relative hopeless cases, as illustrated below. (Also I would like more coding lessons – teach a man to fish and whatnot…)

The Crisis

The Confrontation

The Cure?

So things have gone quiet on the blog front but screamingly loud behind the scenes. I have met some amazing and inspiring people in both the US and Canada. A lot of them have been taken aback by my journey which I have documented here and which I should update on my timeline.

I have called myself a ‘human experiment’. I am trying to create a so-called ‘Data Journalist’ and left the newsroom to retrain. Whilst working with a startup I have come to realise that I am not an experiment, I am a startup. A living, breathing, learning, iterating, startup.

I have used the web to test my seed idea by finding organisations like HacksHackers and attending conferences like NewsRewired. I have used this blog and my journey to seek validation. I have acted to develop my business acumen, creating point stories, Twitter accounts and bots.

I now have ‘Angel’ funding provided by the Knight Foundation and invested by Mozilla. This means I have to go into testing. Across the Atlantic I was doing a lot of outreach. I believed in ScraperWiki more than myself, in some ways I still do. They are a proper startup, a business, an institution, and I believe in their power as a tool for social good. And that will never die. And so the startup will never fail in my eyes.

As the tech startup joke goes: “A million guys walk in to a Silicon Valley bar. None of them buy anything. The bar is declared a rousing success.” Business is a hard world. Your passion for business is fundamentally what is needed to drive even the best ideas. My passion is not in business development, it’s in news development. And good news will always find a way to survive. I am not looking to succeed as a business but as a startup. So how can a startup not be a business?

In the same vein, how can an experiment not have a tangible result? I was never going to get to a stage when I could say ‘I have succeeded’. My experiment was to see if learning some programming (that journey still continues) and gathering news differently, would be a viable route into a newsroom (when newsrooms are hemorrhaging people and resources) in order to produce news and not just talk about it. I wondered, can I restructure myself faster and in the right direction of the evolving news industry by using the lean startup mode of discovery.

Now that has been verified, it’s back to the hypothesis. It’s back to testing and now I have a lab that is willing to take me on, The Guardian. And I couldn’t ask for a better lab. But it’s up to me to continue the research.

I was blown over by the ‘congratulations’ I received at the announcement of my fellowship (which was uploaded here, here and here) but the fellowship is not an acknowledgement of what I have done, but an opportunity to test out what I can do. It’s the end of one leg of my journey and the beginning of an even more daunting road.

My journey is made less daunting by the fact that I have travel partners in Cole Gillespie, Dan Schultz, Laurian Gridinoc and Mark Boas (although their skills and experience make for quite intimidating company). I have carriage provided by The Guardian and service by Knight.

So what is my success? What is my lesson learnt? I took the road less traveled by and that has made all the difference:

So I’m in the US, preparing to roll out events. To make decisions as to where to go I needed to get data. I needed numbers on the type of people we’d like to attend our events. In order to generate good data projects we would need a cohort of guests to attend. We would need media folks (including journalism students) and people who can code in Ruby, Python, and/or PHP. We’d also like to get the data analysts, or Data Scientists, as they are known as, particularly those who use R.

So with assistance from Ross Jones and Friedrich Lindenberg, I scraped the locations (websites and telephone numbers if they were available) of all the media outlets in the US (changing the structure to suit its intended purpose i.e. why there are various tables), data conferences, Ruby, PHP, Python and R meetups, B2B publishers (Informa and McGraw-Hill) and top 10 journalism schools. I added small sets in by hand such as HacksHackers chapters. All in all, nearly 12,800 data points. I had never used Fusion Tables before but I have heard good things. So I mashed up the data and imported it into Fusion Tables. So here is it (clicki on the image as sadly does not support iframes):

Click to explore on Fusion Tables

Sadly there is a lot of overlap so not all the points are visible. Google Earth explodes the points on the same spot however it couldn’t handle this much data when I exported it. Once we decide where best to go I can hone in on exact addresses. I wanted to use it to pinpoint concentrations, so a heat map of the points was mostly what I was looking for.

Click to explore on Fusion Tables

Using Fusion Tables I have then break down the data for the hot spots. I’ve looked at the category proportions and using the filter and aggregate, made pie charts (see New York City for example). The downside I found with Fusion Tables is that the colour schemes cannot be adjusted (I had to fix them up using Gimp) and the filters are AND statement (no OR statement option). The downside with US location data is the similarity of place names across states (also having a place and state name the same), so I had to eye up the data. So here is the breakdown for each region where the size of the pie chart corresponds to the number of data points for that location. It is relative to region not across.

Of course media outlets would outnumber coding meetups, universities and HacksHackers Chapters, but they would be a better measure of population size and city economy.

What I’ve learnt from this is:

  1. Free tools are simple to use if you play around with them
  2. They can be limiting for visual mashups
  3. The potential of your outcome is proportional to your data size, not your tool functionality (you can always use multiple tool)
  4. To work with different sources of data you need to think about your database structure and your outcome beforehand
  5. Manipulate your data in the database not your tool, always keep the integrity of the source data
  6. To have data feed into your outcome changes your efforts from event reporter to source
This all took me about a week between doing other ScraperWiki stuff and speaking at HacksHackers NYC. If I were better at coding I imagine this could be done in a day no problem.

Here’s a little experiment in using data for news:

So I’m back from Berlin and in the US. I met some amazing people at the Knight Mozilla Hacktoberfest, a 4 day hackathon with people from all over the world and from all walks of life. It was the most fun I’ve had all year and I’ve made some friends for life. The project ideas were brilliant and the discussion inspiring. To have the news partners (Al Jazeera, BBC, Guardian, BostonGlobe and Zeit) be active participants was a great move on Mozillla’s part. To have big news organisations look outside for ideas and solutions shows they realise news is out there, not solely within structured organisations.

I remember first seeing a blog post about this partnership process and thinking: “Wow, I wish I could apply. Shame I’m not a developer”. I went along to the application process out of curiosity and thankfully my creative juices got the best of me.

Even then, my scepticism told me not to expect any part of my MozNewsLab pitch, the Big Picture, to be built in 4 days and so I made a little side project, MoJoNewsBot. On the third day of the hackathon I presented my data stream connected chat bot via the Big Discussion part of Big Picture. Thanks to an amazing participant, David Bello, we got a conference with website submission, approval and iframe designed and coded in two days. I only found out before presenting that he is in management at a university in Colombia and doesn’t code for a living. I was truly blown away by how an idea; developed, designed and pitched, can be made reality owing solely to the good will of someone who “plays” with code.

You can keep track of both projects, Big Picture and MoJoNewsBot on the Mozilla wiki. I’m looking to make the first and third part of Big Picutre with further help and advice from the participants. Thanks to the magic of GitHub and DotCloud, I have a local version of Big Picture running on my computer. I’m going to learn JavaScript and add to/clean up Big Picture before I present it formally on my blog. As for my chat bot, I need to add error messages and tidy up the code a bit. Then I’ll relocate him from the #botpark to #HacksHackers on IRC. During events in the US I’m going to add more modules with interesting data for journalists to reference.

To all my viewers, whoever you are, I recommend you hop on the MoJo bandwagon next year. It’ll be the ride of your life! Almost as eventful as driving the ScraperWiki digger 😉

Things have been quiet on the blog front and I apologize. What began as a tumultuous year with a big risk on my part has become even more turbulent. Happily with opportunities rather than uncertainties. Trips to Germany and the US have landed in my lap. Both hugely challenging and exciting.

I completed the Knight Mozilla Learning Lab successfully and have been invited to Berlin for the MoJoHackfest next week. I’m really looking forward to meeting all the participants and getting some in depth hands-on experience of creating applications built around a better news flow.

This is a level between the hack days ScraperWiki ran and the ScraperWiki platform development itself (I don’t play a part in this but work closely with those who do), which is more akin to the development newsroom.

My pitch for the Learning Lab, Big Picture, is asking a lot of developers coming with their own great ideas and prototypes. I would love to get some of the functionality working but that very much depends on the goodwill, skills and availability of a small group of relative strangers.

I have a tendency to bite off more than I can chew and ask a lot of people who have no vested interests in my development. I am acutely aware that I cannot build any part of the Big Picture project. That being said I have built a new project that can be added to with a basic knowledge of Python. I give you MoJoNewsBot:

If you want to know more about how the Special Advisers’ query was done read my ScraperWiki blog post. Also, I fixed the bug in the Goolge News search so the links match the headline.

Come October I will be heading to the US to help fulfill part of ScraperWiki’s obligations to the Knight News Challenge. I am honoured to be one of ScraperWiki’s first full-time employees and actually get paid to further the field of data journalism!

Being part of a startup has its risks. No one’s role is every fully defined. This really is a huge experiment and I’m not sure I can even describe what it is I am doing. I am not a noun, however. I am a verb. My definition is in my functionality and defining this through ScraperWiki, MoJo and any other opportunities that come my way will be the basis of this blog from now on. So my posts will be sporadic but I hope you look forward to them.