Posts Tagged ‘ScraperWiki’

Seeing as tomorrow is Open Data Day and I claim to be a Data Journalist (I think JournoCoder is more suitable) here’s a little data food for journalistic thought.

RSS Feed of US Nuclear Reactor Events

Here is the site showing the US nuclear reactors power output status. Here is the scraper for that site written by ScraperWiki founder Julian Todd. Here is my script for catching the unplanned events and converting them to RSS format. And here is the URL you can use to subscribe to the feed yourself:

https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=rss2&name=us_nuclear_reactor_rss&query=select%20title%2C%20link%2C%20description%2C%20date%2C%20guid%20from%20%60swdata%60%20limit%2020

Oh, and a video (using the example to go through the ScraperWiki API)

Advertisements

ScraperWiki has had a little facelift of late so I’ve done a couple of screencasts (one more to come). Enjoy!

To gain knowledge, insight and foresight into the developing media landscape, the best forms of education lie outside the classroom. I am a huge proponent of self-learning through experimentation. So I constantly go to events, lectures, hackathons and conferences.

I have recently been to HackHackersNYC and HacksHackerTO, as well as universities and newsrooms in the US. I find myself preaching the data journalism cause but also looking to learn more (code, as with journalism, is all about continuous learning).

An amazing opportunity that rolls everything into one brilliant bonanza of creativity, collaboration and coding is the Mozilla Festival taking place in London, UK on 4-6 November. The theme is Media, Freedom and the Web and if that isn’t enough to entice you I suggest you take a look at the line up as well as the star attendees.

ScraperWiki and DataMinerUK will be there as part of the Data Driven Journalism Toolkit. So come along if you wanna dig the data and do a whole lot more!

So I’m in the US, preparing to roll out events. To make decisions as to where to go I needed to get data. I needed numbers on the type of people we’d like to attend our events. In order to generate good data projects we would need a cohort of guests to attend. We would need media folks (including journalism students) and people who can code in Ruby, Python, and/or PHP. We’d also like to get the data analysts, or Data Scientists, as they are known as, particularly those who use R.

So with assistance from Ross Jones and Friedrich Lindenberg, I scraped the locations (websites and telephone numbers if they were available) of all the media outlets in the US (changing the structure to suit its intended purpose i.e. why there are various tables), data conferences, Ruby, PHP, Python and R meetups, B2B publishers (Informa and McGraw-Hill) and top 10 journalism schools. I added small sets in by hand such as HacksHackers chapters. All in all, nearly 12,800 data points. I had never used Fusion Tables before but I have heard good things. So I mashed up the data and imported it into Fusion Tables. So here is it (clicki on the image as sadly wordpress.com does not support iframes):

Click to explore on Fusion Tables

Sadly there is a lot of overlap so not all the points are visible. Google Earth explodes the points on the same spot however it couldn’t handle this much data when I exported it. Once we decide where best to go I can hone in on exact addresses. I wanted to use it to pinpoint concentrations, so a heat map of the points was mostly what I was looking for.

Click to explore on Fusion Tables

Using Fusion Tables I have then break down the data for the hot spots. I’ve looked at the category proportions and using the filter and aggregate, made pie charts (see New York City for example). The downside I found with Fusion Tables is that the colour schemes cannot be adjusted (I had to fix them up using Gimp) and the filters are AND statement (no OR statement option). The downside with US location data is the similarity of place names across states (also having a place and state name the same), so I had to eye up the data. So here is the breakdown for each region where the size of the pie chart corresponds to the number of data points for that location. It is relative to region not across.

Of course media outlets would outnumber coding meetups, universities and HacksHackers Chapters, but they would be a better measure of population size and city economy.

What I’ve learnt from this is:

  1. Free tools are simple to use if you play around with them
  2. They can be limiting for visual mashups
  3. The potential of your outcome is proportional to your data size, not your tool functionality (you can always use multiple tool)
  4. To work with different sources of data you need to think about your database structure and your outcome beforehand
  5. Manipulate your data in the database not your tool, always keep the integrity of the source data
  6. To have data feed into your outcome changes your efforts from event reporter to source
This all took me about a week between doing other ScraperWiki stuff and speaking at HacksHackers NYC. If I were better at coding I imagine this could be done in a day no problem.

Here’s a little experiment in using data for news:

Things have been quiet on the blog front and I apologize. What began as a tumultuous year with a big risk on my part has become even more turbulent. Happily with opportunities rather than uncertainties. Trips to Germany and the US have landed in my lap. Both hugely challenging and exciting.

I completed the Knight Mozilla Learning Lab successfully and have been invited to Berlin for the MoJoHackfest next week. I’m really looking forward to meeting all the participants and getting some in depth hands-on experience of creating applications built around a better news flow.

This is a level between the hack days ScraperWiki ran and the ScraperWiki platform development itself (I don’t play a part in this but work closely with those who do), which is more akin to the development newsroom.

My pitch for the Learning Lab, Big Picture, is asking a lot of developers coming with their own great ideas and prototypes. I would love to get some of the functionality working but that very much depends on the goodwill, skills and availability of a small group of relative strangers.

I have a tendency to bite off more than I can chew and ask a lot of people who have no vested interests in my development. I am acutely aware that I cannot build any part of the Big Picture project. That being said I have built a new project that can be added to with a basic knowledge of Python. I give you MoJoNewsBot:

If you want to know more about how the Special Advisers’ query was done read my ScraperWiki blog post. Also, I fixed the bug in the Goolge News search so the links match the headline.

Come October I will be heading to the US to help fulfill part of ScraperWiki’s obligations to the Knight News Challenge. I am honoured to be one of ScraperWiki’s first full-time employees and actually get paid to further the field of data journalism!

Being part of a startup has its risks. No one’s role is every fully defined. This really is a huge experiment and I’m not sure I can even describe what it is I am doing. I am not a noun, however. I am a verb. My definition is in my functionality and defining this through ScraperWiki, MoJo and any other opportunities that come my way will be the basis of this blog from now on. So my posts will be sporadic but I hope you look forward to them.

Just to let you know that the Twitter account @Scrape_No10 which tweets out ministers’, special advisers’ and permanent secretaries’ meetings, gifts and hospitalities is back up and tweeting. You can read the post about its creation here and download all the data the account contains. This account needs more coding maintenance than the @OJCstatements account (read about it here) because the data is contained in CSV files posted onto a webpage. I code sentences to be tweeted from the rows and columns. The scraper feeding the twitter account feeds off 5 separate scrapers of the CSV files. Because of this, the account is more likely to throw up errors than the simple scraping of the Office for Judicial Complaints site.

So I decided, as I’m learning to code and structure scrapers, to run the scrapers manually every time the twitter account stops, fix the bugs and set the account tweeting again. There will be better ways to structure the scrapers but right now I’m concentrating on the coding.

Learning to scrape CSVs is very handy as lots of government data are released as CSV. That being said, there is CSV documentation/tutorial on ScraperWiki, although it is aimed at programmers. For those interested in learning to code/scrape I would recommend “Learn Python the Hard Way” (which is the easiest for beginners, it’s just ‘hard’ for programmers because it involves typing code!). For more front end work I have recently discovered Codecademy. I can’t vouch for it but it looks interesting enough. I have also put all the datasets for the @Scrape_No10 account on BuzzData as an experiment.