Posts Tagged ‘data’

So I’m in the US, preparing to roll out events. To make decisions as to where to go I needed to get data. I needed numbers on the type of people we’d like to attend our events. In order to generate good data projects we would need a cohort of guests to attend. We would need media folks (including journalism students) and people who can code in Ruby, Python, and/or PHP. We’d also like to get the data analysts, or Data Scientists, as they are known as, particularly those who use R.

So with assistance from Ross Jones and Friedrich Lindenberg, I scraped the locations (websites and telephone numbers if they were available) of all the media outlets in the US (changing the structure to suit its intended purpose i.e. why there are various tables), data conferences, Ruby, PHP, Python and R meetups, B2B publishers (Informa and McGraw-Hill) and top 10 journalism schools. I added small sets in by hand such as HacksHackers chapters. All in all, nearly 12,800 data points. I had never used Fusion Tables before but I have heard good things. So I mashed up the data and imported it into Fusion Tables. So here is it (clicki on the image as sadly wordpress.com does not support iframes):

Click to explore on Fusion Tables

Sadly there is a lot of overlap so not all the points are visible. Google Earth explodes the points on the same spot however it couldn’t handle this much data when I exported it. Once we decide where best to go I can hone in on exact addresses. I wanted to use it to pinpoint concentrations, so a heat map of the points was mostly what I was looking for.

Click to explore on Fusion Tables

Using Fusion Tables I have then break down the data for the hot spots. I’ve looked at the category proportions and using the filter and aggregate, made pie charts (see New York City for example). The downside I found with Fusion Tables is that the colour schemes cannot be adjusted (I had to fix them up using Gimp) and the filters are AND statement (no OR statement option). The downside with US location data is the similarity of place names across states (also having a place and state name the same), so I had to eye up the data. So here is the breakdown for each region where the size of the pie chart corresponds to the number of data points for that location. It is relative to region not across.

Of course media outlets would outnumber coding meetups, universities and HacksHackers Chapters, but they would be a better measure of population size and city economy.

What I’ve learnt from this is:

  1. Free tools are simple to use if you play around with them
  2. They can be limiting for visual mashups
  3. The potential of your outcome is proportional to your data size, not your tool functionality (you can always use multiple tool)
  4. To work with different sources of data you need to think about your database structure and your outcome beforehand
  5. Manipulate your data in the database not your tool, always keep the integrity of the source data
  6. To have data feed into your outcome changes your efforts from event reporter to source
This all took me about a week between doing other ScraperWiki stuff and speaking at HacksHackers NYC. If I were better at coding I imagine this could be done in a day no problem.

Although “data journalism” can encompass infographics, interactives, web apps, FOI, databases and a whole host of other numbering, coding, displaying techniques; the road less travelled-by has certain steps, turns and speed bumps. In that sense, here’s a list of things to tick off if you’re interested in going down the data journalism road:

  1. Know the legal boundaries – get to know the Data Protection Act 1998 and the sections on access to personal data and unstructured personal data held by authorities. Do not set foot on your journey without reading the section on exemptions relating to journalism. Use legislation as a reference by downloading the Mobile Legislate app.
  2. Look at data – get to know what is out there, what format it’s in and where it’s coming from. Places like Data.gov.uk, London Datastore, Office for National Statistics and Get the Data are good places to start for raw data but don’t forget, anything on the web is data. The best data are often hidden. Data can be text and pictures so even mining social media and catching the apps built from them can give you insight into what can be done with data.
  3. Read all about it – to make data and stats accessible you need to know how to frame them within a story. In that sense, you need to know how to undertand the stories they tell. That doesn’t mean going on a stats course. There are a lot of accessible reading material and I would recommend The Tiger That Isn’t.
  4. Get connected – find HacksHackers near you and join Meetup groups to point you in the right directions. Data journalists’ interests and abilities are unique to the individual (much like programmers) so don’t take text of advice as set in stone (the web changes too quickly for that!). Find your own way and your own set of people to guide you. Go to courses and conferences. Look outside the journalism bubble. Data is more than just news.
  5. Spread your bets – the easiest way to sort data is by using spreadsheets. Start with free options like Google Docs and OpenOffice. Industry standards include Microsoft Excel and Access. Learn to sort, filter and pivot. Find data you’re interested in and explore the data using your eyes balls. Know what each piece of software does and can do to the data before mashing it with another piece of software.
  6. Investigate your data – query it using the simple language SQL and the software MySQL. It’s a bit tricky to set up but by now you’ll know a hacker you can ask for help! Clean your data using Google Refine. There are tutorials and a help wiki. Know how these function not just how to navigate the user interfaces, as these will change. These products go through iterations much more quickly than the spreadsheet software.
  7. Map your data – from Google spreadsheets the easiest way to build a map is by using MapAList. There is a long list of mapping software from GeoCommons to ArcGIS. Find what’s easiest for you and most suitable for your data. See what landscapes can be revealed and hone in on areas of interest. Understand the limitations of mapping data, you’ll find devolution makes it difficult to get data for the whole of the UK and some postcodes will throw up errors.
  8. Make it pretty – visualize your data only once you fully understand it (source, format, timeframe, missing points, etc). Do not jump straight to this as visuals can be misleading. Useful and easy software solutions include Google Fusion Tables, Many Eyes and Tableau. Think of unique ways to present data by checking out what the graphics teams at news organizations have made but also what design sites such as Information is Beautiful and FlowingData are doing.
  9. Make your community – don’t just find one, build one. This area in journalism is constantly changing and for you to keep up you’ll need to source a custom made community. So blog and tweet but also source ready-made online communities from places like the European Journalism Centre, National Institute for Computer Assisted Reporting (NICAR), BuzzData and DataJournalismBlog.
  10. Scrape it – do not be constrained by data. Liberate it, mash it, make it useable. Just like a story, data is unique and bad data journalism comes with constraining the medium containing it. With code, there is no need to make the story ‘fit’ into the medium. “The Medium is the Message” (a la Marshall McLuhan). Scrape the data using ScraperWiki and make applications beyond story telling. Make data open. For examples check out OpenCorporates, Schooloscope and Planning Alerts. If you’re willing to give coding a try, this book called “Learning Python the Hard Way” is actually the easiest way to learn for the non-programmer. There is also a Google Group for Python Journalists you should join.
These are guidelines and not a map for your journey. Your beat, the data landscape, changes at the speed of web. You just need to be able to read the signs of the land as there’s no end point, no goal and no one to guide you.

It’s been a while since I liberated any data and that’s because I’ve been wrestling with a scraper of Government Salaries. I’ve only looked at pay floor for the tables below. This is the minimum pay with ceiling pay being £4,999 more than the floor. Salaries ranged from £35,000 to £235,000.

There was the coding to deal with of course. I’ve had my third formal lesson. Lists, ooh er. But the main difficulty came from the awful state the data was in. All CSV files but spread over the web with a blatant disregard for consistency. Anyway, you can download the data for yourself. There are 440 rows so looking back at the poor quality of the data one could easily have done a copy and paste job, but a scraper that trawls a site for the links to the pages that has the download ensures you get all the data sets and makes the next collection a matter of hitting a button (hopefully).

Here is the top 10 for pay:

If you’re looking to climb your way up to the top pay band then here’s the top 10 departments:

And since the Information Commissioner has ordered release of names of high-paid civil servants who did not want their salaries disclosed the data shows 48 names have been withheld amounting to £508,000 in cumulative pay. Most of these redactions came from the Cabinet and Home Office. I can understand why names from the Office for Security and Counter-Terrorism Unit were not disclosed. One consistency with the withholding of names is from the lawyers. The entire list of names for the Office of the Parliamentary Counsel is N/D as is the legal advisers branch of the Cabinet Office. Also not named are the Media Director and Communication and Change Director. Here are the details where the names are not disclosed:

Seeing as I like to fly in the face of tradition, I’m going to turn things on it’s head and write a blog post of how I did it before I publish what “it” actually is. That is, I have scraped all the Cabinet Office spending data, cleaned it up and extracted it. But before I tell you what I’ve found (indeed, I haven’t got around to that properly yet!), I’m going to tell you how I found it.

Firstly, I scraped this page to pull out all the CSV files and put all the data in the ScraperWiki datastore. The scraper can be found here. It has over 1,200 lines of code but don’t worry, I did very little of the work myself! Spending data is very messy with trailing spaces, inconsistent capitals and various phenotypes. So I scraped the raw data which you can find in the “swdata” tab. I downloaded this and plugged it into Google Refine. I used the text facet functions to clean up the suppliers’ names as best I could (I figured these were of the most interest and would be more suitable for cleaning). This can be done by going into the “Undo/Redo” tab and clicking on “Extract…”. Select the processes you want the code for, then copy the right hand box. I pasted this prepackaged code into my scraper.

So if you want the cleaned data make sure you select the “Refined” table by hitting the tab and selecting “Download spreadsheet (CSV)”. If you want to use the amount as a numerical field (it was not put in as such in the original!) to get totals for each supplier, for example, you’ll have to use the refined table as I had to code to get the “Amount” as numbers. Or if you know a bit of SQL and want to query the data from ScraperWiki you can use my viewer to be found here. Either way, here is the data. I have already found something of interest which I am chasing but if you’re interested in data journalism here is a data set to play with. Before I can advocate using, developing and refining the tools needed for data journalism I need journalists (and anyone interested) to actually look at data. So before I say anything of what I’ve found, here are my materials plus the process I used to get them. Just let me know what you find and please publish it!

————————

Here is a table of the top 10 receivers of Cabinet Office money. I’ve put the image in here but the original is a view that feeds off the scraper so as the data gets published, this table should update. So the information becomes living information not a static visual. The story is being told not catalogued.

Oh and V is V inspired youth volunteering. They received nearly £44 million over a nine month period. On their website they say they have received over £48 million from the private sector. I imagine £44 million of that has come straight from the Cabinet Office. The Big Society seems to be costing the government a lot of money at the moment even though they say it will be mostly funded by the private sector.

And here’s what Tim Berner-Lee, founder of the internet, said regarding the subject of data journalism:

Journalists need to be data-savvy… [it’s] going to be about poring over data and equipping yourself with the tools to analyse it and picking out what’s interesting. And keeping it in perspective, helping people out by really seeing where it all fits together, and what’s going on in the country

How the Media Handle Data:

Data has sprung onto the journalistic platform of late in the form of the Iraq War Logs (mapped by The Guardian), the MP’s expenses (bought by The Telegraph) and the leaked US Embassy Cables (visualized by Der Spiegel). What strikes me about these big hitters is the existence of the data is a story in itself. Which is why they had to be covered. And how they can be sold to an editor. These data events force the journalistic platform into handling large amounts of data. The leaks are stories so there’s your headline before you start actually looking for stories. In fact, the Fleet Street Blues blog pointed out the sorry lack of stories from such a rich source of data, noting the quick turn to headlines about Wikileaks and Assange.

Der Spiegel - The US Embassy Dispatches

So journalism so far has had to handle large data dumps which has spurred on the area of data journalism. But they also serve to highlight the fact that the journalistic platform as yet cannot handle data. Not the steady stream of public data eking out of government offices and public bodies. What has caught the attention of news organizations is social media. And that’s a steady stream of useful information. But again, all that’s permitted is some fancy graphics hammered out by programmers who are glad to be dealing with something more challenging than picture galleries (here’s an example of how  CNN used twitter data).

So infographics (see the Stanford project: Journalism in the Age of Data) and interactives (e.g. New York Times: A Peek into Netflix Queues) have been the keystone from which the journalism data platform is being built. But there are stories and not just pictures to be found in data. There are strange goings-on that need to be unearthed. And there are players outside of the newsroom doing just that.

How the Data Journalists Handle Data:

Data, before it was made sociable or leakable, was the beat of the computer-assisted-reporters (CAR). They date as far back as 1989 with the setting up of the National Institute for Computer-Assisted Reporting in the States. Which is soon to be followed by the European Centre for Computer Assisted Reporting. The french group, OWNI, are the latest (and coolest) revolutionaries when it comes to new age journalism and are exploring the data avenues with aplomb. CAR then morphed into Hacks/Hackers when reporters realized that computers were tools that every journalist should use for reporting. There’s no such thing as telephone-assisted-reporting.  So some whacky journalists (myself now included) decided to pair up with developers to see what can be done with web data.

This now seems to be catching on in the newsroom. The Chicago Tribune has a data center, to name just one. In fact, the data center at the Texas Tribune drives the majority of the sites traffic. Data journalism is growing alongside the growing availability of data and the tools that can be used to extract, refine and probe it. However, at the core of any data driven story is the journalist. And what needs to be fostered now, I would argue, is the data nose of a (any) journalist. Journalism, in its purest form, is interrogation. The world of data is an untapped goldmine and what’s lacking now is the data acumen to get digging. There are Pulitzers embedded in the data strata which can be struck with little use of heavy machinery. Data driven journalism and indeed CAR has been around long before social media, web 2.0 and even the internet. One of the earliest examples of computer assisted reporting was in 1967, after riots in Detroit, when Philip Meyer used survey research, analyzed on a mainframe computer, to show that people who had attended college were equally likely to have rioted as were high school dropouts. This turned the publics’ attention to the pervasive racial discrimination in policing and housing in Detroit.

Where Data Fits into Journalism:

I’ve been looking at the States and the broadsheets reputation for investigative journalism has produced some real gems. What stuck me, by looking at news data over the Atlantic, is that data journalism has been seeded earlier and possibly more prolifically than in the UK. I’m not sure if it’s more established but I suspect so (but not by a wide margin). For example, at the end of 2004, the then Dallas Morning News analyzed the school test scores of the Texas Assessment of Knowledge and Skills and uncovered one school’s alleged cheating on standardized tests. This then turned into a story on cheating across the state. The Seattle Times piece of 2008, logging and landslides, revealed how a logging company was blatantly allowed to clear-cut unstable slopes. Not only did they produce and interactive but the beauty of data journalism (which is becoming a trend) is to write about how the investigation was uncovered using the requested data.

The Seattle Times: Landslides in the Upper Chehalis River Basin

Newspapers in the US are clearly beginning to realize that data is a commodity for which you can buy trust from your consumer. The need for speed seems to be diminishing as social media gets there first, and viewers turn to the web for richer information. News in the sense of something new to you, is being condensed into 140 character alerts, newsletters, status updates and things that go bing on your mobile device. News companies are starting to think about news online as exploratory information that speaks to the individual (which is web 2.0). So the The New York Times has mapped the census data in its project “Mapping America: Every City, Every Block”. The Los Angeles Times has also added crime data so that its readers are informed citizens not just site surfers. My personal heros are the investigative reporters at ProPublica who not only partner with mainstream news outlets for projects like Dollars for Doctors, they also blog about the new tools they’re using to dig the data. Proof the US is heading down the data mine is the fact that Pulitzer finalists for local journalism included a two year data dig by the Las Vegas Sun into preventable medical mistakes in Las Vegas hospitals.

Lessons in Data Journalism:

Another sign that data journalism is on the up is the recent uptake at teaching centres for the next generation journalist. Here in the UK, City University has introduced an MA in Interactive Journalism which includes a module in data journalism. Across the pond, the US is again ahead of the game with Columbia University offering a duel masters’ in Computer Science and Journalism. Words from the journalism underground are now muttering terms like Goolge Refine, Ruby and Scraperwiki. O’Reilly Radar has talked about data journalism.

The beauty of the social and semantic web is that I can learn from the journalists working with data, the miners carving out the pathways I intend to follow. They share what they do. Big shot correspondents get a blog on the news site. Data journalists don’t, but they blog because they know that collaboration and information is the key to selling what it is they do (e.g Anthony DeBarros, database editor at USA Today). They are still trying to sell damned good journalism to the media sector!  Multimedia journalists for local news are getting it (e.g David Higgerson, Trinity Mirror Regionals). Even grassroots community bloggers are at it (e.g. Joseph Stashko of Blog Preston). Looks like data journalism is working its way from the bottom up.

Back in Business:

Here are two interesting articles relating to the growing area of data and data journalism as a business. Please have a look: Data is the New Oil and News organizations must become hubs of trusted data in a market seeking (and valuing) trust.

The Cabinet Office, in a move towards greater transparency, are attempting to publish all their data online. This isn’t really news but I don’t think news organizations are looking at this data so I’m scraping it and seeing what it has to offer. So as an exercise I’m scraping the page where ministerial gifts, hospitality, travel and meetings with external organisations are published as CSV or PDF. All this should be pretty much covered by Who’s Lobbying but I’m hoping to set up a little social media experiment (more on that to come). So here is all the data, set to scrape the site every month. You can download it all.

I whacked it into Google Refine to deal with the different spellings, nuances and the change in the format of the date. The date transformation option never seems to work for me in Refine so I exported it and opened it up in Excel to get the data out in chronological order. This may sound cumbersome to those who don’t work with data it’s actually quite quick and easy once you’ve tried it. Anyway, I looked at some of the more popular reasons for meeting ministers and grabbed a screen shot of the Excel table (Refine allows you to export a html table but I’ll have to get it to open up in Firefox so I can use my full page grab add-on).

I looked at the meetings for Big Society:

The major meeting with the Prime Minister and Deputy Prime Minister in May involved Young Foundation, Community Links, Antigone, Big Society Network, Balsall Health Forum, London Citizens, Participle, Talk About Local, CAN Breakthrough, Mayor of Middlesborough, Business in the Community, Esmee Fairbairn, Greener Leith, St Giles Trust, Big Issue Invest, Kids Company. Since then there has been a steady trickle of over 30 meetings with Nick Hurd, Oliver Letwin and Francis Maude about Big Society. Note that these are all Conservative MPs so the Big Society is already looking smaller along coalition party lines.

Sure, they have the titles to be involved but the trend in the data seems to be more about big financing. Meetings with the likes of Goldman Sachs, Barclays, British Banking Association and Co-op Financial Servies leads one to believe that Big Society is being outsourced to local communities but the big financing has to come from the top. In Building the Big Society, the Cabinet Office writes:

We will use funds from dormant bank accounts to establish a Big Society
Bank, which will provide new finance for neighbourhood groups, charities,
social enterprises and other nongovernmental bodies

What are ‘funds from dormant bank accounts’ and why didn’t they use these instead of looking to the government to bail them out? The banks and their reckless trading in toxic assets and credit default swaps led to a massive recession. This shed the light on reckless government borrowing and the massive deficit. This led to budget cuts to local services and the need for the Big Society. Which is now being funded by the banks! Am I missing something?

The next thing to look at from the data is the category ‘Introductory Meeting':

Introductory meetings interest me as I imagine it pays to be at the back of a politicians mind. It must be worthwhile to have some ear time and get your points across. I’m sure not any old Joe can get an introductory meeting. There must be PR companies that specialise in getting these meetings (lobby firms) so it’s interesting how many large companies are going to appear on this list. In fact, the purpose for one meeting was put down as ‘Lobbying’ with UK Public Affairs Council. They have a register of firms and clients published in evil PDF (go figure). Will have to scrape that.

Lastly, I thought the ‘Renegotiation of Contract’ category might be of interest so here it is:

A lot of these are big technology companies yet the government is notorious for accumulating huge costs with little effectiveness when it comes to implementing new IT systems. I also wonder whether Vodafone’s tax dispute was known during the negotiation of their contract with the Cabinet Office.

I’m getting the data out so that anyone with inside knowledge can put two and two together to further the information. I’m churning the data in so that what can be churned out is journalism and not churnalism. That’s the idea anyway. Just looking at the data is a step in the right direction so anyone interested in data journalism, just keep on looking at what’s coming out. And I’ll try and put it into a context that has journalistic value.

As part of my data journey, I’m learning to scrape. And so I’m looking for small pieces of data in the usual forms to work on first. That being said, I decided to scrape a csv file of UK Ministerial Gifts received in Cabinet Office 2009-10.

For all you novices out there, csv is a basic spreadsheet format which you can open in Excel so it’s fairly usable. That being said, from a data journalism point of view it was less than clean. All the departments were entered and ones which didn’t receive anything had a ‘NIL RETURN’ entry under ‘Minister’. I have no use for that. And the department entry was left empty if the next gift fell under the same department. I had to fix that with code. The entry of dates is appalling. But my main issue is with the data collection. Only gifts worth over £140 are registered. I doubt some poor civil servant is calling up foreign dignitaries to ask how much that bottle of wine you gave the PM is worth, so most gifts are valued at ‘Over limit’. Regardless, as an exercise, here’s what’s of interest:

The King of Saudi Arabia, Abdullah bin Abdul Aziz, gave Alastair Darling jewellery! He also gave Gordon and Sarah Brown an ornament and jewellery.

Nick Keller, founder of Beyond Sport, gave Tessa Jowell a travel alarm clock worth over £140. Beyond sport ambassadors include Tony Blair, Michael Johnson and Dame Kelly Holmes.

Other gifts by non-dignitaries include: Bathrobe, slippers, towels set, and a bed linen set for Gordon Brown from Enrico Marinelli, EMI gave a selection of CDs to Ben Bradshaw (some of which he purchased), Lola Rose gave jewellery and a scarf to Sarah Brown, Naomi Campbell gave her a hamper, and Sir Gulam Noon (a controversial Labour donor) gave Gordon Brown a hamper.

No. 10 must be full of rugs, 3 from Pakistan, 1 from Afghanistan and 1 from Azerbaijan.

Wine given by Nicholas Sarkozy and the President of Algeria, Abdelaziz Bouteflika, were used for official entertainment whereas that given by the President of Tunisia and the Sultan of Brunei were given to charity. Either they didn’t bring good enough wine or Nicholas Sarkozy and Abdelaziz Bouteflika didn’t trust No. 10 to stock good enough wine.

I think I’m going to hack gifts for s bit so stay tuned.