Posts Tagged ‘code’

As part of my data journey, I’m learning to scrape. And so I’m looking for small pieces of data in the usual forms to work on first. That being said, I decided to scrape a csv file of UK Ministerial Gifts received in Cabinet Office 2009-10.

For all you novices out there, csv is a basic spreadsheet format which you can open in Excel so it’s fairly usable. That being said, from a data journalism point of view it was less than clean. All the departments were entered and ones which didn’t receive anything had a ‘NIL RETURN’ entry under ‘Minister’. I have no use for that. And the department entry was left empty if the next gift fell under the same department. I had to fix that with code. The entry of dates is appalling. But my main issue is with the data collection. Only gifts worth over £140 are registered. I doubt some poor civil servant is calling up foreign dignitaries to ask how much that bottle of wine you gave the PM is worth, so most gifts are valued at ‘Over limit’. Regardless, as an exercise, here’s what’s of interest:

The King of Saudi Arabia, Abdullah bin Abdul Aziz, gave Alastair Darling jewellery! He also gave Gordon and Sarah Brown an ornament and jewellery.

Nick Keller, founder of Beyond Sport, gave Tessa Jowell a travel alarm clock worth over £140. Beyond sport ambassadors include Tony Blair, Michael Johnson and Dame Kelly Holmes.

Other gifts by non-dignitaries include: Bathrobe, slippers, towels set, and a bed linen set for Gordon Brown from Enrico Marinelli, EMI gave a selection of CDs to Ben Bradshaw (some of which he purchased), Lola Rose gave jewellery and a scarf to Sarah Brown, Naomi Campbell gave her a hamper, and Sir Gulam Noon (a controversial Labour donor) gave Gordon Brown a hamper.

No. 10 must be full of rugs, 3 from Pakistan, 1 from Afghanistan and 1 from Azerbaijan.

Wine given by Nicholas Sarkozy and the President of Algeria, Abdelaziz Bouteflika, were used for official entertainment whereas that given by the President of Tunisia and the Sultan of Brunei were given to charity. Either they didn’t bring good enough wine or Nicholas Sarkozy and Abdelaziz Bouteflika didn’t trust No. 10 to stock good enough wine.

I think I’m going to hack gifts for s bit so stay tuned.

Here is my first piece of Python code (that isn’t messing about in my command line). Well, it’s the first piece that does something with something on the web. That thing being the Complete Works of William Shakespeare. Seeing as his works now come in the html edition, all the words he ever writ is now public data. My task – because what’s a piece of code without a purpose ? – was to find out which Shakespearean character has the largest vocabulary. I imagine it could have some academic use. So here’s the scraper on ScraperWiki. Hit the ‘Edit’ tab to see the above code.

To get it in order of the character with the largest vocabulary, I just hit the ‘Interactive’ link on the top right of the scraper data and set the SQL query to order the results by the ‘Total Vocabulary’ in descending order. And here it is:

For those of you who don’t want to squint or look at code here’s what I got:

  1. Gloucester from Richard III with a vocabulary size of 1,636 words
  2. Coriolanus from Coriolanus with a vocabulary size of 1,586 words
  3. Benedick from Much Ado About Nothing with a vocabulary size of 1,116 words
  4. Gloucester from King Lear with a vocabulary size of 957 words
  5. Gower from Pericles with a vocabulary size of 899 words
  6. Adriana from The Comedy of Errors with a vocabulary size of 806 words
  7. Oberon from A Midsummer Night’s Dream with a vocabulary size of 739 words
  8. Katharina from Taming of the Shrew with a vocabulary size of 706 words
  9. Gloucester from Henry VI with a vocabulary size of 699 words
  10. York from Henry VI with a vocabulary size of 623 words

I have no formal programming training. I am a complete novice and have been reading up on compilers and other such weird and wonderful things for the first time. I don’t want to build programmes, I just want to scrape data. This has little journalistic value but was a good exercise in learning to scrape. All developers can do it and they generally have scrapers sitting on their computer. But for someone with no programming experience, I’m getting all I need with ScraperWiki. It’s not easy as the 210 revisions of the code proves but here’s a short video to explain just what went on (in plain English not Shakespeare!). Here’s the first part of the code explained where I scrape the webpage and get everything each character has said put into a dict:

Here’s the second part showing how to calculate the total vocabulary of a character after you’ve scraped everything they’ve said: