Posts Tagged ‘python’

Here’s a little experiment in using data for news:

Just to let you know that the Twitter account @Scrape_No10 which tweets out ministers’, special advisers’ and permanent secretaries’ meetings, gifts and hospitalities is back up and tweeting. You can read the post about its creation here and download all the data the account contains. This account needs more coding maintenance than the @OJCstatements account (read about it here) because the data is contained in CSV files posted onto a webpage. I code sentences to be tweeted from the rows and columns. The scraper feeding the twitter account feeds off 5 separate scrapers of the CSV files. Because of this, the account is more likely to throw up errors than the simple scraping of the Office for Judicial Complaints site.

So I decided, as I’m learning to code and structure scrapers, to run the scrapers manually every time the twitter account stops, fix the bugs and set the account tweeting again. There will be better ways to structure the scrapers but right now I’m concentrating on the coding.

Learning to scrape CSVs is very handy as lots of government data are released as CSV. That being said, there is CSV documentation/tutorial on ScraperWiki, although it is aimed at programmers. For those interested in learning to code/scrape I would recommend “Learn Python the Hard Way” (which is the easiest for beginners, it’s just ‘hard’ for programmers because it involves typing code!). For more front end work I have recently discovered Codecademy. I can’t vouch for it but it looks interesting enough. I have also put all the datasets for the @Scrape_No10 account on BuzzData as an experiment.

This is Cuddles. He’s a Russian Dwarf hamster. We have 2 males. Cuddles is the beta male. He was fat and lazy which made him very docile. You’d like Cuddles. He’s a good little hamster. He’s been made docile by the fact that he’s bullied by the alpha male.

We initially called the alpha male ‘Dimples’ but he turned out to be really evil. So we renamed him Morbo. This is after the news presenter from Futurama: Morbo the Annihilator. Because Morbo incessantly bullies Cuddles, he’s broken out in boils and is under weight. Whenever we give him food, Morbo steals it from him. Morbo often pushes him over and presses him until he squeaks. In fact, we’re going to separate them even though Russian Dwarf hamsters are supposed to be social creatures.

To mark this point of separation I have written a piece of code to explain why it is necessary for poor Cuddles to be taken away from his brother. It’s also a exercise for me to learn about classes in Python but nevertheless it is very poignant. Here it is:

It’s on ScraperWiki so you can play around with the parmeters. It takes an initial hamster weight of 300g and assumes he loses 2g every time he’s chased by Morbo and gains 5g every time he eats. Morbo bullies him up to 10 time a day. It shows you how long he has to live. You can change these parameters to see how that changes his life expectancy.

Here’s a part of the output from my command line:

I’m a journalist learning to code and this is my story telling through the medium of Python.

This will explain a part of the code:

Here is my first piece of Python code (that isn’t messing about in my command line). Well, it’s the first piece that does something with something on the web. That thing being the Complete Works of William Shakespeare. Seeing as his works now come in the html edition, all the words he ever writ is now public data. My task – because what’s a piece of code without a purpose ? – was to find out which Shakespearean character has the largest vocabulary. I imagine it could have some academic use. So here’s the scraper on ScraperWiki. Hit the ‘Edit’ tab to see the above code.

To get it in order of the character with the largest vocabulary, I just hit the ‘Interactive’ link on the top right of the scraper data and set the SQL query to order the results by the ‘Total Vocabulary’ in descending order. And here it is:

For those of you who don’t want to squint or look at code here’s what I got:

  1. Gloucester from Richard III with a vocabulary size of 1,636 words
  2. Coriolanus from Coriolanus with a vocabulary size of 1,586 words
  3. Benedick from Much Ado About Nothing with a vocabulary size of 1,116 words
  4. Gloucester from King Lear with a vocabulary size of 957 words
  5. Gower from Pericles with a vocabulary size of 899 words
  6. Adriana from The Comedy of Errors with a vocabulary size of 806 words
  7. Oberon from A Midsummer Night’s Dream with a vocabulary size of 739 words
  8. Katharina from Taming of the Shrew with a vocabulary size of 706 words
  9. Gloucester from Henry VI with a vocabulary size of 699 words
  10. York from Henry VI with a vocabulary size of 623 words

I have no formal programming training. I am a complete novice and have been reading up on compilers and other such weird and wonderful things for the first time. I don’t want to build programmes, I just want to scrape data. This has little journalistic value but was a good exercise in learning to scrape. All developers can do it and they generally have scrapers sitting on their computer. But for someone with no programming experience, I’m getting all I need with ScraperWiki. It’s not easy as the 210 revisions of the code proves but here’s a short video to explain just what went on (in plain English not Shakespeare!). Here’s the first part of the code explained where I scrape the webpage and get everything each character has said put into a dict:

Here’s the second part showing how to calculate the total vocabulary of a character after you’ve scraped everything they’ve said:

The web has two faces. One it shows to programmers, the other to us mere mortals. As a source, the face it shows to programmers is so much more revealing. In this way, I’m trying first to understand the language of the web before I can scrape it.

Now scraping is a lot harder than actually building a web page so I need to get my head around the nuts and bolts. Coding looks scary and without having studied it at university it’s hard to know where to start. So here’s what I’ve found:

W3schools is a brilliant online resource that even advanced programmers refer to. It covers HTML, XHTML, CSS, Javascript, JQuery, SQL, and PHP. Basically all you need to know to build the web face mortals see. To scrape a site you need to know how it was constructed. You can pick apart a site’s recipe in Chrome or by using Firebug. So going through the tutorials on this site will help you understand what’s cooking. You don’t need to make anything yourself although if you want to build a view in ScraperWiki the HTML is useful for creating tables and the SQL for using the SQLite view.

I’m aware that plebs don’t use command consoles and I don’t plan on needing to. I worry about making my computer angry. I like to use stuff on the web so that I don’t break my shit. So here’s a console and tutorial for writing Python.

This is really helpful and the console can be used to play with Python without having to install it. You can use it to try out this Python tutorial and then build up your skills by using the Python documentation. A more learner friendly version is Dive onto Python. I find this very programmey but by playing around on ScraperWiki I will hopefully be able to get through all I need to become a killer data journalist. Whatever happens, I’ll let you know.

I have gone dark on the data journalism front and I apologize. It’s just that work that actually pays me has picked up! But I have not dropped the baton and indeed, my latest finding have reignited my flame for all things data driven.

So please do join me on my data journey. I am attempting to learn Python. But before I (or anyone for that fact) can beginning wailing and gnashing my teeth over text errors, I have to educate myself on the very basics of coding. A hacker friend has recommended some light reading material in the form of a tutorial from City University of New York. It is for those who have never seen code in their lives!

So once this has been ingested into my neural network I can begin the Python tutorials on ScraperWiki. More on ScraperWiki soon (and scraping in general).

To keep up to date with all things #data #datajournalism follow me @DataMinerUK.