Ode to a Piece of Code

Posted: March 28, 2011 in My Data Journey
Tags: , , , , , ,

Here is my first piece of Python code (that isn’t messing about in my command line). Well, it’s the first piece that does something with something on the web. That thing being the Complete Works of William Shakespeare. Seeing as his works now come in the html edition, all the words he ever writ is now public data. My task – because what’s a piece of code without a purpose ? – was to find out which Shakespearean character has the largest vocabulary. I imagine it could have some academic use. So here’s the scraper on ScraperWiki. Hit the ‘Edit’ tab to see the above code.

To get it in order of the character with the largest vocabulary, I just hit the ‘Interactive’ link on the top right of the scraper data and set the SQL query to order the results by the ‘Total Vocabulary’ in descending order. And here it is:

For those of you who don’t want to squint or look at code here’s what I got:

  1. Gloucester from Richard III with a vocabulary size of 1,636 words
  2. Coriolanus from Coriolanus with a vocabulary size of 1,586 words
  3. Benedick from Much Ado About Nothing with a vocabulary size of 1,116 words
  4. Gloucester from King Lear with a vocabulary size of 957 words
  5. Gower from Pericles with a vocabulary size of 899 words
  6. Adriana from The Comedy of Errors with a vocabulary size of 806 words
  7. Oberon from A Midsummer Night’s Dream with a vocabulary size of 739 words
  8. Katharina from Taming of the Shrew with a vocabulary size of 706 words
  9. Gloucester from Henry VI with a vocabulary size of 699 words
  10. York from Henry VI with a vocabulary size of 623 words

I have no formal programming training. I am a complete novice and have been reading up on compilers and other such weird and wonderful things for the first time. I don’t want to build programmes, I just want to scrape data. This has little journalistic value but was a good exercise in learning to scrape. All developers can do it and they generally have scrapers sitting on their computer. But for someone with no programming experience, I’m getting all I need with ScraperWiki. It’s not easy as the 210 revisions of the code proves but here’s a short video to explain just what went on (in plain English not Shakespeare!). Here’s the first part of the code explained where I scrape the webpage and get everything each character has said put into a dict:

Here’s the second part showing how to calculate the total vocabulary of a character after you’ve scraped everything they’ve said:


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s