Web scraping is a powerful capability for data science and analytics. We covered the technical aspects of getting started elsewhere on this blog (see: readlines and RCurl, importing web-based CSV files, reading JSON / API data). This article is intended to help you find a project that motivates you to dig deeper in this space.
These project ideas are intended to be opened ended. We will point you towards an interesting data source, accessible via web scraping, and you can build things up from there. Use the web scraping tutorials to pull back an initial data set and build things up from there. If you find something interesting that the rest of the community may like, drop us a note in the comments section below.
Basic Text Mining:
To start with something basic, download some large text files from Project Gutenberg. For example, take a look at the text file for the Complete Works of Shakespeare. This can be downloaded using readLines() and parsed. A common next step is to run counts to identify the most common words within the document, identifying the most common terms.
A second round of analysis might look at sets of words which a particular author likes to use together. This could be bi-grams (pairs of sequential words), tri-grams (three), or quads (four words). The statistical distribution of tri-grams or quads could be run for two different books and compared, to identify distinctive differences in the language used between the authors of those books. (or, in certain situations, determine that two authors have suspiciously similar writing styles).
If you’re looking for something a little more difficult, try writing a crawler which takes advantage of Project Gutenberg’s subject and author search capabilities. This will give you a list of books to traverse to build up a corpus for analysis.
Next Up – what can we learn from analyzing forums and discussion lists?