Scraper Ergo Sum – Good Web Scraping Projects for R

Social News Site Voting:

An evolution of the forum site, social news sites such as Reddit and Hacker News provide the analyst with a richer set of feedback about what the audience thinks of a particular topic. These sites are centered around a list of “top stories” for a particular subject, ranked by the number of “votes” that that article receives. Other indicators of audience sentiment can include cross-postings (to related interest groups), comments, and votes on comments. Each of these can be analyzed in turn for factors such as sentiment, spam, and predictability.

Good options for analysis (that are relatively friendly to analysts) include: Hacker News and Reddit. Please take a careful look at the site’s current terms and conditions and, if appropriate, use any API’s provided for that site. (First rule of web scraping: don’t be a jerk).

From a predictive analytics perspective, there are some interesting things to look at:

  • Which topics tend to attract the greatest number of up-votes or user comments? Can we predict which articles will win? Any pointers on what words or topics to mention in your next blog post so it will rock the front page of Reddit?
  • Are there any patterns in link submission (across sites, by a single user) which are suggestive of spam or abuse? What rules can be developed to block this type of activity?
  • Look at some threads where there was a two party debate (or more). Can we spot any trends in voting or language choice?
  • Look at comment activity across threads – how often do the same users interact with each other? How directly?
  • Can you build a profile of user activity? What are the common patterns, in terms of days between first and last activity and average volume of events per day? What does sharing a link, success / failure of that link, and commenting do to this pattern of activity?
  • Another really cool idea – but a ton of work – can we look at multiple related social news sites and track how quickly a story moves across those sites? What you may notice is the same topic will be trending on Hacker News and the startup / technical sub-reddits.

Next Up: Open Government Data!