Scraper Ergo Sum – Good Web Scraping Projects for R

Forum / Discussion List Scraping:

There are a large variety of forums available on the web, centered around a variety of topics. These are often a good place to start working on extracting HTML text “in the wild”.

Most forums and mailing list servers tend to be relatively well structured, frequently operating with older scripts that are relatively free of plugins and cruft.

Posts are relatively structured, usually threaded around a specific topic and frequently organized into sub-sections. The typical post will consist of a title (often part of a thread), a primary message (usually text) accompanied by a (repetitive) forum signature text block beneath the message. Most posts have a large amount of meta-data, such as poster name, topic tags, and IP address information, which can be extracted and analyzed. Many boards also allow posters to  maintain their own profile pages, which can be an additional source of information.

This data can be analyzed for a couple of interesting sets of insights:

  • Development of profiles of the typical poster – how often do they post? How broadly do they participate in the forum? How diverse is their interest in topics? How long do they remain active on the forum?
  • Forums are frequently a target of spam and commercial posting. What signals can we use to detect this?
  • What topics tend to get the most engaged response from the forum members? Can we draw any inferences between how a topic is presented or initially discussed with the degree of attention it receives from a forum members? Are there any trigger phrases or keywords which tend to prolong debate?
  • Given the large body of text, sentiment analysis is always possible and could allow you to see how trends move through the group.
  • For forums discussing broader issues, such as politics, identify pairs of related (same party) or complementary (opposite party) forums. Compare discussion on the two, either over long periods of time or around a specific event (to compare their perspectives).

We’re ready to face the public – Analyzing Social News Site Voting?