Webscraping with rvest: So Easy Even An MBA Can Do It!

This is the fourth installment in our series about web scraping with R. This includes practical examples for the leading R web scraping packages, including: RCurl package and jsonlite (for JSON). This article primarily talks about using the rvest package. We will be targeting data using CSS tags.

I read the email and my heart sank. As part of our latest project, my team was being asked to compile statistics for a large group of public companies.  A rather diverse set of statistics, from about four different sources. And to make life better, the list was “subject to change”. Translated: be ready to update this mess at a moment’s notice….

The good news. Most of the request was publicly available, crammed into the nooks and corners of various financial sites.

This was a perfect use case for web scraping. An old school update (aka, the intern-o-matic model) would take about three or four hours. Even worse, it would be nearly impossible to quality check. A well written web scraper would be faster and easier to check afterwards.

After installing the rvest and jsonlite libraries, I fired up Google and started looking for sources. The information we needed was available on several sites. After doing a little comparison and data validation, I settled on several preferred sources.

Important: Many websites have policies which restrict or prohibit web scraping; the same policies generally prohibit you from doing anything else useful with the data (such as compiling it).  If you intend to use the scraped data for public (publication) or commercial use, you should consult a lawyer to understand your legal risks. This code should be used for educational purposes only. In practice, personal scraping is difficult to detect and rarely pursued (particularly if there is a low volume of requests).

Back to our example. To reduce the risk of getting a snarky legal letter, we’re going to share a couple of examples using the package to grab information from Wikipedia. The same techniques can be used to pull data from other sites.

Nice Table. We’ll Take It!

In many cases, the data you want is neatly laid out on the page in a series of tables. Here’s a sample of rvest code where you target a specific page and pick the table you want (in order of appearance). This script is going after every item on the page within an HTML tag of <table> and selecting the first one.

page <- read_html("https://en.wikipedia.org/wiki/List_of_largest_employers_in_the_United_States")

employers <- page %>%
html_nodes("table") %>%
.[1] %>%
html_table()

employers <- employers[[1]]

This will generate a nicely formatted list of the top employers in the US. This technique can be easily extended to grab data in almost any table on a web page. Basically, grab anything enclosed within a <table> tag and count through the tables until you find the one you want.

Slicing and Dicing with CSS selectors

But wait, there’s more! It slices, it dices, it even finds Julianne’s fries.

The ability to select pieces of a page using CSS selectors gives you the freedom to do some creative targeting. For example, if you wanted to grab the content of a specific text box on a page.  For this second example, we’re going to target a non-table element of the page – the list of sources at the end of the wikipedia article. On Julienning, of course (the cutting technique used to make Julienne Fries).

A little inspect of the page reveals the sources are organized as an ordered list (<ol> HTML element). This list has been assigned an HTML class of “references”. We are interested in the content of each of the list element objects (<li> HTML element) that are children of this ordered list. This is reflected in the selection criteria below.

page <- read_html("https://en.wikipedia.org/wiki/Julienning")

sources <- page %>%
html_nodes(".references li") %>%
html_text()

The output of this:

sources
[1] "^ a b Larousse Gastronomique. Hamlyn. 2000. p. 642. ISBN 0-600-60235-4. "
[2] "^ Viard, Alexandre (1820). Le Cuisinier Impérial (10th ed.). Paris. OCLC 504878002. "

This same technique can be used to select items based on the HTML element ID field. In simple terms:

  1. Target by Class ID =>  appears as <div class=’target’></div> => you target this as: “.target”
  2. Target by Element ID =>  appears as <div id=’target’></div> => you target this as: “#target”
  3. Target by HTML tag type => appears as <table></table>  => you target this as “table”
  4. Target child of another tag => appears as <ol class=’sources’><li></li><ol> => you target this as “sources li”

This is just scratching the surface of what you can accomplish using CSS selector targeting. For a deeper view of the possibilities, take a look at some of the tutorials written by the JQuery community.

JSON: On a Silver Platter…

Many modern web design frameworks don’t incorporate the data request into the initial HTML document. The initial document serves as a template and the data is retrieved via a series of follow-up JavaScript calls after the page is loaded. You’ll encounter this when you  look at the document and realize the data you’re after isn’t anywhere in the HTML (which is usually 80% JavaScript). The trick with these sites is to look at the “network activity” from the page. One of these calls is requesting and getting data. You’ll see a neatly formed JSON (JavaScript Object Notation) object returned by that request. Once you find it, try to reverse engineer the request.

The good news is once you’ve figured out how the request is structured, the data is usually handed to you on a silver platter. The basic design of JSON is a dictionary structure. Data is labeled (usually very well), free of display cruft, and you can filter down to the parts you want. For a deeper look at how to work with JSON, check out our article on this topic.

Other Benefits

While it is always nice to automate the boring stuff, there are a couple of other advantages to using web scraping to over manual collection. The use of scripted processes makes it easier to replicate errors and fix them. You’re no longer at the whim of a (usually bored) human data collector (aka. the inter-o-matic) grabbing the wrong fields or mis-coding a record. We have also found that large scale database errors are detected faster in this approach. For example, in the corporate data collection project we mentioned earlier we noticed that the websites we were scraping generally didn’t seem to collect accurate data on certain types of companies. While this would have eventually surfaced via a manual collection effort, the process-focused element of scraping forced this issue to the surface quickly. And finally, since the scraping script shrunk our refresh cycle from several hours to under a minute, we can refresh our results much more frequently.

This was the latest in our series on web scraping. Check out one of the earlier articles to learn more about scraping:

You may also be interested in the following

Be the first to comment

Leave a Reply

Your email address will not be published.


*



*