How To Use Sys.sleep To Slow Down Web Scraping in R (and Avoid Getting Busted)

The R programming language is best known for its compatibility with data science. R contains a vast number of functions and data types specially tailored around statistics, machine learning, and data science. For example, you don’t need to use any special third-party libraries to create and manipulate a large data frame in R.

This data-focused design should by no means be taken to mean that R can’t be leveraged for other tasks. For example, it’s a remarkably friendly platform for web scraping. R even makes it easy to implement some scraping safeguards so that you won’t get busted when grabbing online data. You’ll soon see just how easy it is to use R for web scraping. And you’ll also discover how the Sys.sleep can protect you from accidentally tripping any automated safeguards.

What Is Scraping?

Before we dive into any R code it’s important to properly define exactly what we’re trying to accomplish. Web scraping encompasses a vast array of different subjects. But at the simplest level, you can think of it as analogous to the name. We essentially use R to load up a webpage and scrape it for content to use within our own code.

This is most commonly used in the context of data science to populate a data frame or database with statistical information. For example, imagine that NASA released information about the geological composition of a mined sample on mars. However, that information was released on a website rather than as a csv or other usable file format. If we wanted to use that data in R we’d need to either copy it into a database by hand or automate the process in code. If we’re only using one or two values then it’s easy enough to do by hand. But imagine if we were looking at thousands of values scattered over multiple linked pages. That’s a rough situation for humans. But it’s a problem almost tailor-made for computerized automation.

We do have one problem to consider when designing a scraping implementation. Many online attacks function by repeatedly requesting resources. These attacks work by leveraging hundreds of thousands, or even millions, of requests at the same time. While a scraping solution typically only needs a handful of requests. Even large-scale scraping will typically remain in the triple digits. Unfortunately, automated security on a site will often mistake scraping as an attack and block the offending IP address.

Thankfully there’s an easy way around that problem. We just need to reduce the frequency with which we make requests from the server. We essentially try to emulate a normal human being’s usage patterns when requesting data through an automated system. You might imagine that this would require some complex looping logic. But R typically gives us a considerable number of tools to make the most out of loop logic. However, that does leave us with the problem of actually working with a website’s content.

Starting Out With Some Basic Scraping

Thankfully the solution to our problem can be found in an R library called rvest. The rvest library provides us with a wide range of functions related to web scraping. Rvest is similar in many ways to the httr package. The main point of difference is that httr is more of a general-purpose solution for communicating with basic http functionality. Or, more verbosely, the httr package tends to focus on http’s transfer protocol and networking functionality. Rvest tends to focus on web content rather than specific protocol implementations. But with that in mind, let’s dive into some code using rvest.

Remember that you’ll need to install the rvest package since it’s not an official part of the R language. Thankfully you’ll find it available in most IDEs like R studio. Or you can simply install the library with install.packages(“rvest”). With that in mind, let’s take a look at a simple example of web scraping with rvest. You might be surprised by just how easy rvest makes the scraping process!

library(rvest)
ourHTML <- read_html(“https://www.google.com”)

We just need one line of code and the initial rvest import to set up our scraping procedure. From here and as we move forward we’ll use google as our target URL. It’s always best to test something like this with a server large enough that we won’t be a bother by tossing some extra traffic their way. Plus Google is also a big proponent of automated defenses. So we can get assurance that our precautions are working as expected.

Full Scraping While Using Sys.sleep

Our initial example didn’t do anything except load up our URL. But we can easily expand on this to implement true scraping. For example, imagine that you wanted to read through the source of the Google homepage to find the first three image files within it. And you then needed to download those files without Google thinking that you were a malicious attacker. We could easily implement that with the following script.

library(rvest)

ourHTML <- read_html(“https://www.google.com”)
ourTargetURLs <- html_nodes(ourHTML, “img”) %>%
html_attr(“src”) %>%
head(3)
for (ourURL in ourTargetURLs) {
if (!startsWith(ourURL, “http”)) {
ourURL <- paste0(“https://www.google.com”, ourURL)
}
Sys.sleep(5)
download.file(ourURL, destfile = basename(baseURL))
}

We begin with our previous code that loads up the totality of whatever is currently at google.com. Because Google changes the content of its main URL on a regular basis you might get different results when running this script at different times. But, with that in mind, take a look at the html_nodes call. This function acts as a CSS selector and makes it easy to transverse HTML that uses CSS.

In this example we’re not doing anything particularly complex with html_nodes. We’re just looking for img tags. Next, we pass off the raw HTML content through some pipes to grab what we need. This entails looking for the src attribute in the img tags. We then use the head function on the results to select the first three elements in our total vector of image locations. However, at this point, we typically face a path problem. A site might hardcode data to a particular domain. Or it might point to a relative location, which the browser will load with appended domain information. A browser handles this type of path resolution transparently. But we need to manually fix any URL that contains a path without a domain name. Our for loop appends the requisite domain name if any of our results don’t start with http.

Finally, it’s time to actually act on our extracted data. After each loop iteration checks and fixes the URL we use Sys.sleep with an argument of 5. This means that we’ll wait for five seconds before each download. This keeps us from getting flagged by any automated security system. Once five seconds have passed we can download the resource pointed to by our scraped url. In total we make three passes through the loop. And with each pass we wait for five seconds thanks to the sleep function.