Ever need to quickly read a CSV file from the web?
CSV files are the most basic option for moving data between systems. They are supported by every major database and spreadsheet system. They can be easily generated using almost every programming language, including R. They can be easily edited with any text editor. Fortunately, there’s an easy trick using the
read.csv() procedure which can be used to import data from the web into a data frame.
Simple take the URL and feed it into read.csv():
data <- read.csv("http://apps.fs.fed.us/fiadb-downloads/CSV/LICHEN_SPECIES_SUMMARY.csv")
This only works if the file is being served via http. Unfortunately, this method of serving files is becoming increasingly rare. As a result of major companies pushing for more security on the Internet, an increasing share of websites are using secure https to handle data. Fortunately, there is a simple tweak we can make to the read.csv one liner using the RCurl library and getURL library that solves this:
Our next example involves a list of lost pets in Seattle, Washington. We’re going to adapt our example to use RCurl to handle the file transfer and then read the result using
download <- getURL("https://data.kingcounty.gov/api/views/yaai-7frk/rows.csv?accessType=DOWNLOAD")
data <- read.csv (text = download)
Simple but effective. This is a good option for web developers if you’re working with basic reporting and data extraction systems.
R is also capable of handling more complicated data requests. The next step up from processing CSV files involves using readLines and the RCurl and XML libraries to handle more complicated import operations. This gives you some capacity to parse and reshape the contents of the web page you are scraping. We also have an article covering JSON based web scraping options.
And finally, if you’re still looking for a project – here are some web scraping project ideas.