Scraping Commercial Sites
The sites presented thus far generally have a big “welcome, visitors” sign placed out front when it comes to web scraping. Commercial sites can also be scraped, although you will move into a legal grey area. Many terms of service for publications and commercial websites officially forbid scraping their content via automated systems. Although, in reality, a whole bunch of people scrape all of these sites on a perpetual basis (including Google, for search engine ranking calculations). Look at this on a case by case basis and use appropriate judgement.
Where possible, look to see if an API already exists. For many popular datasets, R packages have been created which wrap the API in a convenient set of helper functions to simplify access to the data.
A couple of particularly useful R Packages for API’s:
- The quantmod package is a well regarded source for economic and financial data. It can be used to retrieve macro-economic data from the St Louis Federal Reserve (FRED database) and various stock quotes from Yahoo or Google.
- The blscrapeR package provides access to the Bureau of Labor Statistics API. This is good for gathering data on wages, employment, and inflation activity.
- Got a website? You may find the googleAnalyticsR package interesting – it provides a wrapper for accessing Google Analytics data.
- The pewdata provides access to the Pew Research Center survey datasets on American attitudes.
- Got a fitbit? This one looked interesting… fitbitScraper
- We also have an article covering an example of using the WDI package to generate plots of economic activity
Hopefully one of these ideas sparked an interest and you’re well on your way to creating a new project. If you get stuck, be sure to check our reference materials: