Scraper Ergo Sum – Good Web Scraping Projects for R

Open Government Data

There is a wealth of public data available on the web, much of which is highly suitable for data science projects. Here are a few sources which caught my eye as potentially interesting analysis projects. You should also check out our tutorial on using SPARQL to access open government data sets on data.gov.

  • City of Chicago Public Employee Data: As part of their effort at transparency, the City of Chicago has published a large body of data about compensation and employee behavior. Probably could glean some interesting insights from this. Some of relevant files.

Current Employee Names, Salaries, and Position Titles

Lobbyist Contribution Data

  • US Treasury News Feed – Treasury’s official blog, featuring blog posts from Treasury’s senior officials and staff sharing news, announcements and information about the work done at the Treasury Department.

Treasury Blog Posts

  • Lost, Found, and Adoptable Pets: Data feed From King County, Washington (Seattle) about lost, found, and adoptable pets. Seems like something you could pull a couple of times and compare versions to spot trends in how these cases were resolved.

King County Lost, Found, Adoptable Pets

  • Sunlight Foundation: Collection of open API’s and data-sets focused on making the government and politicians more accountable and transparent.

Sunlight Foundation API’s

  • College Scorecard: Large Department of Education data-set which monitors how well students and alumni of each school perform after they graduate.

College Scorecard – Detailed Data

  • Bureau of Labor Statistics – Occupational Information: Survey data around wages and employment for different occupations. The entire BLS site is well worth exploring if this type of analysis interests you.

Occupational Statistics

  • Financial Services Consumer Complaints: Large dataset of consumer complaint data; available in multiple formats. Ample opportunity to data mine this for all sorts of insights. Warning: This one will test the size limits of your computer.

Consumer Complaint Data

  • Healthcare – Timely & Effective Care:

Provider Level Dataset

  • California State Government Open Data Portal:¬†One of many states which is making various data assets public. Their collection includes purchasing information, transportation, and environmental data.

Next Up:¬†Scraping Commercial Sites…