R and ODBC – Pulling Data From A SQL Database

In our previous installment on this tutorial, we demonstrated how to load data into R from existing sources like an Excel spreadsheet by saving it as a Comma Separated Value file and using R’s utility functions to load and check the data. The same approach also works fairly well when you’re trying to get a file of data from your IT department; most business intelligence and querying tools can dump out their results as a CSV and these files can be easily emailed between departments. I used this approach for years before seeking better methods.

However, there are certain types of projects where the “download and save to CSV trick” gets either tedious or impractical. For example, when you need to run the reporting package every week. Or when you wind up having to make a lot of little changes in a database query (adding fields, tweaking selection criteria, etc.). Or, in my case, when your sense of perfectionism decides that a 5,000 line “summary file” just isn’t enough and there is simply no good alternative to grabbing the whole 1,000,000 line sales extract for the previous six months. Naturally, we didn’t just grab 3 or 4 fields, we grabbed about twenty so we could FULLY examine the situation. The reporting system that resulted from this effort, while interesting, required me to spend two hours downloading and loading data each week. Not a good thing.

So what if we could… perhaps… connect R directly to the corporate database in question and pull the data directly? It turns out this sort of process is indeed “a thing” and, depending on how liberal your corporate system administrators are, can significantly simplify your work. It’s also fairly simple to implement using R – for this next example, we will share how you can use two protocols called ODBC and SQL to pull data directly R from a larger corporate system. ODBC is a common protocol that allows computer software to talk to many common databases (Oracle, SQL Server, etc). SQL stands for Structured Query Language and is the primary language used to extract information from relational databases; it’s fairly simple (good intro here, including a code playground) and a “good thing” for analysts to know.

In this example, we will use a R library called RODBC to connect to a database and run a simple query. The results of this query will be pulled back into R and available for us to use as a standard data frame.

Incidentally, your local installation of R likely doesn’t already have the RODBC package installed; so we will need to download and install it. This is quite simple.

install.packages('RODBC')

The system will take it from there (it may prompt you to pick a download site from a list); congratulations, you’ve likely just installed your first R package. These little libraries can handle a wide range of tasks and are one of the best things about the R community. No matter what you’re trying to do, someone else has probably already attempted a similar problem and published a helpful library for the technique.

Once installed, we will load the library into the R environment and connect it to a database.

library(RODBC)
myconn <- odbcConnect('my corporate datasource')

Boom, we’re connected. And now for a little query….

customers <- sqlQuery(myconn, "select cust_id as cust, 
count(sales) as num_sales,
sum(sales) as tot_sales
from invoices 
group by cust_id")
summary(customers)

That particularly epic example of an SQL query just aggregated sales data from my invoice table by customer, returning the result as a neat little table of customers and sale amounts which we can use for future analysis. For example, we could examine the distribution of sales by customer size to understand if our business was driven by a handful of accounts or broadly distributed across a base of small customers. SQL is an incredibly flexible language – you can join different tables together, summarize results, and filter down to the items you want. This approach also has the advantage of allowing you to run heavy processing on a remote machine (usually more powerful that your computer) and only bring back the summarized results to your own machine.