Our tutorial starts with the basic task of loading information from a text file into the R environment. For those of you who are new to a corporate environment, Excel spreadsheets are the backbone of most Finance and Operations departments, providing business users with a convenient way to collect, edit, and present information. The contents of an Excel spreadsheet can be easily exported into a file format known as CSV (Comma Separated Values) which can then be loaded into R.
To make this tutorial more interesting, imagine that we have just been hired as analysts on the marketing team for a small manufacturer. Our job is to take information about the company’s products and customers (from management and the company’s information systems) and distill it into useful reports and charts for business managers.
In this particular case, we have been handed a file of information about the company’s sales over the past week. It is in the form of an Excel spreadsheet downloaded from the accounts receivable system, containing the following fields:
- invoice number
- customer id
- product id
- delivery date
- sale amount
The general manager of the business would like us to summarize this data to understand which customers and products are most active. Naturally, we would like to use R for this analysis and thus we need to load the file into R. We use Excel’s “save as” command to save the file in Comma Separated Value format (CSV) and get started.
Since the first line of the file contains field names for each column, this is easily accomplished with the following single line of code:
mydataset <- read.csv("invoices.csv")
Incidentally, in the event the accounting system had not included a header row, we could have used the following command.
mydataset <- read.csv("filename.csv", header=FALSE)
It is good form to inspect your data after you load it from a file into a new system; while the process we are using is considered reliable, conversion and formatting errors can occur and will cause problems for you during the analysis process. The R language includes several built in utilities for inspecting your data. The most basic of these is summary().
If we type:
The summary function provides you with some default statistics for each column. If the data is a numerical variable, it will describe the results in the form of a statistical distribution – mean, median, min, max, and 1st / 3rd quartiles (25% and 75%). If the data is a date or string (company name, etc,) it will show several common values and provide a total count.
Inspecting the data, we actually DO notice a potential issue here – the sale amount, which should be processed as a numerical variable was instead loaded as a string due to saving the Excel currency format on the CSV. This will cause problems later. So we go back to the original Excel file, change the format of that column to “general” (use the Format Cells feature on the Toolbars) and resave it. We reload and see:
The sale is now expressed as a statistical distribution which is closer to what we expected. Mission Accomplished.
In the event we wanted to inspect a couple of individual lines of data, we could use the head command:
head (mydata, 10) which yields the following result...
Now that we’ve got our data loaded into R, lets move onto the next part of our assignment, which involves pulling some additional data from the accounting database and summarizing this into a report for management.