As you move into advanced R programming, you will want to be able to initialize an empty data frame for use in more complicated procedures. Fortunately, R offers several ways to create an empty data frame depending on your situation and needs.
We’re going to look at four common cases:
- Creating a data frame from scratch in code
- Creating a data frame from the headers of a CSV file
- Creating a data frame from an existing data frame
- Automatic extraction and formatting of data from a SQL query
Use Empty Vectors to Create DataFrame in R
While there are more efficient ways to approach this, for readers solely concerned with coding time and complexity, there is a lot of value in the traditional programming approach to initializing a data object. This is generally done as a slightly ponderous list at the top of the program which enumerates the fields, their data type, and in many cases, adds a comment or tow about what they should contain. The real value of this will be apparent several months later, when you dust off the code to recycle it for another project and try to remember what you did in the first place.
The approach below has the advantage of being both easy to implement and intuitive for other readers of your code to understand (industry clients and lab partners). We will create a series of empty vectors and use them to create a data frame. This has the advantage of tightly controlling the type of data elements we’re putting into the data frame.
# create empty dataframe in r with column names df <- data.frame(Doubles=double(), Ints=integer(), Factors=factor(), Logicals=logical(), Characters=character(), stringsAsFactors=FALSE)
Initializing an Empty Data Frame From Fake CSV
Next up – initializing an empty data frame from scratch, while naming columns and defining data types. We’re going to create a hypothetical list of transactions, including: date, what was sold, who bought it, and the sale price. Each of which has a different data type, of course.
# create empty dataframe in r with column names df <- read.csv(text="Date,customer,prodid,sale", colClasses = c("Date", "character", "integer","numeric"))
This approach uses a couple of clever shortcuts. First, you can initialize the columns of a dataframe through the read.csv function. The function assumes the first row of the file is the headers; in this case, we’re replacing the actual file with a comma delimited string. We provide the process with class descriptions via a vector that we initialize inline. You can even change column names at a later date if you identify you want to drop a particular field from the data frame.
Create Empty Data Frame From Existing Data Frame
Suppose you have an existing data frame, with a lovely naming convention which you have grown very attached to. (We leave it as an exercise for the reader to determine why they are so attached to their data frames. Perhaps it was initialized for you by a friend? Or maybe you dislike creating new data frames). In any event, the proper solution is to merely remove all the rows, as shown below:
# create empty dataframe in r with column names mere_husk_of_my_data_frame <- originaldataframe[FALSE,]
In the blink of an eye, the rows of your data frame will disappear, leaving the neatly structured column heading ready for this next adventure. Flip commentary aside, this is actually very useful when dealing with large and complex datasets. Cloning a properly formatted (and vetted) data frame and emptying the clone is a great way to reduce the frustration associated with processing data updates and similar files.
Creating Data Frames from SQL Query Results or WebScraping
So having created our empty data frame, we can potentially fill it by querying an SQL database. This is a common practice in industry, particularly commercial analytics, where scripting your extracts from the corporate transaction databases is a great way to speed up your process.
Regarding database access, a really clever type could have a little fun with the header record that most databases will provide you. The header describes the field names and the data types of the query results. You can use that to automatically configure column names and data types. I had an Oracle => Python function which automatically performed this for any query results, scanning the results of whatever came back from Oracle and automatically converting the fields and their content into a relevant data type. It was a tremendous time saver for a system that I hit several times per day as a pricing analyst.
A similar approach can be used when working with web scraping results. This permits you to set up the base data frame and invest your time in developing code to unpack and QA the contents of what your web scraping queries return to you.
The details of creating a data frame from these sources are left as an exercise for the student. The details will vary by data source. Every major database has its quirks in terms of how it handles data formats. Along the same lines, results returned from a web scraping query can vary dramatically by site. The basic approach is the split the project into three parts:
- An introspective pass against the underlying data to figure out the general shape and type of what is being returned from the query process
- Automated checks against each row to remove outliers and errors
- Code to convert each record into a standard format and insert into the data frame
The Value of being Explicit
While most of our readers are focused on data analysis, we’re going to do a brief detour into software engineering for a moment.
Most data analysis coders design their code for speed and flexibility. Need to add a new column? Drop it in. Got a messy data structure? No Issue. Client gave you a weird table with odd data type configurations and weird array variable names? Straight from a json file? Just write some throwaway code to get around it.
The problem with that? If you are working in industry and your data analysis spots something significant, you’re going to have to run it again. You may even generate a few versions of it (each with its own different way).
Or even hand it to someone else. For example, your boss or your intern.
Write your code with that in mind. That’s where defining the data structure and data type to explicitly identify how the data needs to look. Name the data frame columns and row names for humans. Avoid splitting information across multiple columns or multiple rows using a complex system. If you have an existing column with a reasonable name, don’t adjust it. A good piece of code should have easy to understand row names, column names, data type definitions, and clear procedures for dealing with duplicate rows or missing value (nan value) situations.
Then take that code and test it. Take a data set and convert it into a dataframe using the code. Test it against your existing dataframe. The resulting dataframe and dataframe column definitions should match.