R is often held up as one of the most valuable tools in data science. Part of this popularity comes from the fact that it supports a wide range of mathematical operations. But the popularity of R is also due to the level of flexibility you have when working with advanced data structure implementations. For example, many programming languages provide users with some form of data collection. But R goes above and beyond.
Programming languages might have different names for these collections, but they provide similar functionality. The collections are essentially a bundle of other data types. A language might even provide functionality that lets collections hold other collections. By doing so a programmer can easily create multi-dimensional structures. But what sets R apart from most other languages is just how flexible these structures are. For example, you can easily take an R data frame and analyze categorical variables within it using the table function. And you’ll soon find out just how easy this is to do in R and how it can help you see your data structure in a whole new light.
An Introduction to R’s Table Function
At this point, you might be wondering exactly what R’s table function actually does. We can technically define the table function in R as a way of cross-classifying different factors to tabulate results based on the numerical quantity of factor levels. However, that description risks missing the forest for the trees. The table function is used for that specific effect. But the applications for table are extremely wide-reaching.
For example, we can create a frequency table by showing the counts or the proportions within the supplied variables. Or we can use the same functionality to create a contingency table by analyzing the relationship between multiple categorical values. While we’re concentrating on the latter usage scenario, it’s important to keep in mind how much the table function in R can do.
Basic Table Functionality
You’d be justified in thinking that the table function will either be quite complex, need a tedious amount of preparation, or both. But if that’s running through your mind then you should prepare for a shock. Because the function really can be boiled down to a single, quite succinct, line of code. Take a look at the following example to see the table function in action.
df <- data.frame(
Result = c(“Negative”, “Positive”, “Positive”, “Negative”, “Positive”, “Positive”, “Negative”, “Negative”),
First = c(“Positive”, “Positive”, “Positive”, “Negative”, “Negative”, “Negative”, “Positive”, “Negative”),
Second = c(“Positive”, “Negative”, “Positive”, “Positive”, “Positive”, “Negative”, “Negative”, “Negative”)
OurContingencyTable <- table(df$Result, df$First, df$Second)
We begin by creating a data frame that consists of the final result, first data point and the second point. This gives us three potential results when comparing the points and the result. The combinations are in turn based on a simple binary definition of positive or negative. However, the human eye isn’t able to see the simplicity of that fact very clearly when we’re presented with raw information. And that’s where the table function comes in.
We can create a contingency table by simply passing the three columns in our data frame and assigning the result to the OurContingencyTable variable. When we print the results we can now view our information laid out in a concise data table. The huge jumble of information in the data frame will now clearly indicate how often a value combination showed up in the row or column. We can also expand on this idea by focusing on the frequency with which variables show up in the original information. This is known as a frequency table. It’s the flip side of contingency tables, which look at combinations of categorical information.
Consider a situation where we’re testing a game to see if it might give left-handed players an unfair disadvantage. We could collect information and test the results with the following code.
df <- data.frame(
Handedness = c(“Right”, “Left”, “Right”, “Left”, “Right”, “Left”, “Right”, “Left”, “Right”, “Left”),
Score = c(12, 5, 10, 5, 20, 7, 15, 5, 20,7)
ourFrequencyTable <- table(df$Handedness, df$Score)
In terms of how we use table, the only real difference is that we provide a single set of results to match the categories. When we print out the result we’re now given a well-organized display of how right and left-handed players performed. It’s hard to see whether left-handed players are at a disadvantage when looking at the raw information. But once it’s organized we can clearly see that lefties disappear as the scores climb higher.
Moving On to More Advanced Use of Table
We’ve looked at some individual usage scenarios at this point. And now we can combine those elements together to demonstrate some of the function’s versatility. Let’s imagine that we’ve modified the input system for the game in our previous example. We want to evaluate our new results in light of whether or not handedness impacts if someone’s beaten our game or not. We can set that situation up with the following code.
df <- data.frame(
Handedness = c(“Right”, “Left”, “Right”, “Left”, “Right”, “Left”),
Final1 = c(“Won”, “Lost”, “Won”, “Lost”, “Lost”, “Won”),
Final2 = c(“Won”, “Won”, “Lost”, “Lost”, “Won”, “Won”)
ourHandednessTable <- table(df$Handedness)
scoreTable <- table(df$Bonus)
ourFinal1Table <- table(df$Final1)
ourFinal2Table <- table(df$Final2)
ourContingencyTable <- table(df$Handedness, df$Final1)
ourContingencyTable2 <- table(df$Handedness, df$Final2)
ourContingencyTable3 <- table(df$Handedness, df$Final1, df$Final2)
We begin by laying out our information as df. Next, we pass every row in the data frame to table and print out the results as a simple frequency. This gives us a better look at what we’re working with, though it’s not a requirement of the contingency table creation.
With our information printed out we can proceed to move on to the contingency table. We’ll try three different variations on this theme to demonstrate how the function works. The first attempt just uses Handedness and Final1. The second uses Handedness and Final2. And the third table uses Handedness, Final1, and Final2. Note too that production code typically wouldn’t include the print statements, frequency tables, or multiple variations on contingency. They’re included for explanatory purposes in order to demonstrate how the data is fit into the different presentations.
Now, let’s imagine that we’re ready to take the testing into the final phase by looking at whether there’s an even balance between the games won and lost in every tested category. Take a look at the following example.
df <- data.frame(
FinalResult = c(“Won”, “Won”, “Lost”, “Lost”, “Won”, “Won”, “Lost”, “Lost”),
Final1 = c(“Won”, “Lost”, “Won”, “Lost”, “Won”, “Lost”, “Won”, “Lost”),
Final2 = c(“Won”, “Won”, “Won”, “Won”, “Lost”, “Lost”, “Lost”, “Lost”)
ourContingencyTable <- table(df$FinalResult, df$Final1, df$Final2)
percentWon <- prop.table(ourContingencyTable, margin=1)
We once again create and populate a data frame assigned to df. This time we’re returning to binary results of won and lost. We use df to create a contingency table and print the result to screen. Then we use prop to calculate the proportion of results in ourContigencyTable.
The end result is an analysis of categorical variables using table. Take note of how much easier it is to understand this information as either a percentile or contingency table when compared to the raw information. And this same concept becomes exponentially more important as the amount of information scales upward. You might be able to get the gist of the results here by eyeballing the frame’s contents. But nobody’s going to be able to simply feel out the results at a glance when the results hit or exceed triple digits. R takes care of all of that for us. Instead of hundreds of points, we simply need to look at a neatly formatted tabulation.