When people start out with a new programming language they’ll typically focus on its basic library of functions and how to package code for distribution. But what really makes languages unique is the underlying syntax. And this is often where we can see areas where programming languages earn their reputation within various industries. For example, the R programming language is extremely popular in data science and statistical computing.
It’s true that R’s data-related libraries are part of its appeal. But R also has a general syntax designed around statistical computing concepts. For example, R borrows the idea of infix notation for a user-defined infix operator that has some impressive functionality. You’ll soon see how you can make use of that operator to essentially act as a data pipe between functions.
R and the Concept of Data Piping
If you’re familiar with Linux or other Unix variants then you may well already be quite familiar with the idea of pipes. One of the classic examples in those environments is a combination of large lists and a filter to sort through them. For example, take a look at the following code that we could use to look for R scripts in a crowded directory.
ls | grep *.r
We use ls to generate a list of files in our current directory. The Unix pipe operator, | , is then used to “pipe” that list as input data for the grep application. Grep applies the criteria we passed to it into the piped data. In this case, we combine the wildcard value * with the .r extension. Grep sorts through all of the file names that were passed from ls and lists out any that match the provided criteria. Again, in this case that means any files that have a .r extension. This means that if you ran the previous command in a Unix-based environment it would list out all of the R scripts in the current directory. Note that the information originates from the left side of the statement. The data will then pass to the right through the pipe.
We’re starting out with a Unix example because this is one of the most basic examples of how to use pipes. A fundamental aspect of software pipes is that they operate in a similar way to physical pipes. We can think of software pipes as a contained system that redistributes isolated information from one program, process, or function to another.
Our earlier look at pipes was a specific part of Unix systems. But the R programming language also has support for pipes through its magrittr library. This library is itself part of the larger dplyr R package.
You’re probably wondering what role pipes serve in R. The previous example was useful on a command line. But the utility might seem less readily apparent when using a programming language that makes it easy to set up conditional systems or loop through collections. To answer that we need to take a look at R’s syntax and coding style.
Style and Syntax in R
Before we take a deeper dive into why you’d want to use pipes in R, we should take a look at how they’re implemented. The cornerstone of piping in R is the %>% symbols. We encase a > character between the two % characters to designate a user, or library, defined infix operator. You can think of it as somewhat similar to creating our own version of the + sign. R lets us essentially add our own functions to the basic underlying syntax. And these new functional pieces are then called by encasing a designated symbol between the % characters. In the case of pipes this lets magri add a pipe character that we can call using %>%. Take a look at the following code sample.
ourVec <- morley
ourVec2 <- ourVec[ourVec$Speed > 940,]
print(ourVec2)
We begin by defining ourVec with the morley dataset. This provides us with a large prebuilt set of data for us to work with. Next, we create a new dataset called ourVec2. We sort through ourVec for instances where the speed is greater than 940 and assign the results to the newly created ourVec2. And, finally, we print ourVec2 to screen. This example is similar to the earlier example where we sorted through data on a Linux terminal. And like that example, we could simplify things by using pipes. Now take a look at this modified version of the code sample.
library(dplyr)
morley %>% filter(Speed > 940) %>% print()
The first thing you’ll probably notice is how much cleaner the code is now. We’re producing the same result as the prior code sample, but it’s all done in a single line and without defining variables. The credit goes to the infix operator and piping.
We begin by importing the dplyr library to inherit its piping definition. Next, we call an instance of the morley dataset. This information is then piped to the filter function. Note that through this process the information is essentially moving from function to function like water moving through pipes.
We pass the piped morley data to a filter which narrows our data down to instances with a speed of over 940. The results of the filter function are then passed directly to the print function through another pipe. We’re essentially chaining our functions together into a singular whole by piping data. It’s not just more concise either. We’re also improving overall performance and efficiency by forgoing the need to define and manipulate new vectors. But the following code highlights another important aspect of piping.
library(dplyr)
morley %>%
filter(Speed > 940) %>%
sum() %>%
print()
This code is similar to our prior example. Functionally, the only difference is that we added another function, sum, to the pipe. This added function highlights how much cleaner code can appear when using piping to chain functions. We added functionality while still keeping the code cleaner than the initial example.
But we were able to add a secondary element to the code as well. Though this element is entirely focused on style rather than functionality. This modified version of the code uses pipes as a way of nesting functions within the processing chain.
This small example is focused on simply demonstrating the concept. But now think about how formulas and implementations can grow over time. As people work on projects it’s quite possible for a procedure to end up needing dozens of separate functions to implement a single process.
As more processing is added to a system the chance of making an error increases. This is largely just due to the fact that a complex mass of unformatted procedures isn’t very human-readable. But when we use pipes we create an easily followed, well-formatted, chain of functions and functionality. The end result is that even a complex procedure that needs to juggle data between dozens of functions can be written in a way that’s easy to understand.