Parallel computing is an important tool within the field of data science. It’s a method by which you can distribute individual elements of a larger task to separate computational structures. Whether that means separate CPU cores or entirely separate machines. It can even encompass a full cluster on the backend.
Parallelism is a useful technique within quite a few different contexts. But it’s especially important for data analysis due to the heavy load that often comes with larger data sets. This type of processing is often quite complex on a foundational level. But R makes it relatively easy to use thanks to its foreach package. As you’ll soon see, foreach and its dopar operator vastly simplify parallel processing in R.
The Basics of Parallel Computing in R
The concept can be best understood by thinking about R code in terms of set problems that need to be solved. By default, most programs work in a sequential manner. It’s a straight line from the start of a script to the end. But parallelism divides things up into concrete chunks. For example, you might have two data sets to work through. By default, you’d start working with the first one, move on to the second, and then finish up any remaining tasks involving the results. But parallelism lets you essentially hand off calculations to be handled independently from the main thread. This lets your code essentially multitask and work on both tasks at the same time.
Setting the Stage for Proper Parallelism
The concept can be best seen with an example of how we might perform a simple calculation both with and without parallelism. Take a look at the following example.
library(foreach)
ourNumbersA <- c(1, 2, 3, 4, 5)
ourNumbersB <- c(6, 7, 8, 9, 10)
ourTotal <- foreach(a=ourNumbersA, b=ourNumbersB) %do%
{
a + b
}
print(ourTotal)
The code begins by importing foreach. This library gives us a lot of added flexibility for loops and will make parallelism a lot easier. We then create two simple collections, ourNumbers A and B. Next, we use the R package to create its augmented foreach loop in order to add the values from the two collections. Finally, after the loop is completed, the code prints out the results. This isn’t too different from the standard method of setting up a loop in R. But you’re about to see just how easily this concept can be turned into true parallel computing.
Implementing Parallelism
Parallelism is generally fairly difficult to implement in lower-level languages. And even many higher-level languages struggle to provide both power and functionality within their parallelism implementations. But take a look at the following example to see just how easy it is to modify the previous code to use parallel computing in R.
library(foreach)
library(doParallel)
ourCluster <- makeCluster(2)
registerDoParallel(ourCluster)
ourNumbersA <- c(1, 2, 3, 4)
ourNumbersB <- c(5, 6, 7, 8)
ourTotal <- foreach(a=ourNumbersA, b=ourNumbersB) %dopar% {
a + b
}
print(ourTotal)
stopCluster(ourCluster)
In addition to foreach, we also import the doParallel package. You can essentially think of it as the parallel package because it provides us with a lot of that functionality. And the power comes in a form so easy to use that we’re able to call it with just a couple of functions. The package is also the first thing we implement when moving on to the actual program logic. The ourCluster variable contains the result of running makeCluster with an argument of 2. This simply means that we’ve created a new cluster that can make use of two cores. In this context, a cluster just refers to any computational entity that’s capable of handling a processing task that we give it. This example provides a fairly simple implementation of the clustering function. But it can be easily extended to handle more advanced usage scenarios. For example, making use of networked resources.
But for the sake of simplicity, we simply move on to register the newly created cluster with the backend through registerDoParallel. The code is mostly unchanged from that point. Right up until the %do%. The parallel computing approach will instead use %dopar%. As the name suggests, it’s essentially a version of do that’s using parallelism. And that really is all there is to it. Aside from registering the cluster, and importing the additional libraries, you simply need to swap the %do% and %dopar% with each other in order to use parallel computing.
The script proceeds to print out the result of the calculation. But there’s one final step at the very end which is unique to this new approach. It’s important to always terminate a cluster after it’s no longer needed. In this case, the script uses the stopCluster function on the ourCluster variable to do so. Technically you don’t absolutely need to perform this step. Just like you can technically just flip your PC’s PSU switch to turn the machine off instead of shutting it down through the operating system’s menus. But it’s always best to manually shut down a cluster for a similar reason to the importance of clean shutdowns on your PC.
Stopping a cluster essentially gives the code a chance to clean up any lingering resources that are in active use. This isn’t a huge issue with a lightweight example like the previous code. But the importance scales up with the size of your workload. Simple arithmetic remaining in memory isn’t a problem. But huge datasets are another matter entirely. They can hog resources if not cleanly removed. Or in worst-case scenarios, they could even interfere with other active projects. So just make sure to always clean up after your code when using parallel computing.