Zombie Safe Ways To Perform Mutating Joins in R (with Examples)

R’s combination of complexity and minimalist design principles has made it a favorite among people working within data science. The R platform as a whole lets coders utilize some of the more complex elements of both statistics and computing. But unlike many languages, R can handle most of the computational heavy lifting. It can usually take care of most aspects of memory management, file systems, etc. But there are some exceptions to that rule.

It can sometimes be a bit tricky to work with R’s memory management and handle more advanced functions like mutating joins while avoiding zombie processes at the same time. It’s easy enough to perform that task when you know what R will and won’t handle on its own. But knowing which elements those actually are is the crux of the problem. But don’t worry, you’re about to find out how to create a zombie-safe way to perform mutating joins in R. This will entail the use of R’s merge function to alter one data frame through the inclusion of another’s information.

Mutating Joins and Zombie Processes

Zombies and mutations might sound like something out of a horror movie. But they’re very real concerns within computational data science. One is generally beneficial and the other is a significant danger.

R mutations might sound harmful, but they’re generally benign. A mutation just describes a change in the value of an object. With an object essentially being anything in R that can hold a value. For example, a data frame is an object. And if we change the structure of a data frame that can be considered a mutation. Especially within the context of joining two data frames together.

A zombie process describes situations where allocated resources continue to be utilized after the originating process has finished up with them. For example, imagine that you wanted to perform a mutating join in a separate process within your script. If there was an issue with the join it might linger on while using up system resources. And even after a successful join created via mutation, we’re still left with one or more redundant data frames taking up resources. Lingering, unused, variables can also be considered zombies. They’re effectively dead data that are still present within our system’s runtime.

This comes back to the fact that R is able to handle a lot of internal memory management on its own. But no machine, whether physical or virtual, is ever going to do a perfect job of memory management. It’s often a question of whether the designers want to take risks in being too lax or take equal risks in being too strict. R generally takes a middle path between the two extremes. The interpreter handles a lot of these issues for us. But it also gives us the tools we need to manage things on our own if we’re worried about zombies and similar problems.

A Data Scientists Guide to Applied Mutations

Never let it be said that data science is anything less than exciting. Because you’re about to spark a mutation in some animals. Well, a mutation in R data frames containing information about animals at least. Take a look at the following code.

ourAnimals <- data.frame(
id = c(1,2,3,4,5,6,7,8,9,10),
name = c(“Lion”, “Tiger”, “Bear”, “Gorilla”, “Chimpanzee”, “Giraffe”, “Elephant”, “Hippopotamus”, “Zebra”, “Kangaroo”),
eco = c(“Savanna”, “Jungle”, “Forest”, “Jungle”, “Jungle”, “Savanna”, “Savanna”, “Jungle”, “Savanna”, “Desert”)
)

ourAnimalTraits <- data.frame(
id = c(1, 3, 5, 7, 9),
size = c(“Large”, “Large”, “Medium”, “Large”, “Medium”),
diet = c(“Carnivore”, “Carnivore”, “Omnivore”, “Herbivore”, “Herbivore”)
)

ourAnimals <- merge(ourAnimals, ourAnimalTraits, by = “id”)
print(ourAnimals)

We begin by creating two new data frames – ourAnimals and ourAnimalTraits. These two data frames share a common element in the id listing. And in the next line down we specify the id column as the connection point for the merge function. Merge, as the name suggests, merges our two data frames on a specified element.

Merge’s syntactical simplicity runs in direct opposition to the complexity beneath its surface. The merge function can perform some pretty complex data management for us. For example, it can reshape a data frame in any number of different ways when merging information. This includes transposing along either dimension and pivoting data from wide to long. And, as we can see from the print function on the next line, the merge function has performed its magic with the data frames.

But take a moment to look at the merge assignment. Many people just create a new variable to hold the result of a merge. The main reason is that you can continue to perform new calculations on the original data if you keep it around. But what if we were working with a huge data set that took up a significant chunk of memory? We might want to avoid the load of keeping multiple huge data frames in memory at the same time. In this case, we’d do exactly as you see in this example. We’d assign the result of our merge back into one of the originating data frames. This is called a mutating join because we altered the structure of a data frame rather than using it to create a new frame.

R and Memory Management

Something might have occurred to you when the topic of large data frames and memory came up. What’s happening to the resources that were used to perform the mutating join? Some of the resources are automatically cleaned up by the R virtual machine. It has built-in garbage collection which can take care of previously assigned information stored in values before reassignment. But that’s quite literally only half of the problem with our earlier example.

The data in ourAnimals should have been properly cleaned upon reassignment with the merge function. But what about ourAnimalTraits? We now have two instances of that data in memory. The interpreter now holds our original data frame and the data that was copied into ourAnimals. That’s not a very big deal if we’re only dealing with tiny data frames. But larger-scale R projects typically deal with immense data frames containing large amounts of information. Keeping unused data like that in memory isn’t the most efficient way to handle things. In fact, if we never use those variables again then they can be classified as a zombie. Thankfully R makes it extremely easy to deal with lingering data. Try adding this line to the end of the previous example.

rm(ourAnimalTraits)

The rm function removes a variable and its associated data from memory. To see this in effect, try running the following altered version of our initial example.

ourAnimals <- data.frame(
id = c(1,2,3,4,5,6,7,8,9,10),
name = c(“Lion”, “Tiger”, “Bear”, “Gorilla”, “Chimpanzee”, “Giraffe”, “Elephant”, “Hippopotamus”, “Zebra”, “Kangaroo”),
eco = c(“Savanna”, “Jungle”, “Forest”, “Jungle”, “Jungle”, “Savanna”, “Savanna”, “Jungle”, “Savanna”, “Desert”)
)

ourAnimalTraits <- data.frame(
id = c(1, 3, 5, 7, 9),
size = c(“Large”, “Large”, “Medium”, “Large”, “Medium”),
diet = c(“Carnivore”, “Carnivore”, “Omnivore”, “Herbivore”, “Herbivore”)
)

ourAnimals <- merge(ourAnimals, ourAnimalTraits, by = “id”)
print(ourAnimals)
print(ourAnimalTraits)
rm(ourAnimalTraits)
print(ourAnimalTraits)

This script is exactly the same as our original in terms of the merge functionality. But after the merge we print out the newly mutated ourAnimals, then the pre-existing ourAnimalTraits. This highlights that ourAnimals has changed while ourAnimalTraits remains the same.

We then use rm to wipe ourAnimalTraits and its associated data from memory. And, finally, we try printing out ourAnimalTraits one more time. The script will fail with an error message stating that ourAnimalTraits isn’t found. But don’t worry, that’s a good thing. The error occurs because the rm function removed the zombie data from memory. If we try to work with ourAnimalTraits after sending it to rm it’s the same as trying to work with it before the initial declaration. In short, we can’t do it without invoking an error.

The error also highlights the major issue with both mutate and zombies. When you’re altering a fundamental aspect of your script it’s immensely important to make sure the new layout is compatible with all of your pre-existing functionality. The final print statement trying to access the now non-existent data highlights what happens if you don’t take this practice into account. You should always take great care when permanently altering an object’s structure or removing variables from memory.

Scroll to top
Privacy Policy