How To Identify and Remove Duplicate Data in R

Very often in your datasets you have a situation where you have duplicated values of data, when one row has same values as some other row. That doesn’t have to be wrong or bad, but sometimes you get into situations where you only want to consider specific values and exclude values that you already have taken into account.

Now, if you have small datasets that you can quickly analyze by naked eye, you can simply remove the rows you want:

planets <-c(“Mercury”,”Venus”,”Mars”,”Earth”,”Jupyter”,

“Saturn”,”Uranus”,”Neptune”,”Pluto”)

position <- c(“First”,”Second”,”Third”,”Fourth”,”Fifth”,

“Seventh”,”Eighth”,”Nineth”)

df <- data.frame(planets, position)

planets position

1 Mercury First

2 Venus Second

3 Mars Third

4 Earth Fourth

5 Jupyter Fifth

6 Saturn Sixth

7 Uranus Seventh

8 Neptune Eighth

9 Pluto Nineth

Now I can remove the 9th row by:

df <- df[-c(9), ]

I specified row (9) to remove and left all the columns as they are.

Now, what happens when you have thousands of rows? Here is where dplyr comes in help. Dplyr is a package which provides a set of tools for efficiently manipulating datasets in R.

In the context of removing duplicate rows, there are three functions from this package that are of our interest: duplicated(), unique() and distinct()

duplicated() identifies rows which values appear more than once.

unique() identifies rows which are original (don’t appear more than once).

distinct() is a function which removes duplicate data from a dataset

Let’s see on example how to use these methods.

Dplyr comes from tidyverse library, so we first must include it by executing:

library(tidyverse)

We’ll use the R built-in iris data set, which we start by converting into a tibble data frame for easier data analysis:

my_data <- as_tibble(iris)

my_data

## # A tibble: 150 x 5

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species

## <dbl> <dbl> <dbl> <dbl> <fct>

## 1 5.1 3.5 1.4 0.2 setosa

## 2 4.9 3 1.4 0.2 setosa

## 3 4.7 3.2 1.3 0.2 setosa

## 4 4.6 3.1 1.5 0.2 setosa

## 5 5 3.6 1.4 0.2 setosa

## 6 5.4 3.9 1.7 0.4 setosa

## # … with 144 more rows

The R function duplicated() returns a logical vector where TRUE specifies which elements of a vector or data frame are duplicates:

x <- c(1, 1, 4, 5, 4, 6)

duplicated(x)

[1] FALSE TRUE FALSE FALSE TRUE FALSE

To extract duplicate elements:

x[duplicated(x)]

## [1] 1 4

If you want to remove duplicated elements, use !duplicated(), where ! is a logical negation:

x[!duplicated(x)]

## [1] 1 4 5 6

You will rarely get identical rows, but very often you will get identical values in specific columns. For example, in our iris data (my_data), Sepal.width is very often repeated by values. Let’s see how to remove rows based on duplicate values from that column:

my_data[!duplicated(my_data$Sepal.Width), ]

## # A tibble: 23 x 5

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species

## <dbl> <dbl> <dbl> <dbl> <fct>

## 1 5.1 3.5 1.4 0.2 setosa

## 2 4.9 3 1.4 0.2 setosa

## 3 4.7 3.2 1.3 0.2 setosa

## 4 4.6 3.1 1.5 0.2 setosa

## 5 5 3.6 1.4 0.2 setosa

## 6 5.4 3.9 1.7 0.4 setosa

## # … with 17 more rows

Another way of removing duplicates is by using unique() function. It works in opposite way of duplicated() function. For example:

unique(c(1,1,4,5,4,6))

## [1] 1 4 5 6

It’s also possible to apply unique() on a data frame, for removing duplicated rows as follow:

unique(my_data)

## # A tibble: 149 x 5

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species

## <dbl> <dbl> <dbl> <dbl> <fct>

## 1 5.1 3.5 1.4 0.2 setosa

## 2 4.9 3 1.4 0.2 setosa

## 3 4.7 3.2 1.3 0.2 setosa

## 4 4.6 3.1 1.5 0.2 setosa

## 5 5 3.6 1.4 0.2 setosa

## 6 5.4 3.9 1.7 0.4 setosa

## # … with 143 more rows

You notice that in those filtering methods where we don’t specify the column, we get rid of only one row because there is not many rows that are identical by all columns. But when we specify the column, we often get rid of quite a lot of rows.

The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. It’s an efficient version of the R base function unique().

Remove duplicate rows based on all columns:

my_data %>% distinct()

## # A tibble: 149 x 5

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species

## <dbl> <dbl> <dbl> <dbl> <fct>

## 1 5.1 3.5 1.4 0.2 setosa

## 2 4.9 3 1.4 0.2 setosa

## 3 4.7 3.2 1.3 0.2 setosa

## 4 4.6 3.1 1.5 0.2 setosa

## 5 5 3.6 1.4 0.2 setosa

## 6 5.4 3.9 1.7 0.4 setosa

## # … with 143 more rows

Remove duplicate rows based on certain columns (Sepal.Length):

my_data %>% distinct(Sepal.Length, .keep_all = TRUE)

## # A tibble: 149 x 5

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species

## <dbl> <dbl> <dbl> <dbl> <fct>

## 1 5.1 3.5 1.4 0.2 setosa

## 2 4.9 3 1.4 0.2 setosa

## 3 4.7 3.2 1.3 0.2 setosa

## 4 4.6 3.1 1.5 0.2 setosa

## 5 5 3.6 1.4 0.2 setosa

## 6 5.4 3.9 1.7 0.4 setosa

## # … with 29 more rows

The .keep_all attribute is used to retain all other variables in the output data frame.

Remove duplicated rows based on Sepal.Length and Petal.Width:

my_data %>% distinct(Sepal.Length, Petal.Width, .keep_all = TRUE)

## # A tibble: 149 x 5

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species

## <dbl> <dbl> <dbl> <dbl> <fct>

## 1 5.1 3.5 1.4 0.2 setosa

## 2 4.9 3 1.4 0.2 setosa

## 3 4.7 3.2 1.3 0.2 setosa

## 4 4.6 3.1 1.5 0.2 setosa

## 5 5 3.6 1.4 0.2 setosa

## 6 5.4 3.9 1.7 0.4 setosa

## # … with 104 more rows

Here we use pipe operator %>%

The point of pipe is to help you write a code that is easier to read and understand.

When we say

my_data %>% distinct()

It is equivalent to

my_data <- distinct(my_data,…)

To learn more about pipe operator and its benefits, check out free e book ‘R for data science’ by Garett Grolemund