How To Remove Duplicates in the R Programming Language

Identify Duplicate Elements

Description

duplicated() identifies which elements of a data frame or vector are replicas of elements with lower subscripts, and gives a logical vector showing which elements (rows) are replicas.

anyDuplicated(.) is a “diagrammed” more proficient version any(duplicated(.)), giving positive integer ratio instead of only TRUE.

Usage

duplicated(x, incomparables = FALSE, …)

## Standard S3 method:
duplicated(x, incomparables = FALSE,
fromLast = FALSE, nmax = NA, …)

## S3 method for group ‘array’
duplicated(x, incomparables = FALSE, MARGIN = 1,
fromLast = FALSE, …)

anyDuplicated(x, incomparables = FALSE, …)
## Default S3 method:
anyDuplicated(x, incomparables = FALSE,
fromLast = FALSE, …)
## S3 method for class ‘array’
anyDuplicated(x, incomparables = FALSE,
MARGIN = 1, fromLast = FALSE, …)

Arguments
x
a vector or an information frame or NULL or an array.

incomparables
a vector of summaries that cannot be correlated. FALSE is a significant value, meaning that any values can be correlated, and may be the single value taken for methods besides the default. It will be coerced within to the exact type as x.

fromLast
logical showing if replication should be accepted from the opposite side, i.e., the end (or rightmost) of similar elements would coincide to duplicated = FALSE.

nmax
the total number of specific items taken (higher than one).

…
arguments for specific methods.

MARGIN
the array margin to be held stable: view apply, and note that MARGIN = 0 may be beneficial.

Details
These are generic actions with methods for vectors, data frames and arrays.

For the standard methods, and whenever there are identical method explanations for replicated and anyDuplicated, anyDuplicated(x, …) is a “identified” shortcut for all(duplicated(x, …)), in the sense that it yields the index i of the first replicated entry x[i] if there is one, and 0 anyways. Their moods may be opposite when at least one of replicated and anyDuplicated has an admissible method.

duplicated(x, fromLast = TRUE) is similar to but quicker than rev(duplicated(rev(x))).

The array method tallies for every element of the sub-array identified it by MARGIN if the leftover dimensions are similar to those for prior (or after, when fromLast = TRUE) element (in row-major sequence). Normally used to locate replicated rows (the standard) or multiple columns (with MARGIN = 2). Mark that MARGIN = 0 yields an array of the exact dimensionality attributes as x.

Missing values (“NA”) are acknowledged as complex, equal and numeric ones differing from NaN; character strings will be correlated in a “common encoding”; for explanations, view match which utilize the exact concept.

Values in incomparables may never be identified as replicated. Intended to be utilized for a fairly tiny set of values and may not be proficient for an extremely large set.

Besides factors, raw and logical vectors the standard nmax = NA is similar to nmax = length(x). Because a hash table of size 8*nmax bytes is assigned, adjusting nmax suitably will save greater amounts of data. With factors it is automatically applied to the lower of length(x) and the total of levels plus one (for NA). If nmax is applied too small there is prone to be an error: nmax = 1 is silently forgotten.

Information cleaning is one of the hardest tasks of a data science specialist. Identifying the information set, remove replicas with specific conditions like column values.

How To Remove duplicates in R
To remove duplicates in R,

1. Use duplicated() method: It determines the duplicate elements.
2. Using unique() method: It removes unique elements
3. dplyr package’s distinct() action: extracting duplicate rows from an information frame.

duplicated() in R

The duplicated() is an established R method explaining which vector items or information frame are replicas with lower subscripts and yields a logical vector identifying which items (rows) are replicas.

rv : c(11, 21, 46, 21, 19, 18, 19)

duplicated(rv)

Output

[1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE
If the element shows a second time in the vector, it yields TRUE.

It provides the position of replicated elements in the vector.

To remove a unique element from a vector in R, utilize the !duplicated(), where ! is logical negation.

rv : c(11, 21, 46, 21, 19, 18, 19)

rv[!duplicated(rv)]

Output

[1] 11 21 46 19 18

R extract duplicate rows

To extract duplicate rows from an information frame in R, utilize the !duplicated() method, where ! is reasonable negation. The duplicated() action in R explains which elements of a vector or information frame are replicas.

To build an information frame in R, utilize the data.frame() method.

provider : data.frame(service_id = c(21, 19, 18, 46, 29, 18, 46),
service_name = c(“Netflix”, “Disney+”, “HBOMAX”, “Hulu”, “Peacock”, “HBOMAX”, “HULU”),
service_price = c(18, 10, 15, 7, 12, 15, 7),
stringsAsFactors = FALSE)

print(provider)

Output

service_id service_name service_price
1 21 Netflix 18
2 19 Disney+ 10
3 18 HBOMAX 15
4 46 Hulu 7
5 29 Peacock 12
6 18 HBOMAX 15
7 46 Hulu 7

You can view that our information frame contains duplicate records rows.

To extract duplicate rows from an information frame based on column values; use the ! duplicated() method. Remove duplicate rows based on the service_name column.

provider : data.frame(service_id = c(21, 19, 18, 46, 29, 18, 46),
service_name = c(“Netflix”, “Disney+”, “HBOMAX”, “Hulu”, “Peacock”, “HBOMAX”, “Hulu”),
service_price = c(18, 10, 15, 7, 12, 15, 7),
stringsAsFactors = FALSE)

print(provider)

cat(“======== After removing replicated rows ==========”, “\n”)

provider[!duplicated(provider$service_name),]

Output

service_id service_name service_price
1 21 Netflix 18
2 19 Disney+ 10
3 18 HBOMAX 15
4 46 Hulu 7
5 29 Peacock 12
6 18 HBOMAX 15
7 46 Hulu 7

======== After extracting duplicate rows ==========

service_id service_name service_price
1 21 Netflix 18
2 19 Disney+ 10
3 18 HBOMAX 15
4 46 Hulu 7
5 29 Peacock 12

The output holds only five rows in the information frame that mean two duplicate rows have been removed.

You can extract the row based on the column you desire. In our example, we extracted the replicated rows based on the service_name column, however you can extract them based on every column.**

The column totals are case sensitive, so if there are two totals like HULU and Hulu, then !duplicated() action accepts this as two opposite values. Counted as a duplicate value. So please remember that duplicated() function is case-sensitive.

Using unique() function in R
To remove unique items from the array-like object, data frame, or vector in R, utilize the unique() function.

rv : c(11, 21, 46, 21, 19, 18, 19)

unique(rv)

Output

[1] 11 21 46 19 18

You can view that applying the unique() function to a vector, advanced filter the duplicate data elements from the vector and yields a vector of unique elements.

Extract unique rows from the information frame in R
The unique() is a built-in R action that yields an array-like object, data frame, or vector with unique elements/rows.

print(provider)
cat(“======== After removing unique rows ==========”, “\n”)
unique(provider)

Output

service_id service_name service_price
1 21 Netflix 18
2 19 Disney+ 10
3 18 HBOMAX 15
4 46 Hulu 7
5 29 Peacock 12
6 18 HBOMAX 15
7 46 Hulu 7

======== After removing unique rows ==========

service_id service_name service_price
1 21 Netflix 18
2 19 Disney+ 10
3 18 HBOMAX 15
4 46 Hulu 7
5 29 Peacock 12

And we receive the unique rows from the data grid.

dplyr package’s distinct() method
The distinct() is an action of the dplyr package that can keep distinct/unique rows from the data grid. Only the first duplicated row is preserved.

If the dplyr package is not set in your system, then do that first. Then you can utilize the distinct() function.

Following installation, you need to transfer it into your program utilizing the following code.

library(dplyr)

To receive the unique rows from the data grid, utilize the following code.

provider %>% distinct()

View below complete code.

library(dplyr)

print(provider)

cat(“======== Using distinct() method to get unique rows ==========”, “\n”)

provider %>% distinct()

And we will receive the exact output as the above areas.

To extract a duplicate row based on a lone column(variable), utilize the following code.

provider %>% distinct(service_price, .keep_all = TRUE)

To extract duplicate rows based on several columns (variables), utilize the following code.

provider %>% distinct(service_price, service_name, .keep_all = TRUE)

It will yield the unique rows based on the service_name and service_price columns.

Closing Words

To extract the duplicate elements or remove rows from data frame or vector, utilize the base functions like duplicated() or unique() method. When working with large data set and extracting replicated rows, utilize the dplyr package’s distinct() function.