[Python] How to check and remove duplicates in a table (pandas DataFrame)

A summary of how to check for duplicate elements in a table and how to remove duplicate values.

It is possible to perform the same processing as "duplicate removal" which is frequently used in excel.

> Official page · [Dupplicated](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html?highlight=dupli#pandas.DataFrame.duplicated) · [Drop_duplicates](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.drop_duplicates.html?highlight=dupli#pandas.Index.drop_duplicates)
**table of contents**

[Check for duplicate lines](# 1 Check for duplicate lines)
[Delete Duplicate Lines](# 2 Delete Duplicate Lines)
[Check for duplicates with line headings](Check for duplicates with # 3 line headings)
[Check for duplicate line names](#Check for duplicate line names)
[Delete and confirm duplicate line names](# Delete and confirm duplicate line names)
[Check for duplicates by column name](# 4 Check for duplicates by column name)

### ▼ Table data to be used Check each method using the table below.

-Row0 and row5 are completely duplicated. ・ All elements of col0 and col4 are duplicated ・ Col2 is row0 = row5, row1 = row4 ・ Col3 is row0 = row5, row1 = row3

## 1. 1. Check for duplicate lines -By using ** duplicated method **, rows with duplicate elements can be extracted.

-** Duplicate lines are True **.

-** By default, the upper duplicate line is regarded as False **, and the lines below it are regarded as True. It is also possible to set the bottom side to True as an option. └ Decide whether the target is the top or bottom when deleting. └ It is also possible to delete all duplicate lines.

-It is also possible to specify the target column by column name.

### Main duplicated methods

①duplicated ②duplicated(keep='last') ③duplicated(keep=False) ④duplicated(['aaa']) ⑤duplicated(['aaa','bbb'])

①duplicated `duplicated()` Duplicate rows that are completely duplicated in all columns are judged.

`duplicated`


df.duplicated()

#output
row0    False
row1    False
row2    False
row3    False
row4    False
row5     True
dtype: bool

Only row5 is judged as duplicate (True).

Row = 0 and row = 5 are the same in all columns. By default, the first row is False, so row5 is judged as duplicate.

Default: "keep ='first'" The following two processes are the same.

df.duplicated() df.duplicated(keep='first')

②duplicated(keep='last') `duplicated(keep='last')` The first duplicate line is judged as duplicate. (Leave the back side)

`keep=last`


df.duplicated(keep='last')

#output
row0     True
row1    False
row2    False
row3    False
row4    False
row5    False
dtype: bool

row0 is a duplicate judgment (True).

③duplicated(keep=False) `duplicated(keep=False)` All duplicate lines are judged as duplicates.

`keep=False`


df.duplicated(keep=False)

#output
row0     True
row1    False
row2    False
row3    False
row4    False
row5     True
dtype: bool

Row0 and row5 are duplicate judgment (True).

④duplicated(['aaa']) `duplicated(['aaa'])` └ Duplicate judgment is performed only in the specified column (here, "aaa").

** ▼ In case of duplicated (['col0']) **

`Specify one column`


df.duplicated(['col0'])

#output

row0    False
row1     True
row2     True
row3     True
row4     True
row5     True
dtype: bool

Since all the elements of col0 are 1, all except the first column are judged as duplicates.

⑤duplicated(['aaa','bbb']) `duplicated(['aaa','bbb',,,])` └ Specify multiple columns for duplicate judgment. └ If each specified line is a duplicate judgment (True), a duplicate judgment is made. └ If even one column is different, it will be False.

df.duplicated(['col0','col3'])

#output
row0    False
row1    False
row2    False
row3     True
row4    False
row5     True
dtype: bool

Only row3 and row5 are True. └ row3 of col0 and col3 is True → True └ row5 of col0 and col3 is True → True └ col0 and row1 are True, row1 of col3 is False → False

## 2. Remove duplicate lines Use the drop_duplicates method. Unlike confirmation of duplication ** plural ** 　└`drop_duplicates` 　└`duplicated`

The line that is judged as duplicate (True) by the duplicated method is deleted.

### Main drop_duplicates methods

①drop_duplicates ②drop_duplicates(keep='last') ③drop_duplicates(keep=False) ④drop_duplicates(['aaa']) ⑤drop_duplicates(['aaa','bbb'])

①drop_duplicates `drop_duplicates()` └ Delete lines where all elements are duplicated. └ Leave the top line (keep ='first')

`drop_duplicates`


df.drop_duplicates()

row5 has been deleted.

Duplicate row0 and row5. Since the top row is not judged as duplicate (default keep ='first'), row5 was deleted.

②drop_duplicates(keep='last') `drop_duplicates(keep='last')` └ Delete lines where all elements are duplicated. └ Leave the bottom line (keep ='last')

`keep='last'`


df.drop_duplicates(keep='last')

③drop_duplicates(keep=False) `drop_duplicates(keep=False)` └ Delete all duplicate lines

`keep=False`


df.drop_duplicates(keep=False)S

Deleted both row0 and row5.

④drop_duplicates(['aaa']) `drop_duplicates(['aaa'])` └ Delete duplicate rows in the specified column. └ "aaa": Column name
** ▼ In case of drop_duplicates (['col0']) **

`Specify column`


df.drop_duplicates(['col0'])

All elements of col0 are duplicated from row0 to row5. Since it is a duplicate judgment other than row1, it will be deleted.

** ▼ In case of drop_duplicates (['col0'], keep ='last') **

`Specify column`


drop_duplicates(['col0'], keep='last')

If keep ='last', the bottom line will not be a duplicate check.

⑤drop_duplicates(['aaa','bbb']) `drop_duplicates(['aaa','bbb'])` └ Delete the duplicate row in the specified f multiple columns. └ "aaa" "bbb": Column name
** ▼ For df.drop_duplicates (['col2','col3']) **

`Specify multiple columns`


df.drop_duplicates(['col2', 'col3'])

row5 deleted.

The duplicates in col2 are row1, row4, row5. The duplicates in col3 are row1 and row5. ⇛ Common is row1 and row5. Since the first duplicate row is set to be left (default keep ='first'), only row5 is deleted.

## 3. 3. Check for duplicates in line headings If you specify a row for the object, you can check if there is a row with the same name or display a list of row names without duplicates.

① Confirmation of duplicate line names ② Delete the duplicate line name and check

** ▼ Available options ** keep='first' └ Do not judge duplicates for the first line name keep='last' └ Do not judge the last line name of duplicates as duplicates keep=False └ All duplicate line names are considered duplicates └ If False, no quotation is required

### ① Confirmation of duplicate line names `df.index.duplicated()` └ "df": Table data └ ".index": Get line name list └ ".duplicated ()": Show duplicate lines as True

Since the option of ".duplicated ()" is "keep ='first'" by default, the first duplicate line is False, and the second and subsequent lines are True.

`Check for duplicate line names`


df.index.duplicated()

#output
# array([False, False, False, False,  True,  True])

▼ Original table

The 0th and 5th, 1st and 4th line names are duplicated. ⇛ The 5th and 4th are duplicate judgments.

** ▼ If keep = False **

`All duplicate line names are True`


df.index.duplicated(keep=False)

#output
# array([ True,  True, False, False,  True,  True])

### ② Delete the duplicate line name and check `df.index.drop_duplicates()`

How to get a list of row names by removing duplicate row names.

Used when you want to check the line name.

`Get by deleting the line name`


df.index.drop_duplicates()

#output
# Index(['AAA', 'BBB', 'CCC', 'DDD'], dtype='object')

The second "AAA" and "BBB" that were duplicated have been deleted.

▼ Original table

## 4. Duplicate confirmation by column name If you specify a column name in the object, you can do the same with the column name.

df.columns.duplicated() df.columns.drop_duplicates() └ ".columns": Extract the column names of the table with list.

[Back to top](#python table duplicate check and how to remove duplicates)

[PYTHON] [Pandas] How to check duplicates and delete duplicates in a table (equivalent to deleting duplicates in Excel)

[Python] How to check and remove duplicates in a table (pandas DataFrame)

duplicated

keep=last

keep=False

Specify one column

drop_duplicates

keep='last'

keep=False

Specify column

Specify column

Specify multiple columns

Check for duplicate line names

All duplicate line names are True

Get by deleting the line name

`duplicated`

`keep=last`

`keep=False`

`Specify one column`

`drop_duplicates`

`keep='last'`

`keep=False`

`Specify column`

`Specify column`

`Specify multiple columns`

`Check for duplicate line names`

`All duplicate line names are True`

`Get by deleting the line name`