[PYTHON] [Pandas] How to check duplicates and delete duplicates in a table (equivalent to deleting duplicates in Excel)

[Python] How to check and remove duplicates in a table (pandas DataFrame)

A summary of how to check for duplicate elements in a table and how to remove duplicate values.

It is possible to perform the same processing as "duplicate removal" which is frequently used in excel.


> Official page · [Dupplicated](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html?highlight=dupli#pandas.DataFrame.duplicated) · [Drop_duplicates](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.drop_duplicates.html?highlight=dupli#pandas.Index.drop_duplicates)
**table of contents**
  1. [Check for duplicate lines](# 1 Check for duplicate lines)
    1. duplicated
    2. duplicated(keep='last')
    3. duplicated(keep=False)
    4. duplicated(['aaa'])
    5. duplicated(['aaa','bbb'])
  2. [Delete Duplicate Lines](# 2 Delete Duplicate Lines)
    1. drop_duplicates
    2. drop_duplicates(keep='last')
    3. drop_duplicates(keep=False)
    4. drop_duplicates(['aaa'])
    5. drop_duplicates(['aaa','bbb'])
  3. [Check for duplicates with line headings](Check for duplicates with # 3 line headings)
  4. [Check for duplicate line names](#Check for duplicate line names)
  5. [Delete and confirm duplicate line names](# Delete and confirm duplicate line names)
  6. [Check for duplicates by column name](# 4 Check for duplicates by column name)

### ▼ Table data to be used Check each method using the table below.

image.png

-Row0 and row5 are completely duplicated. ・ All elements of col0 and col4 are duplicated ・ Col2 is row0 = row5, row1 = row4 ・ Col3 is row0 = row5, row1 = row3


## 1. 1. Check for duplicate lines -By using ** duplicated method **, rows with duplicate elements can be extracted.

-** Duplicate lines are True **.

-** By default, the upper duplicate line is regarded as False **, and the lines below it are regarded as True. It is also possible to set the bottom side to True as an option. └ Decide whether the target is the top or bottom when deleting. └ It is also possible to delete all duplicate lines.

-It is also possible to specify the target column by column name.


### Main duplicated methods

①duplicated ②duplicated(keep='last') ③duplicated(keep=False) ④duplicated(['aaa']) ⑤duplicated(['aaa','bbb'])


①duplicated `duplicated()` Duplicate rows that are completely duplicated in all columns are judged.

duplicated


df.duplicated()

#output
row0    False
row1    False
row2    False
row3    False
row4    False
row5     True
dtype: bool

Only row5 is judged as duplicate (True).

Row = 0 and row = 5 are the same in all columns. By default, the first row is False, so row5 is judged as duplicate.

Default: "keep ='first'" The following two processes are the same.

df.duplicated() df.duplicated(keep='first')


②duplicated(keep='last') `duplicated(keep='last')` The first duplicate line is judged as duplicate. (Leave the back side)

keep=last


df.duplicated(keep='last')

#output
row0     True
row1    False
row2    False
row3    False
row4    False
row5    False
dtype: bool

row0 is a duplicate judgment (True).


③duplicated(keep=False) `duplicated(keep=False)` All duplicate lines are judged as duplicates.

keep=False


df.duplicated(keep=False)

#output
row0     True
row1    False
row2    False
row3    False
row4    False
row5     True
dtype: bool

Row0 and row5 are duplicate judgment (True).


④duplicated(['aaa']) `duplicated(['aaa'])` └ Duplicate judgment is performed only in the specified column (here, "aaa").

** ▼ In case of duplicated (['col0']) **

Specify one column


df.duplicated(['col0'])

#output

row0    False
row1     True
row2     True
row3     True
row4     True
row5     True
dtype: bool

Since all the elements of col0 are 1, all except the first column are judged as duplicates.


⑤duplicated(['aaa','bbb']) `duplicated(['aaa','bbb',,,])` └ Specify multiple columns for duplicate judgment. └ If each specified line is a duplicate judgment (True), a duplicate judgment is made. └ If even one column is different, it will be False.
df.duplicated(['col0','col3'])

#output
row0    False
row1    False
row2    False
row3     True
row4    False
row5     True
dtype: bool

Only row3 and row5 are True. └ row3 of col0 and col3 is True → True └ row5 of col0 and col3 is True → True └ col0 and row1 are True, row1 of col3 is False → False


## 2. Remove duplicate lines Use the drop_duplicates method. Unlike confirmation of duplication ** plural **  └`drop_duplicates`  └`duplicated`

The line that is judged as duplicate (True) by the duplicated method is deleted.


### Main drop_duplicates methods

①drop_duplicates ②drop_duplicates(keep='last') ③drop_duplicates(keep=False) ④drop_duplicates(['aaa']) ⑤drop_duplicates(['aaa','bbb'])


①drop_duplicates `drop_duplicates()` └ Delete lines where all elements are duplicated. └ Leave the top line (keep ='first')

drop_duplicates


df.drop_duplicates()

image.png

row5 has been deleted.

Duplicate row0 and row5. Since the top row is not judged as duplicate (default keep ='first'), row5 was deleted.


②drop_duplicates(keep='last') `drop_duplicates(keep='last')` └ Delete lines where all elements are duplicated. └ Leave the bottom line (keep ='last')

keep='last'


df.drop_duplicates(keep='last')

image.png


③drop_duplicates(keep=False) `drop_duplicates(keep=False)` └ Delete all duplicate lines

keep=False


df.drop_duplicates(keep=False)S

image.png

Deleted both row0 and row5.


④drop_duplicates(['aaa']) `drop_duplicates(['aaa'])` └ Delete duplicate rows in the specified column. └ "aaa": Column name
** ▼ In case of drop_duplicates (['col0']) **

Specify column


df.drop_duplicates(['col0'])

image.png

All elements of col0 are duplicated from row0 to row5. Since it is a duplicate judgment other than row1, it will be deleted.


** ▼ In case of drop_duplicates (['col0'], keep ='last') **

Specify column


drop_duplicates(['col0'], keep='last')

image.png

If keep ='last', the bottom line will not be a duplicate check.


⑤drop_duplicates(['aaa','bbb']) `drop_duplicates(['aaa','bbb'])` └ Delete the duplicate row in the specified f multiple columns. └ "aaa" "bbb": Column name
** ▼ For df.drop_duplicates (['col2','col3']) **

Specify multiple columns


df.drop_duplicates(['col2', 'col3'])

image.png

row5 deleted.

The duplicates in col2 are row1, row4, row5. The duplicates in col3 are row1 and row5. ⇛ Common is row1 and row5. Since the first duplicate row is set to be left (default keep ='first'), only row5 is deleted.


## 3. 3. Check for duplicates in line headings If you specify a row for the object, you can check if there is a row with the same name or display a list of row names without duplicates.

① Confirmation of duplicate line names ② Delete the duplicate line name and check

** ▼ Available options ** keep='first' └ Do not judge duplicates for the first line name keep='last' └ Do not judge the last line name of duplicates as duplicates keep=False └ All duplicate line names are considered duplicates └ If False, no quotation is required


### ① Confirmation of duplicate line names `df.index.duplicated()` └ "df": Table data └ ".index": Get line name list └ ".duplicated ()": Show duplicate lines as True

Since the option of ".duplicated ()" is "keep ='first'" by default, the first duplicate line is False, and the second and subsequent lines are True.

Check for duplicate line names


df.index.duplicated()

#output
# array([False, False, False, False,  True,  True])

▼ Original table image.png

The 0th and 5th, 1st and 4th line names are duplicated. ⇛ The 5th and 4th are duplicate judgments.


** ▼ If keep = False **

All duplicate line names are True


df.index.duplicated(keep=False)

#output
# array([ True,  True, False, False,  True,  True])

### ② Delete the duplicate line name and check `df.index.drop_duplicates()`

How to get a list of row names by removing duplicate row names.

Used when you want to check the line name.

Get by deleting the line name


df.index.drop_duplicates()

#output
# Index(['AAA', 'BBB', 'CCC', 'DDD'], dtype='object')

The second "AAA" and "BBB" that were duplicated have been deleted.

▼ Original table image.png


## 4. Duplicate confirmation by column name If you specify a column name in the object, you can do the same with the column name.

df.columns.duplicated() df.columns.drop_duplicates() └ ".columns": Extract the column names of the table with list.


[Back to top](#python table duplicate check and how to remove duplicates)

Recommended Posts

[Pandas] How to check duplicates and delete duplicates in a table (equivalent to deleting duplicates in Excel)
[Python] How to output a pandas table to an excel file
[GCF + Python] How to upload Excel to GCS and create a new table in BigQuery
[Python] How to add rows and columns to a table (pandas DataFrame)
[Python] How to delete rows and columns in a table (list of drop method options)
<Pandas> How to handle time series data in a pivot table
How to format a table using Pandas apply, pivot and swaplevel
How to get a specific column name and index name in pandas DataFrame
How to display DataFrame as a table in Markdown
How to check if a value exists in an enum
How to create dataframes and mess with elements in pandas
While solving the introductory statistics exercise 12.10, check how to draw a scatter plot in pandas.
How to write soberly in pandas
How to delete a Docker container
How to paste a CSV file into an Excel file using Pandas
How to compare lists and retrieve common elements in a list
How to check the memory size of a dictionary in Python
How to delete multiple specified positions (indexes) in a Python list
How to split and save a DataFrame
How to delete expired sessions in Django
How to get a stacktrace in python
How to display multiplication table in python
How to reassign index in pandas dataframe
How to put a half-width space before letters and numbers in Python.
How to check opencv version in python
How to make a container name a subdomain and make it accessible in Docker
How to read CSV files in Pandas
How to use is and == in Python
How to use pandas Timestamp and date_range
[Note] How to deal with unicode error and No such file or directory (output table to excel file with pandas)
How to stop a program in python until a specific date and time
How to mention a user group in slack notification, how to check the id of the user group
[Python] How to name table data and output it in csv (to_csv method)
How to count the number of elements in Django and output to a template
How to read a serial number file in a loop, process it, and graph it
The first step to log analysis (how to format and put log data in Pandas)
How to clear tuples in a list (Python)
How to generate permutations in Python and C ++
How to embed a variable in a python string
[Python] How to read excel file with pandas
How to implement a gradient picker in Houdini
How to check / extract files in RPM package
How to notify a Discord channel in Python
[Python] How to draw a histogram in Matplotlib
How to create a Rest Api in Django
How to write async and await in Vue.js
How to write a named tuple document in 2020
How to count numbers in a specific range
Add totals to rows and columns in pandas
How to read a file in a different directory
How to Mock a Public function in Pytest
How to plot autocorrelation and partial autocorrelation in python
How to split machine learning training data into objective variables and others in Pandas
How to use any or all to check if it is in a dictionary (Hash)
How to check in Python if one of the elements of a list is in another list
How to use Docker to containerize your application and how to use Docker Compose to run your application in a development environment
How to specify a schema in Django's database settings
How to convert / restore a string with [] in python
[Python] How to expand variables in a character string
A memorandum on how to use keras.preprocessing.image in Keras
How to access with cache when reading_json in pandas