--Environment --macOS Catalina version 10.15.7 - Python 3.8.5 - pandas 1.1.3
The sample data is such data read from CSV. Focus on the [ID] column of this data and try various unique rows.
if __name__ == '__main__':
    df = pandas.read_csv('CSV.csv')
    print(df)
#ID Date Start End
# 0   20.0  10/14   9:00  18:00
# 1   16.0  10/15  10:00  19:00
# 2   39.0  10/16  11:00  20:00
# 3    NaN  10/17  12:00  21:00
# 4   47.0  10/18  20:00  22:00
# 5   29.0  10/19  11:00  12:00
# 6   62.0  10/20  15:00  17:00
# 7    NaN  10/21  16:00  18:00
# 8   16.0  10/22  17:00  19:00
# 9    3.0  10/23  18:00  20:00
# 10  16.0  10/24  19:00  21:00
# 11  20.0  10/25  20:00  21:00
| line number | ID | date | ... | [ID]Duplicate columns しているline | 
|---|---|---|---|---|
| 0 | 20.0 | 10/14 | ... | o | 
| 1 | 16.0 | 10/15 | ... | o | 
| 2 | 39.0 | 10/16 | ... | |
| 3 | NaN | 10/17 | ... | o | 
| 4 | 47.0 | 10/18 | ... | |
| 5 | 29.0 | 10/19 | ... | |
| 6 | 62.0 | 10/20 | ... | |
| 7 | NaN | 10/21 | ... | o | 
| 8 | 16.0 | 10/22 | ... | o | 
| 9 | 3.0 | 10/23 | ... | |
| 10 | 16.0 | 10/24 | ... | o | 
| 11 | 20.0 | 10/25 | ... | o | 
--Reference: Count the number and frequency (number of occurrences) of unique elements in pandas | note.nkmk.me
#When including NaN
print(df.nunique(dropna=False)['ID'])
# 8
#When NaN is excluded
print(df.nunique()['ID'])
# 7
--Reference - pandas.DataFrame.value_counts — pandas 1.1.3 documentation - pandas.DataFrame.groupby — pandas 1.1.3 documentation -query to extract rows of pandas.DataFrame by condition | note.nkmk.me
| the difference | groupby | value_counts | Remarks | 
|---|---|---|---|
| Column heading | Yes | None | |
| sort | Column value | frequency | sort=If set to False, the order will be the same | 
#When including NaN
print(df.groupby('ID', dropna=False).size())
# ID
# 3.0     1
# 16.0    3
# 20.0    2
# 29.0    1
# 39.0    1
# 47.0    1
# 62.0    1
# NaN     2
# dtype: int64
print(df['ID'].value_counts(dropna=False))
# 16.0    3
# NaN     2
# 20.0    2
# 3.0     1
# 62.0    1
# 29.0    1
# 47.0    1
# 39.0    1
# Name: ID, dtype: int64
#When NaN is excluded
print(df.groupby('ID', sort=False).size())
print(df['ID'].value_counts(sort=False))
# 20.0    2
# 16.0    3
# 39.0    1
# 47.0    1
# 29.0    1
# 62.0    1
# 3.0     1
# Name: ID, dtype: int64
#When including NaN
val_con = df.groupby('ID', dropna=False).size()
print(val_con[val_con >= 2])
# ID
# 16.0    3
# 20.0    2
# NaN     2
# dtype: int64
#When NaN is excluded
val_con = df['ID'].value_counts()
print(val_con[val_con >= 2])
# 16.0    3
# 20.0    2
# Name: ID, dtype: int64
--Reference - pandas.DataFrame.duplicated — pandas 1.1.3 documentation -query to extract rows of pandas.DataFrame by condition | note.nkmk.me -Extract / delete duplicate rows of pandas.DataFrame, Series | note.nkmk.me
#When including NaN
print(df.duplicated(subset=['ID']))
# 0     False
# 1     False
# 2     False
# 3     False
# 4     False
# 5     False
# 6     False
# 7      True
# 8      True
# 9     False
# 10     True
# 11     True
# dtype: bool
#When NaN is excluded
print(df.dropna(subset=['ID']).duplicated(subset=['ID']))
# 0     False
# 1     False
# 2     False
# 4     False
# 5     False
# 6     False
# 8      True
# 9     False
# 10     True
# 11     True
# dtype: bool
duplicated defaults to False for the first duplicate line ( keep ='first').
Depending on the value specified for keep, which part of the duplicate line becomes False changes.
| value of keep | ID | keep='first' (Default) | keep='last' | keep=False | 
|---|---|---|---|---|
| 0 | 20 | False | True | True | 
| 1 | 16 | False | True | True | 
| 2 | False | False | False | |
| 3 | NaN | False | True | True | 
| 4 | False | False | False | |
| 5 | False | False | False | |
| 6 | False | False | False | |
| 7 | NaN | True | False | True | 
| 8 | 16 | True | True | True | 
| 9 | False | False | False | |
| 10 | 16 | True | False | True | 
| 11 | 20 | True | False | True | 
# df.shape[0]You can get all the cases with
if df.shape[0] == df.nunique()['ID']:
    print('ID column values are unique')
else:
    print('ID column is not unique')
#When including NaN
print(df[df.duplicated(subset=['ID'])])
#ID Date Start End
# 0   20.0  10/14   9:00  18:00
# 1   16.0  10/15  10:00  19:00
# 3    NaN  10/17  12:00  21:00
# 7    NaN  10/21  16:00  18:00
# 8   16.0  10/22  17:00  19:00
# 10  16.0  10/24  19:00  21:00
# 11  20.0  10/25  20:00  21:00
#When NaN is excluded
df_dropna = df.dropna(subset=['ID'])
print(df_dropna[df_dropna.duplicated(subset=['ID'], keep=False)])
#ID Date Start End
# 0   20.0  10/14   9:00  18:00
# 1   16.0  10/15  10:00  19:00
# 8   16.0  10/22  17:00  19:00
# 10  16.0  10/24  19:00  21:00
# 11  20.0  10/25  20:00  21:00
Recommended Posts