[PYTHON] [Pandas] Delete duplicates while complementing defects

Introduction

When deleting duplicates of pandas data frame with a certain key, you may want to delete duplicates after completing missing records between records judged to be the same record.

import pandas as pd

df = pd.DataFrame({
    'building_name': ['Building A', 'A bill', None, 'C building', 'B building', None, 'D bill'],
    'property_scale': ['large', 'large', , 'small', 'small', 'small', 'large'],
    'city_code': [1, 1, 1, 2, 1, 1, 1]
})
df
building_name property_scale city_code
Building A large 1
Building A large 1
None small 1
C building small 2
B building small 1
None small 1
D building large 1

Completion + duplicate removal function

from pandas.core.frame import DataFrame

def drop_duplicates(df: DataFrame, subset: list, fillna: bool = False) -> DataFrame:
    """Delete duplicates after completing missing subset to key.

    Args:
        df (DataFrame):Arbitrary data frame
        subset (list):Key to delete duplicates
        fillna (bool):Whether to complete missing records between duplicate records. default False.
    
    Returns:
        DataFrame

    """
    group_info = df.groupby(by=subset)
    new_df = pd.concat([
        group_info.get_group(group_name).fillna(method='bfill').fillna(method='ffill')
        for group_name
        in group_info.groups.keys()])
    new_df = new_df.drop_duplicates(subset=subset)
    return new_df

Run

drop_duplicates(df, ['property_scale', 'city_code'], True)
building_name property_scale city_code
Building A large 1
B building small 1
C building small 2

Recommended Posts

[Pandas] Delete duplicates while complementing defects
8rep --Pandas string delete code