[PYTHON] Return one-hot encoded features to the original category value

The movieLens dataset was awkward to process like this, so as a memo

Thing you want to do

The dataset passed in the one-hot encoded state as shown below

   movie_id  action  horror  romance  sf
0         1       1       0        0   0
1         2       0       0        1   0
2         2       1       0        0   0
3         3       0       0        0   1
4         3       1       0        0   0
5         4       0       1        0   0
6         5       0       0        0   1
7         5       0       1        0   0
8         5       1       0        0   0

I want to return to the categorical state before one-hot encoding as shown below

   movie_id    genre
0         1   action
1         2  romance
2         2   action
3         3       sf
4         3   action
5         4   horror
6         5       sf
7         5   horror
8         5   action

manner

Prepare the following function

def convert_onehot_to_category(df, id_col, one_hot_columns, category_col='category'):
    df_concat = pd.DataFrame(columns=[id_col, category_col])
    for col in one_hot_columns:
        #Leave only those with a value of 1 or more
        df_each = df[df[col] >= 1][[id_col, col]]
        #Replace value with categorical value
        df_each[col] = col

        df_each.columns = [id_col, category_col]
        df_concat = pd.concat([df_concat, df_each], axis=0)

    #Duplicate deletion
    df_concat = df_concat.drop_duplicates().reset_index(drop=True).sort_values(by=id_col)
    return df_concat

As below,

--Column name after one-hot encoding --Column containing id --Column name after conversion to category value

If you pass

genres = ['action', 'romance', 'sf', 'horror']
id_col = 'movie_id'
category_col = 'genre'

df_category = convert_onehot_to_category(df_onehot, id_col=id_col, one_hot_columns=genres, category_col=category_col)

print(df_category)

Converts to the original category value

  movie_id    genre
0        1   action
1        2   action
4        2  romance
2        3   action
5        3       sf
7        4   horror
3        5   action
6        5       sf
8        5   horror

Recommended Posts

Return one-hot encoded features to the original category value
Check the return value using PEP 380
About the return value of pthread_mutex_init ()
About the return value of the histogram.
Preparing to load the original dataset
Use numpy's .flatten () [0] to retrieve the value
LightGBM predict contributes to the predicted value
Mia Nanasawa's face is (No'□ `) No Noise processing is applied to return to the original beautiful face
Watch out for the return value of __len__
Add query to url parsed url and return to original url