Data source

https://drive.google.com/file/d/1y5DYn0dGoSbC22xowBq2d4po6h1JxcTQ/view?usp=sharing

Individual purchase data (before conversion)

Each row represents the user ID and each column represents the item purchased by the user. The number of columns is the number of purchases by the user with the most purchased items. Therefore, NaN is entered in the blank.

Table you want to create (after conversion)

Pin each column to a specific item and indicate whether each user purchased it by 1/0

How to do

Use scikit-learn's MultiLabelBinarizer. Let df be the data frame before conversion. The output after conversion is df_trans.

from sklearn.preprocessing import MultiLabelBinarizer

df = df.fillna("none")

mlb = MultiLabelBinarizer()
result = mlb.fit_transform(df.values)
df_trans = pd.DataFrame(result, columns = mlb.classes_).drop('none', axis=1)

If NaN is included in the data frame, an error will occur in MultiLabel Binarizer, so convert it to an appropriate character string (if it is not duplicated, it does not have to be none)

Create a MultiLabelBinarizer object and call the fit_transform method. Specify in numpy.array format as df.values in the argument.

The column name (item name) can be retrieved with mlb.classes_.

Finally, delete none with the drop method to get the converted table.

[PYTHON] Organize individual purchase data in a table with scikit-learn's MultiLabel Binarizer

Data source

Individual purchase data (before conversion)

Table you want to create (after conversion)

How to do