[PYTHON] Organize individual purchase data in a table with scikit-learn's MultiLabel Binarizer

Data source

https://drive.google.com/file/d/1y5DYn0dGoSbC22xowBq2d4po6h1JxcTQ/view?usp=sharing

Individual purchase data (before conversion)

スクリーンショット 2019-12-31 00.13.48.png

Each row represents the user ID and each column represents the item purchased by the user. The number of columns is the number of purchases by the user with the most purchased items. Therefore, NaN is entered in the blank.

Table you want to create (after conversion)

スクリーンショット 2019-12-31 00.18.26.png

Pin each column to a specific item and indicate whether each user purchased it by 1/0

How to do

Use scikit-learn's MultiLabelBinarizer. Let df be the data frame before conversion. The output after conversion is df_trans.

from sklearn.preprocessing import MultiLabelBinarizer

df = df.fillna("none")

mlb = MultiLabelBinarizer()
result = mlb.fit_transform(df.values)
df_trans = pd.DataFrame(result, columns = mlb.classes_).drop('none', axis=1)

If NaN is included in the data frame, an error will occur in MultiLabel Binarizer, so convert it to an appropriate character string (if it is not duplicated, it does not have to be none)

Create a MultiLabelBinarizer object and call the fit_transform method. Specify in numpy.array format as df.values in the argument.

The column name (item name) can be retrieved with mlb.classes_.

Finally, delete none with the drop method to get the converted table.

Recommended Posts

Organize individual purchase data in a table with scikit-learn's MultiLabel Binarizer
Delete data in a pattern with Redis Cluster
Read table data in PDF file with Python
Train MNIST data with a neural network in PyTorch
Ingenuity to handle data with Pandas in a memory-saving manner
A must-see for those involved in Materials Informatics! Visualize compound data with a periodic table heat map.
<Pandas> How to handle time series data in a pivot table
Generate fake table data with GAN
SE, a beginner in data analysis, learns with the data science unit vol.1