https://drive.google.com/file/d/1y5DYn0dGoSbC22xowBq2d4po6h1JxcTQ/view?usp=sharing
Each row represents the user ID and each column represents the item purchased by the user. The number of columns is the number of purchases by the user with the most purchased items. Therefore, NaN is entered in the blank.
Pin each column to a specific item and indicate whether each user purchased it by 1/0
Use scikit-learn's MultiLabelBinarizer. Let df be the data frame before conversion. The output after conversion is df_trans.
from sklearn.preprocessing import MultiLabelBinarizer
df = df.fillna("none")
mlb = MultiLabelBinarizer()
result = mlb.fit_transform(df.values)
df_trans = pd.DataFrame(result, columns = mlb.classes_).drop('none', axis=1)
If NaN is included in the data frame, an error will occur in MultiLabel Binarizer, so convert it to an appropriate character string (if it is not duplicated, it does not have to be none)
Create a MultiLabelBinarizer object and call the fit_transform method. Specify in numpy.array format as df.values in the argument.
The column name (item name) can be retrieved with mlb.classes_.
Finally, delete none with the drop method to get the converted table.
Recommended Posts