Assuming that there is log data that has values hierarchically as keys such as user, item as shown below (Hereafter, user item should be read according to the case)
>>> df = pd.DataFrame([['user1','item1',5],['user1','item2',4],['user2','item2',5],['user2','item3',6],['user3','item4',3]], columns=['username','itemname','rate'])
>>> df
username itemname rate
0 user1 item1 5
1 user1 item2 4
2 user2 item2 5
3 user2 item3 6
4 user3 item4 3
If you want to convert this to matrix format, you can write it in pandas as follows.
>>> df.groupby(['itemname', 'username']).mean().unstack(fill_value=0).values
array([[5, 0, 0],
[4, 5, 0],
[0, 6, 0],
[0, 0, 3]])
When performing this process with pandas as described above, it is necessary to keep all the target data as a DataFrame, and if it is a large-scale data, the process itself will lead to a memory error. Also, if the data itself does not already fit in memory, it is necessary to divide the data into memory and read it, but there is a problem that it cannot be done.
Therefore, the motivation for this time is to want to perform the above conversion process other than pandas.
It was a story of converting each key into an index and keeping each value and those indexes as a list.
Convert each key value with Label Encoding as follows.
>>> from sklearn.preprocessing import LabelEncoder
>>> le_username = LabelEncoder()
>>> le_itemname = LabelEncoder()
>>> df['username'] = le_username.fit_transform(df['username'])
>>> df['itemname'] = le_username.fit_transform(df['itemname'])
>>> df
username itemname rate
0 0 0 5
1 0 1 4
2 1 1 5
3 1 2 6
4 2 3 3
In order to hold the data in the form of what number item of what number user has what value, each value is retrieved in list format as follows.
>>> row = df['itemname'].values.tolist()
>>> col = df['username'].values.tolist()
>>> value = df['rate'].values.tolist()
>>> row
[0, 1, 1, 2, 3]
>>> col
[0, 0, 1, 1, 2]
>>> value
[5, 4, 5, 6, 3]
If you want to divide the data and read it, extend the list here.
Convert the above list to a sparse matrix.
>>> from scipy.sparse import coo_matrix
>>> matrix = coo_matrix((value, (row, col)))
>>> matrix
<4x3 sparse matrix of type '<class 'numpy.int64'>' with 5 stored elements in COOrdinate format>
When it is made dense, it becomes as follows, and the desired matrix can be obtained.
>>> matrix.toarray()
array([[5, 0, 0],
[4, 5, 0],
[0, 6, 0],
[0, 0, 3]])
Recommended Posts