Thing you want to do

Assuming that there is log data that has values hierarchically as keys such as user, item as shown below (Hereafter, user item should be read according to the case)

>>> df = pd.DataFrame([['user1','item1',5],['user1','item2',4],['user2','item2',5],['user2','item3',6],['user3','item4',3]], columns=['username','itemname','rate'])
>>> df
  username itemname  rate
0    user1    item1     5
1    user1    item2     4
2    user2    item2     5
3    user2    item3     6
4    user3    item4     3

If you want to convert this to matrix format, you can write it in pandas as follows.

>>> df.groupby(['itemname', 'username']).mean().unstack(fill_value=0).values
array([[5, 0, 0],
       [4, 5, 0],
       [0, 6, 0],
       [0, 0, 3]])

If there are duplicate user × items, average the evaluation values.

problem

When performing this process with pandas as described above, it is necessary to keep all the target data as a DataFrame, and if it is a large-scale data, the process itself will lead to a memory error. Also, if the data itself does not already fit in memory, it is necessary to divide the data into memory and read it, but there is a problem that it cannot be done.

Therefore, the motivation for this time is to want to perform the above conversion process other than pandas.

Method

It was a story of converting each key into an index and keeping each value and those indexes as a list.

Index conversion

Convert each key value with Label Encoding as follows.

>>> from sklearn.preprocessing import LabelEncoder
>>> le_username = LabelEncoder()
>>> le_itemname = LabelEncoder()
>>> df['username'] = le_username.fit_transform(df['username'])
>>> df['itemname'] = le_username.fit_transform(df['itemname'])
>>> df
   username  itemname  rate
0         0         0     5
1         0         1     4
2         1         1     5
3         1         2     6
4         2         3     3

List

In order to hold the data in the form of what number item of what number user has what value, each value is retrieved in list format as follows.

>>> row = df['itemname'].values.tolist()
>>> col = df['username'].values.tolist()
>>> value = df['rate'].values.tolist()
>>> row
[0, 1, 1, 2, 3]
>>> col
[0, 0, 1, 1, 2]
>>> value
[5, 4, 5, 6, 3]

If you want to divide the data and read it, extend the list here.

Convert to matrix

Convert the above list to a sparse matrix.

>>> from scipy.sparse import coo_matrix
>>> matrix = coo_matrix((value, (row, col)))
>>> matrix
<4x3 sparse matrix of type '<class 'numpy.int64'>' with 5 stored elements in COOrdinate format>

When it is made dense, it becomes as follows, and the desired matrix can be obtained.

>>> matrix.toarray()
array([[5, 0, 0],
       [4, 5, 0],
       [0, 6, 0],
       [0, 0, 3]])