I'm getting started with machine learning in python.
Regardless of the algorithm used, it is essential to convert sample data in csv or tsv format to a matrix, so I investigated several methods.
This time, we will use 100K MovieLens 100K Dataset of MovieLens Dataset, which is said to be the most commonly used benchmark for collaborative filtering. ..
MovieLens Dataset
You can read README for more information about Dataset, but I think u.data will be the main one to use.
A 4-column tsv with user_id, item_id, rating, and timestamp.
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
...
Finally, I want to convert the rating rating for item j of user i to a matrix such that R (i, j) = rating.
with open('u.data', newline='') as f:
reader = csv.reader(f, delimiter='\t')
for row in reader:
print(row)
If you just want to read it, the csv module is enough. However, in order to determine the shape of the matrix R, it is necessary to find the maximum values of user_id and item_id.
Use the pandas: powerful Python data analysis toolkit to improve data handling.
The installation is pip install pandas, and the method to find the maximum value for each column of csv is as follows.
>>> df = pd.read_csv('u.data', sep='\t', names=['user_id','item_id', 'rating', 'timestamp'])
>>> df.max()
user_id 943
item_id 1682
rating 5
timestamp 893286638
dtype: int64
Where df is a DataFrame object and df.max () is a Series object.
>>> type(df)
<class 'pandas.core.frame.DataFrame'>
>>> type(df.max())
<class 'pandas.core.series.Series'>
To access the maximum value for each column, you can do the following:
>>> df.max().ix['user_id']
943
>>> df.max().ix['item_id']
1682
For commentary articles in Japanese, http://oceanmarine.sakura.ne.jp/sphinx/group/group_pandas.html is easy to understand.
At this point, all you have to do is seriously process each piece of data.
import numpy as np
import pandas as pd
df = pd.read_csv('u.data', sep='\t', names=['user_id','item_id', 'rating', 'timestamp'])
shape = (df.max().ix['user_id'], df.max().ix['item_id'])
R = np.zeros(shape)
for i in df.index:
row = df.ix[i]
R[row['user_id'] -1 , row['item_id'] - 1] = row['rating']
>>> print(R)
[[ 5. 3. 4. ..., 0. 0. 0.]
[ 4. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 5. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 5. 0. ..., 0. 0. 0.]]
In general, it's a sparse matrix (in many movies, the number that one person evaluates is limited), so it seems better to use sparse.
import numpy as np
import pandas as pd
from scipy import sparse
df = pd.read_csv('u.data', sep='\t', names=['user_id','item_id', 'rating', 'timestamp'])
shape = (df.max().ix['user_id'] + 1, df.max().ix['item_id'] + 1)
R = sparse.lil_matrix(shape)
for i in df.index:
row = df.ix[i]
R[row['user_id'], row['item_id']] = row['rating']
>>> print(R.todense())
[[ 5. 3. 4. ..., 0. 0. 0.]
[ 4. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 5. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 5. 0. ..., 0. 0. 0.]]
that's all.
I found an issue where the first row and first column were extra, so I fixed it. In the first draft, I wrote as follows. ..
shape = (df.max().ix['user_id'] + 1, df.max().ix['item_id'] + 1)
R = np.zeros(shape)
for i in df.index:
row = df.ix[i]
R[row['user_id'], row['item_id']] = row['rating']
Recommended Posts