Convert csv, tsv data to matrix with python --using MovieLens as an example

I'm getting started with machine learning in python.

Regardless of the algorithm used, it is essential to convert sample data in csv or tsv format to a matrix, so I investigated several methods.

This time, we will use 100K MovieLens 100K Dataset of MovieLens Dataset, which is said to be the most commonly used benchmark for collaborative filtering. ..

MovieLens Dataset

You can read README for more information about Dataset, but I think u.data will be the main one to use.

A 4-column tsv with user_id, item_id, rating, and timestamp.

196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
244	51	2	880606923
166	346	1	886397596
298	474	4	884182806
...

Finally, I want to convert the rating rating for item j of user i to a matrix such that R (i, j) = rating.

Use standard csv module

with open('u.data', newline='') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        print(row)

If you just want to read it, the csv module is enough. However, in order to determine the shape of the matrix R, it is necessary to find the maximum values of user_id and item_id.

Handle csv with pandas

Use the pandas: powerful Python data analysis toolkit to improve data handling.

The installation is pip install pandas, and the method to find the maximum value for each column of csv is as follows.

>>> df = pd.read_csv('u.data', sep='\t', names=['user_id','item_id', 'rating', 'timestamp'])

>>> df.max()
user_id            943
item_id           1682
rating               5
timestamp    893286638
dtype: int64

Where df is a DataFrame object and df.max () is a Series object.

>>> type(df)
<class 'pandas.core.frame.DataFrame'>

>>> type(df.max())
<class 'pandas.core.series.Series'>

To access the maximum value for each column, you can do the following:

>>> df.max().ix['user_id']
943
>>> df.max().ix['item_id']
1682

For commentary articles in Japanese, http://oceanmarine.sakura.ne.jp/sphinx/group/group_pandas.html is easy to understand.

Convert to the desired matrix

At this point, all you have to do is seriously process each piece of data.

import numpy as np
import pandas as pd

df = pd.read_csv('u.data', sep='\t', names=['user_id','item_id', 'rating', 'timestamp'])

shape = (df.max().ix['user_id'], df.max().ix['item_id'])
R = np.zeros(shape) 

for i in df.index:
    row = df.ix[i]
    R[row['user_id'] -1 , row['item_id'] - 1] = row['rating']


>>> print(R)
[[ 5.  3.  4. ...,  0.  0.  0.]
 [ 4.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ...,
 [ 5.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  5.  0. ...,  0.  0.  0.]]

In general, it's a sparse matrix (in many movies, the number that one person evaluates is limited), so it seems better to use sparse.

import numpy as np
import pandas as pd
from scipy import sparse

df = pd.read_csv('u.data', sep='\t', names=['user_id','item_id', 'rating', 'timestamp'])

shape = (df.max().ix['user_id'] + 1, df.max().ix['item_id'] + 1)
R = sparse.lil_matrix(shape) 

for i in df.index:
    row = df.ix[i]
    R[row['user_id'], row['item_id']] = row['rating']

>>> print(R.todense())
[[ 5.  3.  4. ...,  0.  0.  0.]
 [ 4.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ...,
 [ 5.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  5.  0. ...,  0.  0.  0.]]

that's all.

correction

I found an issue where the first row and first column were extra, so I fixed it. In the first draft, I wrote as follows. ..

shape = (df.max().ix['user_id'] + 1, df.max().ix['item_id'] + 1)
R = np.zeros(shape) 

for i in df.index:
    row = df.ix[i]
    R[row['user_id'], row['item_id']] = row['rating']