Convert csv, tsv data to matrix with python --using MovieLens as an example

I'm getting started with machine learning in python.

Regardless of the algorithm used, it is essential to convert sample data in csv or tsv format to a matrix, so I investigated several methods.

This time, we will use 100K MovieLens 100K Dataset of MovieLens Dataset, which is said to be the most commonly used benchmark for collaborative filtering. ..

MovieLens Dataset

You can read README for more information about Dataset, but I think u.data will be the main one to use.

A 4-column tsv with user_id, item_id, rating, and timestamp.

196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
244	51	2	880606923
166	346	1	886397596
298	474	4	884182806
...

Finally, I want to convert the rating rating for item j of user i to a matrix such that R (i, j) = rating.

Use standard csv module

with open('u.data', newline='') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        print(row)

If you just want to read it, the csv module is enough. However, in order to determine the shape of the matrix R, it is necessary to find the maximum values of user_id and item_id.

Handle csv with pandas

Use the pandas: powerful Python data analysis toolkit to improve data handling.

The installation is pip install pandas, and the method to find the maximum value for each column of csv is as follows.

>>> df = pd.read_csv('u.data', sep='\t', names=['user_id','item_id', 'rating', 'timestamp'])

>>> df.max()
user_id            943
item_id           1682
rating               5
timestamp    893286638
dtype: int64

Where df is a DataFrame object and df.max () is a Series object.

>>> type(df)
<class 'pandas.core.frame.DataFrame'>

>>> type(df.max())
<class 'pandas.core.series.Series'>

To access the maximum value for each column, you can do the following:

>>> df.max().ix['user_id']
943
>>> df.max().ix['item_id']
1682

For commentary articles in Japanese, http://oceanmarine.sakura.ne.jp/sphinx/group/group_pandas.html is easy to understand.

Convert to the desired matrix

At this point, all you have to do is seriously process each piece of data.

import numpy as np
import pandas as pd

df = pd.read_csv('u.data', sep='\t', names=['user_id','item_id', 'rating', 'timestamp'])

shape = (df.max().ix['user_id'], df.max().ix['item_id'])
R = np.zeros(shape) 

for i in df.index:
    row = df.ix[i]
    R[row['user_id'] -1 , row['item_id'] - 1] = row['rating']


>>> print(R)
[[ 5.  3.  4. ...,  0.  0.  0.]
 [ 4.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ...,
 [ 5.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  5.  0. ...,  0.  0.  0.]]

In general, it's a sparse matrix (in many movies, the number that one person evaluates is limited), so it seems better to use sparse.

import numpy as np
import pandas as pd
from scipy import sparse

df = pd.read_csv('u.data', sep='\t', names=['user_id','item_id', 'rating', 'timestamp'])

shape = (df.max().ix['user_id'] + 1, df.max().ix['item_id'] + 1)
R = sparse.lil_matrix(shape) 

for i in df.index:
    row = df.ix[i]
    R[row['user_id'], row['item_id']] = row['rating']

>>> print(R.todense())
[[ 5.  3.  4. ...,  0.  0.  0.]
 [ 4.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ...,
 [ 5.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  5.  0. ...,  0.  0.  0.]]

that's all.

correction

I found an issue where the first row and first column were extra, so I fixed it. In the first draft, I wrote as follows. ..

shape = (df.max().ix['user_id'] + 1, df.max().ix['item_id'] + 1)
R = np.zeros(shape) 

for i in df.index:
    row = df.ix[i]
    R[row['user_id'], row['item_id']] = row['rating']

Recommended Posts

Convert csv, tsv data to matrix with python --using MovieLens as an example
Convert Excel data to JSON with python
Convert FX 1-minute data to 5-minute data with Python
Read CSV file with Python and convert it to DataFrame as it is
Write CSV data to AWS-S3 with AWS-Lambda + Python
Reading Note: An Introduction to Data Analysis with Python
How to convert JSON file to CSV file with Python Pandas
Process csv data with python (count processing using pandas)
Write to csv with Python
How to convert an array to a dictionary with Python [Application]
[Python] Write to csv file with Python
Output to csv file with Python
Convert list to DataFrame with python
[Python] Convert CSV file uploaded to S3 to JSON file with AWS Lambda
How to import CSV and TSV files into SQLite with Python
[Python] Explains how to use the format function with an example
[Python] How to store a csv file as one-dimensional array data
[Python] Read a csv file with a large data size using a generator
[Python] How to convert db file to csv
[Data science basics] I tried saving from csv to mysql with python
How to convert Python to an exe file
[Python] Convert csv file delimiters to tab delimiters
Convert from PDF to CSV with pdfplumber
Convert XML document stored in XML database (BaseX) to CSV format (using Python)
Convert Hiragana to Romaji with Python (Beta)
[Python] [Excel] Operate an Excel sheet from Python using openpyxl (using a test sheet as an example)
[Part1] Scraping with Python → Organize to csv!
Convert HEIC files to PNG files with Python
Convert Chinese numerals to Arabic numerals with Python
Preprocessing with Python. Convert Nico Nico Douga tag search results to CSV format
Sample to convert image to Wavelet with Python
Summary of how to read numerical data with python [CSV, NetCDF, Fortran binary]
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 2: Import data to SQL Server using PowerShell
To automatically send an email with an attachment using the Gmail API in Python
How to read a CSV file with Python 2/3
Python C / C ++ Extension Pattern-Pass data to Python as np.array
Try using django-import-export to add csv data to django
Convert PDF to image (JPEG / PNG) with Python
Convert PDFs to images in bulk with Python
Read Python csv data with Pandas ⇒ Graph with Matplotlib
Send an email to Spushi's address with python
Convert svg file to png / ico with Python
How to crop an image with Python + OpenCV
Convert Windows epoch values to date with python
I tried to get CloudWatch data with Python
Generate an insert statement from CSV with Python.
Convert STL to Voxel mesh using Python VTK
Convert json format data to txt (using yolo)
Post an article with an image to WordPress with Python
Example of reading and writing CSV with Python
Convert strings to character-by-character list format with python
[Python] Convert time display (str type) using "" "and"'" to seconds (float type) with datetime and timedelta
I want to get custom data attributes of html as elements using Python Selenium
Reading, summarizing, visualizing, and exporting time series data to an Excel file with Python
Upload as open data using CKAN API in Python & automatically link with Github Actions
I want to convert an image to WebP with lollipop
How to convert / restore a string with [] in python
How to scrape image data from flickr with python
Collectively register data in Firestore using csv file in Python
I tried to touch the CSV file with Python
How to convert horizontally held data to vertically held data with pandas