Reading from text files and SQLite in Python (+ Pandas), R, Julia (+ DataFrames)

2017/1/3 postscript: DataFrames.jl is planning to release v.0.9 in February 2017 with a major fix that also partially breaks API compatibility. Please note that this article is based on v.0.6.10. please. If I have time and feel like it (hey), I may rewrite it accordingly. However, I haven't had much time to touch Julia these days ...


I'm really interested in the words "let's write it casually!", So I've touched it, but I'll post it as an article for Julia Advent Calendar 2015.

In this article, I will write about data formatting and basic aggregation processing in Julia by comparing it with Python's Pandas library and R (which I usually use). I was planning to do that, but when I was writing it, I was in a sad situation that I had run out of time to read the data. Please forgive me as I plan to write more.

As a sample data set, use the Subset of Million Songs Dataset (Thierry Bertin-Mahieux et al., 2011.) published below.

http://labrosa.ee.columbia.edu/millionsong/pages/getting-dataset#subset

I'm sorry if anyone downloads this data, which is a fairly large subset, but this article uses only a small portion of this dataset, the following two in the AdditionalDataset directory: is:

  1. subset_artist_term.db
  2. subset_unique_artists.txt

The file 1 . Is SQLite DB and 2. is in text format. If you read the description on the original page, you will find that the file in 2. is described in the format "artist id \ <SEP > artist mbid \ <SEP > track id \ <SEP > artist name". If you look at this file with head for the time being, it looks like the following.

AR009211187B989185<SEP>9dfe78a6-6d91-454e-9b95-9d7722cbc476<SEP>TRAWSCW12903CD6C7E<SEP>Carroll Thompson
AR00A6H1187FB5402A<SEP>312c14d9-7897-4608-944a-c5b1c76ae682<SEP>TRAKWGL12903CB8529<SEP>The Meatmen
AR00LNI1187FB444A5<SEP>7e836d29-fc2d-4a1f-b8da-566d47c49eed<SEP>TRAHVYQ128F429826B<SEP>Bruce BecVar
AR00MBZ1187B9B5DB1<SEP>ff748426-8873-4725-bdc7-c2b18b510d41<SEP>TRAFCPP128F426EC01<SEP>Memphis Minnie
AR01IP11187B9AF5D2<SEP>dbd2ebce-623d-4639-946e-c558bf56a0e3<SEP>TRAZNKG12903CDCF8A<SEP>Call To Preserve
AR01VU31187B997DA0<SEP>103241b0-6adf-4b4f-9cff-5c87459f61a4<SEP>TRAKZMB128F427B44F<SEP>Grand Funk
AR01W2D1187FB5912F<SEP>125948ec-7f91-4d1a-8b83-accbf50fae3d<SEP>TRASHYD128F93119EE<SEP>3OH!3
AR022JO1187B99587B<SEP>9538ab80-dcbf-4b94-a0cf-e5d1fbbc42c9<SEP>TRAHSJR12903CBF093<SEP>Ross
AR02DB61187B9A0B5E<SEP>bf035517-124e-409b-90f5-35618081a332<SEP>TRBANKK12903CDA071<SEP>Annie Philippe
AR02IU11187FB513F2<SEP>f19ad155-d809-4770-ab8d-7579467d9f55<SEP>TRAJFCC12903CC55AD<SEP>Tito Puente

Looking at the file format, \ <SEP > is used as the delimiter string as described, and there is no header. R's read.delim and Julia's DataFrames.readtable do not support multi-character delimiters and must be replaced appropriately. The comma may be used in the artist name etc., so the tab seems safe. In R and Julia, we will read the tab-replaced version of \ <SEP >.

Incidentally, there is an artist name Buzzov¬ "en on line 224 of this file, and in cases like this where there is only one double quote, Pandas seems to ignore it and read it. , Other than that, an error occurs and it cannot be read, so I decided to delete this double quote this time.

Data reading with Python + Pandas is as follows.

import pandas
import sqlite3


conn = sqlite3.connect("subset_artist_term.db")
artist_term = pandas.read_sql(
    "SELECT * FROM ARTIST_TERM",
    conn)
unique_artists = pandas.read_csv(
    "./subset_unique_artists.txt",
    names=["artist_id", "artist_mbid", "track_id", "artist_name"],
    sep="<SEP>")

With R, it looks like this.

require(RSQLite)

conn <- dbConnect(SQLite(), "subset_artist_term.db")
artist_term <- dbReadTable(conn, "artist_term")

unique_artists <- read.delim(
    "subset_unique_artists.tsv",
    sep="\t",
    header=F,
    col.names=c("artist_id", "artist_mbid", "track_id", "artist_name"))

And the Julia version.

import DataFrames
unique_artists = DataFrames.readtable(
    "subset_unique_artists.tsv",
    header=false,
    names=[:artist_id, :artist_mbid, :track_id, :artist_name])

It was good until I loaded the TSV with the above code, but I got stuck loading SQLite. SQLite.jl reads in DataStreams.Data.Table format, so it needs to be converted to a DataFrame. For the conversion method, the following exchange in julia-users was helpful (although it did not work as it is) https://groups.google.com/forum/#!msg/julia-users/ IFkPso4JUac / KwYj91fJAwAJ Data reading with DataFrame from SQLite was done as follows.

import SQLite
import DataStreams

function DataFrame(dt::DataStreams.Data.Table)
    cols = dt.schema.cols
    data = Array(Any,cols)
    types = dt.schema.types
    for i = 1:cols
        data[i] = dt.data[i]
    end
    return DataFrames.DataFrame(data,Symbol[symbol(x) for x in dt.schema.header]) 
end

conn =  SQLite.DB("subset_artist_term.db")
artist_term_table = SQLite.query(conn, "SELECT * FROM ARTIST_TERM")
artist_term = DataFrame(artist_term_table)

2015/12/18 postscript:

In the above code, each column is of type Nullable {UTF8Strings}, but I feel that it would be better to express it using NA rather than that.

So

function DataFrame(dt::DataStreams.Data.Table)
    cols = dt.schema.cols
    data = Array(Any,cols)
    types = dt.schema.types
    for i = 1:cols
        data[i] = [isnull(el)? NA : get(el) for el in dt.data[i]]
    end
    return DataFrames.DataFrame(data, Symbol[symbol(x) for x in dt.schema.header]) 
end

It may be better to say.


For the time being, this is the end. I will write a comparison of slicing processing, farewell aggregation, etc. again.

The operation was confirmed in the following environment:

Fix:

--I wrote SQLite.connect in Julia's code, but it was a mistake in SQLite.DB. (2015/12/16)

Recommended Posts

Reading from text files and SQLite in Python (+ Pandas), R, Julia (+ DataFrames)
Reading and writing text in Python
Study from Python Reading and writing Hour9 files
Reading and writing CSV and JSON files in Python
Manipulate files and folders in Python
Read and use Python files from Python
Extract text from images in Python
Sort large text files in Python
Hashing data in R and Python
Export and output files in Python
Extract strings from files in Python
Reading and writing JSON files with Python
Read and write JSON files in Python
[Easy Python] Reading Excel files with pandas
Get options in Python from both JSON files and command line arguments
Load and execute command from yml in python
Sqlite in python
Comparison of Python (+ Pandas), R, Julia (+ DataFrames) (summary of table contents, access by column)
Reading and writing fits files with Python (memo)
Python text reading for multiple lines and one line
Translator in Python from Visual Studio 2017 (Microsoft Translator Text API)
Implemented List and Bool in Python and SQLite3 (personal note)
Include and use external Kv files in Python Kivy
[Introduction for beginners] Reading and writing Python CSV files
How to download files from Selenium in Python in Chrome
Import classes in jar files directly from Python scripts
Get files from Linux using paramiko and scp [Python]
Recursively search for files and directories in Python and output
Clustering text in Python
Text processing in Python
[Python] Reading CSV files
I tried [scraping] fashion images and text sentences in Python.
[R] [Python] Memo to read multiple csv files in multiple zip files
Poisson distribution and Poisson cumulative distribution plot via sqlite in Python and Java
[Python] Plotly draws Pandas dataframes in one shot with Cufflinks
Allow Python to select strings in input files from folders
Predict gender from name using Gender API and Pykakasi in Python
Graph time series data in Python using pandas and matplotlib
Comparison of data frame handling in Python (pandas), R, Pig
[Python] Random data extraction / combination from DataFrame using random and pandas
How to create dataframes and mess with elements in pandas
Notes on reading and writing float32 TIFF images in python
UTF8 text processing in python
Reading and writing csv files
Stack and Queue in Python
Reading .txt files with Python
Speech to speech in python [text to speech]
Connect to sqlite from python
Unittest and CI in Python
Swap columns in pandas dataframes
Works with Python and R
How to import CSV and TSV files into SQLite with Python
uproot: Python / Numpy based library for reading and writing ROOT files
How to use functions in separate files Perl and Python versions
[python] Extract text from pdf and read characters aloud with Open-Jtalk
[Python3] Save the mean and covariance matrix in json with pandas
Conditional element extraction from data frame: R is% in%, Python is .isin ()
Extract database tables with CSV [ODBC connection from R and python]