Reading from text files and SQLite in Python (+ Pandas), R, Julia (+ DataFrames)

2017/1/3 postscript: DataFrames.jl is planning to release v.0.9 in February 2017 with a major fix that also partially breaks API compatibility. Please note that this article is based on v.0.6.10. please. If I have time and feel like it (hey), I may rewrite it accordingly. However, I haven't had much time to touch Julia these days ...

I'm really interested in the words "let's write it casually!", So I've touched it, but I'll post it as an article for Julia Advent Calendar 2015.

In this article, I will write about data formatting and basic aggregation processing in Julia by comparing it with Python's Pandas library and R (which I usually use). I was planning to do that, but when I was writing it, I was in a sad situation that I had run out of time to read the data. Please forgive me as I plan to write more.

As a sample data set, use the Subset of Million Songs Dataset (Thierry Bertin-Mahieux et al., 2011.) published below.

http://labrosa.ee.columbia.edu/millionsong/pages/getting-dataset#subset

I'm sorry if anyone downloads this data, which is a fairly large subset, but this article uses only a small portion of this dataset, the following two in the AdditionalDataset directory: is:

subset_artist_term.db
subset_unique_artists.txt

The file 1 . Is SQLite DB and 2. is in text format. If you read the description on the original page, you will find that the file in 2. is described in the format "artist id \ <SEP > artist mbid \ <SEP > track id \ <SEP > artist name". If you look at this file with head for the time being, it looks like the following.

AR009211187B989185<SEP>9dfe78a6-6d91-454e-9b95-9d7722cbc476<SEP>TRAWSCW12903CD6C7E<SEP>Carroll Thompson
AR00A6H1187FB5402A<SEP>312c14d9-7897-4608-944a-c5b1c76ae682<SEP>TRAKWGL12903CB8529<SEP>The Meatmen
AR00LNI1187FB444A5<SEP>7e836d29-fc2d-4a1f-b8da-566d47c49eed<SEP>TRAHVYQ128F429826B<SEP>Bruce BecVar
AR00MBZ1187B9B5DB1<SEP>ff748426-8873-4725-bdc7-c2b18b510d41<SEP>TRAFCPP128F426EC01<SEP>Memphis Minnie
AR01IP11187B9AF5D2<SEP>dbd2ebce-623d-4639-946e-c558bf56a0e3<SEP>TRAZNKG12903CDCF8A<SEP>Call To Preserve
AR01VU31187B997DA0<SEP>103241b0-6adf-4b4f-9cff-5c87459f61a4<SEP>TRAKZMB128F427B44F<SEP>Grand Funk
AR01W2D1187FB5912F<SEP>125948ec-7f91-4d1a-8b83-accbf50fae3d<SEP>TRASHYD128F93119EE<SEP>3OH!3
AR022JO1187B99587B<SEP>9538ab80-dcbf-4b94-a0cf-e5d1fbbc42c9<SEP>TRAHSJR12903CBF093<SEP>Ross
AR02DB61187B9A0B5E<SEP>bf035517-124e-409b-90f5-35618081a332<SEP>TRBANKK12903CDA071<SEP>Annie Philippe
AR02IU11187FB513F2<SEP>f19ad155-d809-4770-ab8d-7579467d9f55<SEP>TRAJFCC12903CC55AD<SEP>Tito Puente

Looking at the file format, \ <SEP > is used as the delimiter string as described, and there is no header. R's read.delim and Julia's DataFrames.readtable do not support multi-character delimiters and must be replaced appropriately. The comma may be used in the artist name etc., so the tab seems safe. In R and Julia, we will read the tab-replaced version of \ <SEP >.

Incidentally, there is an artist name Buzzov¬ "en on line 224 of this file, and in cases like this where there is only one double quote, Pandas seems to ignore it and read it. , Other than that, an error occurs and it cannot be read, so I decided to delete this double quote this time.

Data reading with Python + Pandas is as follows.

import pandas
import sqlite3


conn = sqlite3.connect("subset_artist_term.db")
artist_term = pandas.read_sql(
    "SELECT * FROM ARTIST_TERM",
    conn)
unique_artists = pandas.read_csv(
    "./subset_unique_artists.txt",
    names=["artist_id", "artist_mbid", "track_id", "artist_name"],
    sep="<SEP>")

With R, it looks like this.

require(RSQLite)

conn <- dbConnect(SQLite(), "subset_artist_term.db")
artist_term <- dbReadTable(conn, "artist_term")

unique_artists <- read.delim(
    "subset_unique_artists.tsv",
    sep="\t",
    header=F,
    col.names=c("artist_id", "artist_mbid", "track_id", "artist_name"))

And the Julia version.

import DataFrames
unique_artists = DataFrames.readtable(
    "subset_unique_artists.tsv",
    header=false,
    names=[:artist_id, :artist_mbid, :track_id, :artist_name])

It was good until I loaded the TSV with the above code, but I got stuck loading SQLite. SQLite.jl reads in DataStreams.Data.Table format, so it needs to be converted to a DataFrame. For the conversion method, the following exchange in julia-users was helpful (although it did not work as it is) https://groups.google.com/forum/#!msg/julia-users/ IFkPso4JUac / KwYj91fJAwAJ Data reading with DataFrame from SQLite was done as follows.

import SQLite
import DataStreams

function DataFrame(dt::DataStreams.Data.Table)
    cols = dt.schema.cols
    data = Array(Any,cols)
    types = dt.schema.types
    for i = 1:cols
        data[i] = dt.data[i]
    end
    return DataFrames.DataFrame(data,Symbol[symbol(x) for x in dt.schema.header]) 
end

conn =  SQLite.DB("subset_artist_term.db")
artist_term_table = SQLite.query(conn, "SELECT * FROM ARTIST_TERM")
artist_term = DataFrame(artist_term_table)

2015/12/18 postscript:

In the above code, each column is of type Nullable {UTF8Strings}, but I feel that it would be better to express it using NA rather than that.

function DataFrame(dt::DataStreams.Data.Table)
    cols = dt.schema.cols
    data = Array(Any,cols)
    types = dt.schema.types
    for i = 1:cols
        data[i] = [isnull(el)? NA : get(el) for el in dt.data[i]]
    end
    return DataFrames.DataFrame(data, Symbol[symbol(x) for x in dt.schema.header]) 
end

It may be better to say.

For the time being, this is the end. I will write a comparison of slicing processing, farewell aggregation, etc. again.

The operation was confirmed in the following environment:

OS: Manjaro Linux
Python: 3.5.0
- pandas: 0.17.1
R: 3.2.2
- RSQLite 1.0.0
Julia: 0.4.0
- DataFrames: 0.6.10
- DataStreams: 0.0.3
- SQLite: 0.3.0

Fix:

--I wrote SQLite.connect in Julia's code, but it was a mistake in SQLite.DB. (2015/12/16)