Let's process the same table data in the same way with dplyr in R and pandas in python.
Which is faster? I was curious, so I looked it up.
Let's make a baseball number ranking .csv from the 2013 Major League baseball at-bat result data (77MB, about 190,000 lines).
The script dplyr.R written using R's dplyr is
library(data.table)
library(dplyr)
##Data read
dat = fread("all2013.csv")
##Aggregate
dat %>% select(BAT_ID, H_FL) %>% 
 group_by(BAT_ID) %>% 
 summarise(BASE = sum(H_FL)) %>% 
 arrange(desc(BASE)) %>% 
 write.csv("hoge.csv")
Like this.
> time R -f dplyr.R
R -f dplyr.R  3.13s user 0.15s system 99% cpu 3.294 total
With python pandas,
#!/usr/bin/python
import pandas as pd
df = pd.read_csv('all2013.csv')
df[["BAT_ID", "H_FL"]].groupby("BAT_ID").sum().sort("H_FL", ascending=False).to_csv('hoge.csv')
Like this.
> time ./pd.py                                                                          
./pd.py  3.12s user 0.40s system 98% cpu 3.567 total
3.29 seconds for dplyr, 3.56 seconds for pandas.
dplyr is a little better.
With 77MB of data, neither seems to be particularly fast.
Is it OK if you use someone who is used to it?
that's all.
Recommended Posts