I tried to compare the processing speed with dplyr of R and pandas of Python

Introduction

Let's process the same table data in the same way with dplyr in R and pandas in python.

Which is faster? I was curious, so I looked it up.

Aggregation of baseball data

Let's make a baseball number ranking .csv from the 2013 Major League baseball at-bat result data (77MB, about 190,000 lines).

The script dplyr.R written using R's dplyr is

library(data.table)
library(dplyr)

##Data read
dat = fread("all2013.csv")

##Aggregate
dat %>% select(BAT_ID, H_FL) %>% 
 group_by(BAT_ID) %>% 
 summarise(BASE = sum(H_FL)) %>% 
 arrange(desc(BASE)) %>% 
 write.csv("hoge.csv")

Like this.

> time R -f dplyr.R
R -f dplyr.R  3.13s user 0.15s system 99% cpu 3.294 total

With python pandas,


#!/usr/bin/python

import pandas as pd

df = pd.read_csv('all2013.csv')

df[["BAT_ID", "H_FL"]].groupby("BAT_ID").sum().sort("H_FL", ascending=False).to_csv('hoge.csv')

Like this.

> time ./pd.py                                                                          
./pd.py  3.12s user 0.40s system 98% cpu 3.567 total

3.29 seconds for dplyr, 3.56 seconds for pandas.

dplyr is a little better.

Summary

With 77MB of data, neither seems to be particularly fast.

Is it OK if you use someone who is used to it?

that's all.

Recommended Posts

I tried to compare the processing speed with dplyr of R and pandas of Python
I tried to find the entropy of the image with python
I tried to automate the article update of Livedoor blog with Python and selenium.
I compared the speed of Hash with Topaz, Ruby and Python
I tried to improve the efficiency of daily work with Python
I tried to get the number of days of the month holidays (Saturdays, Sundays, and holidays) with python
Compare the speed of Python append and map
I replaced the numerical calculation of Python with Rust and compared the speed
I tried to get the authentication code of Qiita API with Python.
I tried to verify and analyze the acceleration of Python by Cython
I tried to analyze the negativeness of Nono Morikubo. [Compare with Posipa]
I tried to get the movie information of TMDb API with Python
I tried to touch the CSV file with Python
I tried to solve the soma cube with python
I tried to solve the problem with Python Vol.1
I tried to summarize the string operations of Python
I tried to easily visualize the tweets of JAWS DAYS 2017 with Python + ELK
I tried to automatically send the literature of the new coronavirus to LINE with Python
I tried to get and analyze the statistical data of the new corona with Python: Data of Johns Hopkins University
I tried to compare the accuracy of Japanese BERT and Japanese Distil BERT sentence classification with PyTorch & Introduction of BERT accuracy improvement technique
I tried "gamma correction" of the image with Python + OpenCV
I tried to simulate how the infection spreads with Python
I tried to find the average of the sequence with TensorFlow
I tried to put out the frequent word ranking of LINE talk with Python
[Python] I tried to visualize the follow relationship of Twitter
I want to know the features of Python and pip
I tried to enumerate the differences between java and python
Image processing with Python (I tried binarizing it into a mosaic art of 0 and 1)
I tried to make GUI tic-tac-toe with Python and Tkinter
I tried to divide the file into folders with Python
The 15th offline real-time I tried to solve the problem of how to write with python
I tried to automatically post to ChatWork at the time of deployment with fabric and ChatWork Api
How to write offline real time I tried to solve the problem of F02 with Python
I also tried to imitate the function monad and State monad with a generator in Python
I wrote a doctest in "I tried to simulate the probability of a bingo game with Python"
I tried scraping the ranking of Qiita Advent Calendar with Python
I tried to solve the ant book beginner's edition with python
Speed comparison of Wiktionary full text processing with F # and Python
I tried to automate the watering of the planter with Raspberry Pi
[Introduction to Python] I compared the naming conventions of C # and Python.
I want to output the beginning of the next month with Python
I tried to create a list of prime numbers with python
Consider the speed of processing to shift the image buffer with numpy.ndarray
[Pandas] I tried to analyze sales data with Python [For beginners]
I tried to make a periodical process with Selenium and Python
I tried to expand the size of the logical volume with LVM
I tried to easily detect facial landmarks with python and dlib
I tried to automatically collect images of Kanna Hashimoto with Python! !!
PhytoMine-I tried to get the genetic information of plants with Python
I tried to verify the speaker identification by the Speaker Recognition API of Azure Cognitive Services with Python. # 1
I tried to create serverless batch processing for the first time with DynamoDB and Step Functions
I tried to verify the speaker identification by the Speaker Recognition API of Azure Cognitive Services with Python. # 2
I tried to get the batting results of Hachinai using image processing
I tried to visualize the age group and rate distribution of Atcoder
I tried to express sadness and joy with the stable marriage problem.
(Python) I tried to analyze 1 million hands ~ I tried to estimate the number of AA ~
I tried to learn the angle from sin and cos with chainer
I have 0 years of programming experience and challenge data processing with python
I tried with the top 100 PyPI packages> I tried to graph the packages installed on Python
I tried to visualize the text of the novel "Weathering with You" with WordCloud
Python: I want to measure the processing time of a function neatly