I tried to compare the processing speed with dplyr of R and pandas of Python

Introduction

Let's process the same table data in the same way with dplyr in R and pandas in python.

Which is faster? I was curious, so I looked it up.

Aggregation of baseball data

Let's make a baseball number ranking .csv from the 2013 Major League baseball at-bat result data (77MB, about 190,000 lines).

The script dplyr.R written using R's dplyr is

library(data.table)
library(dplyr)

##Data read
dat = fread("all2013.csv")

##Aggregate
dat %>% select(BAT_ID, H_FL) %>% 
 group_by(BAT_ID) %>% 
 summarise(BASE = sum(H_FL)) %>% 
 arrange(desc(BASE)) %>% 
 write.csv("hoge.csv")

Like this.

> time R -f dplyr.R
R -f dplyr.R  3.13s user 0.15s system 99% cpu 3.294 total

With python pandas,


#!/usr/bin/python

import pandas as pd

df = pd.read_csv('all2013.csv')

df[["BAT_ID", "H_FL"]].groupby("BAT_ID").sum().sort("H_FL", ascending=False).to_csv('hoge.csv')

Like this.

> time ./pd.py                                                                          
./pd.py  3.12s user 0.40s system 98% cpu 3.567 total

3.29 seconds for dplyr, 3.56 seconds for pandas.

dplyr is a little better.

Summary

With 77MB of data, neither seems to be particularly fast.

Is it OK if you use someone who is used to it?

that's all.

Recommended Posts

I tried to compare the processing speed with dplyr of R and pandas of Python

I tried to find the entropy of the image with python

I tried to automate the article update of Livedoor blog with Python and selenium.

I compared the speed of Hash with Topaz, Ruby and Python

I tried to improve the efficiency of daily work with Python

I tried to get the number of days of the month holidays (Saturdays, Sundays, and holidays) with python

Compare the speed of Python append and map

I replaced the numerical calculation of Python with Rust and compared the speed

I tried to get the authentication code of Qiita API with Python.

I tried to verify and analyze the acceleration of Python by Cython

I tried to analyze the negativeness of Nono Morikubo. [Compare with Posipa]

I tried to get the movie information of TMDb API with Python

I tried to touch the CSV file with Python

I tried to solve the soma cube with python

I tried to solve the problem with Python Vol.1

I tried to summarize the string operations of Python

I tried to easily visualize the tweets of JAWS DAYS 2017 with Python + ELK

I tried to automatically send the literature of the new coronavirus to LINE with Python

I tried to get and analyze the statistical data of the new corona with Python: Data of Johns Hopkins University

I tried to compare the accuracy of Japanese BERT and Japanese Distil BERT sentence classification with PyTorch & Introduction of BERT accuracy improvement technique

I tried "gamma correction" of the image with Python + OpenCV

I tried to simulate how the infection spreads with Python

I tried to find the average of the sequence with TensorFlow

I tried to put out the frequent word ranking of LINE talk with Python

[Python] I tried to visualize the follow relationship of Twitter

I want to know the features of Python and pip

I tried to enumerate the differences between java and python

Image processing with Python (I tried binarizing it into a mosaic art of 0 and 1)

I tried to make GUI tic-tac-toe with Python and Tkinter

I tried to divide the file into folders with Python

The 15th offline real-time I tried to solve the problem of how to write with python

I tried to automatically post to ChatWork at the time of deployment with fabric and ChatWork Api

How to write offline real time I tried to solve the problem of F02 with Python

I also tried to imitate the function monad and State monad with a generator in Python

I wrote a doctest in "I tried to simulate the probability of a bingo game with Python"

I tried scraping the ranking of Qiita Advent Calendar with Python

I tried to solve the ant book beginner's edition with python

Speed comparison of Wiktionary full text processing with F # and Python

I tried to automate the watering of the planter with Raspberry Pi

[Introduction to Python] I compared the naming conventions of C # and Python.

I want to output the beginning of the next month with Python

I tried to create a list of prime numbers with python

Consider the speed of processing to shift the image buffer with numpy.ndarray

[Pandas] I tried to analyze sales data with Python [For beginners]

I tried to make a periodical process with Selenium and Python

I tried to expand the size of the logical volume with LVM

I tried to easily detect facial landmarks with python and dlib

I tried to automatically collect images of Kanna Hashimoto with Python! !!

PhytoMine-I tried to get the genetic information of plants with Python

I tried to verify the speaker identification by the Speaker Recognition API of Azure Cognitive Services with Python. # 1

I tried to create serverless batch processing for the first time with DynamoDB and Step Functions

I tried to verify the speaker identification by the Speaker Recognition API of Azure Cognitive Services with Python. # 2

I tried to get the batting results of Hachinai using image processing

I tried to visualize the age group and rate distribution of Atcoder

I tried to express sadness and joy with the stable marriage problem.

(Python) I tried to analyze 1 million hands ~ I tried to estimate the number of AA ~

I tried to learn the angle from sin and cos with chainer

I have 0 years of programming experience and challenge data processing with python

I tried with the top 100 PyPI packages> I tried to graph the packages installed on Python

I tried to visualize the text of the novel "Weathering with You" with WordCloud

Python: I want to measure the processing time of a function neatly