Comparison of data frame handling in Python (pandas), R, Pig

Yesterday I touched Pig with the title Basic grammar by Apache Pig (1), so today is of course the basic grammar ( 2) I wondered if it was, but this area of the reference link of the article of the day before yesterday yutakikuchi / 20130107/1357514830) is all you need to do, so the grammar story is over in one go. It is the last inning suddenly.

Instead, today I would like to find out how different the usability is when dealing with a certain amount of huge data when using pandas, R, and Pig that I have dealt with so far.

Data to be verified

Consider a text file formed by a set of lines like this: The key items are date, primary key, store name, time stamp, converted time stamp, and numerical value. The delimiter is a tab delimiter.

20140205 XXXXXXAABBCC    Shop7 1391568621      2014-02-05 11:50:21 +0900       0

The calculator has about 100 million rows of data and a data size of 7.5 gigabytes. This time, let's find the average for this numerical data. Find out which one is the best to handle this task with pandas, R, and Pig.

The performance of the computer used for verification is Core i7 (Haswell) and memory 32GB.

Python (pandas)

In Python, there is a library for dataframe manipulation called pandas. Unless you have any restrictions, you should use this.

$ pip install pandas
$ ipython

In [1]: import pandas as pd
In [2]: df = pd.read_table('sample.txt', header=None)
In [3]: df.ix[:,5].mean()
Out[3]: 305.4479883399822

The features of pandas are as follows.

In other words, pandas is good if the computer has enough performance to put the data to be operated in memory.

R

Speaking of data frame operations, it's R.

df <- read.table("sample.txt", sep="\t")
colMeans(df[6])
#=>     V6
#  305.448

In the case of R, the data is stored in memory when read.table like pandas, but when it comes to several gigabytes of data, the performance is clearly slower than pandas.

We also use colMeans () to calculate the average, but the execution speed of statistical functions was superior to that of pandas.

The features of R are summarized below.

Pig

Finally, Apache Pig. This time, we will use pig -x local because we will handle the text file on a single computer.

df = LOAD 'sample.txt' USING PigStorage('\t') AS (date: chararray, key: chararray, shop: chararray, unixtime: int, humantime: chararray, times: int);

grouped = group df all; 
times_mean = foreach grouped generate AVG(df.times);

dump times_mean;

#=> (305.4479883399822)

In the case of Pig, memory is not allocated even if you enter LOAD and subsequent functions. The interactive shell also responds instantly.

MapReduce is executed only after the last dump times_mean ;.

Summary

I think it's better to use pandas if you have the ability to process data with a single calculator, and Pig if the computer's performance isn't enough for your data.

Recommended Posts

Comparison of data frame handling in Python (pandas), R, Pig
Notes on handling large amounts of data with python + pandas
Handling of JSON files in Python
Hashing data in R and Python
Conditional element extraction from data frame: R is% in%, Python is .isin ()
Comparison of Japanese conversion module in Python3
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data
The story of reading HSPICE data in Python
Make a joyplot-like plot of R in python
Comparison of R and Python writing (Euclidean algorithm)
A well-prepared record of data analysis in Python
[Data science memorandum] Handling of missing values ​​[python]
Find the index of items that match the conditions in the pandas data frame / series
Comparison of Python (+ Pandas), R, Julia (+ DataFrames) (summary of table contents, access by column)
Basic data frame operations written by beginners in a week of learning Python
Handling json in python
Hexadecimal handling in Python 3
Summary of tools needed to analyze data in Python
Full-width and half-width processing of CSV data in Python
Power BI visualization of Salesforce data entirely in Python
Summary of Pandas methods used when extracting data [Python]
Not being aware of the contents of the data in python
List of Python code used in big data analysis
Let's use the open data of "Mamebus" in Python
Understand the status of data loss --Python vs. R
[Memo] Text matching in pandas data frame using flashtext
Try scraping the data of COVID-19 in Tokyo with Python
Handle Ambient data in Python
Changed the default style (CSS) of pandas data frame output by display in Google Colab
Display UTM-30LX data in Python
Null object comparison in Python
Handling of quotes in [bash]
Comparison of exponential moving average (EMA) code written in Python
python> Handling of 2D arrays
Handling of python on mac
[Homology] Count the number of holes in data with Python
Comparison of 4 Python web frameworks
Comparison of how to use higher-order functions in Python 2 and 3
Graph time series data in Python using pandas and matplotlib
Implementation of quicksort in Python
How to get an overview of your data in Pandas
Relative url handling in python
Data science companion in python, how to specify elements in pandas
Data analysis using python pandas
The Power of Pandas: Python
Ported from R language of "Sazae-san's rock-paper-scissors data analysis" to Python
Recommended books and sources of data analysis programming (Python or R)
A simple data analysis of Bitcoin provided by CoinMetrics in Python
Automatic acquisition of gene expression level data by python and R
Practice of data analysis by Python and pandas (Tokyo COVID-19 data edition)
[Blender Python] Arrange custom property data in template_list () of UI layout
Get Leap Motion data in Python.
Pixel manipulation of images in Python
[Python] Operation memo of pandas DataFrame
Handling of sparse tree-structured attributes (Python)
Read Protocol Buffers data in Python3
Hit treasure data from Python Pandas
Get data from Quandl in Python
Python Application: Data Handling Part 3: Data Format
Run shell command / python in R
Division of timedelta in Python 2.7 series