Data analysis in Python: A note about line_profiler

line_profiler is useful

I've heard that short speeches and skirts are better.

Even in data analysis, I want to do as many experiments as possible, so Routine repetitive work such as pretreatment should be as short as possible.

I think profiling is useful in such cases.

Recently, I have decided to handle data of several tens of GB ~ in size privately.

Through that work, about parallel processing, profiling, etc. I made a small discovery, so I wish I could share it.

The first is the discovery when profiling with line_profiler.

Many people have written about line_profiler, so please check it out. It's a great project.

Profiling data aggregation processing

About data

I can't show you the data that was actually used. .. .. We will proceed with sample data, which has a structure similar to that data.

In [1]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 3 columns):
key      100000 non-null int64
data1    100000 non-null int64
data2    100000 non-null int64
dtypes: int64(3)
memory usage: 3.1 MB
In [2]: df.head()
Out[2]: 
    key  data1  data2
0  1800   4153    159
1  5568   6852     45
2   432   7598    418
3  4254   9412    931
4  3634   8204    872

The actual number of data is tens of millions of lines, Since it is sample data here, it is dropped to 100000 lines.

Aggregation processing

Aggregated for the code below (redundant due to line-by-line profiling).

def proc1():
    chunker = pd.read_csv('./data/testdata.csv', chunksize=10000)

    li = []
    for df in chunker:
        #Change column name
        df.rename(columns={'data': 'value1', 'data2': 'value2'}, inplace=True)
        #Aggregate for each key and take the total of value1
        li.append(df.groupby('key')['value1'].sum())

    g = pd.concat(li, axis=1)
    return g.sum(axis=1)

It is a supplement about the code.

  1. Since the data does not fit in memory, it is divided by chunk size and read.
  2. The column name of the original data is difficult to use, so I changed it to a name that is easy to use.
  3. In this aggregation, we will not do about value2.

Use line_profiler

This time, it is used in ipython notebook.

%load_ext line_profiler

You can access it with the magic command % lprun. Let's use this to measure the above proc1.

In [3]: %load_ext line_profiler

In [4]: %lprun -f proc1 proc1()
Timer unit: 1e-06 s

Total time: 0.060401 s
File: <ipython-input-105-0457ade3b36e>
Function: proc1 at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def proc1():
     2         1         1785   1785.0      3.0      chunker = pd.read_csv('./data/coltest.csv', chunksize=100000)
     3                                           
     4         1            2      2.0      0.0      li = []
     5         2        49155  24577.5     81.4      for df in chunker:
     6         1         1932   1932.0      3.2          df.rename(columns={'data': 'value1', 'data2': 'value2'}, inplace=True)
     7         1         4303   4303.0      7.1          li.append(df.groupby('key')['value1'].sum())
     8                                           
     9         1         2723   2723.0      4.5      g = pd.concat(li, axis=1)
    10         1          501    501.0      0.8      return g.sum(axis=1)

Look for bottlenecks

Until I ran line_profiler, it was like" I wonder if reading in file division is slow ". It's slow, but there are other parts that take a lot of time.

a. The part of df.rename ... takes about half of the aggregation process of groupby in terms of% time (percentage of total).

--The process of renaming columns must not be inside the loop. .. --In this case, if you want to change it in the first place, you should change it using the option of read_csv. ――There seems to be an opinion that you don't have to change it in the first place.

b. If you don't use the column value2, you should also not read it using the ʻusercols option of read_csv.

at the end

I thought it would take longer than I expected to process rename. I think the line_profiler that made such a discovery is very good.

Next, I would like to write a working note about parallel processing in ipython.

Recommended Posts

Data analysis in Python: A note about line_profiler
A well-prepared record of data analysis in Python
A note about [python] __debug__
Python: A Note About Classes 1 "Abstract"
A note about get_scorer in sklearn
A note about mock (Python mock library)
Data analysis python
Data analysis with python 2
Data analysis overview python
A note on optimizing blackbox functions in Python
A memo about writing merge sort in Python
A note about the python version of python virtualenv
A simple Pub / Sub program note in Python
Think about building a Python 3 environment in a Mac environment
A note about __call__
Python data analysis template
[Note] About the role of underscore "_" in Python
A note about subprocess
A note about mprotect (2)
Association analysis in Python
About __all__ in python
Data analysis with Python
A story about data analysis by machine learning
Regression analysis in Python
Create a data collection bot in Python using Selenium
Receive dictionary data from a Python program in AppleScript
Reading Note: An Introduction to Data Analysis with Python
Things to note when initializing a list in Python
List of Python code used in big data analysis
Until you insert data into a spreadsheet in Python
A note on handling variables in Python recursive functions
A reminder about the implementation of recommendations in Python
Take a screenshot in Python
My python data analysis container
Handle Ambient data in Python
Create a function in Python
Create a dictionary in Python
Python for Data Analysis Chapter 4
Display UTM-30LX data in Python
Think about architecture in python
Axisymmetric stress analysis in Python
A note about KornShell (ksh)
Python data analysis learning notes
A memorandum about correlation [Python]
Make a bookmarklet in Python
A note about TensorFlow Introduction
Python for Data Analysis Chapter 2
A memorandum about Python mock
Simple regression analysis in Python
Draw a heart in Python
Data analysis using python pandas
About "for _ in range ():" in python
Python for Data Analysis Chapter 3
A note about hitting the Facebook API with the Python SDK
Published a library that hides character data in Python images
A story about how to specify a relative path in python.
A note on touching Microsoft's face recognition API in Python
[Understand in the shortest time] Python basics for data analysis
Build a python data analysis environment on Mac (El Capitan)
[Note] Import of a file in the parent directory in Python
<Python> Build a dedicated server for Jupyter Notebook data analysis