Python Pandas is not suitable for batch processing

What is Pandas?

Pandas is a library that can process various data, centering on a tabular data structure called DataFrame. Since it is a table in the database, you can get started immediately if you know SQL. Familiar to anyone who analyzes data with Python.

How did you introduce it?

There is a lot of hearsay information, but I think it was introduced to the development team in this way.

――Hybridization of on-premises and cloud is progressing, and database storage is becoming more and more distributed. --Data flow management will become an issue, and Luigi will be introduced, which allows you to create data flows in Python. --Initially, Luigi was supposed to be primarily responsible for input and output to database storage. --Since the common language of the team is Scala, the logic was planned to be cut out and implemented firmly. --Create an environment where Luigi can connect to each database storage. --Easy transfer and report data flow will be organized in Luigi. ――Because this is convenient, migration and refurbishment are progressing, and processing such as filtering, joining, and aggregation gradually comes in, and Pandas is used naturally. ――If you notice, some batch processing will be dependent on Pandas, and you will be addicted to various things.

Mainly addicted to: scream:

Missing value problem

Since the missing value NaN is treated as a float, the moment the missing value is mixed in the int column, the entire column is cast to the float. If the type information is corrupted, it tends to be a problem, especially when it is submitted to the database.

>>> s = pd.Series([0, 1, 2])
>>> s[2]
>>> s[1] = np.nan
>>> s[2]

Reference problem

With just a little index operation, you can be forced into an uncertain situation whether it is view or copy (!?)

def do_something(df):
   foo = df[['bar', 'baz']]  # Is foo a view? A copy? Nobody knows!
   # ... many lines here ...
   foo['quux'] = value       # We don't know whether this will modify df or not!
   return foo

In this case, no matter how much testing is done, the quality is not guaranteed. Warning may be spit out at runtime, but the only suspicious part is to explicitly call the copy method ...

Sudden death

Looking at the log of a certain batch, there is a 1% chance of dying. There are many memory related items, and core dumps multiply. It also freezes.

*** glibc detected *** /usr/local/anaconda/bin/python: free(): invalid pointer:
Fatal Python error: GC object already tracked

People People People > Sudden death <  ̄Y^Y^Y^Y ̄

Since it is a Python 2.7 & Pandas 0.17 environment, it may be solved by updating ....

What to do in the future: thinking:

In future new development, it is a policy not to use Pandas together with Luigi as much as possible. After all, Pandas was for analysis, and it wasn't good to use it in batch ...

However, even for analytical purposes, I personally feel that the reference problem is fatal, so I will use Spark if I want a DataFrame in the future. Although it can be written in statically typed Scala, note that the compile check does not work for the essential schema operations. Library using cats framelessもありますが、あくまでproof-of-conceptです。

By the way, Luigi is idempotent for each task and assumes one output data, so it may not be suitable depending on the data flow to be assembled. And it seems that Spotify, the developer of Luigi, has moved to Google Cloud Dataflow and is developing Scala's wrapper library scio ....

Scio - A Scala API for Google Cloud Dataflow & Apache Beam

Recommended Posts

Python Pandas is not suitable for batch processing
Python round is not strictly round
Python list is not a list
100 Pandas knocks for Python beginners
[Python] Iterative processing (for, while)
Qiita API Python wrapper for batch processing to grab Qiita posts
What is Python? What is it used for?
Pandas basics for beginners ⑧ Digit processing
Personal notes for python image processing
Python for statement ~ What is iterable ~
What is the python underscore (_) for?
Windows Subsystem for Linux is not displayed
[Python] What is pandas Series and DataFrame?
python note: when easy_install is not available
[Python] Name Error: name'urlparse' is not defined
[Python] [pandas] How is pd.DataFrame (). T implemented?
Inject is recommended for DDD in Python
Why Python is chosen for machine learning
Template for writing batch scripts in python
[Python] Script useful for Excel / csv processing
Pandas of the beginner, by the beginner, for the beginner [Python]
5 Reasons Processing is Useful for Those Who Want to Get Started with Python
Python pandas: Search for DataFrame using regular expressions
Wagtail is the best CMS for Python! (Perhaps)
Python log is not output with docker-compose up
Image processing? The story of starting Python for
Image Processing with Python Environment Setup for Windows
scipy.sparse is not optimized for dot product operations
2016-10-30 else for Python3> for:
python [for myself]
Python is easy
My pandas (python)
python image processing
Quick batch text formatting + preprocessing for Aozora Bunko data for natural language processing with Python
Python file processing
Python handy batch
What is python
Python is instance
python pandas notes
What is Python
[Python] This is easy! Search for tweets on Twitter
Key input that does not wait for key input in Python
[Python] Measures and displays the time required for processing
Today's python error: HTTPError: 404 Client Error: Not Found for url:
Summary of pre-processing practices for Python beginners (Pandas dataframe)
3. Natural language processing with Python 4-1. Analysis for words with KWIC
Building an environment for natural language processing with Python
Electron is the best solution for Python multi-platform development
[Python] pandas Code that is likely to be reused
python> check NoneType or not> if a == None:> if a is None:
Process csv data with python (count processing using pandas)
Python memo using perl-Dictionary type (case is not valid)
[Python beginner] Variables and scope inside the function (when the processing inside the function is reflected outside the function and when it is not reflected)