[PYTHON] About the inefficiency of data transfer in luigi on-memory

In the comment of Parameter tuning with luigi, I received the information that "on-memory can be passed with luigi.mock", and I actually did it. The story I tried. Well, from the name mock, it seems that file I / O is done on-memory in a pseudo manner, and I think that it will not be so efficient. I was there.

Data transfer code on-memory

The code used this time is as follows. https://github.com/keisuke-yanagisawa/study/blob/20151208/luigi/mock_test.py Use python mock_test.py main --use mock to check the mock version. You can run the mock-free version with python mock_test.py main.

As you can see, it is a code that creates csv with 10000000 "1" s, separated by commas, reads it, and counts the number of characters, and the final output is 19999999. I have some time to create an array, but it's almost like this. In fact, this made a difference in the following time measurements.

Time measurement result

I will show you the result quickly. This time, I used the time command to measure the time three times.

luigi.LocalTarget luigi.mock.MockTarget
First time 10.952 sec. 29.879 sec.
Second time 7.829 sec. 30.883 sec.
Third time 11.137 sec. 27.766 sec.

Yes, I have no objection. Even though it's a mock, I didn't expect it to be this slow. As explained by the head family, it feels like a mechanism for testing.

So, let's write out a pounding file for everyday use.

Recommended Posts

About the inefficiency of data transfer in luigi on-memory
About the components of Luigi
The story of reading HSPICE data in Python
[Note] About the role of underscore "_" in Python
About the behavior of Model.get_or_create () of peewee in Python
About testing in the implementation of machine learning models
Not being aware of the contents of the data in python
About the uncluttered arrangement in the import order of flake8
Let's use the open data of "Mamebus" in Python
A reminder about the implementation of recommendations in Python
About the ease of Python
About the features of Python
About data management of anvil-app-server
Try scraping the data of COVID-19 in Tokyo with Python
Analyzing data on the number of corona patients in Japan
[Homology] Count the number of holes in data with Python
The story of participating in AtCoder
About Boxplot and Violinplot that visualize the variability of independent data
About the return value of pthread_mutex_init ()
Organize useful blogs in the field of data science (overseas & Japan)
About the return value of the histogram.
About the basic type of Go
The story of the "hole" in the file
About the upper limit of threads-max
Check the data summary in CASTable
About the average option in sklearn.metrics.f1_score
The meaning of ".object" in Django
About the behavior of yield_per of SqlAlchemy
About the size of matplotlib points
About the basics list of Python basics
Look up the names and data of free variables in function objects
Get the key for the second layer migration of JSON data in python
Explain the mechanism of PEP557 data class
[Understanding in 3 minutes] The beginning of Linux
Check the behavior of destructor in Python
The story of an error in PyOCR
The story of verifying the open data of COVID-19
Implement part of the process in C ++
Get the column list & data list of CASTable
About the behavior of enable_backprop of Chainer v2
About the virtual environment of python version 3.7
The result of installing python in Anaconda
Let's claim the possibility of pyenv-virtualenv in 2021
I saved the scraped data in CSV!
About the arguments of the setup function of PyCaret
The basics of running NoxPlayer in Python
Separation of design and data in matplotlib
About the Normal Equation of Linear Regression
Conversion of time data in 25 o'clock notation
In search of the fastest FizzBuzz in Python
Visualize the export data of Piyo log
Talking about the features that pandas and I were in charge of in the project
Find the index of items that match the conditions in the pandas data frame / series
Try to display the railway data of national land numerical information in 3D
Output the number of CPU cores in Python
The meaning of {version-number} in the mysql rpm package
[Python] Sort the list of pathlib.Path in natural sort
About the accuracy of Archimedean circle calculation method
About the behavior of copy, deepcopy and numpy.copy
About the X-axis notation of Matplotlib bar graphs
Change the font size of the legend in df.plot