[Python] Strengths and weaknesses of DataFrame in terms of time required

Introduction In Python, pandas.DataFrame can handle 2D table data. It is easier to handle table data than list and tuple, which are equivalent to arrays in other languages, but its processing speed is basically faster than list and tuple. Let's see the strengths and weaknesses of such a DataFrame while actually running the program.

Execution environment / conditions Execution environment ・ Windows10 Home 64bit conditions -Data used: CSV data of 10000 lines ・ Column of data used: index, student ID, A-E 5 subject scores (0-10000) * Modeled on a list of grades of 10,000 students who took a certain exam

DataFrame strength The strength of DataFrame is that it is easy to process and can process data at high speed.

When calculating the average value Add a column after it to implement the process of substituting the mean value. If it is a list, the following processing will be done.
for idx, row in enumerate(list):
	row.extend('0')
	row[7] = str((float(row[2]) + float(row[3]) + float(row[4]) + float(row[5]) + float(row[6]))/5.0)

It is similar in writing to other languages. This is fine, but ... If you implement it with DataFrame, you only need one line below.

df['average'] = (df['subjectA'] + df['subjectB'] + df['subjectC'] + df['subjectD'] + df['subjectE'])/5

Since you can write with the image of one line operation, there is less risk of coding mistakes.

When sorting When sorting in descending order of the above average value, if it is list, the following processing will be done.
list = sorted(list, key=lambda x: x[7], reverse=True)

Sorting can be implemented in one line even with list. If you implement this with a DataFrame, you only need one line.

df.sort_values('average', ascending=False)
When narrowing down by conditions If you want to narrow down only the data with an average score of 50 points or more, the following processing will be done if it is a list.
list2 = []
for idx, row in enumerate(list):
	if 50 <= float(row[7]):
		list2.append(row)

Use the for loop as you did when calculating the average. If you implement this in a DataFrame ... you can expect it to be in one line.

df2 = df[50 < df['average']]
The speed of each process is ... The speed of each of these processes is measured and averaged 10 times as follows.
Process name 10 times average travel time(list)[sec] 10 times average travel time(DataFrame)[sec]
Average calculation 0.764768385887146 0.01179955005645752
sort 0.030899477005004884 0.011399650573730468
Narrow down 0.04529948234558105 0.006699275970458984

Not only is the code simple to implement with DataFrame, but it's also fast.

DataFrame Weaknesses The weakness of DataFrame is the for loop. If you want to create the mean value in a for loop, use the code below.
for idx in range(len(df)):
	df.iat[idx, 6] = str((float(df.iat[idx, 1]) + float(df.iat[idx, 2])
		+ float(df.iat[idx, 3]) + float(df.iat[idx, 4]) + float(df.iat[idx, 5])/5.0))

It takes an average of 2.33 [s] 10 times, which is slower than that of list. Therefore, when dealing with DataFrame, it is desirable not to use for as much as possible.

If you really want to use for in DataFrame Still, there are situations where for is used in DataFrame. In such a case, you can speed up the process by making only the part that uses for into list or ndarray, or by using the method like the post below. [[Python3 / pandas] Speed improvement measures when you really want to process DataFrame line by line](https://qiita.com/siruku6/items/0633db690283a0f525ad)

About sample data creation The sample data used this time was created with the code at the following URL. https://github.com/HagiAyato/PythonTests/blob/main/make10000data.py

Recommended Posts

[Python] Strengths and weaknesses of DataFrame in terms of time required
A discussion of the strengths and weaknesses of Python
Difference between Ruby and Python in terms of variables
Project Euler # 1 "Multiples of 3 and 5" in Python
Explanation of edit distance and implementation in Python
To represent date, time, time, and seconds in Python
Basic operation of Python Pandas Series and Dataframe (1)
"Linear regression" and "Probabilistic version of linear regression" in Python "Bayesian linear regression"
Full-width and half-width processing of CSV data in Python
Calculation of standard deviation and correlation coefficient in Python
[Python] Measures and displays the time required for processing
[python] Calculation of months and years of difference in datetime
Overview of generalized linear models and implementation in Python
Sample of getting module name and class name in Python
Summary of date processing in Python (datetime and dateutil)
Check the processing time and the number of calls for each process in python (cProfile)
Applied practice of try/except and dictionary editing and retrieval in Python
Equivalence of objects in Python
Reference order of class variables and instance variables in "self. Class variables" in Python
[Python] Display the elapsed time in hours, minutes, and seconds (00:00:00)
Stack and Queue in Python
Get the current date and time in Python, considering the time difference
Graph time series data in Python using pandas and matplotlib
[Tips] Problems and solutions in the development of python + kivy
Unittest and CI in Python
Implementation of quicksort in Python
Source installation and installation of Python
The story of returning to the front line for the first time in 5 years and refactoring Python Django
Determine the date and time format in Python and convert to Unixtime
I compared the calculation time of the moving average written in Python
A function that measures the processing time of a method in python
Create a CGH for branching a laser in Python (laser and SLM required)
List of Linear Programming (LP) solvers and modelers available in Python
Browse .loc and .iloc at the same time in pandas DataFrame
Verify the compression rate and time of PIXZ used in practice
Get the title and delivery date of Yahoo! News in Python
Environment construction of python and opencv
Pixel manipulation of images in Python
The story of Python and the story of NaN
MIDI packages in Python midi and pretty_midi
[Python] Operation memo of pandas DataFrame
Difference between list () and [] in Python
Difference between == and is in python
Installation of SciPy and matplotlib (Python)
View photos in Python and html
Sorting algorithm and implementation in Python
Division of timedelta in Python 2.7 series
Manipulate files and folders in Python
MySQL-automatic escape of parameters in python
About dtypes in Python and Cython
Handling of JSON files in Python
Measure function execution time in Python
Assignments and changes in Python objects
Implementation of life game in Python
Waveform display of audio in Python
Check and move directories in Python
This and that of python properties
Ciphertext in Python: IND-CCA2 and RSA-OAEP
Hashing data in R and Python
Function synthesis and application in Python
Export and output files in Python