TL;DR

I will summarize the story of using apply () when preprocessing data with Pandas.

What is Pandas

http://pandas.pydata.org/pandas-docs/stable/ pandas is a library that provides methods required for data analysis in Python. It can handle a wide range of data from time series data to data series such as tables, and can be aggregated at high speed. In today's world, you often hear the case of "analyzing data with Python".

Data analysis work in the company

When performing data analysis work within Moff, we often process the data after extracting the preliminary data and then analyzing it in the spot. After drawing the data table you want to create in the end, you will have to run some preprocessing by then.

What was done as a beginner.

For example, suppose you have the following dataset.

スクリーンショット 2019-12-24 11.04.27.png

Then, let's say you want to preprocess "I want to add a new number of characters for name to column".

At that time, it was especially seen among intern students who were just starting programming, but the following pre-processing code was often seen in pre-processing.

data['len'] = 0
for k, d in data['name'].iteritems():
    data['len'][k] = len(d)

Certainly, this is enough for the output result, and I think that a few lines of data will certainly consume that much time.

Try to deal with it with apply ().

Now, let's deal with it with apply (). apply () is one of the methods provided in DataFrame and Series types. In DataFrame and Series, the expression is the return value of the function given to the argument of apply () with each value in the grouped DataFrame as an argument. It can return a Series, DataFrame, or scalar value.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html

In the previous example, if you create a function that returns len, it will look like this.

def return_len(x)
    return len(x)

data['len'] = data['name'].apply(return_len)

If you know a lambda expression (anonymous function), you can also do the following:

data['len'] = data['name'].apply(lambda x: len(x))

I think that it will be relatively simple in terms of description.

As you may have noticed, this time I introduced a function that returns the result of len () by implementing the function. If you think about it carefully, len () itself is also a function, so you can do it like this.

data['len'] = data['name'].apply(len)

If it is a simple process, it may be possible to solve it with a for loop, but when considering more complicated preprocessing, it is better to make it a function and apply it with apply () rather than the above, and notice mistakes etc. I think it will be cheaper.

How is the processing speed

In the above example, since the amount of data is small, I think there is a difference in the processing speed, so I will show the result of applying the processing that outputs the length for random character string data. The verification environment is as follows.

MacBook Pro (13-inch, 2016, Four Thunderbolt 3 Ports)
macOS Mojave 10.14.5
Processor 2.9GHz Intel Core i5
Memory 8GB 2133 MHz LPDR3
Measure the time until the process of adding a new line with the length of the character string as the value to the character string data (that is, the above process) is completed.

# of data	for loop	pandas.DataFrame.apply()
100	3.907sec	0.079sec
10000	415.032sec	0.231sec
100000	3100.906sec	1.283sec

As you can see, even if there are about 100 cases, the difference is about 3.8 seconds, and if the number is 10,000 or 100,000, the difference in the time it takes to output the result will be enormous. Obviously, it's better to use apply () to get a quick result, no matter what the size of the dataset or any pre-processing.

What is apply () doing? What is the substance?

For example, here is the main body of Apply of DataFrame https://github.com/pandas-dev/pandas/blob/5a7b5c958c6f004c395136f9438a1ed6c04861dd/pandas/core/frame.py#L6440

If you follow frame_apply ()

https://github.com/pandas-dev/pandas/blob/5a7b5c958c6f004c395136f9438a1ed6c04861dd/pandas/core/apply.py#L26

It seems that something is written in a class called FrameApply.

https://github.com/pandas-dev/pandas/blob/5a7b5c958c6f004c395136f9438a1ed6c04861dd/pandas/core/apply.py#L56

Since we are calling get_result () on frame_apply (), we will track the process.

https://github.com/pandas-dev/pandas/blob/5a7b5c958c6f004c395136f9438a1ed6c04861dd/pandas/core/apply.py#L144

There is a branch depending on the parameter, but it seems that the following is called unless an optional argument is given.

https://github.com/pandas-dev/pandas/blob/5a7b5c958c6f004c395136f9438a1ed6c04861dd/pandas/core/apply.py#L269

It seems that libreduction.compute_reduction () is doing the calculation body in apply_standard (). After that, it seems that apply_series_generator () and wrap_results () are producing results. It seems that libreduction is in /_lib/reduction.pyx.

https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/reduction.pyx

Apparently, it seems that it behaves like dividing the matrix data with numpy.ndarray.shape in Reducer, and doing various things from for i in range (self.n results):.

https://github.com/pandas-dev/pandas/blob/5a7b5c958c6f004c395136f9438a1ed6c04861dd/pandas/_libs/reduction.pyx#L88

From here it seems to be a story on Numpy. Even if you browse the above code, Pandas In the above, it seems that there are no points devised in the calculation, so it seems that the processing is devised on Numpy. Because, in the end, there is res = self.f (chunk) on the above Reducer, and the point that it is specially devised in the calculation process after the process of applying the function to Numpy's ndarray. I couldn't find it. There was a description that it was processed separately in Chunk, but the details of the actual processing are unclear at the time of res = self.f (chunk) of function application, and the processing on the Numpy side thereafter It looks like it's complete with. To the extent that it is said in the street, it makes sense to add that the calculation process in Numpy is faster than the calculation process in pure Python alone.

I'd like to know a little more, but since it is volumey to write any more, I will take another opportunity to specify the process in Numpy. Anyway, it's definitely not on Pandas at least. (Although it is still unclear whether it depends on language performance or is devised by calculation processing)

Others regarding map, apply, applymap, etc.

In the form of derivation of apply (), there are map () and applymap () in Series and DataFrame, respectively. At first glance, I was wondering what was different, so I did a quick search and found that the differences were summarized in the StackOverFlow answer below. https://stackoverflow.com/a/56300992/7649377

If you suppress only the main points,

Apply () can be used for both DataFrame and Series, applymap can be used only for DataFrame, and map can be used only for Series.
In Serires map, if you pass dict and Serires as arguments, you can set the function to be applied according to the key of the argument.
Regarding apply (), it can also be used with DaraFrame, Series that can be used for aggregation processing such as group by.

Is that the point?

Summary

Just by getting used to using apply (), the pre-processing work can be made much faster.
What apply () is doing is what Pandas does. As a result of following it, it results in array processing in Numpy, so code reading on Numpy is necessary.

Next time I write an article with this story, I will read Numpy and catch it. Thank you very much.

[PYTHON] Why you should use Pandas apply ()