[PYTHON] Be careful when assigning Series as a column to pandas.DataFrame

If you try to add a series to a pandas dataframe, it behaves like a join, so be careful.

Before

When processing data with pandas, the process of adding columns to a data frame is frequent. There are two main ways to add columns to a dayframe.

  1. Add column by specifying column name
  2. Add with pd.DataFrame.assign method

** 1. Add column by specifying column name **

df['new_col'] = data

** 2. Add column using assign method **

df.assign(new_col=data)

In either case, you can pass a list of equal values, sizes, np.array, pd.Series, and so on.

Intuitive behavior that occurs when substituting a series

Prepare a series with the same number of records as any data frame.

df = pd.DataFrame(
    [[1,2,3], [4,5,6], [7,8,9]],
    columns=['a', 'b', 'c'],
    index=[1,2,3]
)
sr = pd.Series([-1, -2, -3])

df
#   	a 	b 	c
# 1 	1 	2 	3
# 2 	4 	5 	6
# 3 	7 	8 	9

sr
# 0   -1
# 1   -2
# 2   -3
# dtype: int64

If you want to add the data of sr as a new column'd' to df, you would do the following.

df = df.assign(d=sr)

I hope that such a table will be created.

a b c d
1 1 2 3 -1
2 4 5 6 -2
3 7 8 9 -3

However, in reality, such a data frame is returned.

a b c d
1 1 2 3 -2
2 4 5 6 -3
3 7 8 9 NaN

What's going on

When comparing the data frame and the series again, the indexes of both do not match. With such data, you can see that even in the case of assignment, it behaves like a join.

Workaround

This can be avoided by passing it as np.array.

df.assign(new_col=new_series.values)

Note: As far as the Official Documentation is concerned, the to_numpy method is used rather than the values method. It is recommended to do it. It looks like this to clearly distinguish the ʻExtension Arary` added in 0.24 of pandas.

Why this happens

First, if the value passed to the pd.DataFrame.assign method is not callable, the process 1 shown at the beginning is only called internally. Therefore, a phenomenon like this one occurs in either method.

(By the way, when "the passed value is callable", it corresponds to the case of calling the column of the data frame itself with a lambda expression etc.) [^ callable]

If you try to assign something like df ['X'] = hogehoge,pd.DataFrame.__ setitem__ ()will be called. As I followed the code, I found the following docstring. [^ setitem]

        """
        Add series to DataFrame in specified column.
        If series is a numpy-array (not a Series/TimeSeries), it must be the
        same length as the DataFrames index or an error will be thrown.
        Series/TimeSeries will be conformed to the DataFrames index to
        ensure homogeneity.
        """

In other words, the passed data is

It is stated that it will be processed like this. If you follow the code further, you'll see that the data is sorted along the index of the data frame before it's added. [^ reindex]

This phenomenon was caused by thinking of the series in the same way as an array. As you can see below, we also found that the size of the series didn't even have to match the records in the dataframe to add, it was completely different from the list or array.

df.assign(
    x=pd.Series([3], index=[2])
)
#  	 	a 	b 	c 	x
# 1 	1 	2 	3 	NaN
# 2 	4 	5 	6 	3.0
# 3 	7 	8 	9 	NaN

Summary

Don't add one-dimensional data of the same size to a data frame like a list or array. When assigning a series to a column of a data frame, the process will proceed without an error even if the size is different, so the risk of creating a bug without noticing it is likely to increase. I always warned that I would convert it to numpy-array and perform the assignment process.

Recommended Posts

Be careful when assigning Series as a column to pandas.DataFrame
Be careful when adding an array to an array
Python Note: When assigning a value to a string
Be careful when differentiating the eigenvectors of a matrix
What to do when Unalignable boolean Series provided as indexer
Be careful when specifying the default argument value in Python3 series
What to do when a video cannot be read by cv2.VideoCapture
How to set Jupytext nicely when managing code as a team
[Python] Be careful when using print
Things to be aware of when building a recommender system using Item2Vec
When you want to replace a column with a missing value (NaN) column by column
[Python memo] Be careful when creating a two-dimensional array (list of lists)
In pandas.DataFrame, even when assigning only a specific column, if index is attached, you do not have to worry about the order of data
Be careful when running CakePHP3 with PHP7.2
[Linux] [kernel module] How to pass parameters as arguments when loading a loadable kernel module
How to manage arguments when implementing a Python script as a command line tool