[PYTHON] Pandas memo ~ None, np.nan, empty string ~

Pandas memo ~ None, np.nan, empty string ~

I was addicted to pandas around None, np.nan, so personal notes

The verified environment is as follows (there was no difference in the results)

Summary

None np.nan Empty string
DataFrame conversion np except when object is not specified for dtype.Converted to nan np.nan cannot be converted to an int, so np.Columns containing nan are basically float type Character type(Non-numeric)Because it is treated as, it is not treated as a missing value, and the column containing the empty string becomes the basic object type.
read_csv - Np regardless of which dtype is specified for both empty and empty strings on csv.Read as nan -
fillna, fropna Judged as a missing value Judged as a missing value Not judged as a missing value
groupby Judged as missing value and ignored Judged as missing value and ignored Not judged as a missing value

inspection result

DataFrame conversion by specifying dtype

Verification of how the column type changes when the following data are specified with different dtypes


df = pd.DataFrame(
    {
        #Column A: int+None
        "A": [1, 2, 3, None],
        #Column B: str+Empty string
        "B": ["1", "2", "3", ""],
        #Column C: int+np.nan
        "C": [1, 2, 3, np.nan],
        #Column D:int only
        "D": [1, 2, 3, 4]
    }
)

dtype not specified

None seems to be converted to np.nan ... Along with that, the column containing np.nan becomes float64


df = pd.DataFrame(
    {
        "A": [1, 2, 3, None],
        "B": ["1", "2", "3", ""],
        "C": [1, 2, 3, np.nan],
        "D": [1, 2, 3, 4]
    }
)

print(df)

     A  B    C  D
0  1.0  1  1.0  1
1  2.0  2  2.0  2
2  3.0  3  3.0  3
3  NaN     NaN  4

print(df.dtypes)

A    float64
B     object
C    float64
D      int64
dtype: object

print(df.values)

array([[1.0, '1', 1.0, 1],
       [2.0, '2', 2.0, 2],
       [3.0, '3', 3.0, 3],
       [nan, '', nan, 4]], dtype=object)

Specify object

All values have not changed, None remains the same


df = pd.DataFrame(
    {
        "A": [1, 2, 3, None],
        "B": ["1", "2", "3", ""],
        "C": [1, 2, 3, np.nan],
        "D": [1, 2, 3, 4]
    },
    dtype=object
)

print(df)

      A  B    C  D
0     1  1    1  1
1     2  2    2  2
2     3  3    3  3
3  None     NaN  4

print(df.dtypes)

A    object
B    object
C    object
D    object
dtype: object

print(df.values)

array([[1, '1', 1, 1],
       [2, '2', 2, 2],
       [3, '3', 3, 3],
       [None, '', nan, 4]], dtype=object)

Specify float

The empty string cannot be changed to float, and only the column containing the empty string becomes object type.


df = pd.DataFrame(
    {
        "A": [1, 2, 3, None],
        "B": ["1", "2", "3", ""],
        "C": [1, 2, 3, np.nan],
        "D": [1, 2, 3, 4]
    },
    dtype=float
)

print(df)

      A  B    C  D
0     1  1    1  1
1     2  2    2  2
2     3  3    3  3
3  None     NaN  4

print(df.dtypes)

A    float64
B     object
C    float64
D    float64
dtype: object

print(df.values)

array([[1.0, '1', 1.0, 1.0],
       [2.0, '2', 2.0, 2.0],
       [3.0, '3', 3.0, 3.0],
       [nan, '', nan, 4.0]], dtype=object)

Specify int

Columns that cannot be converted to int64 (columns that contain np.nan or None) will be of type object


df = pd.DataFrame(
    {
        "A": [1, 2, 3, None],
        "B": ["1", "2", "3", ""],
        "C": [1, 2, 3, np.nan],
        "D": [1, 2, 3, 4]
    },
    dtype=int
)

print(df)

      A  B    C  D
0     1  1    1  1
1     2  2    2  2
2     3  3    3  3
3  None     NaN  4

print(df.dtypes)

A    object
B    object
C    object
D     int64
dtype: object

print(df.values)

array([[1, '1', 1, 1],
       [2, '2', 2, 2],
       [3, '3', 3, 3],
       [None, '', nan, 4]], dtype=object)

read_csv with dtype specified

Verify what happens to the column type when the following csv is specified with different dtypes

sample.csv


#Column A: int+Sky
#Column B:String+Empty string
#Column C: float+Sky
#Column D:int only
A,B,C,D
1,"1",1.0,1
2,"2",2.0,2
3,"3",3.0,3
,"",,4

dtype not specified

Both empty and empty strings are read as np.nan, and int is converted to float accordingly.


df = pd.read_csv("sample.csv")

print(df)

     A    B    C  D
0  1.0  1.0  1.0  1
1  2.0  2.0  2.0  2
2  3.0  3.0  3.0  3
3  NaN  NaN  NaN  4

print(df.dtypes)

A    float64
B    float64
C    float64
D      int64
dtype: object

print(df.values)

array([[ 1.,  1.,  1.,  1.],
       [ 2.,  2.,  2.,  2.],
       [ 3.,  3.,  3.,  3.],
       [nan, nan, nan,  4.]])

Specify object

Empty and empty strings are converted to np.nan, but other values are converted to str type


df = pd.read_csv("sample.csv", dtype=object)

print(df)

     A    B    C  D
0    1    1  1.0  1
1    2    2  2.0  2
2    3    3  3.0  3
3  NaN  NaN  NaN  4

print(df.dtypes)

A    object
B    object
C    object
D    object
dtype: object

print(df.values)

array([['1', '1', '1.0', '1'],
       ['2', '2', '2.0', '2'],
       ['3', '3', '3.0', '3'],
       [nan, nan, nan, '4']], dtype=object)

Specify float

All columns are converted to float64 type


df = pd.read_csv("sample.csv", dtype=float)

print(df)

     A    B    C    D
0  1.0  1.0  1.0  1.0
1  2.0  2.0  2.0  2.0
2  3.0  3.0  3.0  3.0
3  NaN  NaN  NaN  4.0

print(df.dtypes)

A    float64
B    float64
C    float64
D    float64
dtype: object

print(df.values)

array([[ 1.,  1.,  1.,  1.],
       [ 2.,  2.,  2.,  2.],
       [ 3.,  3.,  3.,  3.],
       [nan, nan, nan,  4.]])

Specify int

Since empty and empty characters are converted to np.nan, they cannot be read as ints and an error occurs.


df = pd.read_csv("sample.csv", dtype=int)

ValueError: Integer column has NA values in column 0

Behavior at fillna and dropna

Behavior when filling the following data


df = pd.DataFrame(
    {
        #Column A: int+None
        "A": [1, 2, 3, None],
        #Column B: str+Empty string
        "B": ["1", "2", "3", ""],
        #Column C: int+np.nan
        "C": [1, 2, 3, np.nan],
        #Column D:int only
        "D": [1, 2, 3, 4]
    },
    dtype="object"
)

print(df.values)

array([[1, '1', 1, 1],
       [2, '2', 2, 2],
       [3, '3', 3, 3],
       [None, '', nan, 4]], dtype=object)

If you do df.fillna ('FILL'), the values of None and np.nan will be converted, but the empty string will remain.


print(df.fillna('FILL'))

      A  B     C  D
0     1  1     1  1
1     2  2     2  2
2     3  3     3  3
3  FILL     FILL  4

print(df.fillna('FILL').values)

array([[1, '1', 1, 1],
       [2, '2', 2, 2],
       [3, '3', 3, 3],
       ['FILL', '', 'FILL', 4]], dtype=object)

Similarly, the behavior at the time of dropna is that the rows and columns containing np.nan and None are deleted, but empty strings are not treated as missing values.


print(df.dropna(axis=1))

   B  D
0  1  1
1  2  2
2  3  3
3     4

print(df.dropna(axis=1).values)

array([['1', 1],
       ['2', 2],
       ['3', 3],
       ['', 4]], dtype=object)

Behavior when group by

Perform verification using the following data frame


df = pd.DataFrame(
    {
        #Column A: int+None
        "A": [1, 2, 3, None],
        #Column B: str+Empty string
        "B": ["1", "2", "3", ""],
        #Column C: int+np.nan
        "C": [1, 2, 3, np.nan],
        #Column D:int only
        "D": [1, 2, 3, 4]
    },
    dtype="object"
)

When grouping in a column containing None, np.nan, the rows of None, np.nan are ignored (missing).


print(df.groupby("A").max().reset_index())

   A  B  C  D
0  1  1  1  1
1  2  2  2  2
2  3  3  3  3

print(df.groupby("A").max().reset_index().values)

array([[1, '1', 1, 1],
       [2, '2', 2, 2],
       [3, '3', 3, 3]], dtype=object)

print(df.groupby("C").max().reset_index())

   C  A  B  D
0  1  1  1  1
1  2  2  2  2
2  3  3  3  3

print(df.groupby("C").max().reset_index().values)

array([[1, 1, '1', 1],
       [2, 2, '2', 2],
       [3, 3, '3', 3]], dtype=object)

If the column contains an empty string, it will not be ignored


print(df.groupby("B").max().reset_index())

   B    A    C  D
0     NaN  NaN  4
1  1  1.0  1.0  1
2  2  2.0  2.0  2
3  3  3.0  3.0  3

print(df.groupby("B").max().reset_index().values)

array([[1, 1, '1', 1],
       [2, 2, '2', 2],
       [3, 3, '3', 3]], dtype=object)

Recommended Posts

Pandas memo ~ None, np.nan, empty string ~
Pandas memo
pandas memo
Pandas reverse lookup memo
8rep --Pandas string delete code
Visualization memo by pandas, seaborn