[Python] How to extract / delete / convert a matrix containing missing values (NaN)

[Python] How to extract / delete / convert a matrix containing missing values (NaN)

When reading a csv or excel file, if there is a blank cell, it will be read as a missing value (NaN).

A summary of how to find, delete, and convert matrices that contain this missing value (NaN).


**table of contents**
  1. [Identification of elements / matrices containing missing values](# 1 Identification of element matrices containing missing values)
    1. isna()
      1. isna().all()
      2. isna().any()
      3. isna().all().any()
      4. isna().any().any()
    2. isnull()
  2. [Count the number of missing values](# 2 Count the number of missing values)
    1. isna().sum()
    2. isna().sum().sum()
    3. len() - count()
  3. [Delete missing value](# 3 Delete missing value)
  4. [List of dropna options](List of options for #dropna)
  5. [Delete all missing value matrix](# 5 Delete all missing value matrix)
  6. [axis = 0 (omitted)](# axis0 omitted) 2. axis=1 3. dropna(how='all').dropna(how='all', axis=1)
  7. [Delete by specifying the number of non-missing cells](# 6 Delete by specifying the number of non-missing cells)
    1. df.dropna(thresh=n)
    2. df.dropna(thresh=n, axis=1)
  8. [Specify the row / column where you want to delete NaN](Specify the matrix where you want to delete # 7nan)
    1. df.dropna(subset=['AAA', 'BBB'])
    2. df.dropna(subset=['aaa', 'bbb'], axis=1)
  9. [Allow delete overwrite](# 8 Allow delete overwrite)
  10. [Convert Missing Value](# 9 Convert Missing Value)
  11. [Replace entire table](#Replace entire table) 1. fillna('AAA')
  12. [method option](# 1-2-method option) 3. fillna(method='ffill') 4. fillna(method='ffill', axis=1) 5. fillna(method='bfill') 6. fillna(method='bfill', axis=1)
  13. [Specify how many NaNs to replace](#Specify how many nans to replace) 1. fillna('AAA', limit=n) 2. fillna('AAA', limit=n, axis=1) 3. fillna(method='ffill', limit=n) 4. fillna(method='ffill', limit=n, axis=1)
  14. [Allow overwrite](# Allow overwrite)

## 1. 1. Identification of elements / matrices containing missing values **①isna()**  1-1. isna().all()  1-2. isna().any()  1-3. isna().all().any()  1-4. isna().any().any() **②isnull()**
** ▼ Usage ** -** Element identification ** is ** isna method **. -To find out if ** all ** of a row or column is NaN, combine the ** isna method and the all method **. -To find out if there is NaN in the ** element of the row or column **, combine the ** isna method and any method **. -Isnull method is ** isna method alias ** (alias). The official page mainly uses isna.
## Original data to use Read the 6-by-6 csv file as shown below.

image.png

Contains cells with no values entered or blank matrices.

▼ [Reference] Read the test.csv file on the desktop.

Read csv file


import pandas as pd
df = pd.read_csv('~/desktop/test.csv')
df

See below for details on reading csv files with python.

How to read csv file Application for reading csv files


### Read data (stored in variable df) ![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/563526/d1656f44-8997-228f-ae49-1d46eb24da4c.png)

-The blank cell becomes "NaN". -Formula errors "#REF!", "# DIV / 0!", And "#NAME?" Are not recognized as missing values (as they were). ・ The first line is recognized as a header


①isna() Used to check if each element is NaN. If it is NaN, "True" is returned, otherwise "False" is returned. (Boolean type: boolean)

** 1-1. Mapping the entire table **

df.isna() └ "df": Data with a table

Map the entire table


df.isna()

▼ Output image.png

NaN is displayed as "True" and others are displayed as "False".

▼ Original data image.png


** 1-2. Mapping by specific row, column, element ** It is also possible to map a specific matrix or element by specifying the target range.

pd.isna (specified range) └ "pd.isna ()": pd becomes an object


** ▼ Map specified columns **

Map specified columns


pd.isna(df['col1'])

#output
0    False
1    False
2    False
3    False
4     True
5    False
Name: col1, dtype: bool

** ▼ Map specified line **

Map specified line


pd.isna(df['col1'])

#output
0    False
1    False
2    False
3    False
4     True
5    False
Name: col1, dtype: bool

** ▼ Map specified line **

Map specified line


pd.isna(df.loc[[3]])

▽ Output image.png


** ▼ Map multiple lines **

Map specified line


pd.isna(df[1:4])

▽ Output image.png


** ▼ Judge the specified cell **

If you put a value in "()" of pd.isna (), it will be judged whether it is NaN. ⇒Enter the cell element.


** ① Judge the cell with row name "0" and column name "col4" ** -Cell specification: `df.loc [0,'col4']`

Judge cell (specified by matrix name)


pd.isna(df.loc[0,'col4'])

#output
#True

** ② Judge the cell with row number "1" and column number "3" ** -Cell specification: `df.iloc [1,3]`

Judge cell (specified by matrix name)


pd.isna(df.iloc[1,3])

#output
#True

There are various ways to specify the range of a DataFrame table. Click here for details


1-1. isna().all() List rows or columns with ** all NaN ** contents.

True if applicable. False if not applicable.

Whether it is a row or a column is switched by the presence or absence of the option "axis = 1".

-Df.isna (). All () : ** Column ** list ・ Df.isna (). All (axis = 1): ** Line ** list


** ▼ Original table ** ![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/563526/bd35b75c-b12b-62fc-75b3-339b4a4f126e.png)
** ▼ List of columns ** `df.isna().all()`

List of columns


df.isna().all()

#output
col1          False
Unnamed: 1    False
col3          False
col4           True
Unnamed: 4     True
col6          False
dtype: bool

The column names "col4" and "Unnamed: 4" are empty.


** ▼ List of lines ** `df.isna().all(axis=1)`

List of rows


df.isna().all(axis=1)

0    False
1    False
2    False
3    False
4     True
5    False
dtype: bool

Line name "4" is blank


1-2. isna().any() List rows or columns that contain ** at least one NaN **.

True if applicable. False if not applicable.

Whether it is a row or a column is switched by the presence or absence of the option "axis = 1".

-Df.isna (). Any () : ** Column ** list ・ Df.isna (). Any (axis = 1): ** Line ** list


** ▼ Original table ** ![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/563526/bd35b75c-b12b-62fc-75b3-339b4a4f126e.png)
** ▼ List of columns ** `df.isna().any()`

List of columns


df.isna().any()

#output
col1          True
Unnamed: 1    True
col3          True
col4          True
Unnamed: 4    True
col6          True
dtype: bool

All columns contain NaN.


** ▼ List of lines ** `df.isna().any(axis=1)`

List of rows


df.isna().any(axis=1)

0    True
1    True
2    True
3    True
4    True
5    True
dtype: bool

All lines contain NaN.


1-3. isna().all().any() ** If there is at least one blank column in the table ** returns False if there is no True.

It can be used when you want to see the state quickly.


** ▼ What you are doing ** Get a list showing the state of each column as True or False with ʻisna (). All`.

ʻAny ()` returns True if there is at least one True in the list, False otherwise.

Check if there is at least one blank column


df.isna().all().any()

#output
# True

1-3'. isna().all(axis=1).any() By specifying "axis = 1" in the option of all method,

If there is at least one ** all blank row ** in the table, False is returned if there is no True.

Check if there is at least one blank line


df.isna().all(axis=1).any()

#output
# True

1-4. isna().any().any() Returns True ** if there is at least one NaN in the table, False otherwise.

Does it contain at least one NaN?


df.isna().any().any()

#output
# True

②isnull() isna method alias Even if the above isna is replaced with isnull, the process is exactly the same.
>> Link to official page - [isna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html) - [all()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.all.html) - [any()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.any.html) - [isnull()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isnull.html)
## 2. Count the number of missing values ①isna().sum() ②isna().sum().sum() ③len() - count()
①isna().sum() Count how many NaNs each row or column contains.

Whether it is a row or a column is switched by the presence or absence of the option "axis = 1".

Df.isna (). Sum (): Count number per ** column ** ・ Df.isna (). Sum (axis = 1): Count number per ** line **


** ▼ Original table ** ![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/563526/bd35b75c-b12b-62fc-75b3-339b4a4f126e.png)
** ▼ Count NaN in column ** `df.isna().sum()`

Count the NaN contained in each column


df.isna().sum()

#output

col1          1
Unnamed: 1    2
col3          1
col4          6
Unnamed: 4    6
col6          2
dtype: int64

** ▼ Count NaN in line ** `df.isna().sum(axis=1)`

Count NaN contained in each row


df.isna().sum(axis=1)

#output

0    2
1    3
2    3
3    2
4    6
5    2
dtype: int64

②isna().sum().sum() Count the number of NaNs contained in the entire table.
** ▼ What you are doing ** Count the number of NaNs for each row with ʻisna (). Sum ()`.

Furthermore, by doing sum (), the NaN number of each line is added up.

Number of NaNs in the entire table


df.isna().sum().sum()

#output
# 18

③len() - count()

len (df ['line name']) --count (df ['line name'])

Find the number of NaNs in the specified row or column.


** ▼ What you are doing ** ① `len (df ['line name'])` └ Count the number of elements in the specified row

count (df ['line name']) └ Number of elements other than NaN contained in the specified line

By setting (1)-(2), the number of NaNs contained in the specified line is calculated.


Supplement
count method: counts the number of elements other than NaN

** ▼ Target column (column containing two NaN. Column name: col6) **

Target column


df.col6

0      A
1    NaN
2      C
3      D
4    NaN
5      F

** ▼ Execution result **

Number of NaNs in the specified row


len(df['col6'])-df['col6'].count()

#output
# 2

>> Official page -[sum method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sum.html) --[count method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.count.html)
## 3. 3. Delete missing values Missing value NaN can be deleted by using the dropna method.

▼ Dropna option list

option Contents
axis=0 Line (optional)
axis=1 Column
how='any' Delete if there is even one NaN(Optional)
how='all' Delete all NaN
thresh=n Not NaN(Valid)Do not delete if there are n or more cells
subset=['A',,] Delete rows that contain NaN in the specified column name (remove NaN from the specified column name)
subset=['a',,], axis=1 Delete columns that contain NaN in the specified row name (remove NaN from the specified row name)
inplace=True Allow overwriting

** Original table ** ![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/563526/84b5b6ed-38e4-a9b3-14ac-16e6d18ce605.png)
### 5. Delete all missing matrix -Specify "** how ='all'**" as an option. -Specify the row or column with axis.

① axis = 0 (omitted)  ②axis=1  ③dropna(how='all').dropna(how='all', axis=1)


#### ① axis = 0 (omitted) -Delete "lines" containing spaces. -Default is "axis = 0" (optional) └ The following two processes are the same.

dropna(how='all') dropna(how='all', axis=0)


Remove all blank lines


df.dropna(how='all')

▽ Output image.png

Line 4 deleted.

②axis=1 dropna(how='all', axis=1)

Delete "columns" that contain spaces.


Remove all blank columns


df.dropna(how='all')

▽ Output image.png

Removed two columns, col4 and Unnamed: 4.


③dropna(how='all').dropna(how='all', axis=1) 

Removed rows and columns where all elements are blank.

After removing blank lines with dropna (how ='all') Remove blank columns with dropna (how ='all', axis = 1).


Remove blank rows and columns


df.dropna(how='all').dropna(how='all', axis=1)

▽ Output

image.png

Row 4, column col4, Unnamed: 4 deleted.


### 6. Delete a matrix that contains at least one missing value -Specify "** how ='any' **" as an option. ・ The default is this state (optional) -Specify the row or column with axis. -The original table is not overwritten └ Overwrite setting is also possible as an option.

① Default  ②axis=1


** ▼ Original table **
image.png

There are four NaNs. (Row, Column) = (4,1), (5,2), (1,5), (4,5)


#### ① Default ** Deleted any line containing NaN. ** **

The default of the dropna method is that how ='any' is specified as an option.


** ▼ Execution example **
image.png

In each case, lines 1, 4, 5, and 6 including NaN have been deleted.


②axis=1 Removed ** columns ** containing at least one NaN.

Since how ='any' can be omitted, the following two processes are the same.

dropna(axis=1) dropna(how='any', axis=1)


** ▼ Execution example **
image.png

All contain one or more NaNs, col1, Unnamed: 1, col5 deleted.


## 6. Delete by specifying the number of cells that are not missing values Example: Ignore if there are n or more non-missing values. If it is less than that, delete it.

①df.dropna(thresh=n) ②df.dropna(thresh=n, axis=1)

▽ Supplement thresh is an abbreviation for "threshold".


### ▼ Original data
image.png

・ 10 rows x 6 columns table -The number of NaN is 1 in the 0th row (9 cells are not). The number of NaNs increases one column at a time. ・ All 4th columns are blank


①df.dropna(thresh=n) Example: `df.dropna (thresh = 1)` └ "df": Table data └ "thresh = 1": Threshold 1

Deleted rows with one or less cells other than NaN. Same as deleting all NaN lines.

** ▼ When thresh = 1 **

Specifying the threshold (n)=1)


df.dropna(thresh=1)
image.png

There are no NaN-only cells, so all rows remain.


** ▼ When thresh = 4 **

Specifying the threshold (n)=4)


df.dropna(thresh=4)
image.png

Rows with 4 or less cells other than NaN, rows 1, 2 and 3 are deleted.


** ▼ When thresh = 7 **

Specifying the threshold (n)=7)


df.dropna(thresh=7)
image.png

Since there are no rows with 7 or more cells other than NaN, delete all.


②df.dropna(thresh=n, axis=1) Example: `df.dropna (thresh = 1, axis = 1)` └ "df": Table data └ "thresh = 1": Threshold 1

Deleted columns with one or less cells other than NaN. Same as deleting all NaN columns.


** ▼ When thresh = 1, axis = 1 **

Specifying the threshold (n)=1)


df.dropna(thresh=1, axis=1)
image.png

All NaN column 4 removed.


** ▼ When thresh = 7, axis = 1 **

Specifying the threshold (n)=7)


df.dropna(thresh=7, axis=1)
image.png

Columns with 7 or less cells other than NaN, rows 4, 5, and 6 are deleted.


** ▼ When thresh = 11, axis = 1 **

Specifying the threshold (n)=11)


df.dropna(thresh=11, axis=1)

image.png

Delete all because there are no columns with 11 or more cells other than NaN.


## 7. Specify the row / column for which you want to delete NaN Use the subset option. The specified row / column is not deleted, but the NaN contained in it is deleted.

①df.dropna(subset=['AAA', 'BBB']) ②df.dropna(subset=['aaa', 'bbb'], axis=1)


### ▼ Original data |![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/563526/d1a720eb-4f71-9f8e-ffd3-d7bc3d357768.png)| |:--|

・ 10 rows x 6 columns ・ The row name is row x row number ・ Column name is col x column number

①df.dropna(subset=['AAA', 'BBB']) Delete the row by specifying the column name.

df.dropna(subset=['AAA', 'BBB']) └ "df": Variable with table └ "subset": Option to specify the name of the table or column └ "AAA" "BBB": Column name (optional)

-Whether the deletion target is row or column is specified by the presence or absence of axis.


** ▼ Delete specified line ** `df.dropna(subset=['col1','col3','col5'])` └ Delete the row that has at least one NaN in columns col1, col3, col5. ⇛ NaN disappears from the specified column.

Delete row by specifying by column name


df.dropna(subset=['col1','col3','col5'])
image.png

Lines 1,2,3,4,5 deleted. The specified columns col1,3,5 no longer contain NaN.


②df.dropna(subset=['aaa', 'bbb'], axis=1) Delete the column by specifying the row name. NaN is deleted from the specified line.
`df.dropna(subset=['row0','row3'], axis=1)`

Delete the column with NaN in rows 0 and row 3. (NaN disappears from the specified column)

Delete row by specifying by column name


df.dropna(subset=['row0','row3'], axis=1)
image.png

Removed columns col3, col4, col5.


## 8. Allow delete overwrite Use the inplace option.

Do not overwrite by default.  └inplace=False

Describe the following to allow overwriting.  └**inplace=True**


** ▼ Allow overwriting **

Overwrite permission


df.dropna(thresh=5, inplace=True)
df
image.png

** ▼ Default (do not overwrite) **

No overwrite


df.dropna(thresh=5)
df
image.png

The table remains the same.


>> Official page [dropna method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html?highlight=dropna)
## 9. Convert missing values You can replace NaN with the specified value with the fillna method.

The replacement method and the conditions of NaN to be replaced are specified as options.


### ▼ Fillna option list
option Contents
axis=0 Line (omitted)
axis=1 Column
value Value to convert NaN
method='bfill' Convert to lower cell value of NaN
method='backfill' 'bfill'Same as
method='ffill' Convert NaN to the value in the upper cell
method='pad' 'ffill'Same as
method='bfill', axis=1 Convert to NaN right cell value
method='ffill', axis=1 Convert to the value in the left cell of NaN
limit=n Replace up to nth NaN. Do not replace any more.
inplace=True Allow overwriting

** ① Replace the entire table **  1-1. fillna('AAA') 1-2. method option  1-3. fillna(method='ffill')  1-4. fillna(method='ffill', axis=1)  1-5. fillna(method='bfill')  1-6. fillna(method='bfill', axis=1)

** ② Specify the number of NaN to replace **  1-1. fillna('AAA', limit=n)  1-2. fillna('AAA', limit=n, axis=1)  1-3. fillna(method='ffill', limit=n)  1-4. fillna(method='ffill', limit=n, axis=1)

** ③ Allow overwriting **


### ▼ Original data
image.png

## ① Replace the entire table 1-1. fillna('AAA') `fillna('AAA')` └ "AAA": Replace NaN with the specified character (optional)

fillna('AAA')


df.fillna('AAA')

image.png


### 1-2. method option It can be replaced with a * value adjacent to NaN by using the method option.

** ▼ Example: When you specify to replace with the value in the right cell **

Before replacement: NaN NaN NaN AAA After replacement: AAA AAA AAA AAA

** ▼ method option list **

method meaning Contents
ffill forward fill Replace with the value in the front cell (row: top, column: left)
pad padding Same as ffill
bfill backward fill Replace with the value in the back cell (row: bottom, column: right)
backfill backward fill Same as bfill

When specifying a line, determine whether it is before or after based on the line number. (Up or down)

When specifying a column, determine whether it is the front or the back based on the column number. (Left or right)

error


ValueError: Cannot specify both 'value' and 'method'.

1-3. fillna(method='ffill') `fillna(method='ffill')`

Replace with the value in the top row.

ffill


df.fillna(method='ffill')

image.png

Convert to a value above NaN. Not replaced if the top row is NaN.


1-4. fillna(method='ffill', axis=1) `fillna(method='ffill', axis=1)`

Replace with the value in the front column.

ffill,axis=1


df.fillna(method='ffill', axis=1)

image.png


1-5. fillna(method='bfill') `fillna(method='bfill')`

Replace with the value in the last row.

bfill


df.fillna(method='bfill')

image.png


1-6. fillna(method='bfill', axis=1) `fillna(method='bfill', axis=1)`

Replace with the value in the back column.

bfill,axis=1


df.fillna(method='bfill', axis=1)

image.png

(2) Specify the number of NaN to replace

By using the limit option, you can specify the number of NaN to replace.

limit=n Replace up to nth. Do not replace any more.

** Example: When limit = 2, the value to be replaced AAA, and the axis is a column **

■ Before replacement AAA NaN NaN AAA AAA AAA NaN NaN NaN AAA AAA NaN AAA NaN AAA NaN NaN AAA NaN NaN

■ After replacement AAA AAA AAA AAA AAA AAA AAA AAA NaN AAA AAA AAA AAA AAA AAA AAA AAA AAA NaN NaN

-Replace each line from the left up to the second NaN.


1-1. fillna('AAA', limit=n) `fillna('AAA', limit=3)`

From the top in line number order, NaN replaces up to the second.

limit=2


df.fillna('AAA', limit=2)

image.png

Replace from top to bottom for each column. Do not replace any more.


1-2. fillna('AAA', limit=n, axis=1) `fillna('AAA', limit=3, axis=1)`

Replace from the top to the third NaN in line number order.

limit=3,axis=1


df.fillna('AAA', limit=3, axis=1)

image.png


1-3. fillna(method='ffill', limit=n) `fillna(method='ffill', limit=3)`

Replaces up to the third NaN in line number order with the previous value.

ffill,limit=3


df.fillna(method='ffill', limit=3)

image.png


1-4. fillna(method='ffill', limit=n, axis=1) `fillna(method='ffill', limit=3, axis=1)`

Replace up to the third NaN in column number order with the previous value.

ffill,limit=3,axis=1


df.fillna(method='ffill', limit=3)

image.png


### ③ Allow overwriting Use the inplace option.

Do not overwrite by default.  └inplace=False

Describe the following to allow overwriting.  └**inplace=True**


** ▼ Allow overwriting **

Overwrite permission


df.fillna('AAA', inplace=True)
df
image.png

** ▼ Default (do not overwrite) **

No overwrite


df.fillna('AAA')
df
image.png

The table remains the same.


>> Official page [dropna method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html?highlight=dropna)
[Back to top](#python How to extract, delete, and convert a matrix containing missing value nan)

Recommended Posts

[Python] How to extract / delete / convert a matrix containing missing values (NaN)
[Python] How to convert a 2D list to a 1D list
How to convert / restore a string with [] in python
How to convert an array to a dictionary with Python [Application]
How to write a Python class
[Python] How to swap array values
How to delete a Docker container
[Python] How to get & change rows / columns / values from a table.
How to delete multiple specified positions (indexes) in a Python list
[Python] How to make a matrix of repeating patterns (repmat / tile)
[Python] How to convert db file to csv
[Python] How to make a class iterable
How to convert Python to an exe file
[Python] How to invert a character string
How to get a stacktrace in python
How to extract polygon area in Python
How to run a Maya Python script
[Python] How to create a dictionary type list, add / change / delete elements, and extract with a for statement
How to read a CSV file with Python 2/3
How to create a Python virtual environment (venv)
How to open a web browser from python
How to clear tuples in a list (Python)
How to embed a variable in a python string
How to create a JSON file in Python
Convert Windows epoch values to date with python
How to generate a Python object from JSON
How to add a Python module search path
How to convert SVG to PDF and PNG [Python]
How to extract coefficients from a fractional formula
How to notify a Discord channel in Python
[Python] How to draw a histogram in Matplotlib
How to unit test a function containing the current time using freezegun in python
How to check for missing values (Kaggle: House Prices)
[Python] How to remove duplicate values from the list
How to convert Python # type for Python super beginners: str
[Python] How to draw a line graph with Matplotlib
How to set up a Python environment using pyenv
Convert a text file with hexadecimal values to a binary file
How to convert a class object to a dictionary with SQLAlchemy
[Python] How to expand variables in a character string
How to write a list / dictionary type of Python3
How to convert floating point numbers to binary numbers in Python
How to build a Django (python) environment on docker
How to extract null values and non-null values with pandas
How to extract non-missing value nan data with pandas
How to convert JSON file to CSV file with Python Pandas
How to make a Python package using VS Code
[Python] How to write a docstring that conforms to PEP8
How to convert a mel spectrogram back to a wav file
[Python] How to create Correlation Matrix and Heat Map
How to save a table scraped by python to csv
[Python] Created a method to convert radix in 1 second
python, php, ruby How to convert decimal numbers to n-ary numbers
[Python] How to create a 2D histogram with Matplotlib
How to execute a command using subprocess in Python
[Python] How to output the list values in order
How to read csv containing only integers in Python
How to build a Python environment on amazon linux 2
[Python] How to call a c function from python (ctypes)
How to create a kubernetes pod from python code
[Python] How to draw a scatter plot with Matplotlib