I searched for a bit and couldn't find it, so I managed to figure out how to put it out, so make a note.
The theme is "Is there NaN in the pandas DataFrame?"
As a simple check to see if the data is being processed properly, I would like to ** find out if there is a NaN value in the data frame and where it is **.
If you want to fill / delete NaN, you can use fillna ()
/ dropna ()
, but what you want to do now is ** "Check if there is NaN and display the row (column). **
As an example, I want to extract only the 2nd-4th rows or 1-3rd columns of this data frame.
Data creation
df=pd.DataFrame(np.random.randn(5,5))
df.ix[2:, 1:3] = np.nan
df.columns=list('abcde')
df
#[Out]# a b c d e
#[Out]# 0 -0.678873 -1.277486 -1.062232 0.097525 -2.386115
#[Out]# 1 -1.063709 -1.919997 -0.131733 -0.606348 0.101888
#[Out]# 2 -1.701473 NaN NaN NaN 0.201468
#[Out]# 3 -0.624932 NaN NaN NaN -0.654297
#[Out]# 4 0.345065 NaN NaN NaN -0.232199
Use isnull () / notnull () to see if there is NaN. Reference below
How to handle NaN by the pandas formula: pandas 0.19.1 documentation »Working with missing data
use isnull method
isnull()
df.isnull()
#[Out]# a b c d e
#[Out]# 0 False False False False False
#[Out]# 1 False False False False False
#[Out]# 2 False True True True False
#[Out]# 3 False True True True False
#[Out]# 4 False True True True False
What is returned is a data frame that is the same size as df and contains a bool value. True only at NaN.
not null () is the reverse of True / False of the data frame returned by is null ()
This is a little different from what I want to do
What I want to do ** "Check for NaN and display its rows (columns)" ** When decomposed
I wonder if it will be.
** There is more than one Honyalara ** Speaking of ** numpy's ʻany` method **
np.any()
df.isnull().any()
#[Out]# a False
#[Out]# b True
#[Out]# c True
#[Out]# d True
#[Out]# e False
#[Out]# dtype: bool
df.isnull().any(axis=1)
#[Out]# 0 False
#[Out]# 1 False
#[Out]# 2 True
#[Out]# 3 True
#[Out]# 4 True
#[Out]# dtype: bool
df.isnull().any(axis=0) # df.isnull().any()Same as
#[Out]# a False
#[Out]# b True
#[Out]# c True
#[Out]# d True
#[Out]# e False
#[Out]# dtype: bool
Since the default scanning direction of ʻany () is row direction (axis = 0),
df.isnull (). Any () is a conversion by
True(isnull () in the column, that is,
NaN). Returns
True if more than one is included /
Falseif not. If you set any (axis = 1), the scanning direction is changed and the column direction (axis = 1) is searched for whether
True (that is,
NaN`) is included.
ʻAxis =
can be omitted, so writing
df.isnull (). Any (1)is the same as
df.isnull (). Any (axis = 1) `.
It's a little different from what I want to do, and to make it ** return True
if there is NaN
in one place **, overlap any.
Does it contain even one NaN?
df.isnull().any().any() #Contains NaN
#[Out]# True
dff=pd.DataFrame(np.random.randn(5,5)) #Does not contain NaN
dff.isnull().any().any()
#[Out]# False
I did the same for stack overflow. stack overflow - Python pandas: check if any value is NaN in DataFrame Besides df.any (). any ()
I'm using it.
The fastest time measured by % timeit
wasdf.isnull (). Values.any ()
.
** If you want to know if even one NaN
is included **, use it.
I can finally do what I want to do.
With df.isnull (). Any (1)
, you can create a bool value to see if the row contains NaN, slice it **, and extract only the columns containing NaN.
Line extraction including NaN
df[df.isnull().any(1)]
#[Out]# a b c d e
#[Out]# 2 -1.701473 NaN NaN NaN 0.201468
#[Out]# 3 -0.624932 NaN NaN NaN -0.654297
#[Out]# 4 0.345065 NaN NaN NaN -0.232199
Row extraction including NaN
df.ix[:,df.isnull().any()]
#[Out]# b c d
#[Out]# 0 -1.277486 -1.062232 0.097525
#[Out]# 1 -1.919997 -0.131733 -0.606348
#[Out]# 2 NaN NaN NaN
#[Out]# 3 NaN NaN NaN
#[Out]# 4 NaN NaN NaN
that's all!
There seems to be an easier way, but isn't it? Please let me know.
Also, while the pandas row extraction has loc
, ʻiloc, the column extraction has
df. or
df.ix [:,
Update 2017/4/15
Extract the third row with df.icol (3)
Extract the 0th and 2nd columns with df.icol ([0,2])
In df.icol ([0: 2])
, columns 0, 1 and 2 are ** not extracted and error **
I posted a speed comparison in the comment section.
Recommended Posts