I searched for a bit and couldn't find it, so I managed to figure out how to put it out, so make a note.
The theme is "Is there NaN in the pandas DataFrame?"
As a simple check to see if the data is being processed properly, I would like to ** find out if there is a NaN value in the data frame and where it is **.
If you want to fill / delete NaN, you can use fillna () / dropna (), but what you want to do now is ** "Check if there is NaN and display the row (column). **
As an example, I want to extract only the 2nd-4th rows or 1-3rd columns of this data frame.
Data creation
df=pd.DataFrame(np.random.randn(5,5))
df.ix[2:, 1:3] = np.nan
df.columns=list('abcde')
df
#[Out]# a b c d e
#[Out]# 0 -0.678873 -1.277486 -1.062232 0.097525 -2.386115
#[Out]# 1 -1.063709 -1.919997 -0.131733 -0.606348 0.101888
#[Out]# 2 -1.701473 NaN NaN NaN 0.201468
#[Out]# 3 -0.624932 NaN NaN NaN -0.654297
#[Out]# 4 0.345065 NaN NaN NaN -0.232199
Use isnull () / notnull () to see if there is NaN. Reference below
How to handle NaN by the pandas formula: pandas 0.19.1 documentation »Working with missing data
use isnull method
isnull()
df.isnull()
#[Out]# a b c d e
#[Out]# 0 False False False False False
#[Out]# 1 False False False False False
#[Out]# 2 False True True True False
#[Out]# 3 False True True True False
#[Out]# 4 False True True True False
What is returned is a data frame that is the same size as df and contains a bool value. True only at NaN.
not null () is the reverse of True / False of the data frame returned by is null ()
This is a little different from what I want to do
What I want to do ** "Check for NaN and display its rows (columns)" ** When decomposed
I wonder if it will be.
** There is more than one Honyalara ** Speaking of ** numpy's ʻany` method **
np.any()
df.isnull().any()
#[Out]# a False
#[Out]# b True
#[Out]# c True
#[Out]# d True
#[Out]# e False
#[Out]# dtype: bool
df.isnull().any(axis=1)
#[Out]# 0 False
#[Out]# 1 False
#[Out]# 2 True
#[Out]# 3 True
#[Out]# 4 True
#[Out]# dtype: bool
df.isnull().any(axis=0) # df.isnull().any()Same as
#[Out]# a False
#[Out]# b True
#[Out]# c True
#[Out]# d True
#[Out]# e False
#[Out]# dtype: bool
Since the default scanning direction of ʻany () is row direction (axis = 0), df.isnull (). Any () is a conversion by True(isnull () in the column, that is,NaN). Returns True if more than one is included / Falseif not. If you set any (axis = 1), the scanning direction is changed and the column direction (axis = 1) is searched for whetherTrue (that is, NaN`) is included.
ʻAxis =
can be omitted, so writingdf.isnull (). Any (1)is the same asdf.isnull (). Any (axis = 1) `.
It's a little different from what I want to do, and to make it ** return True if there is NaN in one place **, overlap any.
Does it contain even one NaN?
df.isnull().any().any() #Contains NaN
#[Out]# True
dff=pd.DataFrame(np.random.randn(5,5)) #Does not contain NaN
dff.isnull().any().any()
#[Out]# False
I did the same for stack overflow. stack overflow - Python pandas: check if any value is NaN in DataFrame Besides df.any (). any ()
I'm using it.
The fastest time measured by % timeit wasdf.isnull (). Values.any ().
** If you want to know if even one NaN is included **, use it.
I can finally do what I want to do.
With df.isnull (). Any (1), you can create a bool value to see if the row contains NaN, slice it **, and extract only the columns containing NaN.
Line extraction including NaN
df[df.isnull().any(1)]
#[Out]# a b c d e
#[Out]# 2 -1.701473 NaN NaN NaN 0.201468
#[Out]# 3 -0.624932 NaN NaN NaN -0.654297
#[Out]# 4 0.345065 NaN NaN NaN -0.232199
Row extraction including NaN
df.ix[:,df.isnull().any()]
#[Out]# b c d
#[Out]# 0 -1.277486 -1.062232 0.097525
#[Out]# 1 -1.919997 -0.131733 -0.606348
#[Out]# 2 NaN NaN NaN
#[Out]# 3 NaN NaN NaN
#[Out]# 4 NaN NaN NaN
that's all!
There seems to be an easier way, but isn't it? Please let me know.
Also, while the pandas row extraction has loc, ʻiloc, the column extraction has df. or df.ix [:,
Update 2017/4/15
Extract the third row with df.icol (3)
Extract the 0th and 2nd columns with df.icol ([0,2])
In df.icol ([0: 2]), columns 0, 1 and 2 are ** not extracted and error **
I posted a speed comparison in the comment section.
Recommended Posts