[PYTHON] Masks are useful for searching within Pandas data frames

Pandas is used when dealing with various data in Python, but until now, when searching for a specific numerical value, it was normally processed one by one with for ordf.iterrows (). But when the data gets bigger, it's slower.

After investigating if there is any good way, it seems that it is better to use df.ix [] as a mask (filter?).

For example, suppose you have a data frame like this:

print df

                           A        B         C         D         E
TIME                                                                   
2001-01-08 06:00:00  -0.5925    -1.11  0.055023 -0.074103  0.129126    
2001-01-08 06:05:00  -0.4213      NaN       NaN       NaN       NaN
2001-01-08 06:10:00  -0.3613    -2.19       NaN       NaN       NaN    
2001-01-08 06:15:00  -0.3613      NaN       NaN       NaN  0.224574   
2001-01-08 06:20:00  -0.3613      NaN  0.129036 -0.000281  1.094003   
2001-01-08 06:25:00  -0.8243    -0.65  0.137242 -0.022502  2.145931   

I want to know which column is NaN in row B

print df.ix[df["B"].isnull()]

                           A        B         C         D         E
TIME                                                                   
2001-01-08 06:05:00  -0.4213      NaN       NaN       NaN       NaN
2001-01-08 06:15:00  -0.3613      NaN       NaN       NaN  0.224574
2001-01-08 06:20:00  -0.3613      NaN  0.129036 -0.000281  1.094003      

I want to know the columns that are not NaN in row E

print df.ix[df["E"].isnull()==False]

                           A        B         C         D         E
TIME                                                                   
2001-01-08 06:00:00  -0.5925    -1.11  0.055023 -0.074103  0.129126    
2001-01-08 06:15:00  -0.3613      NaN       NaN       NaN  0.224574
2001-01-08 06:20:00  -0.3613      NaN  0.129036 -0.000281  1.094003
2001-01-08 06:25:00  -0.8243    -0.65  0.137242 -0.022502  2.145931         

Search hits no matter which line is NaN

print df.ix[df.isnull().values.any()]

                           A        B         C         D         E
TIME                                                                   
2001-01-08 06:05:00  -0.4213      NaN       NaN       NaN       NaN
2001-01-08 06:10:00  -0.3613    -2.19       NaN       NaN       NaN    
2001-01-08 06:15:00  -0.3613      NaN       NaN       NaN  0.224574   
2001-01-08 06:20:00  -0.3613      NaN  0.129036 -0.000281  1.094003   

I want only the first three columns where row C is NaN

print df.ix[df["C"].isnull(), :3]

                           A        B         C         D         E
TIME                                                                   
2001-01-08 06:05:00  -0.4213      NaN       NaN       NaN       NaN
2001-01-08 06:10:00  -0.3613    -2.19       NaN       NaN       NaN 

I want rows A and B of a column that is NaN in row D

print df.ix[df["D"].isnull(), ["A", "B"]]

                           A        B
TIME                                                                   
2001-01-08 06:05:00  -0.4213      NaN
2001-01-08 06:10:00  -0.3613    -2.19
2001-01-08 06:15:00  -0.3613      Nan

Manipulate the number in row A of the column that is NaN in row D (try subtracting 1)

df.ix[df["D"].isnull(), "A"] -= 1 
print df["A"]

                           A
TIME                          
2001-01-08 06:00:00  -0.5925                 
2001-01-08 06:05:00  -1.4213
2001-01-08 06:10:00  -1.3613
2001-01-08 06:15:00  -1.3613
2001-01-08 06:20:00  -0.3613 
2001-01-08 06:25:00  -0.8243  

Conditional search in the above state

print df["A"]

                           A
TIME                          
2001-01-08 06:00:00  -0.5925                 
2001-01-08 06:05:00  -1.4213
2001-01-08 06:10:00  -1.3613
2001-01-08 06:15:00  -1.3613
2001-01-08 06:20:00  -0.3613 
2001-01-08 06:25:00  -0.8243  


print df.ix[df["A"]<= -1] 

                           A        B         C         D         E
TIME                      
2001-01-08 06:05:00  -1.4213      NaN       NaN       NaN       NaN
2001-01-08 06:10:00  -1.3613    -2.19       NaN       NaN       NaN    
2001-01-08 06:15:00  -1.3613      NaN       NaN       NaN  0.224574   

It's small in this example, but it's quite fast even with large data frames.


Postscript Df.apply (function, axis = 1) seems to be the best when you want to operate the numerical value of each column with a complicated conditional expression.

Recommended Posts

Masks are useful for searching within Pandas data frames
A summary of Python e-books that are useful for free-to-read data analysis
Best practices for messing with data with pandas
How to replace with Pandas DataFrame, which is useful for data analysis (easy)
Basics of pandas for beginners ② Understanding data overview