[PYTHON] 10 selections of data extraction by pandas.DataFrame.query

Introduction

In this article, I will introduce various conditional extractions using query, which is one of the pandas functions. In my experience, I will write the actual code based on 10 examples that are frequently encountered during data preprocessing.

Execution environment

!python -V
Python 3.7.3

Module used

The version of pandas uses 1.0.3.

import pandas as pd  #The leading role this time
import numpy as np  #Used to generate inf and nan values
#Check version of pandas
print(pd.__version__)
1.0.3

Data preparation

#Preparation of data frame to be used this time
df = pd.DataFrame({
        'a': range(1, 6),
        'b': range(10, 0, -2),
        'c c' : [1, 3, 5, 7, 9],
        'd' : [2, np.nan, np.inf, None, 10],
        'e' : ['a', 'b', 'c', 'd', 'e'],
        'f' : [False, True, True, False, True]})
#Check data frame
df
a b c c d e f
0 1 10 1 2.0 a False
1 2 8 3 NaN b True
2 3 6 5 inf c True
3 4 4 7 NaN d False
4 5 2 9 10.0 e True

10 selections of data extraction

  1. [Condition extraction in 1 column](Condition extraction in # 1 column)
  2. [Condition extraction with multiple columns](# Condition extraction with multiple columns)
  3. [Conditional extraction of columns with spaces in column names](#Conditional extraction of columns with spaces in column names)
  4. [Condition extraction by character string specification](# Condition extraction by character string specification)
  5. [Condition extraction by multiple conditions](# Condition extraction by multiple conditions)
  6. [Condition extraction by inf specification](Condition extraction by #inf specification)
  7. [Condition extraction by specifying missing values](# Condition extraction by specifying missing values)
  8. [Condition extraction by specifying boolean value](#Conditional extraction by specifying boolean value)
  9. [Condition extraction using list](# Condition extraction using list)
  10. [Condition extraction using variables](# Condition extraction using variables)

Condition extraction in one column

Basic conditional extraction using only one column of the data frame. Normal comparison operators can be used.

df.query('a == 3')  #Extract rows with column a value of 3
a b c c d e f
2 3 6 5 inf c True
df.query('a != 3')  #Extract rows where the value in column a is not 3
a b c c d e f
0 1 10 1 2.0 a False
1 2 8 3 NaN b True
3 4 4 7 NaN d False
4 5 2 9 10.0 e True
df.query('a > 3')  #Extract rows with column a value greater than 3
a b c c d e f
3 4 4 7 NaN d False
4 5 2 9 10.0 e True

Condition extraction in multiple columns

Conditional extraction using values between multiple columns

df.query('a > b')  #Extract rows where the value in column a is greater than the value in column b
a b c c d e f
4 5 2 9 10.0 e True

Conditional extraction of columns containing spaces in column names

If the column name contains spaces, the column name must be enclosed in backticks (\ `). Since this function was implemented in pandas 0.25.0, an error will occur in version 0.24.0 or earlier. https://pandas-docs.github.io/pandas-docs-travis/whatsnew/v0.25.0.html#other-enhancements

df.query(' `c c` > b')  #c Extract rows where the value in column c is greater than the value in column b
a b c c d e f
3 4 4 7 NaN d False
4 5 2 9 10.0 e True

Condition extraction by specifying a character string

If you specify the condition as a string, you must enclose the value in double quotes (").

df.query('e == "a" ')  #The value in column e is a string"a"Extract the line that is
a b c c d e f
0 1 10 1 2.0 a False

Condition extraction by multiple conditions

Condition extraction using multiple conditions and logical operators such as and and or.

df.query('a > 2 and b < 3')  #Extract rows where the value in column a is greater than 2 and the value in column b is less than 3
a b c c d e f
4 5 2 9 10.0 e True
df.query('a > 2 or b < 3')  #Extract rows where the value in column a is greater than 2 or the value in column b is less than 3
a b c c d e f
2 3 6 5 inf c True
3 4 4 7 NaN d False
4 5 2 9 10.0 e True
df.query('a > 2 or e == "d" ')  #The value in column a is greater than 2 or the value in column e is a string"d"Extract the line that is
a b c c d e f
2 3 6 5 inf c True
3 4 4 7 NaN d False
4 5 2 9 10.0 e True
df.query(' 2<= a <= 4')  #Extract rows where the value in column a is 2 or more and 4 or less
a b c c d e f
1 2 8 3 NaN b True
2 3 6 5 inf c True
3 4 4 7 NaN d False

Condition extraction by specifying inf

If the value has inf, extract as follows.

df.query(' d == inf')  #Extract rows where the value in column d is inf
a b c c d e f
2 3 6 5 inf c True
df.query(' d != inf')  #Extract rows where the value in column d is not inf
a b c c d e f
0 1 10 1 2.0 a False
1 2 8 3 NaN b True
3 4 4 7 NaN d False
4 5 2 9 10.0 e True

Condition extraction by specifying missing values

If the value has a missing value, extract as follows.

df.query(' d != d')  #Extract rows where the value in column d is missing
a b c c d e f
1 2 8 3 NaN b True
3 4 4 7 NaN d False
df.query(' d == d')  #Extract rows where the value in column d is not missing
a b c c d e f
0 1 10 1 2.0 a False
2 3 6 5 inf c True
4 5 2 9 10.0 e True

Condition extraction by specifying boolean value

If the value has a boolean value such as True or False, extract as follows.

df.query('f == True')  #Extract rows where the value in column f is True
a b c c d e f
1 2 8 3 NaN b True
2 3 6 5 inf c True
4 5 2 9 10.0 e True
df.query('f == False')  #Extract rows where the value in column f is False
a b c c d e f
0 1 10 1 2.0 a False
3 4 4 7 NaN d False

Conditional extraction using a list

Conditional extraction is possible depending on whether the value is in the list.

df.query(' a in (1, 2, 3)')  #The value in column a is(1, 2, 3)Extract a row that is one of
a b c c d e f
0 1 10 1 2.0 a False
1 2 8 3 NaN b True
2 3 6 5 inf c True
df.query(' a not in (1, 2, 3)')  #The value in column a is(1, 2, 3)Extract a row that is none of
a b c c d e f
3 4 4 7 NaN d False
4 5 2 9 10.0 e True

Conditional extraction using variables

It is possible to use variables prepared in advance as the value of the condition.

x = 2  #Store 2 in variable x
df.query(' a == @x')  #Extract the row where the value of column a is x
a b c c d e f
1 2 8 3 NaN b True
y = [2, 3, 4]  #Store list in variable y
df.query('a in @y')  #Extract rows where the value in column a is one of y
a b c c d e f
1 2 8 3 NaN b True
2 3 6 5 inf c True
3 4 4 7 NaN d False

Recommended Posts

10 selections of data extraction by pandas.DataFrame.query
Visualization of data by prefecture
Animation of geographic data by geopandas
A memorandum of extraction by python bs4 request
Sentiment analysis of large-scale tweet data by NLTK
Numerical summary of data
Analysis of financial data by pandas and its visualization (2)
Anomaly detection of time series data by LSTM (Keras)
Split data by threshold
Training data by CNN
Story of image analysis of PDF file and data extraction
Correlation by data preprocessing
Data batch extraction method by regular expression from Series
Visualization method of data by explanatory variable and objective variable
Preprocessing of prefecture data
Selection of measurement data
What I saw by analyzing the data of the engineer market
Tuning experiment of Tensorflow data
Classify data by k-means method
Gzip the data by streaming
Calculation of similarity by MinHash
Keyword extraction by MeCab (python)
Line art extraction of illustrations
Fourier transform of raw data
Data acquired by Django releted
Average estimation of capped data
About data management of anvil-app-server
Probability prediction of imbalanced data
Automatic acquisition of gene expression level data by python and R
Feature extraction by TF method using the result of morphological analysis
Impressions of touching Dash, a data visualization tool made by python