Python pandas: Search for DataFrame using regular expressions

pandas has the ability to retrieve data. Note that I stumbled on the method when I wanted to use a part of the character string as a search condition, such as when the ID was grouped by the first character of several digits. I used Jupyter to execute the code. ** Please install pandas using pip etc. ** **

environment

Python3.5 pandas0.17 Jupyter notebook

ID specifications

1st digit: A ~ E 2nd digit: A ~ E 3rd digit: 0 ~ 9 4th digit: 0 ~ 9 5th digit: 0 ~ 9 6th digit: 0 ~ 9

Example: AD3489

Preparation: Data creation

Generate a large amount of ID-like data

It's not very clever code, but it generated 200 ID-like strings and names.

python


import random
id = []
text=['A','B','C','D','E']
name = []
namelistF = ['Thom','Walter','Alexis','Allan','Hubert','Amy','Anastasia','Lorna','Barbara']
namelistL=['Abernethy','Ackroyd','Collins','Guises','Haworth','Hayden','Kingsley','Leyland','McLennan']

for i in range(200):
    id.append(text[random.randint(0,len(text))-1] + text[random.randint(0,len(text))-1] + str(random.randint(0,len(text))) + str(random.randint(0,9)) + str(random.randint(0,9) )+ str(random.randint(0,9)))
    name.append(namelistF[random.randint(0,len(namelist)-1)] + ' ' + namelistL[random.randint(0,len(namelist)-1)])

In this way, 200 pieces of data were generated.

Obtained ID data


['AD4865', 'DE5898', 'ED1278', 'EA4809', 'CA1251', 'EC2877', 'EC0579', 'EC3476', 'EE1624', 'AD3489', 'AC1209', 'EE3963', 'BC4656', 'BE3524', 'BE1321', 'CA4381', 'BB4813', 'AC4885', 'EA2290', 'DE3520', 'EC3145', 'CE5665', 'BE0071', 'EA0096', 'DC3792', 'CC1623', 'EE3194', 'EE1599', 'EE3398', 'EE0891', 'AB2412', 'AB3631', 'EC3118', 'DE2280', 'DA2018', 'CE4929', 'DA0157', 'BE3865', 'EE5122', 'ED5983', 'CD3569', 'BA3981', 'DE3913', 'BC3383', 'EE1456', 'AB0498', 'DA3937', 'AE2070', 'CE5804', 'DE2952', 'EB5014', 'AB0785', 'EC5061', 'AE1615', 'AC5219', 'EB3453', 'BB4372', 'EC4597', 'EE3370', 'CC1251', 'AA2228', 'CE3570', 'AC2734', 'DC0885', 'AC1142', 'DE3956', 'EE4113', 'EA4665', 'EB5010', 'CE1369', 'EC4333', 'AC1651', 'CE2802', 'BE3500', 'BB5521', 'DC5230', 'BA3568', 'AD4049', 'EE0666', 'CB1121', 'AA1281', 'CC1060', 'AD3520', 'BC3767', 'CD1125', 'DA0091', 'DA1978', 'BD2344', 'EE1345', 'AE2442', 'BC2306', 'CD1526', 'DD0514', 'EE0192', 'AE0517', 'ED3991', 'EE4694', 'AA1882', 'AC3296', 'ED1368', 'BD2772', 'BC1574', 'AD4402', 'DA2235', 'EC0168', 'CD0686', 'AB1050', 'BB2185', 'EA0274', 'DE4259', 'CB1594', 'ED3869', 'EA4427', 'CE0328', 'DD1004', 'BC4501', 'EE4727', 'CC5595', 'ED2344', 'ED4679', 'EE4167', 'AD5802', 'CB4401', 'EB2113', 'ED1493', 'CD3285', 'AA2717', 'AA1299', 'DA0957', 'CB5816', 'EE3520', 'DB2187', 'EA4323', 'DC3989', 'AA0663', 'DE5448', 'EE1418', 'EE2066', 'AA1373', 'EE1518', 'EE1782', 'CB2087', 'CE0266', 'AC4397', 'EC1308', 'EE2743', 'AD3849', 'CC3094', 'EE2725', 'BD3893', 'EC4543', 'DE2661', 'BE2766', 'DC2414', 'ED1347', 'CB2313', 'AD0026', 'ED1356', 'BE3893', 'EE3442', 'EC1069', 'EE2767', 'CA1410', 'EC5553', 'EA2432', 'AB5636', 'ED1639', 'DB3176', 'EE5932', 'EC1606', 'CC4372', 'EA2838', 'DE2964', 'DD3804', 'BC5171', 'BD4210', 'EE3239', 'AA3637', 'DA4393', 'CB3981', 'CD4627', 'EE0674', 'BE3231', 'BE0014', 'EE4350', 'BD1956', 'CE3526', 'EB2453', 'AB3957', 'EE5343', 'AD1178', 'CC4550', 'DD0465', 'CE2580', 'CD1344', 'ED0247', 'ED4614', 'DE0976', 'CE5883', 'AA2922']

Obtained name data


['Lorna Leyland', 'Anastasia Guises', 'Hubert Guises', 'Allan Kingsley', 'Lorna Kingsley', 'Amy Hayden', 'Hubert Abernethy', 'Anastasia Guises', 'Thom Hayden', 'Walter Kingsley', 'Allan Ackroyd', 'Thom Kingsley', 'Anastasia Collins', 'Thom Ackroyd', 'Thom Collins', 'Lorna Haworth', 'Anastasia Hayden', 'Thom Leyland', 'Alexis Leyland', 'Anastasia McLennan', 'Walter Haworth', 'Amy Collins', 'Barbara Leyland', 'Anastasia Ackroyd', 'Amy Kingsley', 'Barbara Haworth', 'Allan Ackroyd', 'Barbara Guises', 'Thom Haworth', 'Thom Ackroyd', 'Allan Guises', 'Barbara McLennan', 'Barbara Guises', 'Hubert Abernethy', 'Anastasia Haworth', 'Walter Collins', 'Allan Abernethy', 'Hubert Ackroyd', 'Thom McLennan', 'Thom Abernethy', 'Allan McLennan', 'Hubert Kingsley', 'Allan Leyland', 'Allan Haworth', 'Alexis Collins', 'Thom Ackroyd', 'Anastasia Haworth', 'Alexis McLennan', 'Amy Haworth', 'Walter Guises', 'Barbara Hayden', 'Lorna Collins', 'Lorna Abernethy', 'Alexis Abernethy', 'Anastasia Ackroyd', 'Allan Kingsley', 'Thom McLennan', 'Thom Hayden', 'Walter Hayden', 'Lorna Leyland', 'Alexis Leyland', 'Barbara Leyland', 'Barbara Collins', 'Amy Leyland', 'Lorna Hayden', 'Lorna Leyland', 'Walter Kingsley', 'Allan McLennan', 'Anastasia Haworth', 'Allan Guises', 'Anastasia Abernethy', 'Walter Kingsley', 'Amy Abernethy', 'Allan Guises', 'Lorna Ackroyd', 'Alexis Guises', 'Thom Abernethy', 'Lorna McLennan', 'Allan Abernethy', 'Barbara Hayden', 'Amy Collins', 'Allan Kingsley', 'Thom Haworth', 'Thom Kingsley', 'Alexis Leyland', 'Amy Hayden', 'Hubert Hayden', 'Thom Ackroyd', 'Walter Abernethy', 'Barbara Guises', 'Walter Collins', 'Hubert Abernethy', 'Allan Ackroyd', 'Barbara Kingsley', 'Anastasia Hayden', 'Lorna Abernethy', 'Thom Hayden', 'Thom Leyland', 'Amy Leyland', 'Thom Guises', 'Alexis Hayden', 'Walter Guises', 'Anastasia Leyland', 'Thom Ackroyd', 'Allan Collins', 'Barbara Collins', 'Allan Hayden', 'Thom Collins', 'Barbara McLennan', 'Lorna Haworth', 'Walter Hayden', 'Barbara Guises', 'Alexis Collins', 'Lorna Hayden', 'Barbara Haworth', 'Thom Guises', 'Hubert Guises', 'Anastasia Haworth', 'Thom Ackroyd', 'Hubert Haworth', 'Hubert Abernethy', 'Anastasia Leyland', 'Amy Leyland', 'Walter Guises', 'Thom McLennan', 'Hubert Collins', 'Barbara McLennan', 'Anastasia Guises', 'Amy Ackroyd', 'Barbara McLennan', 'Anastasia Guises', 'Walter Hayden', 'Amy Kingsley', 'Lorna Hayden', 'Walter Guises', 'Barbara Guises', 'Thom Kingsley', 'Anastasia Abernethy', 'Amy Kingsley', 'Walter Kingsley', 'Walter McLennan', 'Alexis Ackroyd', 'Lorna Kingsley', 'Amy Collins', 'Anastasia Haworth', 'Amy Haworth', 'Lorna Kingsley', 'Allan Guises', 'Anastasia Guises', 'Amy Haworth', 'Barbara Guises', 'Lorna Leyland', 'Allan Hayden', 'Alexis Leyland', 'Lorna Ackroyd', 'Alexis Guises', 'Walter Collins', 'Lorna Abernethy', 'Hubert Collins', 'Hubert McLennan', 'Alexis Collins', 'Lorna Leyland', 'Barbara Hayden', 'Lorna Kingsley', 'Barbara Haworth', 'Thom Leyland', 'Lorna Ackroyd', 'Walter Guises', 'Allan Haworth', 'Hubert McLennan', 'Walter Haworth', 'Lorna Ackroyd', 'Allan Hayden', 'Lorna Abernethy', 'Allan Abernethy', 'Allan Ackroyd', 'Lorna Haworth', 'Barbara Hayden', 'Thom Haworth', 'Alexis Hayden', 'Barbara Leyland', 'Alexis Ackroyd', 'Walter Abernethy', 'Anastasia Abernethy', 'Thom Kingsley', 'Hubert Haworth', 'Amy Hayden', 'Hubert Guises', 'Walter Hayden', 'Allan Guises', 'Allan Ackroyd', 'Hubert Abernethy', 'Amy McLennan', 'Allan Abernethy', 'Walter Kingsley', 'Allan Haworth', 'Hubert Hayden', 'Lorna McLennan', 'Lorna Collins', 'Lorna Haworth']

Creating a dictionary

Put this ID array and name array in a dictionary type called data.

data = {
'ID':
['AD4865', 'DE5898', 'ED1278', 'EA4809', 'CA1251', 'EC2877', 'EC0579', 'EC3476', 'EE1624', 'AD3489', 'AC1209', 'EE3963', 'BC4656', 'BE3524', 'BE1321', 'CA4381', 'BB4813', 'AC4885', 'EA2290', 'DE3520', 'EC3145', 'CE5665', 'BE0071', 'EA0096', 'DC3792', 'CC1623', 'EE3194', 'EE1599', 'EE3398', 'EE0891', 'AB2412', 'AB3631', 'EC3118', 'DE2280', 'DA2018', 'CE4929', 'DA0157', 'BE3865', 'EE5122', 'ED5983', 'CD3569', 'BA3981', 'DE3913', 'BC3383', 'EE1456', 'AB0498', 'DA3937', 'AE2070', 'CE5804', 'DE2952', 'EB5014', 'AB0785', 'EC5061', 'AE1615', 'AC5219', 'EB3453', 'BB4372', 'EC4597', 'EE3370', 'CC1251', 'AA2228', 'CE3570', 'AC2734', 'DC0885', 'AC1142', 'DE3956', 'EE4113', 'EA4665', 'EB5010', 'CE1369', 'EC4333', 'AC1651', 'CE2802', 'BE3500', 'BB5521', 'DC5230', 'BA3568', 'AD4049', 'EE0666', 'CB1121', 'AA1281', 'CC1060', 'AD3520', 'BC3767', 'CD1125', 'DA0091', 'DA1978', 'BD2344', 'EE1345', 'AE2442', 'BC2306', 'CD1526', 'DD0514', 'EE0192', 'AE0517', 'ED3991', 'EE4694', 'AA1882', 'AC3296', 'ED1368', 'BD2772', 'BC1574', 'AD4402', 'DA2235', 'EC0168', 'CD0686', 'AB1050', 'BB2185', 'EA0274', 'DE4259', 'CB1594', 'ED3869', 'EA4427', 'CE0328', 'DD1004', 'BC4501', 'EE4727', 'CC5595', 'ED2344', 'ED4679', 'EE4167', 'AD5802', 'CB4401', 'EB2113', 'ED1493', 'CD3285', 'AA2717', 'AA1299', 'DA0957', 'CB5816', 'EE3520', 'DB2187', 'EA4323', 'DC3989', 'AA0663', 'DE5448', 'EE1418', 'EE2066', 'AA1373', 'EE1518', 'EE1782', 'CB2087', 'CE0266', 'AC4397', 'EC1308', 'EE2743', 'AD3849', 'CC3094', 'EE2725', 'BD3893', 'EC4543', 'DE2661', 'BE2766', 'DC2414', 'ED1347', 'CB2313', 'AD0026', 'ED1356', 'BE3893', 'EE3442', 'EC1069', 'EE2767', 'CA1410', 'EC5553', 'EA2432', 'AB5636', 'ED1639', 'DB3176', 'EE5932', 'EC1606', 'CC4372', 'EA2838', 'DE2964', 'DD3804', 'BC5171', 'BD4210', 'EE3239', 'AA3637', 'DA4393', 'CB3981', 'CD4627', 'EE0674', 'BE3231', 'BE0014', 'EE4350', 'BD1956', 'CE3526', 'EB2453', 'AB3957', 'EE5343', 'AD1178', 'CC4550', 'DD0465', 'CE2580', 'CD1344', 'ED0247', 'ED4614', 'DE0976', 'CE5883', 'AA2922'],

'Name':
['Lorna Leyland', 'Anastasia Guises', 'Hubert Guises', 'Allan Kingsley', 'Lorna Kingsley', 'Amy Hayden', 'Hubert Abernethy', 'Anastasia Guises', 'Thom Hayden', 'Walter Kingsley', 'Allan Ackroyd', 'Thom Kingsley', 'Anastasia Collins', 'Thom Ackroyd', 'Thom Collins', 'Lorna Haworth', 'Anastasia Hayden', 'Thom Leyland', 'Alexis Leyland', 'Anastasia McLennan', 'Walter Haworth', 'Amy Collins', 'Barbara Leyland', 'Anastasia Ackroyd', 'Amy Kingsley', 'Barbara Haworth', 'Allan Ackroyd', 'Barbara Guises', 'Thom Haworth', 'Thom Ackroyd', 'Allan Guises', 'Barbara McLennan', 'Barbara Guises', 'Hubert Abernethy', 'Anastasia Haworth', 'Walter Collins', 'Allan Abernethy', 'Hubert Ackroyd', 'Thom McLennan', 'Thom Abernethy', 'Allan McLennan', 'Hubert Kingsley', 'Allan Leyland', 'Allan Haworth', 'Alexis Collins', 'Thom Ackroyd', 'Anastasia Haworth', 'Alexis McLennan', 'Amy Haworth', 'Walter Guises', 'Barbara Hayden', 'Lorna Collins', 'Lorna Abernethy', 'Alexis Abernethy', 'Anastasia Ackroyd', 'Allan Kingsley', 'Thom McLennan', 'Thom Hayden', 'Walter Hayden', 'Lorna Leyland', 'Alexis Leyland', 'Barbara Leyland', 'Barbara Collins', 'Amy Leyland', 'Lorna Hayden', 'Lorna Leyland', 'Walter Kingsley', 'Allan McLennan', 'Anastasia Haworth', 'Allan Guises', 'Anastasia Abernethy', 'Walter Kingsley', 'Amy Abernethy', 'Allan Guises', 'Lorna Ackroyd', 'Alexis Guises', 'Thom Abernethy', 'Lorna McLennan', 'Allan Abernethy', 'Barbara Hayden', 'Amy Collins', 'Allan Kingsley', 'Thom Haworth', 'Thom Kingsley', 'Alexis Leyland', 'Amy Hayden', 'Hubert Hayden', 'Thom Ackroyd', 'Walter Abernethy', 'Barbara Guises', 'Walter Collins', 'Hubert Abernethy', 'Allan Ackroyd', 'Barbara Kingsley', 'Anastasia Hayden', 'Lorna Abernethy', 'Thom Hayden', 'Thom Leyland', 'Amy Leyland', 'Thom Guises', 'Alexis Hayden', 'Walter Guises', 'Anastasia Leyland', 'Thom Ackroyd', 'Allan Collins', 'Barbara Collins', 'Allan Hayden', 'Thom Collins', 'Barbara McLennan', 'Lorna Haworth', 'Walter Hayden', 'Barbara Guises', 'Alexis Collins', 'Lorna Hayden', 'Barbara Haworth', 'Thom Guises', 'Hubert Guises', 'Anastasia Haworth', 'Thom Ackroyd', 'Hubert Haworth', 'Hubert Abernethy', 'Anastasia Leyland', 'Amy Leyland', 'Walter Guises', 'Thom McLennan', 'Hubert Collins', 'Barbara McLennan', 'Anastasia Guises', 'Amy Ackroyd', 'Barbara McLennan', 'Anastasia Guises', 'Walter Hayden', 'Amy Kingsley', 'Lorna Hayden', 'Walter Guises', 'Barbara Guises', 'Thom Kingsley', 'Anastasia Abernethy', 'Amy Kingsley', 'Walter Kingsley', 'Walter McLennan', 'Alexis Ackroyd', 'Lorna Kingsley', 'Amy Collins', 'Anastasia Haworth', 'Amy Haworth', 'Lorna Kingsley', 'Allan Guises', 'Anastasia Guises', 'Amy Haworth', 'Barbara Guises', 'Lorna Leyland', 'Allan Hayden', 'Alexis Leyland', 'Lorna Ackroyd', 'Alexis Guises', 'Walter Collins', 'Lorna Abernethy', 'Hubert Collins', 'Hubert McLennan', 'Alexis Collins', 'Lorna Leyland', 'Barbara Hayden', 'Lorna Kingsley', 'Barbara Haworth', 'Thom Leyland', 'Lorna Ackroyd', 'Walter Guises', 'Allan Haworth', 'Hubert McLennan', 'Walter Haworth', 'Lorna Ackroyd', 'Allan Hayden', 'Lorna Abernethy', 'Allan Abernethy', 'Allan Ackroyd', 'Lorna Haworth', 'Barbara Hayden', 'Thom Haworth', 'Alexis Hayden', 'Barbara Leyland', 'Alexis Ackroyd', 'Walter Abernethy', 'Anastasia Abernethy', 'Thom Kingsley', 'Hubert Haworth', 'Amy Hayden', 'Hubert Guises', 'Walter Hayden', 'Allan Guises', 'Allan Ackroyd', 'Hubert Abernethy', 'Amy McLennan', 'Allan Abernethy', 'Walter Kingsley', 'Allan Haworth', 'Hubert Hayden', 'Lorna McLennan', 'Lorna Collins', 'Lorna Haworth']
}

Since the data is generated using random numbers, the result will be different from this article, but it is the same even if you do the following.

data = {
'ID':id,
'Name':name
}

Creating a DataFrame

Create a DataFrame from the data created earlier.

import pandas ps pd
frame = pd.DataFrame(data)

The DataFrame was created in pandas like this. スクリーンショット 2015-10-31 9.59.20.jpg

Data search

Use ** str.contains () ** to search pandas data using regular expressions. When you want to check "when the first character of the 6-digit ID starts with E"

frame[frame['ID'].str.contains('E.....')]

Then you can get a list of lines whose ID starts with E like this. 71 data hits. The'E .....'part is a regular expression. "." Means something for one character. 'E .....' means to search for the 6 letters starting with E. If you try'E ......', it means 7 characters starting with E, so it will not hit.

スクリーンショット 2015-10-31 10.07.50.jpg

To find data with 1 in the 3rd digit

frame[frame['ID'].str.contains('..1...')]

It will be.

Search by multiple conditions

There are still other ways to write a regular expression, but if you want to search, for example, if the first digit is AA or AB.

frame[frame['ID'].str.contains('(AA|AB)....')]

will do. If you write this way, you will get a warning saying User Warning: This pattern has match groups. To actually get the groups, use str. Extract., But that's okay.

The result is the same regardless of which one is big, but as pandas it was written with or condition

frame[(frame['ID'].str.contains('AA....')) | (frame['ID'].str.contains('AB....')) ]

I think the grammar of is more correct.

When using and condition in pandas

frame[(frame['ID'].str.contains('A.....')) & (frame['ID'].str.contains('...1..')) ]

Scratch like.

I feel that how to write and and or is a point to stumble in pandas. ** frame [(Condition 1) & (Condition 2) | (Condition 3)] **.

Search by two column conditions

If you want to search under the condition that "the first digit of the ID is A and the name is Walter", use the and condition.

frame[(frame['ID'].str.contains('A.....')) & (frame['Name'].str.contains('Walter.*'))]

I will write. スクリーンショット 2015-10-31 10.31.25.jpg

'. \ *'Means "anything is fine, so the string continues". If you enter'Walter. \ *', All strings starting with'Walter' will be hit.

In the case of the search condition "starting with Walter and ending with s" in the previous conditions

frame[(frame['ID'].str.contains('A.....')) & (frame['Name'].str.contains('Walter.*s'))]

It seems that it should be done like this.

スクリーンショット 2015-10-31 10.40.33.jpg However, when I looked at the execution result, I got the result of Walter Kingsley. Apparently it hit the s of King "s" ley. Use $ to specify that s is the last character.
frame[(frame['ID'].str.contains('A.....')) & (frame['Name'].str.contains('Walter.*s$'))]

Then

スクリーンショット 2015-10-31 10.42.08.jpg The $ mark has the meaning of searching for the end of a regular expression string. So I didn't hit Kingsley's s, but Guises and Collins.

Summary

I didn't know how to search based on the first character of the data stored in the DataFrame, so I was forced to search by turning the for loop, but if you use this method, you can search quickly with a short syntax.

Recommended Posts

Python pandas: Search for DataFrame using regular expressions
When using regular expressions in Python
Convert from Pandas DataFrame to System.Data.DataTable using Python for .NET
[Python] Regular Expressions Regular Expressions
[Python / Tkinter] Search for Pandas DataFrame → Create a simple search form to display
Refined search for Pokemon race values using Python
Search Twitter using Python
Python application: Pandas # 3: Dataframe
Summary of pre-processing practices for Python beginners (Pandas dataframe)
[Python] Summary of table creation method using DataFrame (pandas)
Search for strings in Python
Search algorithm using word2vec [python]
Use regular expressions in Python
100 Pandas knocks for Python beginners
About Python and regular expressions
Data analysis using python pandas
Extract arbitrary strings using Python regular expressions / Use named groups
Try a similar search for Image Search using the Python SDK [Search]
[Python] Random data extraction / combination from DataFrame using random and pandas
Python hand play (Pandas / DataFrame beginning)
[Python] Loading csv files using pandas
Pandas / DataFrame Tips for practical use
[Python] Operation memo of pandas DataFrame
3.6 Text Normalization 3.7 Regular Expressions for Tokenizing Text
slackbot memorandum ~ Request using regular expressions ~
I can't remember Python regular expressions
By language: Regular expressions for passwords
Search for profitable brands using COTOHA
[TouchDesigner] Tips for for statements using python
[Python] Reasons for overriding using super ()
[Python] Multiplication table using for statement
Handling regular expressions with PHP / Python
Depth-first search using stack in Python
Dump, restore and query search for Python class instances using mongodb
Overlapping regular expressions in Python and Java
Notes for using OpenCV on Windows10 Python 3.8.3.
[Python] What is pandas Series and DataFrame?
Python development environment for macOS using venv 2016
[50 counts] Key transmission using Python for Windows
[python, multiprocessing] Behavior for exceptions when using multiprocessing
Precautions when using for statements in pandas
Python application: Pandas Part 4: DataFrame concatenation / combination
Tips for using python + caffe with TSUBAME
Replace non-ASCII with regular expressions in Python
Don't use \ d in Python 3 regular expressions!
Pandas of the beginner, by the beginner, for the beginner [Python]
[Python] Add total rows to Pandas DataFrame
Notes for using python (pydev) in eclipse
Search for synonyms from the word list (csv) using Python Japanese WordNet
Create a dataframe from excel using pandas
How to use regular expressions in Python
I analyzed cowrie (honeypot) using python pandas
Python: Simplified morphological analysis with regular expressions
Invert pandas DataFrame upside down with just 15 characters without using a for statement
Search for adsorption structure using Minima Hopping method
vprof --I tried using the profiler for Python
Causal reasoning and causal search with Python (for beginners)
Python Pandas is not suitable for batch processing
Pharmaceutical company researchers summarized regular expressions in Python
[Python pandas] Create an empty DataFrame from an existing DataFrame
[python] Create table from pandas DataFrame to postgres