[PYTHON] Data batch extraction method by regular expression from Series

Character extraction from Seirie by regular expression

How to extract only the character string required by regular expression from a file that can not be read by "," delimiter with pandas as shown below and make it a DataFrame

If you read_csv the sample data below as it is, an error will occur because the number of columns is different.

test.csv


value1=12333,value2(fuga,hoge),value3=fuga
value1=111,value2(hoge),value3=fugahoge

When reading, first read as a row of data.

In[2]: import pandas as pd
In[3]: df = pd.read_csv('test.csv',header=None,sep='\t')
In[4]: df
Out[4]: 
                                            0
0  value1=12333,value2(fuga,hoge),value3=fuga
1     value1=111,value2(hoge),value3=fugahoge

Use Series.str.extract () to split with a regular expression.

In[5]: df[0].str.extract('value1=(?P<val1>\d+),value2\((?P<val2>[\w,]+)\),value3=(?P<val3>.*)')
Out[5]: 
    val1       val2      val3
0  12333  fuga,hoge      fuga
1    111       hoge  fugahoge

The column name can be specified in the part of "? P \ ", which is the value actually extracted from the contents of "()". If not specified, numbers will be assigned in order from the beginning.

Moreover, since the extracted value is returned as an object, it is necessary to change it to an int type or the like as appropriate.

reference

http://sinhrks.hatenablog.com/entry/2014/12/06/233032

Recommended Posts

Data batch extraction method by regular expression from Series
Regular expression matching method
Classify data by k-means method
10 selections of data extraction by pandas.DataFrame.query
Search pythondict dictionary key by regular expression
Get time series data from k-db.com in Python
Kaggle Kernel Method Summary [Table Time Series Data]
[Latest method] Visualization of time series data and extraction of frequent patterns using Pan-Matrix Profile