Get rid of dirty data with Python and regular expressions

Preface

I usually do pre-processing of data. Most of the data works well if you can read it in pandas, but you may come across unreadable data, so keep a record of your struggle with them.

What kind of data

--The delimiter is a comma "," and each data is enclosed in double quotes "" ". --A line feed code is included in the data (between "and"). --Comma is included in the data. --Double quotes are included in the data.

↓ If you compare it like this

"a","b"
"1","Ho\n""
"2","Fu,Or'"

Aims

--Create a csv file that can be read by read.csv () of pandas, which is a Python library. ――Since it is the first part of the pre-processing, I want to make it a DataFrame for the subsequent processing. --Commas and quotes are treated as they are as part of the data, considering the possibility of meaning. --The line feed code in the data is only bad, so remove it.

↓ In other words, I want to make this kind of DataFrame.

a b
1 Hoke"
2 Fu,'Or'

Failure example

Try to read as it is with pandas

import pandas as pd

df = pd.read_csv('hoge.csv')
print(df)

result

a b
1 Ho\n"\n2" Fu,Or'

Successful way

script

--Python version is 3.7

import re
import pandas as pd

#Read as text
with open('hoge.csv', 'r') as f:
    text = f.read()

tmp_text = re.sub('([^"])\n([^"])', r'\1\2', text) #Line feed code in the middle of data(\n)Get rid of
tmp_text = re.sub('","', '\t', tmp_text)           #Convert delimiter to tab
tmp_text = re.sub('(^"|"$)', '', tmp_text)         #Remove the first and last quotes in the file
tmp_text = re.sub('"\n"', '\n', tmp_text)          #Remove the quotes in the middle

#Spit out to a file once
with open('data.csv', 'w') as f:
    f.write(tmp_text)

#Confirm
df = pd.read_csv('data.csv', sep='\t')
print(df)

output


   a     b
0 1"
1 2,Or'

I read.

File

The contents of the file look like this

data.csv


a	b
1"
2 Fu,Or'

Concerns

If the data contains \ t, this code will of course not work. The same applies when the line feed code is different. You need to check which characters are included and how to replace them. → Creating a method for checking and parameterizing delimiters?

I'll do it when I feel like it. If you process it on jupyter, you can easily check the data and change the source, so it may not be necessary ...?

Conclusion

If you read it with pandas, it's already here. It should be usable not only for pandas but also before it is eaten by BI tools.

** Regular expressions are convenient! !! ** **

Recommended Posts

Get rid of dirty data with Python and regular expressions
[Python] Get rid of dating with regular expressions
Get Youtube data with python
About Python and regular expressions
Get a large amount of Starbucks Twitter data with python and try data analysis Part 1
[Python] Get economic data with DataReader
Sample of HTTP GET and JSON parsing with python of pepper
Coexistence of Python2 and 3 with CircleCI (1.0)
Handling regular expressions with PHP / Python
Get additional data to LDAP with python (Writer and Reader)
Get data from MySQL on a VPS with Python 3 and SQLAlchemy
Overlapping regular expressions in Python and Java
Get additional data in LDAP with python
Data pipeline construction with Python and Luigi
[Note] Get data from PostgreSQL with Python
Replace non-ASCII with regular expressions in Python
Recommendation of Altair! Data visualization with Python
Python: Simplified morphological analysis with regular expressions
Get rid of DICOM images in Python
[Python] Regular Expressions Regular Expressions
I have 0 years of programming experience and challenge data processing with python
Get the number of articles accessed and likes with Qiita API + Python
Get images of OpenStreetMap and Geographical Survey Institute maps with Python + staticmap
Get and estimate the shape of the head using Dlib and OpenCV with python
Distinguish between numbers and letters with regular expressions
I tried to get CloudWatch data with Python
Get git branch name and tag name with python
Implementation of TRIE tree with Python and LOUDS
Get CPU information of Raspberry Pi with Python
I tried to get and analyze the statistical data of the new corona with Python: Data of Johns Hopkins University
Get Gmail subject and body with Python and Gmail API
Continuation of multi-platform development with Electron and Python
Example of reading and writing CSV with Python
I compared the speed of regular expressions in Ruby, Python, and Perl (2013 version)
Get the stock price of a Japanese company with Python and make a graph
[Introduction to Python] How to get the index of data with a for statement
Get date with python
Data analysis with Python
Get media timeline images and videos with Python + Tweepy
Full-width and half-width processing of CSV data in Python
List of Python libraries for data scientists and data engineers
Get financial data with python (then a little tinkering)
Easy partial download of mp4 with python and youtube-dl!
Visualize the range of interpolation and extrapolation with python
Challenge principal component analysis of text data with Python
Get comments on youtube Live with [python] and [pytchat]!
Get comments and subscribers with the YouTube Data API
Get mail from Gmail and label it with Python3
Investigate Java and python data exchange with Apache Arrow
Get the operation status of JR West with Python
[Python] Get user information and article information with Qiita API
Get an abstract understanding of Python modules and packages
[Basics of data science] Collecting data from RSS with python
Get data from database via ODBC with Python (Access)
Extract the band information of raster data with python
Comparison of CoffeeScript with JavaScript, Python and Ruby grammar
Version control of Node, Ruby and Python with anyenv
[CGI] Run the Python program on the server with Vue.js + axios and get the output data
Get country code with python
Sample data created with python
Programming with Python and Tkinter