Data cleaning using Python


Actual code

0. Loading the library

This time we will use `pandas` and `` `re``` (modules for using regular expressions)

import pandas as pd
import re

1. Read data

df = pd.read_csv("filename.csv")

2. Delete unnecessary elements (blanks, symbols, numbers, words)

Delete unnecessary elements for the entire column

df['Column name'] = df['Column name'].str.replace(r'(\d)', '') #Delete numbers
df['Column name'] = df['Column name'].str.replace('-', '') #Remove sign
df['Column name'] = df['Column name'].str.replace('word', '') #Delete word
df['Column name'] = df['Column name'].str.strip() #Remove whitespace at the beginning and end
df['Column name'] = df['Column name'].str.replace(r'(\d)', '').str.replace('-', '').str.replace('Ah', '').str.strip()
#These can also be run at the same time

3. Cut out words

Thing you want to do

nameSuppose that each element consisting of multiple words exists in the column Example:

df['name'][0] = "I have a pen."
df['name'][1] = "She has a pen."

On the other hand, the first word is extracted and stored as a list in a new column called `` `subject```. Example:

df['subject'][0] = "I"
df['subject'][1] = "She"


temp = df['name'].str.split() #Break down into words
subject = [] #Create an empty list to store the clipped words
for item in temp: 
    subject.append(item[0]) #Store the first word of each line in the list
df['subject'] = subject #Added to the original dataframe with the column name subject

4. Write to a specific data element

.at[]You can access specific data by using['Line name','Column name'] = "This is a test"[line number,'Column name'] = "This is a test"

## 5. csv output
 Finally, output the edited data frame to csv. By adding ```encoding ='utf_8_sig'```, garbled characters can be prevented.

df.to_csv("filename_v2.csv", encoding='utf_8_sig')

