Data cleaning using Python


Actual code

0. Loading the library

This time we will use `pandas` and `` `re``` (modules for using regular expressions)

import pandas as pd
import re

1. Read data

df = pd.read_csv("filename.csv")

2. Delete unnecessary elements (blanks, symbols, numbers, words)

Delete unnecessary elements for the entire column

df['Column name'] = df['Column name'].str.replace(r'(\d)', '') #Delete numbers
df['Column name'] = df['Column name'].str.replace('-', '') #Remove sign
df['Column name'] = df['Column name'].str.replace('word', '') #Delete word
df['Column name'] = df['Column name'].str.strip() #Remove whitespace at the beginning and end
df['Column name'] = df['Column name'].str.replace(r'(\d)', '').str.replace('-', '').str.replace('Ah', '').str.strip()
#These can also be run at the same time

3. Cut out words

Thing you want to do

nameSuppose that each element consisting of multiple words exists in the column Example:

df['name'][0] = "I have a pen."
df['name'][1] = "She has a pen."

On the other hand, the first word is extracted and stored as a list in a new column called `` `subject```. Example:

df['subject'][0] = "I"
df['subject'][1] = "She"


temp = df['name'].str.split() #Break down into words
subject = [] #Create an empty list to store the clipped words
for item in temp: 
    subject.append(item[0]) #Store the first word of each line in the list
df['subject'] = subject #Added to the original dataframe with the column name subject

4. Write to a specific data element

.at[]You can access specific data by using['Line name','Column name'] = "This is a test"[line number,'Column name'] = "This is a test"

## 5. csv output
 Finally, output the edited data frame to csv. By adding ```encoding ='utf_8_sig'```, garbled characters can be prevented.

df.to_csv("filename_v2.csv", encoding='utf_8_sig')

Recommended Posts

Data cleaning using Python
Data analysis using Python 0
Data analysis using python pandas
Data acquisition using python googlemap api
Data analysis python
Start using Python
Scraping using Python
[python] Read data
Get Youtube data in Python using Youtube Data API
[Python] Various data processing using Numpy arrays
Creating Google Spreadsheet using Python / Google Data API
Data analysis with python 2
Data analysis using xarray
Operate Redmine using Python Redmine
Fibonacci sequence using Python
Python Data Visualization Libraries
[Python] Get all comments using Youtube Data API
Data analysis overview python
Data cleansing 2 Data cleansing using DataFrame
Using Python #external packages
WiringPi-SPI communication using Python
Age calculation using python
[Python3] Let's analyze data using machine learning! (Regression)
Python data analysis template
Search Twitter using Python
[Python tutorial] Data structure
[Python] Sorting Numpy data
Python introductory study-output of sales data using tuples-
Name identification using python
Notes using Python subprocesses
Try using Tweepy [Python2.7]
Data analysis with Python
Cleaning Backlog with Python
Let's analyze Covid-19 (Corona) data using Python [For beginners]
Create a data collection bot in Python using Selenium
Collectively register data in Firestore using csv file in Python
Get LEAD data using Marketo's REST API in Python
[Python] Get insight data using Google My Business API
Write data to KINTONE using the Python requests module
Process csv data with python (count processing using pandas)
[Technical book] Introduction to data analysis using Python -1 Chapter Introduction-
Flatten using Python yield from
Scraping using Python 3.5 async / await
Sample data created with python
My python data analysis container
Save images using python3 requests
Handle Ambient data in Python
data structure python push pop
[S3] CRUD with S3 using Python [Python]
Python for Data Analysis Chapter 4
[Python] Try using Tkinter's canvas
Using Quaternion with Python ~ numpy-quaternion ~
Display UTM-30LX data in Python
Try using Kubernetes Client -Python-
Select features using text data
Get Youtube data with python
[Python] Using OpenCV with Python (Basic)
Website change monitoring using python
Post to Twitter using Python
Data Science Cheat Sheet (Python)
Start to Selenium using python