[PYTHON] Data manipulation with Pandas!

Introduction

I've had several opportunities to use pandas for in-house development tasks, I wonder how to do it because it's not so frequent There are many things like how to do it in the first place. Also, since there are many opportunities to handle several GB of data due to its nature, I will update the method to handle such cases from time to time. Therefore, the composition is described in the form of what you want to do ⇒ how to do it.

By the way, it's not like doing gorigori machine learning, so Please note that we will not be talking about that area.

loading csv

In the first place, nothing can be done unless csv can be read. So from how to read. The basics are as follows.


df = pd.read_csv("file name")

Reading large files

However, when the file size becomes GB, There is a high possibility that you will not be able to survive the memory. In such a case, add the chunksize option and load it separately.

If you specify chunksize, It is loaded as a TextFileReader instance instead of a Dataframe. If you turn the TextFileReader in a loop, you can retrieve the DataFrame. In the following example, 50 lines are taken out and printed.


data = pd.read_csv("test.csv", chunksize=50)
for i in data:
  print(i)
option
Option name meaning Example
encoding Character code specification encoding='UTF-8'
skiprows Specify the line to skip skiprows=2
chunksize Read every specified number of lines chunksize=50
usecols Read only specified columns usecols=[1, 3]

[Other options](https://own-search-and-study.xyz/2015/09/03/pandas%E3%81%AEread_csv%E3%81%AE%E5%85%A8%E5%BC%95 % E6% 95% B0% E3% 82% 92% E4% BD% BF% E3% 81% 84% E3% 81% 93% E3% 81% AA% E3% 81% 99 /)

2 Data concatenation

How to read and concatenate separate files


#Read file 1
Data1 = pd.read_csv(file1, dtype = np.float32)
#File 2 read
Data2 = pd.read_csv(file2, dtype = np.float32)
#2 Concatenate data
rawData = pd.concat([Data1, Data2], ignore_index=True)

Get a specific column


df['A']

Extract rows for a specific section


df[1:3]

Note that the line starts from 0, so in this case the first line cannot be obtained. Also,: 3 means up to 2.

Delete unnecessary columns

The original dataframe is changed by specifying inplace = True. In the following example, the City and Price columns disappear from Data.


df.drop(columns=['City', 'Price'], axis = 1, inplace=True)

Can be specified by column number


df.drop(columns=[[1, 2]], axis = 1, inplace=True)

Delete row


df.drop(df.index[[1, 3, 5]])

By default, the row index is a serial number from 0, Note that the index will not be a serial number after sorting etc.

Substitution of a specific column


df.replace({Column name: {Original value:Value after replacement}})

Splitting data by the value of a column

Used when changing the process depending on the column value. Since it returns a GroupBy object, you can mess with the for statement as it is.


for column value,Data frame in df.groupby('Column name'):
Each process

In addition, since the index is scattered, Use reset_index () to re-paste the index in each data frame.

for city,sdf in  df.groupby('city'):
    sdf.reset_index(drop=True)
    if city = 'Tokyo':
             flags = 1

Count the number of records

Returns the number of elements in the array in the first column. Without [0], each matrix is counted.


pd.shape[0]

average


means = df.mean(axis = 0) 

standard deviation


std = df.std(axis = 0) 

Save csv

To save as csv after processing the dataframe:


df.to_csv("file name")
Option name meaning Example
columns Export only specific columns columns=['age']
header With or without header header=False
index Presence or absence of index index=False

Export / add csv file with pandas (to_csv)

reference

Differences between Pandas and NumPy and how to use them properly Group DataFrame by column value-pandas Drop by specifying the row / column of pandas.DataFrame Read csv / tsv file with pandas (read_csv, read_table) Reassign the index of DataFrame with reset_index-python

Recommended Posts

Data manipulation with Pandas!
Data visualization with pandas
Shuffle data with pandas
Data processing tips with Pandas
Versatile data plotting with pandas + matplotlib
Read pandas data
PySpark data manipulation
Pandas Data Manipulation Column Join, Column Swap, Column Rename
Try converting to tidy data with pandas
Let's do MySQL data manipulation with Python
Working with 3D data structures in pandas
Example of efficient data processing with PANDAS
Best practices for messing with data with pandas
Data analysis with python 2
Quickly visualize with Pandas
Processing datasets with pandas (1)
Bootstrap sampling with Pandas
Convert 202003 to 2020-03 with pandas
Processing datasets with pandas (2)
Merge datasets with pandas
Read Python csv data with Pandas ⇒ Graph with Matplotlib
Visualize data with Streamlit
Learn Pandas with Cheminformatics
Reading data with TensorFlow
Implement "Data Visualization Design # 3" with pandas and matplotlib
Interactively visualize data with TreasureData, Pandas and Jupyter.
Data Augmentation with openCV
Make holiday data into a data frame with pandas
Normarize data with Scipy
Data analysis with Python
LOAD DATA with PyMysql
Get Amazon RDS (PostgreSQL) data using SQL with pandas
String manipulation with python & pandas that I often use
Be careful when reading data with pandas (specify dtype)
Data analysis environment construction with Python (IPython notebook + Pandas)
How to extract non-missing value nan data with pandas
Process csv data with python (count processing using pandas)
How to extract non-missing value nan data with pandas
Sample data created with python
Read csv with python pandas
Embed audio data with Jupyter
Graph Excel data with matplotlib (1)
Load nested json with pandas
Extract Twitter data with CSV
Data Manipulation in Python-Try Pandas_plyr
Get Youtube data with python
Binarize photo data with OpenCV
[Python] Change dtype with pandas
Graph Excel data with matplotlib (2)
Save tweet data with Django
Standardize by group with pandas
Regular expression manipulation with Python
Data analysis using python pandas
Prevent omissions with pandas print
Interpolate 2D data with scipy.interpolate.griddata
Read json data with python
Notes on handling large amounts of data with python + pandas
Ingenuity to handle data with Pandas in a memory-saving manner
[Stock price analysis] Learning pandas with fictitious data (002: Log output)
A collection of methods used when aggregating data with pandas
Basic summary of data manipulation with Python Pandas-First half: Data creation & manipulation