Basic summary of data manipulation with Python Pandas-First half: Data creation & manipulation

Introduction

How to manipulate data in Pandas, which is essential for handling data analysis in Python I summarized the basics.

From important grammar that you forget about, we have included some tips.

Recommended for people like this → I want to touch Pandas for the first time! → Try to use R in Python. → I can't remember the grammar of Pandas-it would be convenient if there was a list somewhere ... → How much data handling can be done with Python in the first place?

Please also match this ◆ Data manipulation with Pandas: Use Pandas_ply http://qiita.com/hik0107/items/3dd260d9939a5e61c4f6

Let's create data

First of all, import Pandas and create data in data frame format appropriately

data_creation.py


import pandas as pd
 
df_sample =\
pd.DataFrame([["day1","day2","day1","day2","day1","day2"],
              ["A","B","A","B","C","C"],
              [100,150,200,150,100,50],
              [120,160,100,180,110,80]] ).T  #For the time being, create appropriate data
 
df_sample.columns = ["day_no","class","score1","score2"]  #Give a column name
df_sample.index   = [11,12,13,14,15,16]  #Give an index name

◆Column / Index Access Access specific columns and index numbers

col_index_access.py


 
df_sample.columns   #Get column name
df_sample.index     #Get index name
 
 
df_sample.columns = ["day_no","class","point1","point2"]   #Overwrite column name
df_sample.index   = [11,12,13,14,15,16]   #Overwrite index name
 
 
#Use Rename method
df_sample.rename(columns={'score1': 'point1'})  #I will put the correspondence in a dictionary type

◆ Check the data structure

Take a look at the data overview

datacheck.py


#Check the number of lines
len(df_sample)
 
#Confirmation of the number of dimensions
df_sample.shape #Returns in the form (number of rows, number of columns)
 
#List of column information
df_sample.info() #List of column names and their types
 
#Confirmation of basic statistics for each column
#Summary in R()
df_sample.describe() #Mean, variance, quartile, etc.
 
# head / tail
df_sample.head(10) #Check the first 10 lines
df_sample.tail(10) #Check the first 10 lines

Let's play with the data

Select only specific columns from the data

datacheck.py


#Built-in functions__get_item___Selection using
df_sample["day_no"] #Write and specify the column name
df_sample[["day_no","score1"]]# Use list comprehension when selecting multiple columns
 
#Column selection using loc
#Grammar: iloc[rows, columns]Write in the form of
#You can subset not only columns but also rows at the same time
df_sample.loc[:,"day_no"]  #The line is "to select all":"Is put.
df_sample.loc[:,["day_no","score1"]]# Use list comprehension when selecting multiple columns
         
#Column selection using iloc
#Grammar: iloc[rows number,columns number]Write in the form of
df_sample.iloc[:,0]  #Select by number
df_sample.iloc[:,0:2] #In case of multiple serial numbers. You can also go in list comprehension
 
 
#Column selection using ix
#Both column names and column numbers can be used. Basically it feels good to use this
df_sample.ix[:,"day_no"] #In the case of single column selection, the result is Pandas.Series Object
df_sample.ix[:,["day_no","score1"]] #In case of multi-column selection, the result is Pandas.Become a Dataframe
 
df_sample.ix[0:4,"score1"] #Rows can be selected by number and columns can be selected by column name
 
 
series_bool = [True,False,True,False]
df_sample.ix[:,series_bool]  #You can also select a Boolean array
 
 
#Select by partial match of column name
#Select for R Dplyr(Contains()), A convenient scheme for partial match selection of column names
#Pandas doesn't have that feature, so you'll have to take a few steps.
 
score_select = pd.Series(df_sample.columns).str.contains("score") # "score"Logical judgment of whether to include in the column name
df_sample.ix[:,np.array(score_select)]   #Column selection using logical arrays

◆Subsetting Partial selection of data based on conditional statements

subsetting.py


##Python default notation
##Data frame[Put an array of Boolean]
df_sample[df_sample.day_no == "day1"]  # day_Select only data whose no column is day1
series_bool = [True,False,True,False,True,False]
df_sample[series_bool] #Of course, you can use other than the columns of the data frame itself as conditions
 
 
##Notation when using Pandas query method
df_sample.query("day_no == 'day1'")  
     #It's neat because you don't have to write the data frame name twice.
     #Note that the conditional expression must be entered in Str format
 
df_sample.query("day_no == 'day1'|day_no == 'day2'")
     #In case of multiple conditions, or condition"|"Or and of the condition"&"I'll put it in between
 
select_condition = "day1"
df_sample.query("day_no == select_condition")  # ☓ doesn't work
        #Since the conditional expression of extraction is str notation, it does not respond if you enter the variable name directly
 
df_sample.query("day_no == @select_condition")  # ◯ it works
        #If you want to use a variable, put it in the variable name@If you add, it will be recognized as a variable name
 
 
##Subsetting using index
df_sample.query("index == 11 ")  #If you write index normally, it will work
df_sample.query("index  in [11,12] ") #"In" can also be used for the or condition

◆Sorting Sorts the data.

sorting.py


df_sample.sort("score1")  #Sort by Score1 value in ascending order
df_sample.sort(["score1","score2"])  #Sort by Score1 and Score2 values in ascending order
 
 
df_sample.sort("score1",ascending=False)  #Sort by the value of score1 in descending order

◆pandas.concat Add records and columns by combining data.

concat.py


 
#Add line
#Create the data you want to add. Consider combining data frames.
#df_Let's assume that you want to add a record with index "17" to sample.
 
df_addition_row =\
    pd.DataFrame([["day1","A",100,180]])  #df_Create a DF with the same column structure as sample
df_addition_row.columns =["day_no","class","score1","score2"]  #Give the same column name
df_addition_row.index   =[17] #Shake the index
 
pd.concat([df_sample,df_addition_row],axis=0)  #Make a join=rbind
        #First argument: DF to combine[]Specify by notation.
      #Second argument: Axis=0 specifies that it is a vertical join.
 
 
#Add column
#Consider adding a Score3 column in addition to Score1 and Score2.
#Create the data you want to add. Consider combining data frames.
 
df_addition_col =\
    pd.DataFrame([[120,160,100,180,110,80]]).T #df_Create a DF with the same number of lines as sample
 
df_addition_col.columns =["score3"] #Column names are used as is after joining
df_addition_col.index   = [11,12,13,14,15,16] 
         #Caution! !! pandas.concat will not work as expected unless the indexes of the joins have the same structure! (See below)
 
 
pd.concat([df_sample,df_addition_col],axis=1) #axis=1 specifies a horizontal join.
 
 
#About the index
#If the index of the new data is different from where it was joined, the data will be joined in a staggered manner.
#Please try the following
 
df_addition_col =\
    pd.DataFrame([[120,160,100,180,110,80]]).T
 
df_addition_col.columns =["score3"]
df_addition_col.index   = [11,12,13,21,22,23]   #Some have the same index as the original data, but some do not
 
 
pd.concat([df_sample,df_addition_col],axis=1)  #Result is....

◆Joining Combines two data based on a certain Key.

join.py


##In the process of creation

Continue to the second half

◆ Basic summary of data manipulation in Python Pandas-Second half: Data aggregation http://qiita.com/hik0107/items/0ae69131e5317b62c3b7

Recommended Posts

Basic summary of data manipulation with Python Pandas-First half: Data creation & manipulation
Basic summary of data manipulation in Python Pandas-Second half: Data aggregation
Summary of the basic flow of machine learning with Python
Basic study of OpenCV with Python
Recommendation of Altair! Data visualization with Python
Let's do MySQL data manipulation with Python
[Python] Summary of S3 file operations with boto3
Python practice data analysis Summary of learning that I hit about 10 with 100 knocks
Summary of how to read numerical data with python [CSV, NetCDF, Fortran binary]
What you want to memorize with the basic "string manipulation" grammar of python
Numerical summary of data
Data analysis with python 2
Basic knowledge of Python
Summary of Python arguments
Data analysis with Python
Summary of tools needed to analyze data in Python
Summary of tools for operating Windows GUI with Python
Challenge principal component analysis of text data with Python
Summary of Pandas methods used when extracting data [Python]
[Basics of data science] Collecting data from RSS with python
Extract the band information of raster data with python
Sample data created with python
Summary of Python3 list operations
Plane skeleton analysis with Python (3) Creation of section force diagram
BASIC authentication with Python bottle
Get Youtube data with python
Notes on handling large amounts of data with python + pandas
I wrote the basic grammar of Python with Jupyter Lab
[For beginners] Summary of standard input in Python (with explanation)
Get rid of dirty data with Python and regular expressions
The story of rubyist struggling with python :: Dict data with pycall
[Homology] Count the number of holes in data with Python
Python data type summary memo
Xpath summary when extracting data from websites with Python Scrapy
Implement normalization of Python training data preprocessing with scikit-learn [fit_transform]
Regular expression manipulation with Python
Basic usage of Pandas Summary
[Python] [SQLite3] Operate SQLite with Python (Basic)
Basic usage of Python f-string
Read json data with python
Practical exercise of data analysis with Python ~ 2016 New Coder Survey Edition ~
Basic map information using Python Geotiff conversion of numerical elevation data
Pixel manipulation of images in Python
Summary of basic knowledge of PyPy Part 1
Scraping with Selenium in Python (Basic)
Summary of basic implementation by PyTorch
A brief summary of Python collections
[Python] Get economic data with DataReader
Getting Started with Python Basics of Python
Python data structures learned with chemoinformatics
Life game with Python! (Conway's Game of Life)
1. Statistics learned with Python 1-1. Basic statistics (Pandas)
Basic grammar of Python3 system (dictionary)
Easy data visualization with Python seaborn.
Implementation of Dijkstra's algorithm with python
Process Pubmed .xml data with python
Data analysis starting with python (data visualization 1)
Coexistence of Python2 and 3 with CircleCI (1.0)
Data analysis starting with python (data visualization 2)
Summary of Python indexes and slices
Python application: Data cleansing # 2: Data cleansing with DataFrame