[PYTHON] Pandas basics

What is Pandas

Pandas is a Python extension module that provides features to assist in data analysis. Preliminary data analysis is important for creating AI, but Pandas (+ Jupyter Notebook) makes it very convenient to analyze. Also, it is used for inputting artificial intelligence frameworks, so it is important to know how to use it when studying artificial intelligence.

Here's a summary of what you can do with Pandas.

Install Pandas

With pip, you can easily install Pandas with the following command.

# pip install pandas

Import Pandas

You can use Pandas with the following "magic". It seems that the abbreviation is often "pd".

import pandas as pd

The following code is written on the assumption that the above "magic" has been executed. Also, since NumPy is often used, NumPy is also described on the assumption that it is imported as the abbreviation "np".

Series type and DataFrame type

Data analysis is performed using Pandas, but in Pandas there is a type for storing the data to be analyzed. That is the Series type and the DataFrame type.

Series type

If NumPy array is "like Python list type", Series type is like "Python dictionary type (dict type)". "is. You can label your data like a dictionary key, and you can do a lot of other things. In addition, it corresponds to the data for one column or the data for one record (one row) in the DataFrame type introduced below.

Creating a Series type

You can create a Series type mainly from list type and dictionary type.

Created from dictionary type (dict type)

You can create a Series type from a dictionary type (dict type) with * Series () * of Pandas. The dictionary type key becomes the Series type label, and the dictionary type element becomes the Series type data.

Science = {"akiyama": 100, "satou": 75, "tanaka": 120}

# SR = akiyama    100
#      satou       75
#      tanaka     120
#      dtype:   int64
SR = pd.Series(Science)

Created from NumPy array and list type

You can use Pandas * Series () * to create a Series type from a list type or a NumPy array. When created with list type, the labels are numbered sequentially from "0", but you can specify the label separately.

record = [100, 75, 120]
record_np = np.array(record)
labels = ["akiyama", "satou", "tanaka"]

#Labels are serial numbers from "0"
# SR = 0        100
#      1         75
#      2        120
#      dtype: int64
SR = pd.Series(record)
SR = pd.Series(record_np)

#Specify the label with "index"
# SR = akiyama    100
#      satou       75
#      tanaka     120
#      dtype:   int64
SR = pd.Series(record, index=labels)
SR = pd.Series(record_np, index=labels)

Series type operation

With the Series type, you can perform various operations.

Display only labels, display only data

For Series type, only the label can be displayed with index, and only the data can be displayed with values.

Science = {"akiyama": 100, "satou": 75, "tanaka": 120}

# SR = akiyama    100
#      satou       75
#      tanaka     120
#      dtype:   int64
SR = pd.Series(Science)

# SR.index = Index(['akiyama', 'satou', 'tanaka'], dtype='object')
SR.index

# SR.values is a NumPy array
# SR.values = [100 75 120]
SR.values

Name the Series type and label

Series types can be named by name. You can also name the label with index.name.

Science = {"akiyama": 100, "satou": 75, "tanaka": 120}
SR = pd.Series(Science)

# SR = Students
#      akiyama    100
#      satou       75
#      tanaka     120
#      Name: Science, dtype: int64
SR.name = 'Science'
SR.index.name = 'Students'

Add new label / change label

If you want to add a new label to the Series type, add it as follows.

Science = {"akiyama": 100, "satou": 75, "tanaka": 120}

# SR = akiyama    100
#      satou       75
#      tanaka     120
#      dtype:   int64
SR = pd.Series(Science)

#If you recreate the Series type using the list type of the label, you can get the Series type with the new label added.
#At this time, labels that are not in the original Series type will not be inherited.
# SR_new = akiyama     100.0
#          satou        75.0
#          nico          NaN
#          mochidan      NaN
#          dtype:    float64
labels = ["akiyama", "satou", "nico", "mochidan"]
SR_new = pd.Series(SR, index=labels)

Access Series type data

Series type can access data by subscript or label name.

Science = {"akiyama": 100, "satou": 75, "tanaka": 120}

# SR = akiyama    100
#      satou       75
#      tanaka     120
#      dtype:   int64
SR = pd.Series(Science)

# tmp = 75
tmp = SR[1]
tmp = SR["satou"]

#It is also possible to specify using slice or list type
#In this case, Series type is returned
# tmp2 = akiyama    100
#        satou       75
#        dtype:   int64
tmp2 = SR[0:2]
tmp2 = SR[["akiyama", "satou"]]

Determine if the data has nulls

You can use Pandas * isnull () * or * notnull () * to determine if the data is null.

Science = {"akiyama": 100, "satou": 75, "tanaka": 120}
SR = pd.Series(Science)
SR["satou"] = np.nan

# SR["satou"]Is a missing value(null)
#Missing values change data type to float64
# SR = akiyama    100.0
#      satou        NaN
#      tanaka     120.0
#      dtype:   float64
SR

# akiyama    False
# satou       True
# tanaka     False
# dtype:      bool
pd.isnull(SR)

# akiyama     True
# satou      False
# tanaka      True
# dtype:      bool
pd.notnull(SR)

Display / change data that meets the conditions

You can judge the data that meets the conditions and display / change the data that meets the conditions.

Science = {"akiyama": 100, "satou": 75, "tanaka": 120}

# SR = akiyama    100
#      satou       75
#      tanaka     120
#      dtype:   int64
SR = pd.Series(Science)

#Judgment of over 80 data
#Judgment result is acquired as Series type
# is_Excellent = akiyama     True
#                satou      False
#                tanaka      True
#                dtype:      bool
is_Excellent = SR > 80

#Only data over 80 is acquired as Series type
# Excellent = akiyama    100
#             tanaka     120
#             dtype:   int64
Excellent = SR[SR > 80]
Excellent = SR[is_Excellent]

#Update data that meets the conditions by assigning a value
# SR = akiyama     80
#      satou       75
#      tanaka      80
#      dtype:   int64
SR[SR > 80] = 80

Delete the specified data

By using * drop () * of Series type, you can get the Series type with the specified data deleted.

Science = {"akiyama": 100, "satou": 75, "tanaka": 120}

# SR = akiyama    100
#      satou       75
#      tanaka     120
#      dtype:   int64
SR = pd.Series(Science)

#Specify the label name of the line to be deleted
# SR_new = akiyama    100
#          tanaka     120
#          dtype:   int64
SR_new = SR.drop("satou")

Sort data

You can sort by label with * sort_index () * of Series type and by data with * sort_values () *.

Science = {"satou": 100, "akiyama": 75, "tanaka": 120}

# SR = satou      100
#      akiyama     75
#      tanaka     120
#      dtype:   int64
SR = pd.Series(Science)

#Sort by label in ascending order
# SR_index = akiyama     75
#            satou      100
#            tanaka     120
#            dtype:   int64
SR_index = SR.sort_index()

#When the argument "inplace" is set to "True", the Series type itself is updated.
SR.sort_index(inplace = True)

#If the argument "ascending" is set to "False", the sort will be in descending order.
SR.sort_index(inplace = True, ascending=False)

#Sort data by key in ascending order
# SR_values = akiyama     75
#             satou      100
#             tanaka     120
#             dtype:   int64
SR_values = SR.sort_values()

#When the argument "inplace" is set to "True", the Series type itself is updated.
SR.sort_values(inplace = True)

#If the argument "ascending" is set to "False", the sort will be in descending order.
SR.sort_values(inplace = True, ascending=False)

DataFrame type

The DataFrame type is "like a table". The data to be analyzed may be csv format, Excel format or HTML data, but it provides a function to read these and operate them as a table.

Creating a DataFrame type

You can read data in various formats as a DataFrame type.

Created from csv file

You can use Pandas * read_csv () * or * read_table () * to create a DataFrame type from a csv file. As an example, the following csv file exists.

test.csv


No,Name,Score
Students,Science,Math
akiyama,100,100
satou,75,99
tanaka,120,150
suzuki,50,50
mochidan,0,10

If you use * read_csv () *, you can create a DataFrame type as follows.

#The argument is the path to read to, specified as a relative path from the current directory.
DF = pd.read_csv('test.csv')

#If None is specified for "header", it will be read as data from the first line.
DF = pd.read_csv('test.csv', header=None)

When using * read_table () *, you can specify the separator. You can create a DataFrame type as follows:

#The argument is the path to read to, specified as a relative path from the current directory.
#Specify a separator for "sep"
DF = pd.read_table('test.csv', sep=',')

#If None is specified for "header", it will be read as data from the first line.
DF = pd.read_table('test.csv', , sep=',', header=None)

In these cases, the DF will be of the following DataFrame type.

Students Science Math
0 akiyama 100 100
1 satou 75 99
2 tanaka 120 150
3 suzuki 50 50
4 mochidan 0 10

Create from clipboard

You can create a DataFrame type from the clipboard with * read_clipboard () * in Pandas. As an example, the following table exists.

Students Science Math
0 akiyama 100 100
1 satou 75 99
2 tanaka 120 150
3 suzuki 50 50
4 mochidan 0 10

Suppose you copy the above table and save it to the clipboard. After that, you can create a DataFrame type as follows.

DF = pd.read_clipboard()

#If None is specified for "header", it will be read as data from the first line.
DF = pd.read_clipboard(header=None)

Created from an Excel file

You can use Pandas * read_excel () * to create a DataFrame type from an Excel file. However, be aware that if there are merged cells, the cells will be unmerged, resulting in "NaN" for the value and "Unnamed: * x *" for the column name.

#The first argument is the read destination path, which is specified as a relative path from the current directory.
# sheet_Specify the name of the sheet to be read in name
DF = pd.read_excel('test.xlsx', sheet_name='Sheet1')

#If None is specified for "header", the data will be read from the first line of the Excel file.
DF = pd.read_excel('test.xlsx', sheet_name='Sheet1', header=None)

Created from dictionary type (dict type)

You can also create a DataFrame type from a dictionary type (dict type). When JSON data is returned by HTTP request etc., if the JSON can be made into a dictionary type well, the data can be manipulated as DataFrame type.

import pandas as pd
import json

#For example, if you have this JSON
json_obj = """
{    
    "result": [{"Students": "akiyama", "Science": 100, "Math": 100}, 
               {"Students": "satou"  , "Science":  75, "Math":  99}, 
               {"Students": "tanaka" , "Science": 120, "Math": 150}]
}
"""

#Create JSON object
data = json.loads(json_obj)

#In this case, data["result"]Is a dictionary type list
# data["result"] = [{'Students': 'akiyama', 'Science': 100, 'Math': 100},
#                   {'Students': 'satou'  , 'Science':  75, 'Math':  99},
#                   {'Students': 'tanaka' , 'Science': 120, 'Math': 150}]
data["result"]

#Pandas DataFrame()If you pass a dictionary type list as an argument of, you can make it a DataFrame type
#1 dictionary type, 1 line DataFrame type
DF = pd.DataFrame(data["result"])

In this case, the DF will be of the following DataFrame type.

Students Science Math
0 akiyama 100 100
1 satou 75 99
2 tanaka 120 150

Created from NumPy array and list type

You can also create a DataFrame type from a NumPy array and a list type.

# data_np = [['akiyama' 100 100],
#            ['satou'    75  99]]
data = [["akiyama", 100, 100], ["satou", 75, 99]]
data_np = np.array(data)

#Specify the column name in "columns" (if not specified, serial number from "0")
DF = pd.DataFrame(data, columns = ['No', 'Name', 'Score'])
DF = pd.DataFrame(data_np, columns = ['No', 'Name', 'Score'])

In this case, the DF will be of the following DataFrame type.

Students Science Math
0 akiyama 100 100
1 satou 75 99

Save DataFrame type

You can save the DataFrame type as a file.

Save in csv format

You can save the DataFrame type in csv format by using * to_csv () * of DataFrame type.

#Suppose you create a DataFrame type in some way and make various edits.
DF = pd.read_csv('test.csv')

#The first argument is the save destination path, which is specified as a relative path from the current directory.
#The default separator is ",」
DF.to_csv('test_2.csv')

#You can also specify the separator with "sep"
DF.to_csv('test_3.csv', sep='_')

#Selectable with or without index and header
DF.to_csv("test_4.csv", header=False, index=False)

DataFrame type operations

If you use the DataFrame type, you can perform various operations.

Display column name, specify column name and display

The DataFrame type can get the column name with columns, or you can specify the column name and get only that column.

data = [["akiyama", 100, 100],
        ["satou"  ,  75,  99]]
DF = pd.DataFrame(data, columns = ['Students', 'Science', 'Math'])

#Show column name
# DF.columns = Index(['Students', 'Science', 'Math'], dtype='object')
DF.columns

#You can get the specified column as Series type by doing the following
# 0    akiyama
# 1      satou
# Name: Students, dtype: object
DF["Students"]
DF.Students

#If you want to get two or more columns, specify those column names as list type.
#When fetching two or more columns, it is fetched by DataFrame type
DF[["Students", "Math"]]

The above DF [["Students", "Math"]] is of the following DataFrame type.

Students Math
0 akiyama 100
1 satou 99

Display by specifying the number of lines

By using the DataFrame type iloc, only the specified row can be retrieved.

data = [["akiyama", 100, 100],
        ["satou"  ,  75,  99]]
DF = pd.DataFrame(data, columns = ['Students', 'Science', 'Math'])

#You can get the specified line as Series type by doing the following
#Note that line numbers start with "0"
# Students    satou
# Science        75
# Math           99
# Name: 1, dtype: object
DF.iloc[1]

#If you want to get more than one line, describe the range you want to get in slices
#When fetching two or more rows, it is fetched by DataFrame type
DF.iloc[0:2]

The above DF.iloc [0: 2] has the following DataFrame type.

Students Science Math
0 akiyama 100 100
1 satou 75 99

Show only the beginning, show only the end

DataFrame type * head () * can be used to display only the beginning, and * tail () * can be used to display only the end.

data = [["akiyama", 100, 100],
        ["satou"  ,  75,  99]]
DF = pd.DataFrame(data, columns = ['Students', 'Science', 'Math'])

#With no arguments, display up to the first 5 lines
#If you specify the number of lines in the argument, only the specified number of lines will be displayed from the beginning.
DF.head()
DF.head(1)

#With no arguments, display the last 5 lines
#If you specify the number of lines in the argument, only the specified number of lines will be displayed from the end.
DF.tail()
DF.tail(1)

Create a new DataFrame type on a particular column

You can create a new DataFrame type using only specific columns of the DataFrame type, as shown below.

data = [["akiyama", 100, 100],
        ["satou"  ,  75,  99]]
DF = pd.DataFrame(data, columns = ['Students', 'Science', 'Math'])

#Create a new DataFrame type using only "Students" and "Math" of DataFrame type "DF"
#Specify a row in "columns", but if you specify a column that does not exist, all the data in that column will be created as "NaN"
DF_new = DataFrame(DF, columns=['Students', 'Math'])

The above DF_new has the following DataFrame type.

Students Math
0 akiyama 100
1 satou 99

Delete the specified row or column

By using * drop () * of DataFrame type, you can get the DataFrame type with the specified column / row deleted.

data = [["akiyama", 100, 100],
        ["satou"  ,  75,  99]]
DF = pd.DataFrame(data, columns = ['Students', 'Science', 'Math'])

#When deleting a row, specify the index of the row to be deleted (if it has a label name, specify the label name)
#Since index starts from "0", delete the second line in this case
DF_drop_axis0 = DF.drop(1)

#When deleting a line, specify "1" for the argument "axis"
#Also, specify the index of the column to be deleted (if the column name is attached, specify the column name)
#Since the column name is attached, delete the column name "Science"
DF_drop_axis1 = DF.drop("Science", axis=1)

The above DF_drop_axis0 is of the following DataFrame type.

Students Science Math
0 akiyama 100 100

DF_drop_axis1 has the following DataFrame type.

Students Math
0 akiyama 100
1 satou 99

Add column

You can add a new column to the DataFrame type as follows:

data = [["akiyama", 100, 100],
        ["satou"  ,  75,  99]]
DF = pd.DataFrame(data, columns = ['Students', 'Science', 'Math'])

#Added column "English". Specify "NaN" as the initial value.
DF['English'] = np.nan
Students Science Math English
0 akiyama 100 100 NaN
1 satou 75 99 NaN

You can also add columns using the Series type.

data = [["akiyama", 100, 100],
        ["satou"  ,  75,  99]]
DF = pd.DataFrame(data, columns = ['Students', 'Science', 'Math'])

# English = 0        100
#           1         30
#           dtype: int64
English = pd.Series([100, 30], index=[0, 1])

#Data is inserted where the index on the Series side and the index of the DataFrame type match.
#If there is no match, it will be "NaN"
DF['English'] = English
Students Science Math English
0 akiyama 100 100 100
1 satou 75 99 30

Display data that meets the conditions

You can judge the data that meets the conditions and display the data that meets the conditions.

data = [["akiyama", 100, 100],
        ["satou"  ,  75,  99]]
DF = pd.DataFrame(data, columns = ['Students', 'Science', 'Math'])

#Determine if there is more than 80 data for "Science" and "Math"
DF_80over = DF[["Science", "Math"]] > 80

#Show lines with "Science" over 80
Science_80over = DF[DF['Science'] > 80]

The above DF_80over has the following DataFrame type.

Science Math
0 True True
1 False True

Science_80over has the following DataFrame type.

Students Science Math
0 akiyama 100 100

Recommended Posts

Pandas basics
Pandas
Pandas memo
Pandas basics for beginners ① Reading & processing
Linux basics
Python basics
NumPy basics
Python basics ④
Git basics
Pandas notes
Python basics ③
Django basics
Pandas memorandum
Python basics
Python basics
Python basics ③
pandas memorandum
pandas memo
Python basics ②
Python basics ②
Pandas basics summary link for beginners
pandas SettingWithCopyWarning
pandas self-study notes
Pandas basics for beginners ③ Histogram creation with matplotlib
Python basics: list
Python basics memorandum
Shell script basics # 2
My pandas (python)
Excel-> pandas-> sqlite
#Python basics (#matplotlib)
Python CGI basics
Python basics: dictionary
[pandas] GroupBy Tips
Read pandas data
About pandas describe
pandas related links
Missing value pandas
9rep --Pandas MySQL
[Pandas] Basics of processing date data using dt
Basics of python ①
Python slice basics
#Python basics (scope)
Go class basics
#Python basics (#Numpy 1/2)
pandas 1.2.0 What's new
#Python basics (#Numpy 2/2)
Unsupervised learning 1 Basics
#Python basics (functions)
Pandas operation memorandum
Python array basics
Sort by pandas
Python profiling basics
Linux command basics
Python #Numpy basics
Python basics: functions
Basics of pandas for beginners ② Understanding data overview
python pandas notes
#Python basics (class)
Python basics summary
pandas series part 1
[Note] pandas unstack