Pandas is a Python extension module that provides features to assist in data analysis. Preliminary data analysis is important for creating AI, but Pandas (+ Jupyter Notebook) makes it very convenient to analyze. Also, it is used for inputting artificial intelligence frameworks, so it is important to know how to use it when studying artificial intelligence.
Here's a summary of what you can do with Pandas.
With pip, you can easily install Pandas with the following command.
# pip install pandas
You can use Pandas with the following "magic". It seems that the abbreviation is often "pd".
import pandas as pd
The following code is written on the assumption that the above "magic" has been executed. Also, since NumPy is often used, NumPy is also described on the assumption that it is imported as the abbreviation "np".
Data analysis is performed using Pandas, but in Pandas there is a type for storing the data to be analyzed. That is the Series type and the DataFrame type.
If NumPy array is "like Python list type", Series type is like "Python dictionary type (dict type)". "is. You can label your data like a dictionary key, and you can do a lot of other things. In addition, it corresponds to the data for one column or the data for one record (one row) in the DataFrame type introduced below.
You can create a Series type mainly from list type and dictionary type.
You can create a Series type from a dictionary type (dict type) with * Series () * of Pandas. The dictionary type key becomes the Series type label, and the dictionary type element becomes the Series type data.
Science = {"akiyama": 100, "satou": 75, "tanaka": 120}
# SR = akiyama 100
# satou 75
# tanaka 120
# dtype: int64
SR = pd.Series(Science)
You can use Pandas * Series () * to create a Series type from a list type or a NumPy array. When created with list type, the labels are numbered sequentially from "0", but you can specify the label separately.
record = [100, 75, 120]
record_np = np.array(record)
labels = ["akiyama", "satou", "tanaka"]
#Labels are serial numbers from "0"
# SR = 0 100
# 1 75
# 2 120
# dtype: int64
SR = pd.Series(record)
SR = pd.Series(record_np)
#Specify the label with "index"
# SR = akiyama 100
# satou 75
# tanaka 120
# dtype: int64
SR = pd.Series(record, index=labels)
SR = pd.Series(record_np, index=labels)
With the Series type, you can perform various operations.
For Series type, only the label can be displayed with index, and only the data can be displayed with values.
Science = {"akiyama": 100, "satou": 75, "tanaka": 120}
# SR = akiyama 100
# satou 75
# tanaka 120
# dtype: int64
SR = pd.Series(Science)
# SR.index = Index(['akiyama', 'satou', 'tanaka'], dtype='object')
SR.index
# SR.values is a NumPy array
# SR.values = [100 75 120]
SR.values
Series types can be named by name. You can also name the label with index.name.
Science = {"akiyama": 100, "satou": 75, "tanaka": 120}
SR = pd.Series(Science)
# SR = Students
# akiyama 100
# satou 75
# tanaka 120
# Name: Science, dtype: int64
SR.name = 'Science'
SR.index.name = 'Students'
If you want to add a new label to the Series type, add it as follows.
Science = {"akiyama": 100, "satou": 75, "tanaka": 120}
# SR = akiyama 100
# satou 75
# tanaka 120
# dtype: int64
SR = pd.Series(Science)
#If you recreate the Series type using the list type of the label, you can get the Series type with the new label added.
#At this time, labels that are not in the original Series type will not be inherited.
# SR_new = akiyama 100.0
# satou 75.0
# nico NaN
# mochidan NaN
# dtype: float64
labels = ["akiyama", "satou", "nico", "mochidan"]
SR_new = pd.Series(SR, index=labels)
Series type can access data by subscript or label name.
Science = {"akiyama": 100, "satou": 75, "tanaka": 120}
# SR = akiyama 100
# satou 75
# tanaka 120
# dtype: int64
SR = pd.Series(Science)
# tmp = 75
tmp = SR[1]
tmp = SR["satou"]
#It is also possible to specify using slice or list type
#In this case, Series type is returned
# tmp2 = akiyama 100
# satou 75
# dtype: int64
tmp2 = SR[0:2]
tmp2 = SR[["akiyama", "satou"]]
You can use Pandas * isnull () * or * notnull () * to determine if the data is null.
Science = {"akiyama": 100, "satou": 75, "tanaka": 120}
SR = pd.Series(Science)
SR["satou"] = np.nan
# SR["satou"]Is a missing value(null)
#Missing values change data type to float64
# SR = akiyama 100.0
# satou NaN
# tanaka 120.0
# dtype: float64
SR
# akiyama False
# satou True
# tanaka False
# dtype: bool
pd.isnull(SR)
# akiyama True
# satou False
# tanaka True
# dtype: bool
pd.notnull(SR)
You can judge the data that meets the conditions and display / change the data that meets the conditions.
Science = {"akiyama": 100, "satou": 75, "tanaka": 120}
# SR = akiyama 100
# satou 75
# tanaka 120
# dtype: int64
SR = pd.Series(Science)
#Judgment of over 80 data
#Judgment result is acquired as Series type
# is_Excellent = akiyama True
# satou False
# tanaka True
# dtype: bool
is_Excellent = SR > 80
#Only data over 80 is acquired as Series type
# Excellent = akiyama 100
# tanaka 120
# dtype: int64
Excellent = SR[SR > 80]
Excellent = SR[is_Excellent]
#Update data that meets the conditions by assigning a value
# SR = akiyama 80
# satou 75
# tanaka 80
# dtype: int64
SR[SR > 80] = 80
By using * drop () * of Series type, you can get the Series type with the specified data deleted.
Science = {"akiyama": 100, "satou": 75, "tanaka": 120}
# SR = akiyama 100
# satou 75
# tanaka 120
# dtype: int64
SR = pd.Series(Science)
#Specify the label name of the line to be deleted
# SR_new = akiyama 100
# tanaka 120
# dtype: int64
SR_new = SR.drop("satou")
You can sort by label with * sort_index () * of Series type and by data with * sort_values () *.
Science = {"satou": 100, "akiyama": 75, "tanaka": 120}
# SR = satou 100
# akiyama 75
# tanaka 120
# dtype: int64
SR = pd.Series(Science)
#Sort by label in ascending order
# SR_index = akiyama 75
# satou 100
# tanaka 120
# dtype: int64
SR_index = SR.sort_index()
#When the argument "inplace" is set to "True", the Series type itself is updated.
SR.sort_index(inplace = True)
#If the argument "ascending" is set to "False", the sort will be in descending order.
SR.sort_index(inplace = True, ascending=False)
#Sort data by key in ascending order
# SR_values = akiyama 75
# satou 100
# tanaka 120
# dtype: int64
SR_values = SR.sort_values()
#When the argument "inplace" is set to "True", the Series type itself is updated.
SR.sort_values(inplace = True)
#If the argument "ascending" is set to "False", the sort will be in descending order.
SR.sort_values(inplace = True, ascending=False)
The DataFrame type is "like a table". The data to be analyzed may be csv format, Excel format or HTML data, but it provides a function to read these and operate them as a table.
You can read data in various formats as a DataFrame type.
You can use Pandas * read_csv () * or * read_table () * to create a DataFrame type from a csv file. As an example, the following csv file exists.
test.csv
No,Name,Score
Students,Science,Math
akiyama,100,100
satou,75,99
tanaka,120,150
suzuki,50,50
mochidan,0,10
If you use * read_csv () *, you can create a DataFrame type as follows.
#The argument is the path to read to, specified as a relative path from the current directory.
DF = pd.read_csv('test.csv')
#If None is specified for "header", it will be read as data from the first line.
DF = pd.read_csv('test.csv', header=None)
When using * read_table () *, you can specify the separator. You can create a DataFrame type as follows:
#The argument is the path to read to, specified as a relative path from the current directory.
#Specify a separator for "sep"
DF = pd.read_table('test.csv', sep=',')
#If None is specified for "header", it will be read as data from the first line.
DF = pd.read_table('test.csv', , sep=',', header=None)
In these cases, the DF will be of the following DataFrame type.
Students | Science | Math | |
---|---|---|---|
0 | akiyama | 100 | 100 |
1 | satou | 75 | 99 |
2 | tanaka | 120 | 150 |
3 | suzuki | 50 | 50 |
4 | mochidan | 0 | 10 |
You can create a DataFrame type from the clipboard with * read_clipboard () * in Pandas. As an example, the following table exists.
Students | Science | Math | |
---|---|---|---|
0 | akiyama | 100 | 100 |
1 | satou | 75 | 99 |
2 | tanaka | 120 | 150 |
3 | suzuki | 50 | 50 |
4 | mochidan | 0 | 10 |
Suppose you copy the above table and save it to the clipboard. After that, you can create a DataFrame type as follows.
DF = pd.read_clipboard()
#If None is specified for "header", it will be read as data from the first line.
DF = pd.read_clipboard(header=None)
You can use Pandas * read_excel () * to create a DataFrame type from an Excel file. However, be aware that if there are merged cells, the cells will be unmerged, resulting in "NaN" for the value and "Unnamed: * x *" for the column name.
#The first argument is the read destination path, which is specified as a relative path from the current directory.
# sheet_Specify the name of the sheet to be read in name
DF = pd.read_excel('test.xlsx', sheet_name='Sheet1')
#If None is specified for "header", the data will be read from the first line of the Excel file.
DF = pd.read_excel('test.xlsx', sheet_name='Sheet1', header=None)
You can also create a DataFrame type from a dictionary type (dict type). When JSON data is returned by HTTP request etc., if the JSON can be made into a dictionary type well, the data can be manipulated as DataFrame type.
import pandas as pd
import json
#For example, if you have this JSON
json_obj = """
{
"result": [{"Students": "akiyama", "Science": 100, "Math": 100},
{"Students": "satou" , "Science": 75, "Math": 99},
{"Students": "tanaka" , "Science": 120, "Math": 150}]
}
"""
#Create JSON object
data = json.loads(json_obj)
#In this case, data["result"]Is a dictionary type list
# data["result"] = [{'Students': 'akiyama', 'Science': 100, 'Math': 100},
# {'Students': 'satou' , 'Science': 75, 'Math': 99},
# {'Students': 'tanaka' , 'Science': 120, 'Math': 150}]
data["result"]
#Pandas DataFrame()If you pass a dictionary type list as an argument of, you can make it a DataFrame type
#1 dictionary type, 1 line DataFrame type
DF = pd.DataFrame(data["result"])
In this case, the DF will be of the following DataFrame type.
Students | Science | Math | |
---|---|---|---|
0 | akiyama | 100 | 100 |
1 | satou | 75 | 99 |
2 | tanaka | 120 | 150 |
You can also create a DataFrame type from a NumPy array and a list type.
# data_np = [['akiyama' 100 100],
# ['satou' 75 99]]
data = [["akiyama", 100, 100], ["satou", 75, 99]]
data_np = np.array(data)
#Specify the column name in "columns" (if not specified, serial number from "0")
DF = pd.DataFrame(data, columns = ['No', 'Name', 'Score'])
DF = pd.DataFrame(data_np, columns = ['No', 'Name', 'Score'])
In this case, the DF will be of the following DataFrame type.
Students | Science | Math | |
---|---|---|---|
0 | akiyama | 100 | 100 |
1 | satou | 75 | 99 |
You can save the DataFrame type as a file.
You can save the DataFrame type in csv format by using * to_csv () * of DataFrame type.
#Suppose you create a DataFrame type in some way and make various edits.
DF = pd.read_csv('test.csv')
#The first argument is the save destination path, which is specified as a relative path from the current directory.
#The default separator is ",」
DF.to_csv('test_2.csv')
#You can also specify the separator with "sep"
DF.to_csv('test_3.csv', sep='_')
#Selectable with or without index and header
DF.to_csv("test_4.csv", header=False, index=False)
If you use the DataFrame type, you can perform various operations.
The DataFrame type can get the column name with columns, or you can specify the column name and get only that column.
data = [["akiyama", 100, 100],
["satou" , 75, 99]]
DF = pd.DataFrame(data, columns = ['Students', 'Science', 'Math'])
#Show column name
# DF.columns = Index(['Students', 'Science', 'Math'], dtype='object')
DF.columns
#You can get the specified column as Series type by doing the following
# 0 akiyama
# 1 satou
# Name: Students, dtype: object
DF["Students"]
DF.Students
#If you want to get two or more columns, specify those column names as list type.
#When fetching two or more columns, it is fetched by DataFrame type
DF[["Students", "Math"]]
The above DF [["Students", "Math"]] is of the following DataFrame type.
Students | Math | |
---|---|---|
0 | akiyama | 100 |
1 | satou | 99 |
By using the DataFrame type iloc, only the specified row can be retrieved.
data = [["akiyama", 100, 100],
["satou" , 75, 99]]
DF = pd.DataFrame(data, columns = ['Students', 'Science', 'Math'])
#You can get the specified line as Series type by doing the following
#Note that line numbers start with "0"
# Students satou
# Science 75
# Math 99
# Name: 1, dtype: object
DF.iloc[1]
#If you want to get more than one line, describe the range you want to get in slices
#When fetching two or more rows, it is fetched by DataFrame type
DF.iloc[0:2]
The above DF.iloc [0: 2] has the following DataFrame type.
Students | Science | Math | |
---|---|---|---|
0 | akiyama | 100 | 100 |
1 | satou | 75 | 99 |
DataFrame type * head () * can be used to display only the beginning, and * tail () * can be used to display only the end.
data = [["akiyama", 100, 100],
["satou" , 75, 99]]
DF = pd.DataFrame(data, columns = ['Students', 'Science', 'Math'])
#With no arguments, display up to the first 5 lines
#If you specify the number of lines in the argument, only the specified number of lines will be displayed from the beginning.
DF.head()
DF.head(1)
#With no arguments, display the last 5 lines
#If you specify the number of lines in the argument, only the specified number of lines will be displayed from the end.
DF.tail()
DF.tail(1)
You can create a new DataFrame type using only specific columns of the DataFrame type, as shown below.
data = [["akiyama", 100, 100],
["satou" , 75, 99]]
DF = pd.DataFrame(data, columns = ['Students', 'Science', 'Math'])
#Create a new DataFrame type using only "Students" and "Math" of DataFrame type "DF"
#Specify a row in "columns", but if you specify a column that does not exist, all the data in that column will be created as "NaN"
DF_new = DataFrame(DF, columns=['Students', 'Math'])
The above DF_new has the following DataFrame type.
Students | Math | |
---|---|---|
0 | akiyama | 100 |
1 | satou | 99 |
By using * drop () * of DataFrame type, you can get the DataFrame type with the specified column / row deleted.
data = [["akiyama", 100, 100],
["satou" , 75, 99]]
DF = pd.DataFrame(data, columns = ['Students', 'Science', 'Math'])
#When deleting a row, specify the index of the row to be deleted (if it has a label name, specify the label name)
#Since index starts from "0", delete the second line in this case
DF_drop_axis0 = DF.drop(1)
#When deleting a line, specify "1" for the argument "axis"
#Also, specify the index of the column to be deleted (if the column name is attached, specify the column name)
#Since the column name is attached, delete the column name "Science"
DF_drop_axis1 = DF.drop("Science", axis=1)
The above DF_drop_axis0 is of the following DataFrame type.
Students | Science | Math | |
---|---|---|---|
0 | akiyama | 100 | 100 |
DF_drop_axis1 has the following DataFrame type.
Students | Math | |
---|---|---|
0 | akiyama | 100 |
1 | satou | 99 |
You can add a new column to the DataFrame type as follows:
data = [["akiyama", 100, 100],
["satou" , 75, 99]]
DF = pd.DataFrame(data, columns = ['Students', 'Science', 'Math'])
#Added column "English". Specify "NaN" as the initial value.
DF['English'] = np.nan
Students | Science | Math | English | |
---|---|---|---|---|
0 | akiyama | 100 | 100 | NaN |
1 | satou | 75 | 99 | NaN |
You can also add columns using the Series type.
data = [["akiyama", 100, 100],
["satou" , 75, 99]]
DF = pd.DataFrame(data, columns = ['Students', 'Science', 'Math'])
# English = 0 100
# 1 30
# dtype: int64
English = pd.Series([100, 30], index=[0, 1])
#Data is inserted where the index on the Series side and the index of the DataFrame type match.
#If there is no match, it will be "NaN"
DF['English'] = English
Students | Science | Math | English | |
---|---|---|---|---|
0 | akiyama | 100 | 100 | 100 |
1 | satou | 75 | 99 | 30 |
You can judge the data that meets the conditions and display the data that meets the conditions.
data = [["akiyama", 100, 100],
["satou" , 75, 99]]
DF = pd.DataFrame(data, columns = ['Students', 'Science', 'Math'])
#Determine if there is more than 80 data for "Science" and "Math"
DF_80over = DF[["Science", "Math"]] > 80
#Show lines with "Science" over 80
Science_80over = DF[DF['Science'] > 80]
The above DF_80over has the following DataFrame type.
Science | Math | |
---|---|---|
0 | True | True |
1 | False | True |
Science_80over has the following DataFrame type.
Students | Science | Math | |
---|---|---|---|
0 | akiyama | 100 | 100 |
Recommended Posts