[PYTHON] An introduction to Pandas that you can learn while suffering [Part 1]

Introduction

Pandas is an inevitable library for data analysis in Python. However, this Pandas has a high hurdle for beginners ... I also had a lot of trouble, so I will try to put it together from the basics. I'm not an advanced Pandas player myself, so I'd appreciate your comments and advice. In addition, this article will be composed of three parts. (Because it is planned, there is a great possibility that it will change ...)

Introduction to Pandas to learn while suffering [Part1] ← Imakoko
How to create Series and DataFrame
Explanation of various attributes of DataFrame
Introduction to Pandas to learn while suffering [Part2]
Explanation of various methods of DataFrame
Introduction to Pandas to learn while suffering [Part3]
Explanation of various methods of DataFrame (continued)
About updating and searching values
About DataFrame calculation

Let's get started (* ^ ▽ ^ *)

Notes

Please note that importing the following libraries will be omitted in future notations.

import pandas as pd
import numpy as np

About Series

There are two main types of Pandas, Series and DataFrame class. To briefly explain the difference, Series can handle one-dimensional data (vectors), and DataFrame can handle two-dimensional data (matrix). Let's check Series first. pandas.Series

Create Series (specify dictionary)

sr = pd.Series({'taro': 'kyoto', 'jiro': 'osaka', 'saburo': 'nara'})
print(sr.head())

In the above code, the attribute name of the dictionary corresponds to the row label (index) of Series, and the value of the dictionary corresponds to the value of Series.

Creation of Series (array specification)

sr = pd.Series(['kyoto', 'osaka', 'nara'], index=['taro', 'jiro','saburo'] )
print(sr.head())

You can set the row label (index) as described above. By the way, if you do not set the row label, no error will occur, but if you do not set it explicitly, the number 0,1 will be set automatically.

The Series still has many features, but we will omit them because there are many duplicates with the DataFrame. ** However, please note that the usage is different between DataFrame and Series even if the attribute name is the same. ** ** (I think this is also one of the factors that make Pandas difficult ...)

About DataFrame

Speaking of Pandas, it's a DataFrame. Let's take a look at the official reference for DataFrame. pandas.DataFrame

The following three are important arguments for the pandas.DataFrame constructor.

data
index
columns

Let's take a look at the actual program.

DataFrame creation (column)

First is the case of creating Datarfame using column data. The attribute name of the dictionary is used as the column label (columns), and the value (registered as an array) is treated as the row (index). (* An error will occur if the values are not arranged in an array) ↑ Apparently, if you set the index properly, you will not get an error even if it is not an array. (It's complicated ...)

data = {
  'name': ['taro', 'jiro', 'saburo'],
  'address': ['kyoto', 'osaka', 'nara'],
  'birth': ['2020-01-01T12:00:00', '2020-02-11T12:00:00', '2020-03-22T12:00:00']
}
df = pd.DataFrame(data = data)
print(df.head())

DataFrame creation (row)

Next, create a Dataframe using the row data. No error will occur without colums, but in that case columns will be automatically labeled with numbers such as 0,1.

data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, columns = ['name', 'address', 'birth'])
print(df.head())

By the way, if you want to set both the row label (index) and the column label (columns), do as follows. Set both index and columns.

data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
print(df.head())

Extracting columns from DataFrame

I think there are many opportunities to extract specific column data from the data. In that case, it is more flexible to use loc, which will be described later, but it is also possible to narrow down the columns by directly specifying it in the Dataframe. ** You don't need to remember this method at first because loc, which will be described later, is more sophisticated (it is confusing) **

data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, columns = ['name', 'address', 'birth'])
print(df['name'])

The above code is extracting the "name" column. You will notice that the table-like display has changed. This is because the result of extracting the columns is Series, not Dataframe.

If you want to extract multiple columns, do as follows.

df[['name', 'address']]

If you extract multiple columns, the extracted result will be a Dataframe instead of a Series.

Extracting rows from Dataframe

Rows can be extracted as well as columns. Use slices to extract. ** You don't need to remember this method at first because loc, which will be described later, is more sophisticated (it is confusing) **

data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
print(df['a':'b'])

In the above, the "a" and "b" rows are extracted. ** Unlike columns, multiple rows cannot be individually selected and extracted. ** ** (For example, you cannot extract lines a and c) Also, if you want to select only row a, you need to specify it in slices. (For example, print (df ['a':'a']) can extract only a line.)

There may be some people who have doubts here. A normal slice does not include the subscript specified on the right side, so if you do df ['a':'a'], you may not be able to extract line a? ,about it. In DataFrame, row label / column label slice specifications include the right subscript (closed interval), but row number / column number slices do not include the right subscript (left closed, right open) like normal slices. section). (I'm also personally addicted to this ...)

About Dataframe attributes

Dataframe has the following attributes.

T
at
iat
loc
iloc
columns
index
shape
values

All of them are important, so let's look at them one by one.

Dataframe.T [Getting transposed matrix]

T can get the transposed matrix. Simply put, you can get data with swapped rows and columns.

data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
print(df.T)

Datarfame.at & Dataframe.iat [Extraction of a single value]

By using at and iat, you can get the value at any position in ** Dataframe. ** ** ** For dataframe at and iat, two arguments are always required **. (To limit the position to one) The difference between at and iat is that at specifies the position by row label and column label, while iat specifies by row number and column number.

data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
print(df.at['a','address']) # kyoto

In the case of the above code, since the "address" column of the "a" row is specified, kyoto can be obtained.

When using iat, it will be as follows.

data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
print(df.iat[1,2]) #2020-02-11T12:00:00

Since the "2" column of the "1" row is specified, 2020-02-11T12: 00: 00 can be obtained. (Row numbers and column numbers start at 0)

Dataframe.loc & Dataframe.iloc [Extracting rows and columns]

I will explain loc, which is so frequent that it is not an exaggeration to say that it is the most important in Dataframe. If you remember how to use loc, you can almost handle basic contents. The basic syntax of loc is as follows. ** When using slices, writing in an array will result in an error. ** **

Dataframe.loc[[<Row label>], [Column label]]
Dataframe.loc[Row label A:Row label B,Column label A:Column label B]

data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
#Single row extraction
print(df.loc[['a']])

#Extraction of multiple lines
print(df.loc[['a', 'b']])

#Single column extraction
print(df.loc[:, ['name']])

#Extraction of multiple columns
print(df.loc[:, ['name', 'address']])

#Row and column combination extraction
print(df.loc[['a', 'c'], ['name', 'birth']])
print(df.loc['a':'c', ['name', 'birth']])

Even if the row label and column label parts are not written in an array, it is possible to extract one row and one column, but the behavior is different from the case of extracting one row and one column in the array. (Be careful as this specification is also a source of confusion ...)

data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
#Single row extraction
df.loc[['a']]
type(df.loc[['a']]) # pandas.core.frame.DataFrame

#Single row extraction
df.loc['a']
print(type(df.loc['a'])) # pandas.core.series.Series

#Single column extraction
df.loc[:, ['name']]
print(type(df.loc[:, ['name']])) # pandas.core.frame.DataFrame

#Single column extraction
df.loc[:, 'name']
print(type(df.loc[:, 'name'])) # pandas.core.series.Series

As in the above code, if you specify it as an array, Dataframe will be returned, and if you specify it without an array, Series will be returned. (** Be careful when actually using **)

DataFrame.columns & DataFrame.index [Check column labels and row labels]

There are many situations where you want to check what kind of column labels and row labels are set in the Dataframe. You can check the list of column labels by columns and row labels by index.

data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
#Show column label
print(df.columns) # Index(['name', 'address', 'birth'], dtype='object')

#Show row label
print(df.index) # Index(['a', 'b', 'c'], dtype='object')

DataFrame.shape [Get the number of rows and columns]

There are many times when you want to see how many rows and columns of data exist in a Dataframe. You can get the number of rows and columns by using shape. The first tuple returned in shape is the number of rows, and the second is the number of columns.

data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
#Get the number of rows and columns
print(df.shape) # (3, 3)

DataFrame.values [Get numpy array]

DataFrame comes with information such as row labels and column labels in addition to values, but if you don't need label information and only data, you can convert it to a numpy array.

However, it seems that the method of getting a numpy array with values is deprecated. (Thanks to nkay.) If you want to get a numpy array, use the to_numpy function. (Details are described in the comment section)

data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
#Get a numpy array
print(df.values)

If you convert it to a numpy array, you can get the data by the accelerator method similar to a normal 2D array. ** Please note that the data extraction method is different for DataFrame, Series, and numpy arrays. ** ** (This is also a source of confusion ...)

#numpy arrays can access data just like regular 2D arrays
#Extract the 0th row and 1st column
print(df.values[0][1]) # kyoto

#Extract line 0
print(df.values[0])

#You can also use slices
#Extract the first row
print(df.values[:, 1]) # ['kyoto' 'osaka' 'nara']

Finally

Pandas is certainly difficult, but the official reference is written in a very easy-to-understand manner, so I think that it is not an unreasonable hurdle to understand if you read the reference carefully and proceed with learning. (Maybe I'm just thinking about it, but ...). In particular, 10 minutes to pandas is the first because the content is compact and easy to understand. I highly recommend it as a starting point.

I think you've used the phrase "source of confusion" about four times in this article. Please note that this part is particularly complicated, so if you do not understand it well, it may not be possible to handle when the amount of data increases or when you combine data.

Part 2 will finally introduce Pandas methods. There are so many types of methods that it is difficult to learn, but I will write them as easily as possible, so thank you. (^^ ♪ Then (^ _-)-☆