[PYTHON] An introduction to Pandas that you can learn while suffering [Part 1]

Introduction

Pandas is an inevitable library for data analysis in Python. However, this Pandas has a high hurdle for beginners ... I also had a lot of trouble, so I will try to put it together from the basics. I'm not an advanced Pandas player myself, so I'd appreciate your comments and advice. In addition, this article will be composed of three parts. (Because it is planned, there is a great possibility that it will change ...)

Let's get started (* ^ ▽ ^ *)

Notes

Please note that importing the following libraries will be omitted in future notations.

import pandas as pd
import numpy as np

About Series

There are two main types of Pandas, Series and DataFrame class. To briefly explain the difference, Series can handle one-dimensional data (vectors), and DataFrame can handle two-dimensional data (matrix). Let's check Series first. pandas.Series

Create Series (specify dictionary)

sr = pd.Series({'taro': 'kyoto', 'jiro': 'osaka', 'saburo': 'nara'})
print(sr.head())

img123.png

In the above code, the attribute name of the dictionary corresponds to the row label (index) of Series, and the value of the dictionary corresponds to the value of Series.

Creation of Series (array specification)

sr = pd.Series(['kyoto', 'osaka', 'nara'], index=['taro', 'jiro','saburo'] )
print(sr.head())

img123.png

You can set the row label (index) as described above. By the way, if you do not set the row label, no error will occur, but if you do not set it explicitly, the number 0,1 will be set automatically.

The Series still has many features, but we will omit them because there are many duplicates with the DataFrame. ** However, please note that the usage is different between DataFrame and Series even if the attribute name is the same. ** ** (I think this is also one of the factors that make Pandas difficult ...)

About DataFrame

Speaking of Pandas, it's a DataFrame. Let's take a look at the official reference for DataFrame. pandas.DataFrame

The following three are important arguments for the pandas.DataFrame constructor.

Let's take a look at the actual program.

DataFrame creation (column)

First is the case of creating Datarfame using column data. The attribute name of the dictionary is used as the column label (columns), and the value (registered as an array) is treated as the row (index). (* An error will occur if the values are not arranged in an array) ↑ Apparently, if you set the index properly, you will not get an error even if it is not an array. (It's complicated ...)

data = {
  'name': ['taro', 'jiro', 'saburo'],
  'address': ['kyoto', 'osaka', 'nara'],
  'birth': ['2020-01-01T12:00:00', '2020-02-11T12:00:00', '2020-03-22T12:00:00']
}
df = pd.DataFrame(data = data)
print(df.head())

img123.png

DataFrame creation (row)

Next, create a Dataframe using the row data. No error will occur without colums, but in that case columns will be automatically labeled with numbers such as 0,1.

data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, columns = ['name', 'address', 'birth'])
print(df.head())

img123.png

By the way, if you want to set both the row label (index) and the column label (columns), do as follows. Set both index and columns.

data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
print(df.head())

img123.png

Extracting columns from DataFrame

I think there are many opportunities to extract specific column data from the data. In that case, it is more flexible to use loc, which will be described later, but it is also possible to narrow down the columns by directly specifying it in the Dataframe. ** You don't need to remember this method at first because loc, which will be described later, is more sophisticated (it is confusing) **

data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, columns = ['name', 'address', 'birth'])
print(df['name'])

img123.png

The above code is extracting the "name" column. You will notice that the table-like display has changed. This is because the result of extracting the columns is Series, not Dataframe.

If you want to extract multiple columns, do as follows.

df[['name', 'address']]

img123.png

If you extract multiple columns, the extracted result will be a Dataframe instead of a Series.

Extracting rows from Dataframe

Rows can be extracted as well as columns. Use slices to extract. ** You don't need to remember this method at first because loc, which will be described later, is more sophisticated (it is confusing) **

data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
print(df['a':'b'])

img123.png

In the above, the "a" and "b" rows are extracted. ** Unlike columns, multiple rows cannot be individually selected and extracted. ** ** (For example, you cannot extract lines a and c) Also, if you want to select only row a, you need to specify it in slices. (For example, print (df ['a':'a']) can extract only a line.)

About Dataframe attributes

Dataframe has the following attributes.

All of them are important, so let's look at them one by one.

Dataframe.T [Getting transposed matrix]

T can get the transposed matrix. Simply put, you can get data with swapped rows and columns.

data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
print(df.T)

img123.png

Datarfame.at & Dataframe.iat [Extraction of a single value]

By using at and iat, you can get the value at any position in ** Dataframe. ** ** ** For dataframe at and iat, two arguments are always required **. (To limit the position to one) The difference between at and iat is that at specifies the position by row label and column label, while iat specifies by row number and column number.

data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
print(df.at['a','address']) # kyoto

In the case of the above code, since the "address" column of the "a" row is specified, kyoto can be obtained.

img123.png

When using iat, it will be as follows.

data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
print(df.iat[1,2]) #2020-02-11T12:00:00

img123.png

Since the "2" column of the "1" row is specified, 2020-02-11T12: 00: 00 can be obtained. (Row numbers and column numbers start at 0)

Dataframe.loc & Dataframe.iloc [Extracting rows and columns]

I will explain loc, which is so frequent that it is not an exaggeration to say that it is the most important in Dataframe. If you remember how to use loc, you can almost handle basic contents. The basic syntax of loc is as follows. ** When using slices, writing in an array will result in an error. ** **

Dataframe.loc[[<Row label>], [Column label]]
Dataframe.loc[Row label A:Row label B,Column label A:Column label B]
data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
#Single row extraction
print(df.loc[['a']])

#Extraction of multiple lines
print(df.loc[['a', 'b']])

#Single column extraction
print(df.loc[:, ['name']])

#Extraction of multiple columns
print(df.loc[:, ['name', 'address']])

#Row and column combination extraction
print(df.loc[['a', 'c'], ['name', 'birth']])
print(df.loc['a':'c', ['name', 'birth']])

Even if the row label and column label parts are not written in an array, it is possible to extract one row and one column, but the behavior is different from the case of extracting one row and one column in the array. (Be careful as this specification is also a source of confusion ...)

data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
#Single row extraction
df.loc[['a']]
type(df.loc[['a']]) # pandas.core.frame.DataFrame

#Single row extraction
df.loc['a']
print(type(df.loc['a'])) # pandas.core.series.Series

#Single column extraction
df.loc[:, ['name']]
print(type(df.loc[:, ['name']])) # pandas.core.frame.DataFrame

#Single column extraction
df.loc[:, 'name']
print(type(df.loc[:, 'name'])) # pandas.core.series.Series

As in the above code, if you specify it as an array, Dataframe will be returned, and if you specify it without an array, Series will be returned. (** Be careful when actually using **)

DataFrame.columns & DataFrame.index [Check column labels and row labels]

There are many situations where you want to check what kind of column labels and row labels are set in the Dataframe. You can check the list of column labels by columns and row labels by index.

data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
#Show column label
print(df.columns) # Index(['name', 'address', 'birth'], dtype='object')

#Show row label
print(df.index) # Index(['a', 'b', 'c'], dtype='object')

DataFrame.shape [Get the number of rows and columns]

There are many times when you want to see how many rows and columns of data exist in a Dataframe. You can get the number of rows and columns by using shape. The first tuple returned in shape is the number of rows, and the second is the number of columns.

data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
#Get the number of rows and columns
print(df.shape) # (3, 3)

DataFrame.values [Get numpy array]

DataFrame comes with information such as row labels and column labels in addition to values, but if you don't need label information and only data, you can convert it to a numpy array.

data = [
  ['taro', 'kyoto', '2020-01-01T12:00:00'],
  ['jiro', 'osaka', '2020-02-11T12:00:00'],
  ['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
#Get a numpy array
print(df.values)

img123.png

If you convert it to a numpy array, you can get the data by the accelerator method similar to a normal 2D array. ** Please note that the data extraction method is different for DataFrame, Series, and numpy arrays. ** ** (This is also a source of confusion ...)

#numpy arrays can access data just like regular 2D arrays
#Extract the 0th row and 1st column
print(df.values[0][1]) # kyoto

#Extract line 0
print(df.values[0])

#You can also use slices
#Extract the first row
print(df.values[:, 1]) # ['kyoto' 'osaka' 'nara']

Finally

Pandas is certainly difficult, but the official reference is written in a very easy-to-understand manner, so I think that it is not an unreasonable hurdle to understand if you read the reference carefully and proceed with learning. (Maybe I'm just thinking about it, but ...). In particular, 10 minutes to pandas is the first because the content is compact and easy to understand. I highly recommend it as a starting point.

I think you've used the phrase "source of confusion" about four times in this article. Please note that this part is particularly complicated, so if you do not understand it well, it may not be possible to handle when the amount of data increases or when you combine data.

Part 2 will finally introduce Pandas methods. There are so many types of methods that it is difficult to learn, but I will write them as easily as possible, so thank you. (^^ ♪ Then (^ _-)-☆

Recommended Posts

An introduction to Pandas that you can learn while suffering [Part 1]
An introduction to Python that even monkeys can understand (Part 3)
An introduction to Python that even monkeys can understand (Part 1)
An introduction to Python that even monkeys can understand (Part 2)
An introduction to Word2Vec that even cats can understand
An introduction to Cython that doesn't go deep
An introduction to Cython that doesn't go deep -2-
Introduction to PyQt4 Part 1
Introduction to Python numpy pandas matplotlib (~ towards B3 ~ part2)
An introduction to the modern socket API to learn in C
An introduction to private TensorFlow
An introduction to machine learning
An introduction to Python Programming
An introduction to Bayesian optimization
Introduction to Ansible Part ③'Inventory'
Introduction to Python For, While
Introduction to Ansible Part ④'Variable'
8 services that even beginners can learn Python (from beginners to advanced users)
Introduction of "scikit-mobility", a library that allows you to easily analyze human flow data with Python (Part 1)
Kalman filter that you can understand
Introduction to Ansible Part 2'Basic Grammar'
An introduction to Mercurial for non-engineers
Introduction to Python Hands On Part 1
Websites to help you learn programming
[Introduction to Python] Let's use pandas
[Introduction to Python] Let's use pandas
Easy Python to learn while writing
[Introduction to Python] Let's use pandas
An introduction to Python for non-engineers
Introduction to Ansible Part 1'Hello World !!'
[Python Tutorial] An Easy Introduction to Python
[For beginners] Super introduction to neural networks that even cats can understand
[Python3] Code that can be used when you want to cut out an image in a specific size
[Python3] Code that can be used when you want to change the extension of an image at once