I tried to summarize how to use pandas in python

Introduction

This time I will summarize how to use pandas.

Many people have summarized how to use pandas, so it may not be new, but I would appreciate it if you could get along with me.

The previous article summarized how to use numpy, so please check it if you like.

I tried to summarize python numpy

Series generation

You can generate a Series by doing the following. Series is an array with an index attached.

import numpy as np
import pandas as pd

series = pd.Series(data=[1, 2, 3, 4, 5], index=['A', 'B', 'C', 'D', 'E'])
print(series)

A 1 B 2 C 3 D 4 E 5 dtype: int64

It can also be generated in combination with numpy.

series = pd.Series(data=np.arange(5), index=['A', 'B', 'C', 'D', 'E'])
print(series)

A 0 B 1 C 2 D 3 E 4 dtype: int64

Extract data from Series

Series can be indexed to retrieve data. It's close to a dictionary type.

series = pd.Series(data=np.arange(5), index=['A', 'B', 'C', 'D', 'E'])
print(series['A'])

0

You can also use slices.

series = pd.Series(data=np.arange(5), index=['A', 'B', 'C', 'D', 'E'])
print(series['A':'D'])

A 0 B 1 C 2 D 3 dtype: int64

In the sense of a normal slice, the data up to C, which is one before D, should be extracted, but in the case of Series, it is extracted to the range specified by the index.

With loc

However, when retrieving data by specifying an index in this way, the loc method is customarily used.

series = pd.Series(data=np.arange(5), index=['A', 'B', 'C', 'D', 'E'])
print(series.loc['A':'D'])

A 0 B 1 C 2 D 3 dtype: int64

You can specify two indexes without using slices.

series = pd.Series(data=np.arange(5), index=['A', 'B', 'C', 'D', 'E'])
print(series.loc[['A', 'D']])

A 0 D 3 dtype: int64

With iloc

Instead of using the Series index, you can also specify the index of the number assigned from the beginning and retrieve it.

series = pd.Series(data=np.arange(5), index=['A', 'B', 'C', 'D', 'E'])
print(series.iloc[[0]])

A 0 dtype: int64

DataFrame generation

You can generate a DataFrame by doing the following.

df = pd.DataFrame(data=[[1, 2, 3], [4, 5, 6], [7, 8, 9]], index=['A', 'B', 'C'], columns=['A1', 'A2', 'A3'])
print(df)

A1 A2 A3 A 1 2 3 B 4 5 6 C 7 8 9

In this way, the DataFrame is two-dimensional data with ʻindex and columns` specified.

When used in machine learning, index represents the type of data and columns represent the features of the data.

Read file

This is a very common task, as pandas basically reads and uses files.

Here, load the csv file that you created appropriately. The following data.

Let's read it.
df = pd.read_csv('train.csv')
print(df)

The following is the execution result.

It was hard to see even if I copied and pasted it, so I took a screenshot.

I wanted to index takash and kenta, but they are not indexed by default.

In this way, when reading data with an index, it must be specified by ʻindex_col. In this example, the leftmost data is treated as an index, so set ʻindex_col = 0.

df = pd.read_csv('train.csv', index_col=0)
print(df)

Also, by default the very first line is treated as a header. If you don't want to specify the very first line as the header, specify header = None.

df = pd.read_csv('train.csv', header=None)
print(df)

Confirmation of data contents

Check the shape

Let's check the shape of the data. Like numpy etc., the shape variable stores the dimension data.

df = pd.read_csv('train.csv', index_col=0)
print(df.shape)

(3, 3)

Check statistics`

You can check the statistics of the data using the describe method.

df = pd.read_csv('train.csv', index_col=0)
print(df.describe())

In this way, you can get the number of data, mean, standard deviation, minimum, maximum, and quartile for each column.

Check the number and type of data

You can check it with the code below.

df = pd.read_csv('train.csv', index_col=0)
print(df.info())

<class 'pandas.core.frame.DataFrame'> Index: 3 entries, takash to yoko Data columns (total 3 columns): math 3 non-null int64 Engrish 3 non-null int64 society 3 non-null int64 dtypes: int64(3) memory usage: 96.0+ bytes None

You can check the data like this. No commentary is needed.

Checking data without duplication

If you use the nunique method, you can check the data without duplication for each column you write.

df = pd.read_csv('train.csv', index_col=0)
print(df.nunique())

>math 3 Engrish 3 society 3 dtype: int64

Since there is no duplication this time, the above results were obtained.

Confirmation of row name and column name

The index variable stores the index and the columns variable stores the column names. Let's check.

df = pd.read_csv('train.csv', index_col=0)
print(df.index)
print(df.columns)

Index(['takash', 'kenta', 'yoko'], dtype='object') Index(['math', 'Engrish', 'society'], dtype='object')

Check the total of missing values

You can see the location of the missing values in each column with the code below.

df = pd.read_csv('train.csv', index_col=0)
print(df.isnull())

math Engrish society takash False False False kenta False False False yoko False False False

Since each value is not a missing value, False is returned.

Now let's get the sum of the missing values with the following code.

df = pd.read_csv('train.csv', index_col=0)
print(df.isnull().sum())

math 0 Engrish 0 society 0 dtype: int64

DataFrame data selection and extraction

Now let's extract the data from the DataFrame. For the time being, I generated the following DataFrame.

df = pd.DataFrame(data=np.random.rand(5, 5),
                  index=['A', 'B', 'C', 'D', 'E'],
                  columns=('A1', 'A2', 'A3', 'A4', 'A5'))
np.random.seed(0)
print(df)

np.random.seed (0) allows you to fix the random numbers generated by np.random.rand. However, since I run the code every time, the random numbers change every time.

np.random.rand is the code that generates random numbers from 0 to 1.

Let's select and extract columuns with the code below.

df = pd.DataFrame(data=np.random.rand(5, 5),
                  index=['A', 'B', 'C', 'D', 'E'],
                  columns=('A1', 'A2', 'A3', 'A4', 'A5'))
np.random.seed(0)
print(df['A1'])

A 0.165899 B 0.144862 C 0.974517 D 0.144633 E 0.806085 Name: A1, dtype: float64

In this way, we were able to extract the columns.

You can use the loc method to specify the index and extract.

df = pd.DataFrame(data=np.random.rand(5, 5),
                  index=['A', 'B', 'C', 'D', 'E'],
                  columns=('A1', 'A2', 'A3', 'A4', 'A5'))
np.random.seed(0)
print(df.loc['A'])

A1 0.687867 A2 0.243104 A3 0.568371 A4 0.125892 A5 0.749777 Name: A, dtype: float64

I wrote that it is extracted by specifying the index, but the specification method with the loc method in DataFrame is quite similar to the specification of the two-dimensional array of numpy.

You can specify loc [row: column].

Let's see how to use it below.

print(df.loc[:, 'A1'])

A 0.108650 B 0.819086 C 0.250341 D 0.950634 E 0.852035 Name: A1, dtype: float64

Since : is specified in the row part, it means that all rows are specified, and ʻA1` is specified in the column, so the columns of A1 are extracted.

print(df.loc['C', ['A2', 'A4']])

A2 0.129296 A4 0.367573 Name: C, dtype: float64

In this way, you can extract the data in rows A2 and A4.

Selection by condition

Let's select and extract the conditions from the DataFrame. Let's check the behavior of df> 0.5 with the following code.

df = pd.DataFrame(data=np.random.rand(5, 5),
                  index=['A', 'B', 'C', 'D', 'E'],
                  columns=('A1', 'A2', 'A3', 'A4', 'A5'))
np.random.seed(0)
print(df)
print(df > 0.5)

In this way, True is stored when the value in the DataFrame satisfies the condition, and False is stored when the condition is not satisfied.

By using this, you can exclude values that do not meet the conditions as shown below.

df = pd.DataFrame(data=np.random.rand(5, 5),
                  index=['A', 'B', 'C', 'D', 'E'],
                  columns=('A1', 'A2', 'A3', 'A4', 'A5'))
np.random.seed(0)
print(df > 0.5)
print(df[df > 0.5])

Also, you can extract only the rows that satisfy the specific columns by doing the following.

df = pd.DataFrame(data=np.random.rand(5, 5),
                  index=['A', 'B', 'C', 'D', 'E'],
                  columns=('A1', 'A2', 'A3', 'A4', 'A5'))
np.random.seed(0)
print(df)
print(df[df['A3'] > 0.5])

You can also add conditions using &, as shown below.

df = pd.DataFrame(data=np.random.rand(5, 5),
                  index=['A', 'B', 'C', 'D', 'E'],
                  columns=('A1', 'A2', 'A3', 'A4', 'A5'))
np.random.seed(0)
print(df)
print(df[(df['A3'] > 0.2) & (df['A3'] < 0.6)])

Add / Remove Data in DataFrame

You can add columns by doing the following.

df = pd.DataFrame(data=np.random.rand(5, 5),
                  index=['A', 'B', 'C', 'D', 'E'],
                  columns=('A1', 'A2', 'A3', 'A4', 'A5'))
np.random.seed(0)
df['new'] = np.arange(5)
print(df)

You can delete a column by specifying the column name.

df = pd.DataFrame(data=np.random.rand(5, 5),
                  index=['A', 'B', 'C', 'D', 'E'],
                  columns=('A1', 'A2', 'A3', 'A4', 'A5'))
np.random.seed(0)
df = df.drop(columns=['A1', 'A3'])
print(df)

You can delete a line by specifying the line name.

df = pd.DataFrame(data=np.random.rand(5, 5),
                  index=['A', 'B', 'C', 'D', 'E'],
                  columns=('A1', 'A2', 'A3', 'A4', 'A5'))
np.random.seed(0)
df = df.drop(index=['A', 'D'])
print(df)

Handling of missing values

Let's prepare the data as follows.

df = pd.DataFrame([[1, 2, 3, np.nan, 5],
                   [np.nan, 7, 8, 9, 10],
                   [11, np.nan, 13, 14, 15],
                   [16, 17, np.nan, 19, 20],
                   [21, 22, 23, 24, np.nan]],
                  index=['A', 'B', 'C', 'D', 'E'],
                  columns=['A1', 'A2', 'A3', 'A4', 'A5'])
print(df)

You can use the dropna method to drop the line that contains the missing value.

df = df.dropna()
print(df)

Columns: [A1, A2, A3, A4, A5] Index: []

This time, all the rows have missing values, so they all disappeared. In this way, if you apply strong restrictions, it will be difficult for data to remain.

Remove missing values in a particular column

You can remove the missing values for a particular column by doing the following:

df = pd.DataFrame([[1, 2, 3, np.nan, 5],
                   [np.nan, 7, 8, 9, 10],
                   [11, np.nan, 13, 14, 15],
                   [16, 17, np.nan, 19, 20],
                   [21, 22, 23, 24, np.nan]],
                  index=['A', 'B', 'C', 'D', 'E'],
                  columns=['A1', 'A2', 'A3', 'A4', 'A5'])
df = df[df['A3'].isnull() == False]
print(df)

If you use ʻisnull`, True will be returned if the data is nan, and False if the data is not nan. Therefore, you can only delete rows that have missing values in A3 as shown above.

Delete by specifying a number that is not a missing value

Argument of dropna By specifying the argument of thresh, it is possible to delete the rows other than the rows with values that are not missing values more than the number specified by the argument.

For example, thresh = 4 deletes rows that do not have more than 4 non-missing values.

df = pd.DataFrame([[1, 2, 3, np.nan, 5],
                   [np.nan, 7, 8, 9, 10],
                   [11, np.nan, 13, 14, 15],
                   [16, np.nan, np.nan, 19, 20],
                   [21, 22, 23, 24, np.nan]],
                  index=['A', 'B', 'C', 'D', 'E'],
                  columns=['A1', 'A2', 'A3', 'A4', 'A5'])
df = df.dropna(thresh=4)

You can do the same for columns by setting ʻaxis = 1`.

df = pd.DataFrame([[1, 2, 3, np.nan, 5],
                   [np.nan, 7, 8, 9, 10],
                   [11, np.nan, 13, 14, 15],
                   [16, np.nan, np.nan, 19, 20],
                   [21, 22, 23, 24, np.nan]],
                  index=['A', 'B', 'C', 'D', 'E'],
                  columns=['A1', 'A2', 'A3', 'A4', 'A5'])
df = df.dropna(thresh=4, axis=1)
print(df)

Replace missing value with another value

For a particular column, you can substitute the average for that column with the missing values for that column:

df = pd.DataFrame([[1, 2, 3, np.nan, 5],
                   [np.nan, 7, 8, 9, 10],
                   [11, np.nan, 13, 14, 15],
                   [16, np.nan, np.nan, 19, 20],
                   [21, 22, 23, 24, np.nan]],
                  index=['A', 'B', 'C', 'D', 'E'],
                  columns=['A1', 'A2', 'A3', 'A4', 'A5'])
df['A3'] = df['A3'].fillna(df['A3'].mean())
print(df)

You can substitute the mean value for the missing value for all columns by doing the following.

df = pd.DataFrame([[1, 2, 3, np.nan, 5],
                   [np.nan, 7, 8, 9, 10],
                   [11, np.nan, 13, 14, 15],
                   [16, np.nan, np.nan, 19, 20],
                   [21, 22, 23, 24, np.nan]],
                  index=['A', 'B', 'C', 'D', 'E'],
                  columns=['A1', 'A2', 'A3', 'A4', 'A5'])
df = df.fillna(df.mean())
print(df)

Missing values for categorical data

Let's create the following DataFrame.

df = pd.DataFrame({'A1': ['A', 'A', 'B', 'B', 'B', 'C', np.nan],
                   'A2': [1, 2, 3, 4, 5, 6, 7],
                   'A3': [8, 9, 10, 11, 12, 13, 14]})
print(df)

image.png

Let's check the category and the number of data with the code below.

print(df['A1'].value_counts())

B 3 A 2 C 1

You can retrieve only the data of a specific category with the following code.

print(df[df['A1'] == 'B'])

A1 A2 A3 2 B 3 10 3 B 4 11 4 B 5 12

You can fill in the missing values for categorical data with the following code. Since the mode is returned by mode () [0], the mode is assigned to the missing value.

df = pd.DataFrame({'A1': ['A', 'A', 'B', 'B', 'B', 'C', np.nan],
                   'A2': [1, 2, 3, 4, 5, 6, 7],
                   'A3': [8, 9, 10, 11, 12, 13, 14]})
df['A1'] = df['A1'].fillna(df['A1'].mode()[0])
print(df)

image.png

Let's calculate the percentage of categorical data with the code below.

df = pd.DataFrame({'A1': ['A', 'A', 'B', 'B', 'B', 'C', np.nan],
                   'A2': [1, 2, 3, 4, 5, 6, 7],
                   'A3': [8, 9, 10, 11, 12, 13, 14]})
print(round(df['A1'].value_counts() / len(df), 3))

B 0.429 A 0.286 C 0.143

You can use the code below to group categorical data and calculate statistics.

df = pd.DataFrame({'A1': ['A', 'A', 'B', 'B', 'B', 'C', np.nan],
                   'A2': [1, 2, 3, 4, 5, 6, 7],
                   'A3': [8, 9, 10, 11, 12, 13, 14]})
print(df.groupby('A1').sum())
print(df.groupby('A1').mean())

image.png

DataFrame Join

By default, ʻaxis = 0`, so they are combined vertically.

df1 = pd.DataFrame(data=np.random.rand(3, 3),
                   index=['A', 'B', 'C'],
                   columns=['A1', 'A2', 'A3'])
df2 = pd.DataFrame(data=np.random.rand(3, 3),
                   index=['D', 'E', 'F'],
                   columns=['A1', 'A2', 'A3'])

df3 = pd.concat([df1, df2])
print(df3)

If you specify ʻaxis = 1` as shown below, you can combine them horizontally.

You need to match columns when joining vertically and ʻindex` when joining horizontally.

df1 = pd.DataFrame(data=np.random.rand(3, 3),
                   index=['A', 'B', 'C'],
                   columns=['A1', 'A2', 'A3'])
df2 = pd.DataFrame(data=np.random.rand(3, 3),
                   index=['A', 'B', 'C'],
                   columns=['A4', 'A5', 'A6'])

df3 = pd.concat([df1, df2], axis=1)
print(df3)

Applying functions to DataFrame

You can apply a function to specific data by using ʻapply`.

df = pd.DataFrame(data=np.random.rand(3, 3),
                  index=['A', 'B', 'C'],
                  columns=['A1', 'A2', 'A3'])

print(df)
df['A1'] = df['A1'].apply(lambda x: x ** 2)
print(df)

When applying a function with multiple arguments to a DataFrame, it is convenient to define a function with a DataFrame as an argument.

df = pd.DataFrame(data=np.random.rand(3, 3),
                  index=['A', 'B', 'C'],
                  columns=['A1', 'A2', 'A3'])
print(df)
def matmul(df):
    return df['A1'] * df['A2']
df['A4'] = df.apply(matmul, axis=1)
print(df)

If you have multiple return values, you can receive them by doing the following.

df = pd.DataFrame(data=np.random.rand(3, 3),
                  index=['A', 'B', 'C'],
                  columns=['A1', 'A2', 'A3'])

def square_and_twice(x):
    return pd.Series([x**2, x*2])
df[['square', 'twice']] = df['A3'].apply(square_and_twice)
print(df)

At the end

This is the end of this article.

Thank you for your relationship.

Recommended Posts

I tried to summarize how to use pandas in python
I tried to summarize how to use matplotlib of python
[Python] How to use Pandas Series
How to use SQLite in Python
How to use Mysql in python
How to use ChemSpider in Python
How to use PubChem in Python
I tried to summarize the code often used in Pandas
I tried "How to get a method decorated in Python"
I tried to summarize how to use the EPEL repository again
[Python] Summary of how to use pandas
[Introduction to Python] How to use class in Python?
I tried to implement permutation in Python
I tried to implement PLSA in Python 2
Python3 standard input I tried to summarize
I tried to implement ADALINE in Python
How to use __slots__ in Python class
How to use regular expressions in Python
How to use is and == in Python
I will explain how to use Pandas in an easy-to-understand manner.
How to use the C library in Python
How to use Python Image Library in python3 series
Summary of how to use MNIST in Python
I tried to implement TOPIC MODEL in Python
How to use tkinter with python in pyenv
I tried to implement selection sort in python
[Python] I tried to summarize the set type (set) in an easy-to-understand manner.
python3: How to use bottle (2)
How to use Python argparse
How to use Pandas Rolling
[Python] How to use checkio
I tried to summarize SparseMatrix
How to develop in Python
[Python] How to use input ()
How to use Python lambda
[Python] How to use virtualenv
python3: How to use bottle (3)
python3: How to use bottle
How to use Python bytes
I tried to graph the packages installed in Python
[For beginners] How to use say command in python!
I tried to implement a pseudo pachislot in Python
I tried to implement Dragon Quest poker in Python
I tried to implement GA (genetic algorithm) in Python
How to use the model learned in Lobe in Python
I want to use the R dataset in python
I tried to summarize the string operations of Python
[Python] How to do PCA in Python
I tried to touch Python (installation)
How to use classes in Theano
How to write soberly in pandas
How to collect images in Python
How to use Requests (Python Library)
[Python] How to use list 3 Added
How to use OpenPose's Python API
How to wrap C in Python
How to use FTP with Python
Python: How to use pydub (playback)
[Introduction to Python] Let's use pandas
How to use python zip function
[Introduction to Python] Let's use pandas