[PYTHON] I will explain how to use Pandas in an easy-to-understand manner.

About this article

I will explain how to use Pandas in an easy-to-understand manner. If you read this article properly, it's OK.

Understanding CSV files before getting started with Pandas

If you're a complete beginner, just listen to the CSV files before you start studying Pandas.

What is a CSV file?

CSV (comma separated value) is a file that literally reads "values are separated by commas (,)". Let's look at a concrete example. Suppose you have a file like the one below.

Language used,Years of experience,annual income
Python,10,"¥60,000,000.00"
Ruby,2,"¥3,500,000.00"
Swift,4,"¥5,000,000.00"

If you open this with excel or google spreadsheet, it will be displayed as follows. スクリーンショット 2020-06-14 8.57.57.png Conclusion The only thing you should keep in mind is the "," delimited version of the excel file.

・ What is Pandas? ・ Installation procedure ・ Basic data type ・ How to retrieve data (loc, iloc, head, tail, etc.) ・ Data reading and output ・ Data sorting ・ Processing of missing values Manipulate data ・ Series edition ・ DataFrame ・ Statistical processing

What is Pandas

Pandas is a library for efficient data analysis in Python. It's kind of abstract and I don't know what it is, so I'll talk about it concretely. When performing machine learning or data analysis, the data for that learning is often not organized for proper learning. Therefore, if you use this Pandas, you can conveniently shape the data. This process before performing this machine learning is called data preprocessing. Speaking of data preprocessing, use Pandas! !! !! !! !! !! !! !! !! Please keep in mind.

Installation procedure

If you installed Python using Anaconda, you probably already have it installed. If not installed

pip install pandas

Use Pandas

When using Pandas, you need to load the Pandas library.

import pandas as pd

It's annoying to call it with pandas every time, so I generally use pd.

Data types (Series and DataFrame)

Series Series is a data type with only one column. To put it simply, a one-dimensional data structure.

import pandas as pd

l = [1,2,3,4,5]
series = pd.Series(l)
print(series)
==========>
0    1
1    2
2    3
3    4
4    5
dtype: int64

The number on the left is the index (row label) and the number on the right is the series data.

DataFrame

Dataframes are two-dimensional labeled data structures, the most used data structures in Pandas. I think it's easy to understand if you imagine the data of excel and spreadsheet.


import pandas as pd
df = pd.DataFrame({
    'Program language' :['Python', 'Ruby', 'Go'],
    'Years of experience' : [1, 1, 2],
    'annual income' : [3000000, 2800000, 16900000]
    })
print(df)
===========>
Program language Years of experience Years of income
0  Python     1   3000000
1    Ruby     1   2800000
2      Go     2  16900000

Such an image スクリーンショット 2020-06-14 9.51.27.png

By the way, in the data frame type, it is automatically sorted by the row label (index), so the order may change.

How to retrieve data

series

For the series, you can access it with the line label as it is.


import pandas as pd

l = [1,2,3,4,5]
series = pd.Series(l)
print(series[1])
==========>
2

Data frame type

The problem is here. There are various ways to take it out, so let's look at it in order. As a premise, assume that you have the following data.

import pandas as pd
df = pd.DataFrame({
    'Program language' :['Python', 'Python','Ruby', 'Go','C#','C#'],
    'Years of experience' : [1, 1, 2, 3, 1,3],
    'annual income' : [3000000, 2800000, 16900000,1230000,2000000,500000],
    'age' : [21,22,34,55,11,8]
    })
print(df)
============>
Program Language Years of Experience Years Income Age
0  Python     1   3000000  21
1  Python     1   2800000  22
2    Ruby     2  16900000  34
3      Go     3   1230000  55
4      C＃     1   2000000  11
5      C＃     3    500000   8

スクリーンショット 2020-06-14 9.54.46.png

Get a specific column

print(df['Program language'])
#Or df.'Program language'But similar results can be obtained.
=================>
0    Python
1    Python
2      Ruby
3        Go
4        C#
5        C#
Name:Program language, dtype: object

Get a specific row

print(df[0:2])
===============>
Program Language Years of Experience Years Income Age
0  Python     1  3000000  21
1  Python     1  2800000  22

I will explain in detail because it seems that it will not be understood as getting a column. If you enter the key normally with df [], pandas will determine that this is a column name. If you type in df [slice], Pandas will consider it a row label.

Retrieving specific "columns and rows" (loc and iloc)

This time specify both rows and columns. loc Basic usage of loc loc [Specify row, specify column] In loc, specify the row name and column name. iloc Basic usage of iloc iloc [row number, column number] In iloc, specify by row number and column number.

Actually move

import pandas as pd
df = pd.DataFrame({
    'Program language' :['Python', 'Python','Ruby', 'Go','C#','C#'],
    'Years of experience' : [1, 1, 2, 3, 1,3],
    'annual income' : [3000000, 2800000, 16900000,1230000,2000000,500000],
    'age' : [21,22,34,55,11,8]
    })

print(df.loc[0:2,'Program language'])#This also includes the last value of the slice. It's just the name of the line.
print(df.iloc[0:2,0])#This does not include the last value of the slice!
=================>
0    Python
1    Python
2      Ruby
Name:Program language, dtype: object
0    Python
1    Python
Name:Program language, dtype: object

Please read the comments for the time being. There are some differences in the output results. By the way, if you access a column that does not exist, NaN will be returned.

head () and tail ()

If you use head (), the first 5 cases You can use tail () to access the last 5 items.

print(df.head())
==================>
Program Language Years of Experience Years Income Age
0  Python     1   3000000  21
1  Python     1   2800000  22
2    Ruby     2  16900000  34
3      Go     3   1230000  55
4      C#     1   2000000  11

print(df.tail())
==================>
Program Language Years of Experience Years Income Age
1  Python     1   2800000  22
2    Ruby     2  16900000  34
3      Go     3   1230000  55
4      C#     1   2000000  11
5      C#     3    500000   8
#You can specify how many items to access with an argument.
print(head(2))
====================>
Program Language Years of Experience Years Income Age
0  Python     1  3000000  21
1  Python     1  2800000  22
print(tail(2))
=====================>
Program Language Years of Experience Years Income Age
4      C#     1  2000000  11
5      C#     3   500000   8

Extract rows by specifying conditions (query)

By using query (), it is possible to specify the value of the data frame and extract the row containing it. It is usually specified using a comparison operator.

import pandas as pd
df = pd.DataFrame({
    'Program language' :['Python', 'Python','Ruby', 'Go','C#','C#'],
    'Years of experience' : [1, 1, 2, 3, 1,3],
    'annual income' : [3000000, 2800000, 16900000,1230000,2000000,500000],
    'age' : [21,22,34,55,11,8]
    })
print(df.query('Years of experience<= 2'))
========================>
Program Language Years of Experience Years Income Age
0  Python     1   3000000  21
1  Python     1   2800000  22
2    Ruby     2  16900000  34
4      C#     1   2000000  11

Data input / output

Pandas has the ability to enter data and output the data as a file after manipulation. Here, we will only introduce the functions.

import pandas as pd

pd.read_CSV('file name', header, sep,...)#read_In CSV, the default delimiter is ",」
pd.read_table('file name', header, sep....)# read_In table, the default delimiter is "\t」

#As output,
pd.to_csv('file name')
pd.to_excel('file name')
pd.to_html('file name')
#And so on.

Sorting data

There are two main methods.

How to use index (row name / column name) and how to sort based on value ... sort_index ()
How to sort by the size of column values ... sort_values ()

import pandas as pd
df = pd.DataFrame({
    'Program language' :['Python', 'Python','Ruby', 'Go','C#','C#'],
    'Years of experience' : [1, 1, 2, 3, 1,3],
    'annual income' : [3000000, 2800000, 16900000,1230000,2000000,500000],
    'age' : [21,22,34,55,11,8]
    })

print(df.sort_index(ascending=False))
===============================>
Program Language Years of Experience Years Income Age
5      C#     3    500000   8
4      C#     1   2000000  11
3      Go     3   1230000  55
2    Ruby     2  16900000  34
1  Python     1   2800000  22
0  Python     1   3000000  21

print(df.sort_values(by="annual income") )
=================================>
Program Language Years of Experience Years Income Age
5      C#     3    500000   8
3      Go     3   1230000  55
4      C#     1   2000000  11
1  Python     1   2800000  22
0  Python     1   3000000  21
2    Ruby     2  16900000  34

Handling of missing values

You will come across many missing values in data analysis and machine learning. Missing values are the missing parts of the data. (For example, the unanswered column of the questionnaire) coming soon....