[PYTHON] I will explain how to use Pandas in an easy-to-understand manner.

About this article

I will explain how to use Pandas in an easy-to-understand manner. If you read this article properly, it's OK.

Understanding CSV files before getting started with Pandas

If you're a complete beginner, just listen to the CSV files before you start studying Pandas.

What is a CSV file?

CSV (comma separated value) is a file that literally reads "values are separated by commas (,)". Let's look at a concrete example. Suppose you have a file like the one below.

Language used,Years of experience,annual income
Python,10,"¥60,000,000.00"
Ruby,2,"¥3,500,000.00"
Swift,4,"¥5,000,000.00"

If you open this with excel or google spreadsheet, it will be displayed as follows. スクリーンショット 2020-06-14 8.57.57.png Conclusion The only thing you should keep in mind is the "," delimited version of the excel file.

table of contents

・ What is Pandas? ・ Installation procedure ・ Basic data type ・ How to retrieve data (loc, iloc, head, tail, etc.) ・ Data reading and output ・ Data sorting ・ Processing of missing values Manipulate data ・ Series edition ・ DataFrame ・ Statistical processing

What is Pandas

Pandas is a library for efficient data analysis in Python. It's kind of abstract and I don't know what it is, so I'll talk about it concretely. When performing machine learning or data analysis, the data for that learning is often not organized for proper learning. Therefore, if you use this Pandas, you can conveniently shape the data. This process before performing this machine learning is called data preprocessing. Speaking of data preprocessing, use Pandas! !! !! !! !! !! !! !! !! Please keep in mind.

Installation procedure

If you installed Python using Anaconda, you probably already have it installed. If not installed

pip install pandas

Use Pandas

When using Pandas, you need to load the Pandas library.

import pandas as pd

It's annoying to call it with pandas every time, so I generally use pd.

Data types (Series and DataFrame)

Series Series is a data type with only one column. To put it simply, a one-dimensional data structure.

import pandas as pd

l = [1,2,3,4,5]
series = pd.Series(l)
print(series)
==========>
0    1
1    2
2    3
3    4
4    5
dtype: int64

The number on the left is the index (row label) and the number on the right is the series data.

DataFrame

Dataframes are two-dimensional labeled data structures, the most used data structures in Pandas. I think it's easy to understand if you imagine the data of excel and spreadsheet.


import pandas as pd
df = pd.DataFrame({
    'Program language' :['Python', 'Ruby', 'Go'],
    'Years of experience' : [1, 1, 2],
    'annual income' : [3000000, 2800000, 16900000]
    })
print(df)
===========>
Program language Years of experience Years of income
0  Python     1   3000000
1    Ruby     1   2800000
2      Go     2  16900000

Such an image スクリーンショット 2020-06-14 9.51.27.png

By the way, in the data frame type, it is automatically sorted by the row label (index), so the order may change.

How to retrieve data

series

For the series, you can access it with the line label as it is.


import pandas as pd

l = [1,2,3,4,5]
series = pd.Series(l)
print(series[1])
==========>
2

Data frame type

The problem is here. There are various ways to take it out, so let's look at it in order. As a premise, assume that you have the following data.

import pandas as pd
df = pd.DataFrame({
    'Program language' :['Python', 'Python','Ruby', 'Go','C#','C#'],
    'Years of experience' : [1, 1, 2, 3, 1,3],
    'annual income' : [3000000, 2800000, 16900000,1230000,2000000,500000],
    'age' : [21,22,34,55,11,8]
    })
print(df)
============>
Program Language Years of Experience Years Income Age
0  Python     1   3000000  21
1  Python     1   2800000  22
2    Ruby     2  16900000  34
3      Go     3   1230000  55
4      C#     1   2000000  11
5      C#     3    500000   8

スクリーンショット 2020-06-14 9.54.46.png

Get a specific column

print(df['Program language'])
#Or df.'Program language'But similar results can be obtained.
=================>
0    Python
1    Python
2      Ruby
3        Go
4        C#
5        C#
Name:Program language, dtype: object

Get a specific row

print(df[0:2])
===============>
Program Language Years of Experience Years Income Age
0  Python     1  3000000  21
1  Python     1  2800000  22

I will explain in detail because it seems that it will not be understood as getting a column. If you enter the key normally with df [], pandas will determine that this is a column name. If you type in df [slice], Pandas will consider it a row label.

Retrieving specific "columns and rows" (loc and iloc)

This time specify both rows and columns. loc Basic usage of loc loc [Specify row, specify column] In loc, specify the row name and column name. iloc Basic usage of iloc iloc [row number, column number] In iloc, specify by row number and column number.

Actually move

import pandas as pd
df = pd.DataFrame({
    'Program language' :['Python', 'Python','Ruby', 'Go','C#','C#'],
    'Years of experience' : [1, 1, 2, 3, 1,3],
    'annual income' : [3000000, 2800000, 16900000,1230000,2000000,500000],
    'age' : [21,22,34,55,11,8]
    })

print(df.loc[0:2,'Program language'])#This also includes the last value of the slice. It's just the name of the line.
print(df.iloc[0:2,0])#This does not include the last value of the slice!
=================>
0    Python
1    Python
2      Ruby
Name:Program language, dtype: object
0    Python
1    Python
Name:Program language, dtype: object

Please read the comments for the time being. There are some differences in the output results. By the way, if you access a column that does not exist, NaN will be returned.

head () and tail ()

If you use head (), the first 5 cases You can use tail () to access the last 5 items.

print(df.head())
==================>
Program Language Years of Experience Years Income Age
0  Python     1   3000000  21
1  Python     1   2800000  22
2    Ruby     2  16900000  34
3      Go     3   1230000  55
4      C#     1   2000000  11

print(df.tail())
==================>
Program Language Years of Experience Years Income Age
1  Python     1   2800000  22
2    Ruby     2  16900000  34
3      Go     3   1230000  55
4      C#     1   2000000  11
5      C#     3    500000   8
#You can specify how many items to access with an argument.
print(head(2))
====================>
Program Language Years of Experience Years Income Age
0  Python     1  3000000  21
1  Python     1  2800000  22
print(tail(2))
=====================>
Program Language Years of Experience Years Income Age
4      C#     1  2000000  11
5      C#     3   500000   8

Extract rows by specifying conditions (query)

By using query (), it is possible to specify the value of the data frame and extract the row containing it. It is usually specified using a comparison operator.

import pandas as pd
df = pd.DataFrame({
    'Program language' :['Python', 'Python','Ruby', 'Go','C#','C#'],
    'Years of experience' : [1, 1, 2, 3, 1,3],
    'annual income' : [3000000, 2800000, 16900000,1230000,2000000,500000],
    'age' : [21,22,34,55,11,8]
    })
print(df.query('Years of experience<= 2'))
========================>
Program Language Years of Experience Years Income Age
0  Python     1   3000000  21
1  Python     1   2800000  22
2    Ruby     2  16900000  34
4      C#     1   2000000  11

Data input / output

Pandas has the ability to enter data and output the data as a file after manipulation. Here, we will only introduce the functions.

import pandas as pd

pd.read_CSV('file name', header, sep,...)#read_In CSV, the default delimiter is ",」
pd.read_table('file name', header, sep....)# read_In table, the default delimiter is "\t」

#As output,
pd.to_csv('file name')
pd.to_excel('file name')
pd.to_html('file name')
#And so on.

Sorting data

There are two main methods.

  1. How to use index (row name / column name) and how to sort based on value ... sort_index ()
  2. How to sort by the size of column values ... sort_values ()
import pandas as pd
df = pd.DataFrame({
    'Program language' :['Python', 'Python','Ruby', 'Go','C#','C#'],
    'Years of experience' : [1, 1, 2, 3, 1,3],
    'annual income' : [3000000, 2800000, 16900000,1230000,2000000,500000],
    'age' : [21,22,34,55,11,8]
    })

print(df.sort_index(ascending=False))
===============================>
Program Language Years of Experience Years Income Age
5      C#     3    500000   8
4      C#     1   2000000  11
3      Go     3   1230000  55
2    Ruby     2  16900000  34
1  Python     1   2800000  22
0  Python     1   3000000  21

print(df.sort_values(by="annual income") )
=================================>
Program Language Years of Experience Years Income Age
5      C#     3    500000   8
3      Go     3   1230000  55
4      C#     1   2000000  11
1  Python     1   2800000  22
0  Python     1   3000000  21
2    Ruby     2  16900000  34

Handling of missing values

You will come across many missing values in data analysis and machine learning. Missing values are the missing parts of the data. (For example, the unanswered column of the questionnaire) coming soon....

Recommended Posts

I will explain how to use Pandas in an easy-to-understand manner.
I tried to summarize how to use pandas in python
I tried to explain how to get the article content with MediaWiki API in an easy-to-understand manner with examples (Python 3)
[For beginners] I want to explain the number of learning times in an easy-to-understand manner.
[Deep Learning from scratch] I tried to explain the gradient confirmation in an easy-to-understand manner.
How to use Pandas 2
[Python] I tried to summarize the set type (set) in an easy-to-understand manner.
I tried to summarize Cpaw Level1 & Level2 Write Up in an easy-to-understand manner
I tried to summarize Cpaw Level 3 Write Up in an easy-to-understand manner
How to get an overview of your data in Pandas
How to use classes in Theano
How to write soberly in pandas
[Python] How to use Pandas Series
How to use SQLite in Python
How to use Mysql in python
How to use ChemSpider in Python
[Python] Summary of how to use pandas
[Introduction to Python] How to use class in Python?
[Pandas] What is set_option [How to use]
I tried to understand supervised learning of machine learning in an easy-to-understand manner even for server engineers 1
How to use Google Test in C
How to reassign index in pandas dataframe
How to use Anaconda interpreter in PyCharm
How to use regular expressions in Python
How to use Map in Android ViewPager
I tried to understand supervised learning of machine learning in an easy-to-understand manner even for server engineers 2
How to read CSV files in Pandas
How to use is and == in Python
How to use pandas Timestamp and date_range
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
How to use Python Image Library in python3 series
How to get help in an interactive shell
Summary of how to use MNIST in Python
How to use tkinter with python in pyenv
View logs in an easy-to-understand manner with Ansible
Introduction to Deep Learning (1) --Chainer is explained in an easy-to-understand manner for beginners-
How to use xml.etree.ElementTree
How to use Python-shell
[For beginners] How to use say command in python!
How to use tf.data
How to use virtualenv
How to use Seaboan
How to use image-match
How to use shogun
I tried to summarize how to use matplotlib of python
I want to use self in Backpropagation (tf.custom_gradient) (tensorflow)
How to use Virtualenv
Explain in detail how to make sounds with python
How to use numpy.vectorize
I tried to understand how to use Pandas and multicollinearity based on the Affairs dataset.
A memorandum on how to use keras.preprocessing.image in Keras
How to use partial
How to use Bio.Phylo
How to use SymPy
How to use WikiExtractor.py
How to access with cache when reading_json in pandas
How to use IPython
How to use virtualenv
How to use template engine in pyramid 1 file application
How to use Matplotlib
How to make an interactive CLI tool in Golang