[PYTHON] [Introduction to Data Scientists] Basics of scientific calculation, data processing, and how to use graph drawing library ♬ Basics of Pandas

Last night, I summarized [Introduction to Data Scientists] Basics of Scipy as the basis of scientific calculation, data processing, and how to use the graph drawing library, but tonight I will summarize the basics of Pandas. I will supplement the explanations in this book. 【Caution】 ["Data Scientist Training Course at the University of Tokyo"](https://www.amazon.co.jp/%E6%9D%B1%E4%BA%AC%E5%A4%A7%E5%AD%A6%E3 % 81% AE% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% B5% E3% 82% A4% E3% 82% A8% E3% 83% B3% E3% 83 % 86% E3% 82% A3% E3% 82% B9% E3% 83% 88% E8% 82% B2% E6% 88% 90% E8% AC% 9B% E5% BA% A7-Python% E3% 81 % A7% E6% 89% 8B% E3% 82% 92% E5% 8B% 95% E3% 81% 8B% E3% 81% 97% E3% 81% A6% E5% AD% A6% E3% 81% B6 % E3% 83% 87% E2% 80% 95% E3% 82% BF% E5% 88% 86% E6% 9E% 90-% E5% A1% 9A% E6% 9C% AC% E9% 82% A6% I will read E5% B0% 8A / dp / 4839965250 / ref = tmm_pap_swatch_0? _ Encoding = UTF8 & qid = & sr =) and summarize the parts that I have some doubts or find useful. Therefore, I think the synopsis will be straightforward, but please read it, thinking that the content has nothing to do with this book.

Chapter 2-4 Pandas Basics

"Pandas is a convenient library for so-called pre-processing before modeling in Python (using machine learning etc.) ... You can perform operations such as spreadsheets and data extraction and retrieval."

2-4-1 Import Pandas Library

>>> import pandas as pd
>>> from pandas import Series, DataFrame
>>> pd.__version__
'1.0.3

2-4-2 How to use Series

"Series is like a one-dimensional array ..." "Like", what is it? So, if you look at the type below and output it, it looks like. .. ..

>>> sample_pandas_data = pd.Series([0,10,20,30,40,50,60,70,80,90])
>>> print(type(sample_pandas_data))
<class 'pandas.core.series.Series'>
>>> print(sample_pandas_data)
0     0
1    10
2    20
3    30
4    40
5    50
6    60
7    70
8    80
9    90
dtype: int64

<class'pandas.core.series.Series'> is indexed.

Convert from numpy array to pd.Series

According to the reference "Pandas is based on NumPy, so compatibility is very high."

>>> array = np.arange(0,100,10)
>>> array
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])
>>> series_sample = pd.Series(array)
>>> series_sample
0     0
1    10
2    20
3    30
4    40
5    50
6    60
7    70
8    80
9    90
dtype: int32

Specify dtype ='int64'

>>> array = np.arange(0,100,10, dtype = 'int64')
>>> array
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90], dtype=int64)
>>> series_sample = pd.Series(array)
>>> series_sample
0     0
1    10
2    20
3    30
4    40
5    50
6    60
7    70
8    80
9    90
dtype: int64

【reference】 Differences between Pandas and NumPy and how to use them properly

pd.Series: Specify index
>>> sample_pandas_index_data = pd.Series([0,10,20,30,40,50,60,70,80,90], index = ['a','b','c','d','e','f','g','h','i','j'])
>>> sample_pandas_index_data
a     0
b    10
c    20
d    30
e    40
f    50
g    60
h    70
i    80
j    90
dtype: int64
>>> sample_pandas_index_data.index
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')
>>> sample_pandas_index_data.values
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90], dtype=int64)

Can be created from a numpy array.

>>> array0 = np.arange(0,100,10, dtype = 'int64')
>>> array1 = np.array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
>>> sample_pandas_index_data2 = pd.Series(array0,index = array1)
>>> sample_pandas_index_data2
a     0
b    10
c    20
d    30
e    40
f    50
g    60
h    70
i    80
j    90
dtype: int64

2-4-3 How to use DataFrame

"DataFrame is a two-dimensional array ..." data is converted from dictionary format. The output is in tabular format.

>>> attri_data1 = {'ID':['100','101','102','103','104'],
...               'City':['Tokyo','Osaka','Kyoto','Hokkaido','Tokyo'],
...               'Birth_year':['1990','1989','1970','1954','2014'],
...                'Name':['Hiroshi','Akiko','Yuki','Satoru','Steve']}
>>> attri_data_frame1=DataFrame(attri_data1)
>>> attri_data_frame1
    ID      City Birth_year     Name
0  100     Tokyo       1990  Hiroshi
1  101     Osaka       1989    Akiko
2  102     Kyoto       1970     Yuki
3  103  Hokkaido       1954   Satoru
4  104     Tokyo       2014    Steve
>>> type(attri_data1)
<class 'dict'>
DataFrame: Specify index
>>> attri_data_frame1=DataFrame(attri_data1, index=['a','b','c','d','e'])
>>> attri_data_frame1
    ID      City Birth_year     Name
a  100     Tokyo       1990  Hiroshi
b  101     Osaka       1989    Akiko
c  102     Kyoto       1970     Yuki
d  103  Hokkaido       1954   Satoru
e  104     Tokyo       2014    Steve

2-4-4 Matrix operation

2-4-4-1 Transpose
>>> attri_data_frame1.T
                  a      b      c         d      e
ID              100    101    102       103    104
City          Tokyo  Osaka  Kyoto  Hokkaido  Tokyo
Birth_year     1990   1989   1970      1954   2014
Name        Hiroshi  Akiko   Yuki    Satoru  Steve
2-4-4-2 Extraction of specific columns
>>> attri_data_frame1.Birth_year
a    1990
b    1989
c    1970
d    1954
e    2014
Name: Birth_year, dtype: object
>>> attri_data_frame1[['ID','Birth_year']]
    ID Birth_year
a  100       1990
b  101       1989
c  102       1970
d  103       1954
e  104       2014

2-4-5 Data extraction

>>> attri_data_frame1[attri_data_frame1['City']=='Tokyo']
    ID   City Birth_year     Name
a  100  Tokyo       1990  Hiroshi
e  104  Tokyo       2014    Steve
>>> attri_data_frame1['City']=='Tokyo'
a     True
b    False
c    False
d    False
e     True
Name: City, dtype: bool
Specify multiple conditions
>>> attri_data_frame1[attri_data_frame1['City'].isin(['Tokyo','Osaka'])]
    ID   City Birth_year     Name
a  100  Tokyo       1990  Hiroshi
b  101  Osaka       1989    Akiko
e  104  Tokyo       2014    Steve

2-4-6 Delete and combine data

2-4-6-1 Delete columns and rows
drop (list of columns you want to drop, axis = 1)

** axis = 1 is a column **

>>> attri_data_frame1.drop(['Birth_year'], axis = 1)
    ID      City     Name
a  100     Tokyo  Hiroshi
b  101     Osaka    Akiko
c  102     Kyoto     Yuki
d  103  Hokkaido   Satoru
e  104     Tokyo    Steve
drop (list of rows you want to drop, axis = 0)

** axis = 0 is a line **

>>> attri_data_frame1.drop(['c','e'], axis = 0)
    ID      City Birth_year     Name
a  100     Tokyo       1990  Hiroshi
b  101     Osaka       1989    Akiko
d  103  Hokkaido       1954   Satoru

The above operation does not change the original data

>>> attri_data_frame1
    ID      City Birth_year     Name
a  100     Tokyo       1990  Hiroshi
b  101     Osaka       1989    Akiko
c  102     Kyoto       1970     Yuki
d  103  Hokkaido       1954   Satoru
e  104     Tokyo       2014    Steve

Replaced by the following option replace = True.

>>> attri_data_frame1.drop(['c','e'], axis = 0, inplace = True)
>>> attri_data_frame1
    ID      City Birth_year     Name
a  100     Tokyo       1990  Hiroshi
b  101     Osaka       1989    Akiko
d  103  Hokkaido       1954   Satoru

2-4-6-2 Data combination

Add data
>>> attri_data1 = {'ID':['100','101','102','103','104'],
...               'City':['Tokyo','Osaka','Kyoto','Hokkaido','Tokyo'],
...               'Birth_year':['1990','1989','1970','1954','2014'],
...                'Name':['Hiroshi','Akiko','Yuki','Satoru','Steve']}
>>> attri_data_frame1=DataFrame(attri_data1)
>>> attri_data_frame1
    ID      City Birth_year     Name
0  100     Tokyo       1990  Hiroshi
1  101     Osaka       1989    Akiko
2  102     Kyoto       1970     Yuki
3  103  Hokkaido       1954   Satoru
4  104     Tokyo       2014    Steve
>>> math_pt = [50, 43, 33,76,98]
>>> attri_data_frame1['Math']=math_pt
>>> attri_data_frame1
    ID      City Birth_year     Name  Math
0  100     Tokyo       1990  Hiroshi    50
1  101     Osaka       1989    Akiko    43
2  102     Kyoto       1970     Yuki    33
3  103  Hokkaido       1954   Satoru    76
4  104     Tokyo       2014    Steve    98
Data combination
>>> attri_data2 = {'ID':['100','101','102','105','107'],
...                'Math':[50, 43, 33,76,98],
...                'English':[90, 30, 20,50,30],
...                'Sex':['M', 'F', 'F', 'M', 'M']}
>>> attri_data_frame2=DataFrame(attri_data2)
>>> attri_data_frame2
    ID  Math  English Sex
0  100    50       90   M
1  101    43       30   F
2  102    33       20   F
3  105    76       50   M
4  107    98       30   M
>>> attri_data_frame1
    ID      City Birth_year     Name  Math
0  100     Tokyo       1990  Hiroshi    50
1  101     Osaka       1989    Akiko    43
2  102     Kyoto       1970     Yuki    33
3  103  Hokkaido       1954   Satoru    76
4  104     Tokyo       2014    Steve    98

Find the same key and merge it. The key is ID. .. ..

>>> pd.merge(attri_data_frame1,attri_data_frame2)
    ID   City Birth_year     Name  Math  English Sex
0  100  Tokyo       1990  Hiroshi    50       90   M
1  101  Osaka       1989    Akiko    43       30   F
2  102  Kyoto       1970     Yuki    33       20   F

pandas.merge

>>> pd.merge(attri_data_frame1,attri_data_frame2, how = 'outer')
    ID      City Birth_year     Name  Math  English  Sex
0  100     Tokyo       1990  Hiroshi    50     90.0    M
1  101     Osaka       1989    Akiko    43     30.0    F
2  102     Kyoto       1970     Yuki    33     20.0    F
3  103  Hokkaido       1954   Satoru    76      NaN  NaN
4  104     Tokyo       2014    Steve    98      NaN  NaN
5  105       NaN        NaN      NaN    76     50.0    M
6  107       NaN        NaN      NaN    98     30.0    M

Relation Merge, join, concatenate and compare

2-4-7 Aggregation

"Aggregation around a specific column with group by"

>>> attri_data_frame2.groupby('Sex')['Math'].mean()
Sex
F    38.000000
M    74.666667
Name: Math, dtype: float64
>>> attri_data_frame2.groupby('Sex')['English'].mean()
Sex
F    25.000000
M    56.666667
Name: English, dtype: float64

2-4-8 Sorting of values

You can sort by index with attri_data_frame1.sort_index ().

>>> attri_data_frame1=DataFrame(attri_data1, index=['e','b','a','c','d'])
>>> attri_data_frame1
    ID      City Birth_year     Name
e  100     Tokyo       1990  Hiroshi
b  101     Osaka       1989    Akiko
a  102     Kyoto       1970     Yuki
c  103  Hokkaido       1954   Satoru
d  104     Tokyo       2014    Steve
>>> attri_data_frame1.sort_index()
    ID      City Birth_year     Name
a  102     Kyoto       1970     Yuki
b  101     Osaka       1989    Akiko
c  103  Hokkaido       1954   Satoru
d  104     Tokyo       2014    Steve
e  100     Tokyo       1990  Hiroshi

Attri_data_frame1.sort_values (by = ['Birth_year']) allows you to sort by the value in the'Birth_year' column.

>>> attri_data_frame1.sort_values(by=['Birth_year'])
    ID      City Birth_year     Name
c  103  Hokkaido       1954   Satoru
a  102     Kyoto       1970     Yuki
b  101     Osaka       1989    Akiko
e  100     Tokyo       1990  Hiroshi
d  104     Tokyo       2014    Steve

2-4-9 Judgment of nan (null)

Perform operations such as excluding missing values.

2-4-9-1 Comparison of data that meets the conditions
>>> attri_data_frame1.isin(['Tokyo'])
      ID   City  Birth_year   Name
e  False   True       False  False
b  False  False       False  False
a  False  False       False  False
c  False  False       False  False
d  False   True       False  False
2-4-9-2 Examples of nan and null
>>> attri_data_frame1['Name'] = np.nan
>>> attri_data_frame1
    ID      City Birth_year  Name
e  100     Tokyo       1990   NaN
b  101     Osaka       1989   NaN
a  102     Kyoto       1970   NaN
c  103  Hokkaido       1954   NaN
d  104     Tokyo       2014   NaN
>>> attri_data_frame1.isnull()
      ID   City  Birth_year  Name
e  False  False       False  True
b  False  False       False  True
a  False  False       False  True
c  False  False       False  True
d  False  False       False  True

Count the number of nulls.

>>> attri_data_frame1.isnull().sum()
ID            0
City          0
Birth_year    0
Name          5
dtype: int64

Exercises

Extraction of Math> = 50

>>> attri_data_frame2
    ID  Math  English Sex  Money
0  100    50       90   M   1000
1  101    43       30   F   2000
2  102    33       20   F    500
3  105    76       50   M    300
4  107    98       30   M    700
>>> attri_data_frame2[attri_data_frame2['Math'] >= 50]
    ID  Math  English Sex  Money
0  100    50       90   M   1000
3  105    76       50   M    300
4  107    98       30   M    700

Money Gender average

>>> attri_data_frame2['Money'] = np.array([1000,2000, 500,300,700])
>>> attri_data_frame2
    ID  Math  English Sex  Money
0  100    50       90   M   1000
1  101    43       30   F   2000
2  102    33       20   F    500
3  105    76       50   M    300
4  107    98       30   M    700
>>> attri_data_frame2.groupby('Sex')['Money'].mean()
Sex
F    1250.000000
M     666.666667
Name: Money, dtype: float64

You may want to process missing values. .. ..

>>> attri_data_frame2['Money'].mean()
900.0
>>> attri_data_frame2['Math'].mean()
60.0
>>> attri_data_frame2['English'].mean()
44.0

csv input / output

Add the writing, reading, and index presence / absence of the csv file. It is necessary to read the saved file with or without index.

>>> attri_data_frame2.to_csv(r'samole0.csv',index=False)
>>> attri_data_frame2.to_csv(r'samole1.csv',index=True)
>>> df = pd.read_csv("samole0.csv")
>>> df
    ID  Math  English Sex  Money
0  100    50       90   M   1000
1  101    43       30   F   2000
2  102    33       20   F    500
3  105    76       50   M    300
4  107    98       30   M    700
>>> df = pd.read_csv("samole1.csv")
>>> df
   Unnamed: 0   ID  Math  English Sex  Money
0           0  100    50       90   M   1000
1           1  101    43       30   F   2000
2           2  102    33       20   F    500
3           3  105    76       50   M    300
4           4  107    98       30   M    700
>>> df = pd.read_csv("samole1.csv", index_col=0)
>>> df
    ID  Math  English Sex  Money
0  100    50       90   M   1000
1  101    43       30   F   2000
2  102    33       20   F    500
3  105    76       50   M    300
4  107    98       30   M    700

Without index ,. .. .. After all it is better to be aware of it.

>>> df.to_csv(r'samole3.csv')
>>> df_ = pd.read_csv("samole3.csv")
>>> df_
   Unnamed: 0   ID  Math  English Sex  Money
0           0  100    50       90   M   1000
1           1  101    43       30   F   2000
2           2  102    33       20   F    500
3           3  105    76       50   M    300
4           4  107    98       30   M    700
>>> df_ = pd.read_csv("samole3.csv", index_col=0)
>>> df_
    ID  Math  English Sex  Money
0  100    50       90   M   1000
1  101    43       30   F   2000
2  102    33       20   F    500
3  105    76       50   M    300
4  107    98       30   M    700

Summary

・ Summarized according to the basics of Pandas in this book ・ Pandas can also draw graphs and perform various processing, but I think that it can be used if you understand the range summarized this time.

・ For further learning, a link was added to the relatively easy-to-understand Tutorial.

bonus

Package overview Getting started tutorials What kind of data does pandas handle? How do I read and write tabular data? How do I select a subset of a DataFrame? How to create plots in pandas? How to create new columns derived from existing columns? How to calculate summary statistics? How to reshape the layout of tables? How to combine data from multiple tables? How to handle time series data with ease? How to manipulate textual data? Comparison with other tools

Recommended Posts

[Introduction to Data Scientists] Basics of scientific calculation, data processing, and how to use graph drawing library ♬ Basics of Pandas
[Introduction to Data Scientists] Basics of scientific calculation, data processing, and how to use graph drawing library ♬ Basics of Scipy
[Introduction to Data Scientists] Basics of scientific calculation, data processing, and how to use graph drawing library ♬ Basics of Matplotlib
[Introduction to Data Scientists] Basics of scientific calculation, data processing, and how to use the graph drawing library ♬ Environment construction
[Introduction to Data Scientists] Basics of Python ♬ Functions and classes
[Introduction to Data Scientists] Basics of Python ♬
[Introduction to Data Scientists] Basics of Python ♬ Conditional branching and loops
[Introduction to Data Scientists] Basics of Python ♬ Functions and anonymous functions, etc.
How to use the graph drawing library Bokeh
[Introduction to Data Scientists] Basics of Probability and Statistics ♬ Probability / Random Variables and Probability Distribution
Introduction of DataLiner ver.1.3 and how to use Union Append
[Python] Summary of how to use pandas
Basics of PyTorch (1) -How to use Tensor-
How to use pandas Timestamp and date_range
Introduction of cyber security framework "MITRE CALDERA": How to use and training
Use decorators to prevent re-execution of data processing
[Pandas] Basics of processing date data using dt
How to use PyTorch-based image processing library "Kornia"
[Introduction to Python] How to use while statements (repetitive processing)
How to use Pandas 2
[Introduction to Azure for kaggle users] Comparison of how to start and use Azure Notebooks and Azure Notebooks VM
[Python] How to use the graph creation library Altair
[Introduction to Data Scientists] Descriptive Statistics and Simple Regression Analysis ♬
[python] How to use the library Matplotlib for drawing graphs
[Python] Summary of how to use split and join functions
Comparison of how to use higher-order functions in Python 2 and 3
[Introduction to Scipy] Calculation of Lorenz curve and Gini coefficient ♬
How to get an overview of your data in Pandas
How to use Pandas Rolling
[Introduction] How to use open3d
processing to use notMNIST data in Python (and tried to classify it)
[Introduction to cx_Oracle] (Part 2) Basics of connecting and disconnecting to Oracle Database
[Introduction to Python] How to use the Boolean operator (and ・ or ・ not)
Summary of how to use pandas.DataFrame.loc
How to install and use Tesseract-OCR
[Python] How to use Pandas Series
How to use Requests (Python Library)
Summary of how to use pyenv-virtualenv
[Introduction to Python] Let's use pandas
How to use .bash_profile and .bashrc
[Introduction to Python] Let's use pandas
Summary of how to use csvkit
[Introduction to Python] Let's use pandas
From the introduction of GoogleCloudPlatform Natural Language API to how to use it
[Introduction to pytorch-lightning] How to use torchvision.transforms and how to freely create your own dataset ♬
If you use Pandas' Plot function in Python, it is really seamless from data processing to graph creation
[Introduction to Python] How to use class in Python?
How to install and use pandas_datareader [Python]
[Pandas] What is set_option [How to use]
How to calculate Use% of df command
Use pandas to convert grid data to row-holding (?) Data
[Python2.7] Summary of how to use unittest
python: How to use locals () and globals ()
How to use "deque" for Python data
How to use Python zip and enumerate
[Python2.7] Summary of how to use subprocess
How to use is and == in Python
Example of efficient data processing with PANDAS
[Question] How to use plot_surface of python
[Python] From morphological analysis of CSV data to CSV output and graph display [GiNZA]
How to use OpenGoddard 3-python library for nonlinear optimal control and orbit generation