[PYTHON] [Introduction à Data Scientist] Bases du calcul scientifique, du traitement des données et comment utiliser la bibliothèque de dessins graphiques graph Bases de Pandas

Hier soir, j'ai résumé [Introduction to Data Scientists] Bases de Scipy comme base du calcul scientifique, du traitement des données et comment utiliser la bibliothèque de dessin de graphes, mais ce soir Je vais résumer les bases de Pandas. Je compléterai les explications de ce livre. 【Mise en garde】 ["Cours de formation de scientifique des données à l'Université de Tokyo"](https://www.amazon.co.jp/%E6%9D%B1%E4%BA%AC%E5%A4%A7%E5%AD%A6%E3 % 81% AE% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% B5% E3% 82% A4% E3% 82% A8% E3% 83% B3% E3% 83 % 86% E3% 82% A3% E3% 82% B9% E3% 83% 88% E8% 82% B2% E6% 88% 90% E8% AC% 9B% E5% BA% A7-Python% E3% 81 % A7% E6% 89% 8B% E3% 82% 92% E5% 8B% 95% E3% 81% 8B% E3% 81% 97% E3% 81% A6% E5% AD% A6% E3% 81% B6 % E3% 83% 87% E2% 80% 95% E3% 82% BF% E5% 88% 86% E6% 9E% 90-% E5% A1% 9A% E6% 9C% AC% E9% 82% A6% Je vais lire E5% B0% 8A / dp / 4839965250 / ref = tmm_pap_swatch_0? _ Encoding = UTF8 & qid = & sr =) et résumer les parties que j'ai des doutes ou que je trouve utiles. Par conséquent, je pense que le synopsis sera simple, mais veuillez le lire en pensant que le contenu n'a rien à voir avec ce livre.

Chapitre 2-4 Principes de base des pandas

"Pandas est une bibliothèque pratique pour ce que l'on appelle le prétraitement avant la modélisation en Python (en utilisant l'apprentissage automatique, etc.) ... Vous pouvez effectuer des opérations telles que le calcul de table et la recherche d'extraction de données."

2-4-1 Importer la bibliothèque Pandas

>>> import pandas as pd
>>> from pandas import Series, DataFrame
>>> pd.__version__
'1.0.3

2-4-2 Comment utiliser la série

"La série est comme un tableau unidimensionnel ..." "Comme", qu'est-ce que c'est? Donc, si vous regardez le type ci-dessous et que vous le sortez, il ressemble à. .. ..

>>> sample_pandas_data = pd.Series([0,10,20,30,40,50,60,70,80,90])
>>> print(type(sample_pandas_data))
<class 'pandas.core.series.Series'>
>>> print(sample_pandas_data)
0     0
1    10
2    20
3    30
4    40
5    50
6    60
7    70
8    80
9    90
dtype: int64

<class'pandas.core.series.Series '> est indexé.

Convertir du tableau numpy en pd.

Selon la référence "Pandas est basé sur NumPy, donc la compatibilité est très élevée."

>>> array = np.arange(0,100,10)
>>> array
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])
>>> series_sample = pd.Series(array)
>>> series_sample
0     0
1    10
2    20
3    30
4    40
5    50
6    60
7    70
8    80
9    90
dtype: int32

Spécifiez dtype = 'int64'

>>> array = np.arange(0,100,10, dtype = 'int64')
>>> array
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90], dtype=int64)
>>> series_sample = pd.Series(array)
>>> series_sample
0     0
1    10
2    20
3    30
4    40
5    50
6    60
7    70
8    80
9    90
dtype: int64

【référence】 Différences entre Pandas et NumPy et comment les utiliser correctement

pd.Series: spécifiez l'index

>>> sample_pandas_index_data = pd.Series([0,10,20,30,40,50,60,70,80,90], index = ['a','b','c','d','e','f','g','h','i','j'])
>>> sample_pandas_index_data
a     0
b    10
c    20
d    30
e    40
f    50
g    60
h    70
i    80
j    90
dtype: int64
>>> sample_pandas_index_data.index
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')
>>> sample_pandas_index_data.values
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90], dtype=int64)

Peut être créé à partir d'un tableau numpy.

>>> array0 = np.arange(0,100,10, dtype = 'int64')
>>> array1 = np.array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
>>> sample_pandas_index_data2 = pd.Series(array0,index = array1)
>>> sample_pandas_index_data2
a     0
b    10
c    20
d    30
e    40
f    50
g    60
h    70
i    80
j    90
dtype: int64

2-4-3 Comment utiliser DataFrame

"DataFrame est un tableau à deux dimensions ..." les données sont converties à partir du format de dictionnaire. La sortie est au format tabulaire.

>>> attri_data1 = {'ID':['100','101','102','103','104'],
...               'City':['Tokyo','Osaka','Kyoto','Hokkaido','Tokyo'],
...               'Birth_year':['1990','1989','1970','1954','2014'],
...                'Name':['Hiroshi','Akiko','Yuki','Satoru','Steve']}
>>> attri_data_frame1=DataFrame(attri_data1)
>>> attri_data_frame1
    ID      City Birth_year     Name
0  100     Tokyo       1990  Hiroshi
1  101     Osaka       1989    Akiko
2  102     Kyoto       1970     Yuki
3  103  Hokkaido       1954   Satoru
4  104     Tokyo       2014    Steve
>>> type(attri_data1)
<class 'dict'>

DataFrame: spécifiez l'index

>>> attri_data_frame1=DataFrame(attri_data1, index=['a','b','c','d','e'])
>>> attri_data_frame1
    ID      City Birth_year     Name
a  100     Tokyo       1990  Hiroshi
b  101     Osaka       1989    Akiko
c  102     Kyoto       1970     Yuki
d  103  Hokkaido       1954   Satoru
e  104     Tokyo       2014    Steve

2-4-4 Fonctionnement de la matrice

2-4-4-1 Translocation

>>> attri_data_frame1.T
                  a      b      c         d      e
ID              100    101    102       103    104
City          Tokyo  Osaka  Kyoto  Hokkaido  Tokyo
Birth_year     1990   1989   1970      1954   2014
Name        Hiroshi  Akiko   Yuki    Satoru  Steve

2-4-4-2 Extraction de colonnes spécifiques

>>> attri_data_frame1.Birth_year
a    1990
b    1989
c    1970
d    1954
e    2014
Name: Birth_year, dtype: object
>>> attri_data_frame1[['ID','Birth_year']]
    ID Birth_year
a  100       1990
b  101       1989
c  102       1970
d  103       1954
e  104       2014

2-4-5 Extraction de données

>>> attri_data_frame1[attri_data_frame1['City']=='Tokyo']
    ID   City Birth_year     Name
a  100  Tokyo       1990  Hiroshi
e  104  Tokyo       2014    Steve
>>> attri_data_frame1['City']=='Tokyo'
a     True
b    False
c    False
d    False
e     True
Name: City, dtype: bool

Spécifiez plusieurs conditions

>>> attri_data_frame1[attri_data_frame1['City'].isin(['Tokyo','Osaka'])]
    ID   City Birth_year     Name
a  100  Tokyo       1990  Hiroshi
b  101  Osaka       1989    Akiko
e  104  Tokyo       2014    Steve

2-4-6 Supprimer et combiner des données

2-4-6-1 Supprimer des colonnes et des lignes

drop (liste des colonnes que vous souhaitez supprimer, axe = 1)

** axis = 1 est une colonne **

>>> attri_data_frame1.drop(['Birth_year'], axis = 1)
    ID      City     Name
a  100     Tokyo  Hiroshi
b  101     Osaka    Akiko
c  102     Kyoto     Yuki
d  103  Hokkaido   Satoru
e  104     Tokyo    Steve

drop (liste des lignes que vous souhaitez supprimer, axe = 0)

** axis = 0 est une ligne **

>>> attri_data_frame1.drop(['c','e'], axis = 0)
    ID      City Birth_year     Name
a  100     Tokyo       1990  Hiroshi
b  101     Osaka       1989    Akiko
d  103  Hokkaido       1954   Satoru

L'opération ci-dessus ne modifie pas les données d'origine

>>> attri_data_frame1
    ID      City Birth_year     Name
a  100     Tokyo       1990  Hiroshi
b  101     Osaka       1989    Akiko
c  102     Kyoto       1970     Yuki
d  103  Hokkaido       1954   Satoru
e  104     Tokyo       2014    Steve

Remplacé par l'option suivante replace = True.

Notez que les données d'origine seront perdues.

>>> attri_data_frame1.drop(['c','e'], axis = 0, inplace = True)
>>> attri_data_frame1
    ID      City Birth_year     Name
a  100     Tokyo       1990  Hiroshi
b  101     Osaka       1989    Akiko
d  103  Hokkaido       1954   Satoru

2-4-6-2 Combiner les données

Ajouter des données

>>> attri_data1 = {'ID':['100','101','102','103','104'],
...               'City':['Tokyo','Osaka','Kyoto','Hokkaido','Tokyo'],
...               'Birth_year':['1990','1989','1970','1954','2014'],
...                'Name':['Hiroshi','Akiko','Yuki','Satoru','Steve']}
>>> attri_data_frame1=DataFrame(attri_data1)
>>> attri_data_frame1
    ID      City Birth_year     Name
0  100     Tokyo       1990  Hiroshi
1  101     Osaka       1989    Akiko
2  102     Kyoto       1970     Yuki
3  103  Hokkaido       1954   Satoru
4  104     Tokyo       2014    Steve
>>> math_pt = [50, 43, 33,76,98]
>>> attri_data_frame1['Math']=math_pt
>>> attri_data_frame1
    ID      City Birth_year     Name  Math
0  100     Tokyo       1990  Hiroshi    50
1  101     Osaka       1989    Akiko    43
2  102     Kyoto       1970     Yuki    33
3  103  Hokkaido       1954   Satoru    76
4  104     Tokyo       2014    Steve    98

Combiner des données

>>> attri_data2 = {'ID':['100','101','102','105','107'],
...                'Math':[50, 43, 33,76,98],
...                'English':[90, 30, 20,50,30],
...                'Sex':['M', 'F', 'F', 'M', 'M']}
>>> attri_data_frame2=DataFrame(attri_data2)
>>> attri_data_frame2
    ID  Math  English Sex
0  100    50       90   M
1  101    43       30   F
2  102    33       20   F
3  105    76       50   M
4  107    98       30   M
>>> attri_data_frame1
    ID      City Birth_year     Name  Math
0  100     Tokyo       1990  Hiroshi    50
1  101     Osaka       1989    Akiko    43
2  102     Kyoto       1970     Yuki    33
3  103  Hokkaido       1954   Satoru    76
4  104     Tokyo       2014    Steve    98

Trouvez la même clé et fusionnez-la. La clé est l'ID. .. ..

>>> pd.merge(attri_data_frame1,attri_data_frame2)
    ID   City Birth_year     Name  Math  English Sex
0  100  Tokyo       1990  Hiroshi    50       90   M
1  101  Osaka       1989    Akiko    43       30   F
2  102  Kyoto       1970     Yuki    33       20   F

pandas.merge

>>> pd.merge(attri_data_frame1,attri_data_frame2, how = 'outer')
    ID      City Birth_year     Name  Math  English  Sex
0  100     Tokyo       1990  Hiroshi    50     90.0    M
1  101     Osaka       1989    Akiko    43     30.0    F
2  102     Kyoto       1970     Yuki    33     20.0    F
3  103  Hokkaido       1954   Satoru    76      NaN  NaN
4  104     Tokyo       2014    Steve    98      NaN  NaN
5  105       NaN        NaN      NaN    76     50.0    M
6  107       NaN        NaN      NaN    98     30.0    M

Relation Merge, join, concatenate and compare

Agrégation 2-4-7

"Agrégation centrée sur une colonne spécifique avec groupe par"

>>> attri_data_frame2.groupby('Sex')['Math'].mean()
Sex
F    38.000000
M    74.666667
Name: Math, dtype: float64
>>> attri_data_frame2.groupby('Sex')['English'].mean()
Sex
F    25.000000
M    56.666667
Name: English, dtype: float64

2-4-8 Tri des valeurs

Avec attri_data_frame1.sort_index (), vous pouvez trier par index.

>>> attri_data_frame1=DataFrame(attri_data1, index=['e','b','a','c','d'])
>>> attri_data_frame1
    ID      City Birth_year     Name
e  100     Tokyo       1990  Hiroshi
b  101     Osaka       1989    Akiko
a  102     Kyoto       1970     Yuki
c  103  Hokkaido       1954   Satoru
d  104     Tokyo       2014    Steve
>>> attri_data_frame1.sort_index()
    ID      City Birth_year     Name
a  102     Kyoto       1970     Yuki
b  101     Osaka       1989    Akiko
c  103  Hokkaido       1954   Satoru
d  104     Tokyo       2014    Steve
e  100     Tokyo       1990  Hiroshi

Attri_data_frame1.sort_values (by = ['Birth_year']) vous permet de trier par la valeur de la colonne'Birth_year '.

>>> attri_data_frame1.sort_values(by=['Birth_year'])
    ID      City Birth_year     Name
c  103  Hokkaido       1954   Satoru
a  102     Kyoto       1970     Yuki
b  101     Osaka       1989    Akiko
e  100     Tokyo       1990  Hiroshi
d  104     Tokyo       2014    Steve

2-4-9 Jugement de nan (nul)

Effectuez des opérations telles que l'exclusion des valeurs manquantes.

2-4-9-1 Comparaison des données qui remplissent les conditions

>>> attri_data_frame1.isin(['Tokyo'])
      ID   City  Birth_year   Name
e  False   True       False  False
b  False  False       False  False
a  False  False       False  False
c  False  False       False  False
d  False   True       False  False

2-4-9-2 Exemples de nan et nul

>>> attri_data_frame1['Name'] = np.nan
>>> attri_data_frame1
    ID      City Birth_year  Name
e  100     Tokyo       1990   NaN
b  101     Osaka       1989   NaN
a  102     Kyoto       1970   NaN
c  103  Hokkaido       1954   NaN
d  104     Tokyo       2014   NaN
>>> attri_data_frame1.isnull()
      ID   City  Birth_year  Name
e  False  False       False  True
b  False  False       False  True
a  False  False       False  True
c  False  False       False  True
d  False  False       False  True

Comptez le nombre de valeurs nulles.

>>> attri_data_frame1.isnull().sum()
ID            0
City          0
Birth_year    0
Name          5
dtype: int64

Des exercices

Extraction de mathématiques> = 50

>>> attri_data_frame2
    ID  Math  English Sex  Money
0  100    50       90   M   1000
1  101    43       30   F   2000
2  102    33       20   F    500
3  105    76       50   M    300
4  107    98       30   M    700
>>> attri_data_frame2[attri_data_frame2['Math'] >= 50]
    ID  Math  English Sex  Money
0  100    50       90   M   1000
3  105    76       50   M    300
4  107    98       30   M    700

Argent Moyenne de genre

>>> attri_data_frame2['Money'] = np.array([1000,2000, 500,300,700])
>>> attri_data_frame2
    ID  Math  English Sex  Money
0  100    50       90   M   1000
1  101    43       30   F   2000
2  102    33       20   F    500
3  105    76       50   M    300
4  107    98       30   M    700
>>> attri_data_frame2.groupby('Sex')['Money'].mean()
Sex
F    1250.000000
M     666.666667
Name: Money, dtype: float64

Vous souhaiterez peut-être traiter les valeurs manquantes. .. ..

>>> attri_data_frame2['Money'].mean()
900.0
>>> attri_data_frame2['Math'].mean()
60.0
>>> attri_data_frame2['English'].mean()
44.0

entrée / sortie csv

Ajoutez l'écriture, la lecture et la présence / absence d'index du fichier csv. Il est nécessaire de lire avec ou sans index du fichier enregistré.

>>> attri_data_frame2.to_csv(r'samole0.csv',index=False)
>>> attri_data_frame2.to_csv(r'samole1.csv',index=True)
>>> df = pd.read_csv("samole0.csv")
>>> df
    ID  Math  English Sex  Money
0  100    50       90   M   1000
1  101    43       30   F   2000
2  102    33       20   F    500
3  105    76       50   M    300
4  107    98       30   M    700
>>> df = pd.read_csv("samole1.csv")
>>> df
   Unnamed: 0   ID  Math  English Sex  Money
0           0  100    50       90   M   1000
1           1  101    43       30   F   2000
2           2  102    33       20   F    500
3           3  105    76       50   M    300
4           4  107    98       30   M    700
>>> df = pd.read_csv("samole1.csv", index_col=0)
>>> df
    ID  Math  English Sex  Money
0  100    50       90   M   1000
1  101    43       30   F   2000
2  102    33       20   F    500
3  105    76       50   M    300
4  107    98       30   M    700

Sans index,. .. .. Après tout, il vaut mieux en être conscient.

>>> df.to_csv(r'samole3.csv')
>>> df_ = pd.read_csv("samole3.csv")
>>> df_
   Unnamed: 0   ID  Math  English Sex  Money
0           0  100    50       90   M   1000
1           1  101    43       30   F   2000
2           2  102    33       20   F    500
3           3  105    76       50   M    300
4           4  107    98       30   M    700
>>> df_ = pd.read_csv("samole3.csv", index_col=0)
>>> df_
    ID  Math  English Sex  Money
0  100    50       90   M   1000
1  101    43       30   F   2000
2  102    33       20   F    500
3  105    76       50   M    300
4  107    98       30   M    700

Résumé

・ Résumé selon les bases de Pandas dans ce livre ・ Les pandas peuvent également dessiner des graphiques et effectuer divers processus, mais je pense que cela peut être utilisé si vous comprenez la plage résumée cette fois.

・ Pour un apprentissage plus approfondi, un lien a été ajouté au didacticiel relativement facile à comprendre.

prime

Package overview Getting started tutorials What kind of data does pandas handle? How do I read and write tabular data? How do I select a subset of a DataFrame? How to create plots in pandas? How to create new columns derived from existing columns? How to calculate summary statistics? How to reshape the layout of tables? How to combine data from multiple tables? How to handle time series data with ease? How to manipulate textual data? Comparison with other tools