[PYTHON] [Einführung in Data Scientist] Grundlagen der wissenschaftlichen Berechnung, Datenverarbeitung und Verwendung der Grafikzeichnungsbibliothek ♬ Grundlagen von Pandas

Letzte Nacht habe ich [Einführung in Data Scientists] Grundlagen von Scipy als Grundlage für wissenschaftliche Berechnungen, Datenverarbeitung und Verwendung der Grafikzeichnungsbibliothek zusammengefasst, aber heute Abend Ich werde die Grundlagen von Pandas zusammenfassen. Ich werde die Erklärungen in diesem Buch ergänzen. 【Hinweis】 ["Data Scientist Training Course an der Universität von Tokio"](https://www.amazon.co.jp/%E6%9D%B1%E4%BA%AC%E5%A4%A7%E5%AD%A6%E3 % 81% AE% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% B5% E3% 82% A4% E3% 82% A8% E3% 83% B3% E3% 83 % 86% E3% 82% A3% E3% 82% B9% E3% 83% 88% E8% 82% B2% E6% 88% 90% E8% AC% 9B% E5% BA% A7-Python% E3% 81 % A7% E6% 89% 8B% E3% 82% 92% E5% 8B% 95% E3% 81% 8B% E3% 81% 97% E3% 81% A6% E5% AD% A6% E3% 81% B6 % E3% 83% 87% E2% 80% 95% E3% 82% BF% E5% 88% 86% E6% 9E% 90-% E5% A1% 9A% E6% 9C% AC% E9% 82% A6% Ich werde E5% B0% 8A / dp / 4839965250 / ref = tmm_pap_swatch_0? _ Encoding = UTF8 & qid = & sr =) lesen und die Teile zusammenfassen, an denen ich einige Zweifel habe oder die ich nützlich finde. Daher denke ich, dass die Zusammenfassung unkompliziert sein wird, aber bitte lesen Sie sie und denken Sie, dass der Inhalt nichts mit diesem Buch zu tun hat.

Kapitel 2-4 Grundlagen von Pandas

"Pandas ist eine praktische Bibliothek für die sogenannte Vorverarbeitung vor dem Modellieren in Python (durch maschinelles Lernen usw.) ... Sie können Operationen wie Tabellenberechnung und Datenextraktionssuche ausführen."

2-4-1 Pandas-Bibliothek importieren

>>> import pandas as pd
>>> from pandas import Series, DataFrame
>>> pd.__version__
'1.0.3

2-4-2 Verwendung von Serien

"Serie ist wie ein eindimensionales Array ..." "Wie", was ist das? Wenn Sie sich also den folgenden Typ ansehen und ihn ausgeben, sieht es so aus. .. ..

>>> sample_pandas_data = pd.Series([0,10,20,30,40,50,60,70,80,90])
>>> print(type(sample_pandas_data))
<class 'pandas.core.series.Series'>
>>> print(sample_pandas_data)
0     0
1    10
2    20
3    30
4    40
5    50
6    60
7    70
8    80
9    90
dtype: int64

<class'pandas.core.series.Series '> ist indiziert.

Konvertieren Sie vom numpy-Array in pd.Series

Nach der Referenz "Pandas basiert auf NumPy, daher ist die Kompatibilität sehr hoch."

>>> array = np.arange(0,100,10)
>>> array
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])
>>> series_sample = pd.Series(array)
>>> series_sample
0     0
1    10
2    20
3    30
4    40
5    50
6    60
7    70
8    80
9    90
dtype: int32

Geben Sie dtype = 'int64' an.

>>> array = np.arange(0,100,10, dtype = 'int64')
>>> array
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90], dtype=int64)
>>> series_sample = pd.Series(array)
>>> series_sample
0     0
1    10
2    20
3    30
4    40
5    50
6    60
7    70
8    80
9    90
dtype: int64

【Referenz】 Unterschiede zwischen Pandas und NumPy und wie man sie richtig benutzt

pd.Series: Geben Sie den Index an

>>> sample_pandas_index_data = pd.Series([0,10,20,30,40,50,60,70,80,90], index = ['a','b','c','d','e','f','g','h','i','j'])
>>> sample_pandas_index_data
a     0
b    10
c    20
d    30
e    40
f    50
g    60
h    70
i    80
j    90
dtype: int64
>>> sample_pandas_index_data.index
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')
>>> sample_pandas_index_data.values
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90], dtype=int64)

Kann aus einem Numpy-Array erstellt werden.

>>> array0 = np.arange(0,100,10, dtype = 'int64')
>>> array1 = np.array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
>>> sample_pandas_index_data2 = pd.Series(array0,index = array1)
>>> sample_pandas_index_data2
a     0
b    10
c    20
d    30
e    40
f    50
g    60
h    70
i    80
j    90
dtype: int64

2-4-3 Verwendung von DataFrame

"DataFrame ist ein zweidimensionales Array ..." Daten werden aus dem Wörterbuchformat konvertiert. Die Ausgabe erfolgt in Tabellenform.

>>> attri_data1 = {'ID':['100','101','102','103','104'],
...               'City':['Tokyo','Osaka','Kyoto','Hokkaido','Tokyo'],
...               'Birth_year':['1990','1989','1970','1954','2014'],
...                'Name':['Hiroshi','Akiko','Yuki','Satoru','Steve']}
>>> attri_data_frame1=DataFrame(attri_data1)
>>> attri_data_frame1
    ID      City Birth_year     Name
0  100     Tokyo       1990  Hiroshi
1  101     Osaka       1989    Akiko
2  102     Kyoto       1970     Yuki
3  103  Hokkaido       1954   Satoru
4  104     Tokyo       2014    Steve
>>> type(attri_data1)
<class 'dict'>

DataFrame: Geben Sie den Index an

>>> attri_data_frame1=DataFrame(attri_data1, index=['a','b','c','d','e'])
>>> attri_data_frame1
    ID      City Birth_year     Name
a  100     Tokyo       1990  Hiroshi
b  101     Osaka       1989    Akiko
c  102     Kyoto       1970     Yuki
d  103  Hokkaido       1954   Satoru
e  104     Tokyo       2014    Steve

2-4-4 Matrixbetrieb

2-4-4-1 Translokation

>>> attri_data_frame1.T
                  a      b      c         d      e
ID              100    101    102       103    104
City          Tokyo  Osaka  Kyoto  Hokkaido  Tokyo
Birth_year     1990   1989   1970      1954   2014
Name        Hiroshi  Akiko   Yuki    Satoru  Steve

2-4-4-2 Extraktion bestimmter Spalten

>>> attri_data_frame1.Birth_year
a    1990
b    1989
c    1970
d    1954
e    2014
Name: Birth_year, dtype: object
>>> attri_data_frame1[['ID','Birth_year']]
    ID Birth_year
a  100       1990
b  101       1989
c  102       1970
d  103       1954
e  104       2014

2-4-5 Datenextraktion

>>> attri_data_frame1[attri_data_frame1['City']=='Tokyo']
    ID   City Birth_year     Name
a  100  Tokyo       1990  Hiroshi
e  104  Tokyo       2014    Steve
>>> attri_data_frame1['City']=='Tokyo'
a     True
b    False
c    False
d    False
e     True
Name: City, dtype: bool

Geben Sie mehrere Bedingungen an

>>> attri_data_frame1[attri_data_frame1['City'].isin(['Tokyo','Osaka'])]
    ID   City Birth_year     Name
a  100  Tokyo       1990  Hiroshi
b  101  Osaka       1989    Akiko
e  104  Tokyo       2014    Steve

2-4-6 Daten löschen und kombinieren

2-4-6-1 Spalten und Zeilen löschen

drop (Liste der Spalten, die Sie löschen möchten, Achse = 1)

** Achse = 1 ist eine Spalte **

>>> attri_data_frame1.drop(['Birth_year'], axis = 1)
    ID      City     Name
a  100     Tokyo  Hiroshi
b  101     Osaka    Akiko
c  102     Kyoto     Yuki
d  103  Hokkaido   Satoru
e  104     Tokyo    Steve

drop (Liste der Zeilen, die Sie löschen möchten, Achse = 0)

** Achse = 0 ist eine Linie **

>>> attri_data_frame1.drop(['c','e'], axis = 0)
    ID      City Birth_year     Name
a  100     Tokyo       1990  Hiroshi
b  101     Osaka       1989    Akiko
d  103  Hokkaido       1954   Satoru

Die obige Operation ändert die Originaldaten nicht

>>> attri_data_frame1
    ID      City Birth_year     Name
a  100     Tokyo       1990  Hiroshi
b  101     Osaka       1989    Akiko
c  102     Kyoto       1970     Yuki
d  103  Hokkaido       1954   Satoru
e  104     Tokyo       2014    Steve

Ersetzt durch die folgende Option replace = True.

Beachten Sie, dass die Originaldaten verloren gehen.

>>> attri_data_frame1.drop(['c','e'], axis = 0, inplace = True)
>>> attri_data_frame1
    ID      City Birth_year     Name
a  100     Tokyo       1990  Hiroshi
b  101     Osaka       1989    Akiko
d  103  Hokkaido       1954   Satoru

2-4-6-2 Daten kombinieren

Daten hinzufügen

>>> attri_data1 = {'ID':['100','101','102','103','104'],
...               'City':['Tokyo','Osaka','Kyoto','Hokkaido','Tokyo'],
...               'Birth_year':['1990','1989','1970','1954','2014'],
...                'Name':['Hiroshi','Akiko','Yuki','Satoru','Steve']}
>>> attri_data_frame1=DataFrame(attri_data1)
>>> attri_data_frame1
    ID      City Birth_year     Name
0  100     Tokyo       1990  Hiroshi
1  101     Osaka       1989    Akiko
2  102     Kyoto       1970     Yuki
3  103  Hokkaido       1954   Satoru
4  104     Tokyo       2014    Steve
>>> math_pt = [50, 43, 33,76,98]
>>> attri_data_frame1['Math']=math_pt
>>> attri_data_frame1
    ID      City Birth_year     Name  Math
0  100     Tokyo       1990  Hiroshi    50
1  101     Osaka       1989    Akiko    43
2  102     Kyoto       1970     Yuki    33
3  103  Hokkaido       1954   Satoru    76
4  104     Tokyo       2014    Steve    98

Daten kombinieren

>>> attri_data2 = {'ID':['100','101','102','105','107'],
...                'Math':[50, 43, 33,76,98],
...                'English':[90, 30, 20,50,30],
...                'Sex':['M', 'F', 'F', 'M', 'M']}
>>> attri_data_frame2=DataFrame(attri_data2)
>>> attri_data_frame2
    ID  Math  English Sex
0  100    50       90   M
1  101    43       30   F
2  102    33       20   F
3  105    76       50   M
4  107    98       30   M
>>> attri_data_frame1
    ID      City Birth_year     Name  Math
0  100     Tokyo       1990  Hiroshi    50
1  101     Osaka       1989    Akiko    43
2  102     Kyoto       1970     Yuki    33
3  103  Hokkaido       1954   Satoru    76
4  104     Tokyo       2014    Steve    98

Suchen Sie den gleichen Schlüssel und führen Sie ihn zusammen. Der Schlüssel ist ID. .. ..

>>> pd.merge(attri_data_frame1,attri_data_frame2)
    ID   City Birth_year     Name  Math  English Sex
0  100  Tokyo       1990  Hiroshi    50       90   M
1  101  Osaka       1989    Akiko    43       30   F
2  102  Kyoto       1970     Yuki    33       20   F

pandas.merge

>>> pd.merge(attri_data_frame1,attri_data_frame2, how = 'outer')
    ID      City Birth_year     Name  Math  English  Sex
0  100     Tokyo       1990  Hiroshi    50     90.0    M
1  101     Osaka       1989    Akiko    43     30.0    F
2  102     Kyoto       1970     Yuki    33     20.0    F
3  103  Hokkaido       1954   Satoru    76      NaN  NaN
4  104     Tokyo       2014    Steve    98      NaN  NaN
5  105       NaN        NaN      NaN    76     50.0    M
6  107       NaN        NaN      NaN    98     30.0    M

Beziehung Merge, join, concatenate and compare

2-4-7 Aggregation

"Aggregation zentriert auf eine bestimmte Spalte mit Gruppierung nach"

>>> attri_data_frame2.groupby('Sex')['Math'].mean()
Sex
F    38.000000
M    74.666667
Name: Math, dtype: float64
>>> attri_data_frame2.groupby('Sex')['English'].mean()
Sex
F    25.000000
M    56.666667
Name: English, dtype: float64

2-4-8 Werte sortieren

Mit attri_data_frame1.sort_index () können Sie nach Index sortieren.

>>> attri_data_frame1=DataFrame(attri_data1, index=['e','b','a','c','d'])
>>> attri_data_frame1
    ID      City Birth_year     Name
e  100     Tokyo       1990  Hiroshi
b  101     Osaka       1989    Akiko
a  102     Kyoto       1970     Yuki
c  103  Hokkaido       1954   Satoru
d  104     Tokyo       2014    Steve
>>> attri_data_frame1.sort_index()
    ID      City Birth_year     Name
a  102     Kyoto       1970     Yuki
b  101     Osaka       1989    Akiko
c  103  Hokkaido       1954   Satoru
d  104     Tokyo       2014    Steve
e  100     Tokyo       1990  Hiroshi

Mit Attri_data_frame1.sort_values (by = ['Birth_year']) können Sie nach dem Wert in der Spalte'Birth_year 'sortieren.

>>> attri_data_frame1.sort_values(by=['Birth_year'])
    ID      City Birth_year     Name
c  103  Hokkaido       1954   Satoru
a  102     Kyoto       1970     Yuki
b  101     Osaka       1989    Akiko
e  100     Tokyo       1990  Hiroshi
d  104     Tokyo       2014    Steve

2-4-9 Beurteilung von nan (null)

Führen Sie Vorgänge aus, z. B. das Ausschließen fehlender Werte.

2-4-9-1 Vergleich von Daten, die die Bedingungen erfüllen

>>> attri_data_frame1.isin(['Tokyo'])
      ID   City  Birth_year   Name
e  False   True       False  False
b  False  False       False  False
a  False  False       False  False
c  False  False       False  False
d  False   True       False  False

2-4-9-2 Beispiele für nan und null

>>> attri_data_frame1['Name'] = np.nan
>>> attri_data_frame1
    ID      City Birth_year  Name
e  100     Tokyo       1990   NaN
b  101     Osaka       1989   NaN
a  102     Kyoto       1970   NaN
c  103  Hokkaido       1954   NaN
d  104     Tokyo       2014   NaN
>>> attri_data_frame1.isnull()
      ID   City  Birth_year  Name
e  False  False       False  True
b  False  False       False  True
a  False  False       False  True
c  False  False       False  True
d  False  False       False  True

Zählen Sie die Anzahl der Nullen.

>>> attri_data_frame1.isnull().sum()
ID            0
City          0
Birth_year    0
Name          5
dtype: int64

Übungen

Extraktion von Mathematik> = 50

>>> attri_data_frame2
    ID  Math  English Sex  Money
0  100    50       90   M   1000
1  101    43       30   F   2000
2  102    33       20   F    500
3  105    76       50   M    300
4  107    98       30   M    700
>>> attri_data_frame2[attri_data_frame2['Math'] >= 50]
    ID  Math  English Sex  Money
0  100    50       90   M   1000
3  105    76       50   M    300
4  107    98       30   M    700

Geld Geschlecht Durchschnitt

>>> attri_data_frame2['Money'] = np.array([1000,2000, 500,300,700])
>>> attri_data_frame2
    ID  Math  English Sex  Money
0  100    50       90   M   1000
1  101    43       30   F   2000
2  102    33       20   F    500
3  105    76       50   M    300
4  107    98       30   M    700
>>> attri_data_frame2.groupby('Sex')['Money'].mean()
Sex
F    1250.000000
M     666.666667
Name: Money, dtype: float64

Möglicherweise möchten Sie fehlende Werte verarbeiten. .. ..

>>> attri_data_frame2['Money'].mean()
900.0
>>> attri_data_frame2['Math'].mean()
60.0
>>> attri_data_frame2['English'].mean()
44.0

CSV-Eingabe / Ausgabe

Fügen Sie das Schreiben, Lesen und Index-Vorhandensein / Fehlen der CSV-Datei hinzu. Es ist notwendig, mit oder ohne Index der gespeicherten Datei zu lesen.

>>> attri_data_frame2.to_csv(r'samole0.csv',index=False)
>>> attri_data_frame2.to_csv(r'samole1.csv',index=True)
>>> df = pd.read_csv("samole0.csv")
>>> df
    ID  Math  English Sex  Money
0  100    50       90   M   1000
1  101    43       30   F   2000
2  102    33       20   F    500
3  105    76       50   M    300
4  107    98       30   M    700
>>> df = pd.read_csv("samole1.csv")
>>> df
   Unnamed: 0   ID  Math  English Sex  Money
0           0  100    50       90   M   1000
1           1  101    43       30   F   2000
2           2  102    33       20   F    500
3           3  105    76       50   M    300
4           4  107    98       30   M    700
>>> df = pd.read_csv("samole1.csv", index_col=0)
>>> df
    ID  Math  English Sex  Money
0  100    50       90   M   1000
1  101    43       30   F   2000
2  102    33       20   F    500
3  105    76       50   M    300
4  107    98       30   M    700

Ohne Index ,. .. .. Schließlich ist es besser, sich dessen bewusst zu sein.

>>> df.to_csv(r'samole3.csv')
>>> df_ = pd.read_csv("samole3.csv")
>>> df_
   Unnamed: 0   ID  Math  English Sex  Money
0           0  100    50       90   M   1000
1           1  101    43       30   F   2000
2           2  102    33       20   F    500
3           3  105    76       50   M    300
4           4  107    98       30   M    700
>>> df_ = pd.read_csv("samole3.csv", index_col=0)
>>> df_
    ID  Math  English Sex  Money
0  100    50       90   M   1000
1  101    43       30   F   2000
2  102    33       20   F    500
3  105    76       50   M    300
4  107    98       30   M    700

Zusammenfassung

・ Nach den Grundlagen der Pandas in diesem Buch zusammengefasst ・ Pandas können auch Diagramme zeichnen und verschiedene Prozesse ausführen, aber ich denke, es kann verwendet werden, wenn Sie den diesmal zusammengefassten Bereich verstehen.

・ Zum weiteren Lernen wurde ein Link zum relativ leicht verständlichen Tutorial hinzugefügt.

Bonus

Package overview Getting started tutorials What kind of data does pandas handle? How do I read and write tabular data? How do I select a subset of a DataFrame? How to create plots in pandas? How to create new columns derived from existing columns? How to calculate summary statistics? How to reshape the layout of tables? How to combine data from multiple tables? How to handle time series data with ease? How to manipulate textual data? Comparison with other tools