Pandas demande souvent: "Que dois-je faire quand je veux faire ça?", Je vais donc les résumer par utilisation.
Dans cet exemple de code,
Liste des survivants du Titanic (train.csv) fournie par Kaggle
Lire et utiliser avec pandas.read_csv ().
Titanic: Machine Learning from Disaster | Kaggle
import pandas as pd
df = pd.read_csv('train.csv')
pandas.read_csv — pandas 1.0.5 documentation
df.describe()
| PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
|---|---|---|---|---|---|---|---|
| count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 | 
| mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 | 
| std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 | 
| min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 | 
| 25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 | 
| 50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 | 
| 75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 | 
| max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 | 
#Réduisez les colonnes de sortie
df['Age'].describe()
count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64
pandas.DataFrame.describe — pandas 1.0.5 documentation
df['Age'].count()
714
Vous pouvez vérifier le nombre de lignes / colonnes contenant des valeurs autres que «None», «NaN» et «NaT».
# 20 < Age <Extraire 40 lignes
df[(20 < df['Age']) & (df['Age'] < 40)].head()
| Index | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 
Si vous souhaitez restreindre par plusieurs conditions ET / OU, spécifiez les conditions en les entourant de (), comme df [(A) & (B)].
# Embarked(C, Q, S)Valeur numérique(1, 2, 3)Conversion en
df['Embarked'] = df['Embarked'].map({'C': 1, 'Q': 2, 'S': 3})
| Index | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 3.0 | 
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1.0 | 
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 3.0 | 
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 3.0 | 
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 3.0 | 
pandas.Series.map — pandas 1.0.4 documentation
# Sex(female, male)Valeur numérique(0, 1)Convertir en et nom de colonne(Sex)À l'homme
df['Sex'] = df['Sex'].map({'female': 0, 'male': 1})
df = df.rename(columns={'Sex': 'Male'})
| Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 3.0 | 
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1.0 | 
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 3.0 | 
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 3.0 | 
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 3.0 | 
pandas.DataFrame.rename — pandas 1.0.4 documentation
Si vous passez un tableau avec une liste de noms de colonnes, vous pouvez modifier tous les noms de colonnes à la fois.
pd.DataFrame({'c': [1, 2], 'd': [10, 20]}).columns = ['a', 'b']
| Index | a | b | 
|---|---|---|
| 0 | 1 | 10 | 
| 1 | 2 | 20 | 
python - Renaming columns in pandas - Stack Overflow
df.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Male             0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
pandas.isnull — pandas 1.0.4 documentation pandas.DataFrame.sum — pandas 1.0.4 documentation
#Exclure toutes les lignes contenant des valeurs manquantes
df_dn = df.dropna()
df_dn.count()
PassengerId    183
Survived       183
Pclass         183
Name           183
Male           183
Age            183
SibSp          183
Parch          183
Ticket         183
Fare           183
Cabin          183
Embarked       183
dtype: int64
pandas.DataFrame.dropna — pandas 1.0.5 documentation
#Extraire les colonnes Survived et Age
df[['Survived', 'Age']]
| Index | Survived | Age | 
|---|---|---|
| 0 | 0 | 22.0 | 
| 1 | 1 | 38.0 | 
| 2 | 1 | 26.0 | 
| 3 | 1 | 35.0 | 
| 4 | 0 | 35.0 | 
Indexing and selecting data — pandas 1.0.4 documentation Obtenir / modifier la valeur de n'importe quelle position avec les pandas à, iat, loc, iloc | note.nkmk.me
df_dn = df.drop('Cabin', axis='columns')
| Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Embarked | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | 3.0 | 
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | 1.0 | 
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | 3.0 | 
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | 3.0 | 
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | 3.0 | 
pandas.DataFrame.dropna — pandas 1.0.5 documentation
import re
#Fonction pour extraire le titre
def getTitle(row):
    name = row['Name']
    p = re.compile('.*\ (.*)\.\ .*')
    surname = p.search(name)
    return surname.group(1)
df['Title'] = df.apply(getTitle, axis='columns')
| Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 3.0 | Mr | 
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1.0 | Mrs | 
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 3.0 | Miss | 
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 3.0 | Mrs | 
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 3.0 | Mr | 
pandas.DataFrame.apply — pandas 1.0.5 documentation
#Trouvez l'âge moyen pour chaque titre
df.groupby('Title').mean()['Age']
Title
Capt        70.000000
Col         58.000000
Countess    33.000000
Don         40.000000
Dr          42.000000
Jonkheer    38.000000
L           54.000000
Lady        48.000000
Major       48.500000
Master       4.574167
Miss        21.773973
Mlle        24.000000
Mme         24.000000
Mr          32.368090
Mrs         35.728972
Ms          28.000000
Rev         43.166667
Sir         49.000000
Name: Age, dtype: float64
Vous pouvez également trouver le nombre d'éléments de données pour chaque titre en utilisant df.groupby ('Titre'). Count ().
Comment utiliser Pandas groupby --Qiita
df.sort_values(by='Age')
| Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | AgeMean | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 803 | 804 | 1 | 3 | Thomas, Master. Assad Alexander | 1 | 0.42 | 0 | 1 | 2625 | 8.5167 | NaN | 1.0 | Master | NaN | 
| 755 | 756 | 1 | 2 | Hamalainen, Master. Viljo | 1 | 0.67 | 1 | 1 | 250649 | 14.5000 | NaN | 3.0 | Master | NaN | 
| 644 | 645 | 1 | 3 | Baclini, Miss. Eugenie | 0 | 0.75 | 2 | 1 | 2666 | 19.2583 | NaN | 1.0 | Miss | NaN | 
| 469 | 470 | 1 | 3 | Baclini, Miss. Helene Barbara | 0 | 0.75 | 2 | 1 | 2666 | 19.2583 | NaN | 1.0 | Miss | NaN | 
| 78 | 79 | 1 | 2 | Caldwell, Master. Alden Gates | 1 | 0.83 | 0 | 2 | 248738 | 29.0000 | NaN | 3.0 | Master | NaN | 
pandas.DataFrame.sort_values — pandas 1.0.5 documentation
Normalement, le DaraFrame qui a exécuté sort_values () est inchangé et les valeurs de retour sont obtenues dans un état trié.
Si ʻascending = False est spécifié, les colonnes spécifiées seront triées par ordre décroissant.  Si ʻinplace = True est spécifié, le DataFrame qui a exécutésort_values ()sera trié et la valeur de retour sera None.
df['Survived'].unique()
array([0, 1], dtype=int64)
pandas.unique — pandas 1.0.5 documentation
df[df['Name'].str.contains('Thomas')]
| Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | AgeMean | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 149 | 150 | 0 | 2 | Byles, Rev. Thomas Roussel Davids | 1 | 42.00 | 0 | 0 | 244310 | 13.0000 | NaN | 3.0 | Rev | NaN | 
| 151 | 152 | 1 | 1 | Pears, Mrs. Thomas (Edith Wearne) | 0 | 22.00 | 1 | 0 | 113776 | 66.6000 | C2 | 3.0 | Mrs | NaN | 
| 159 | 160 | 0 | 3 | Sage, Master. Thomas Henry | 1 | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | 3.0 | Master | NaN | 
| 186 | 187 | 1 | 3 | O'Brien, Mrs. Thomas (Johanna "Hannah" Godfrey) | 0 | NaN | 1 | 0 | 370365 | 15.5000 | NaN | 2.0 | Mrs | NaN | 
| 252 | 253 | 0 | 1 | Stead, Mr. William Thomas | 1 | 62.00 | 0 | 0 | 113514 | 26.5500 | C87 | 3.0 | Mr | NaN | 
pandas.Series.str.contains — pandas 1.0.5 documentation python - How to filter rows containing a string pattern from a Pandas dataframe - Stack Overflow
Utilisez l'opérateur ~ si vous voulez récupérer * des valeurs qui n'incluent pas * de chaîne spécifique.
df[~df['Name'].str.contains('Thomas')]
| Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | AgeMean | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 3.0 | Mr | NaN | 
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1.0 | Mrs | NaN | 
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 3.0 | Miss | NaN | 
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 3.0 | Mrs | NaN | 
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 3.0 | Mr | NaN | 
python - Search for "does-not-contain" on a DataFrame in pandas - Stack Overflow
#valeur"Mr"Rendre la couleur d'arrière-plan de la colonne jaune
df.style.apply(lambda x: ['background-color: yellow' if v == 'Mr' else '' for v in x])
| Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | AgeMean | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.000000 | 1 | 0 | A/5 21171 | 7.250000 | nan | 3.000000 | Mr | nan | 
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | 0 | 38.000000 | 1 | 0 | PC 17599 | 71.283300 | C85 | 1.000000 | Mrs | nan | 
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.000000 | 0 | 0 | STON/O2. 3101282 | 7.925000 | nan | 3.000000 | Miss | nan | 
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.000000 | 1 | 0 | 113803 | 53.100000 | C123 | 3.000000 | Mrs | nan | 
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.000000 | 0 | 0 | 373450 | 8.050000 | nan | 3.000000 | Mr | nan | 
Lorsqu'elle est ouverte dans Jupyter Notebook, la colonne correspondante s'affiche avec un arrière-plan coloré. Notez que lorsque vous ouvrez le bloc-notes Jupyter sur GitHub, la couleur d'arrière-plan ne sera pas ajoutée.
pandas.io.formats.style.Styler.apply — pandas 1.0.5 documentation python - Pandas style function to highlight specific columns - Stack Overflow
df.to_csv('output.csv', index=False)
Si vous ne souhaitez pas inclure l'index (numéro de ligne), spécifiez ʻindex = False`. pandas.DataFrame.to_csv — pandas 1.0.5 documentation
Si vous ne voulez pas insérer de saut de ligne sur la dernière ligne du fichier, passez line_terminator =" " uniquement sur la dernière ligne
python - How to stop writing a blank line at the end of csv file - pandas - Stack Overflow
Recommended Posts