[PYTHON] pandas Matplotlib Summary by usage

Pandas

Data read

import pandas as pd
df = pd.read_csv('data.csv')

Output statistical information

pandas.DataFrame.describe — pandas 1.0.4 documentation

df.describe()
TeamId	Score
count	4.709900e+04	47099.000000
mean	4.409698e+06	0.749839
std	9.901986e+05	0.099161
min	2.792400e+04	0.000000
25%	4.501446e+06	0.760760
50%	4.774358e+06	0.770330
75%	4.915774e+06	0.779900
max	5.051599e+06	1.000000

#Narrow down the output columns
df['Score'].describe()
count    47099.000000
mean         0.749839
std          0.099161
min          0.000000
25%          0.760760
50%          0.770330
75%          0.779900
max          1.000000
Name: Score, dtype: float64

Narrow down the data

Python Pandas: Boolean indexing on multiple columns - Stack Overflow

total_count = df['Score'].count() # 47099
partial_count = df[(0.6 < df['Score']) & (df['Score'] < 0.8)]['Score'].count() # 42893

Convert categorized data to numbers

pandas.Series.map — pandas 1.0.4 documentation

# Embarked(C, Q, S)Numerical value(1, 2, 3)Conversion to
df_train['Embarked'] = df_train['Embarked'].map({'C': 1, 'Q': 2, 'S': 3})

Rename column

pandas.DataFrame.rename — pandas 1.0.4 documentation

# Sex(female, male)Numerical value(0, 1)Convert to and change column name to Male
df_train['Sex'] = df_train['Sex'].map({'female': 0, 'male': 1})
df_train = df_train.rename(columns={'Sex': 'Male'})

Check for missing values

pandas.isnull — pandas 1.0.4 documentation pandas.DataFrame.sum — pandas 1.0.4 documentation

df_train.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Male             0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Exclude missing values

#Exclude all rows containing missing values
df_train_dn = df_train.dropna()
#Exclude columns specified by columns
df_train_dn = df_train_dn.drop('Cabin', axis='columns

Apply function to row / column values

pandas.DataFrame.apply — pandas 1.0.4 documentation

#Extract titles
def getTitle(row):
    name = row['Name']
    p = re.compile('.*\ (.*)\.\ .*')
    surname = p.search(name)
    return surname.group(1)

df_train['Title'] = df_train.apply(getTitle, axis=1)
df_train['FamilyName'] = df_train.apply(getFamilyName, axis=1)

Extract value

Indexing and selecting data — pandas 1.0.4 documentation Get / change the value of any position with pandas at, iat, loc, iloc | note.nkmk.me

#Specify column label
df_train.loc[:, ['Title', 'FamilyName']].head()

# 	Title	FamilyName
# 0	Mr	Braund
# 1	Mrs	Cumings
# 2	Miss	Heikkinen
# 3	Mrs	Futrelle
# 4	Mr	Allen

Calculate the average for each category (GROUP BY)

How to use Pandas groupby --Qiita

#Find the average age and number of data for each title

s_age_mean_groupby_title = df_train.groupby('Title').mean().loc[:, 'Age']
s_age_count_groupby_title = df_train.groupby('Title').count().loc[:, 'Age']

df_age = pd.concat([s_age_mean_groupby_title, s_age_count_groupby_title], axis='columns')
df_age.columns.values[0] = 'AgeMean'
df_age.columns.values[1] = 'AgeCount'
df_age.sort_values(by='AgeCount', ascending=False)

#        AgeMean   AgeCount	
# Mr	 32.368090 398
# Miss	 21.773973 146
# Mrs	 35.728972 107
# Master  4.574167  36
# Rev    43.166667   6

Sort values

pandas.DataFrame.sort_values — pandas 1.0.5 documentation

--Normally, the DaraFrame that executed sort_values () is not changed, and the returned values are obtained in a sorted state. If ʻinplace = Trueis specified, the DataFrame that executedsort_values ()will be sorted and the return value will beNone`.

Extract unique values

pandas.unique — pandas 1.0.5 documentation

Color when displaying data frames

pandas.io.formats.style.Styler.apply — pandas 1.0.5 documentation python - Pandas style function to highlight specific columns - Stack Overflow

Matplotlib

Set graph axis

matplotlib.pyplot.axis — Matplotlib 3.2.1 documentation

plt.axis(xlim=(-0.005, 1.005), ylim=(0, 9000))

matplotlib.axes.Axes.set_ylim — Matplotlib 3.2.2 documentation It is also possible to set for each axis with set_xlim () and set_ylim ().

#Set the upper limit of the y-axis to 100
plt.gca().set_ylim(top=100)

Label adjustment

Specify the position of the label

plt.gca().yaxis.set_label_position('right')

Specify label coordinates

#Specify the label position to the right and set the coordinates(x, y) = (1.25, 0.5)Shift
#(Relative to the default coordinates at right(1.25, 0.5)Behaves off)
plt.gca().yaxis.set_label_position('right')
plt.gca().yaxis.set_label_coords(1.25, 0.5)

Hide label

#Hide x-axis labels
plt.gca().set_xticklabels([])
#Hide y-axis label
plt.gca().set_yticklabels([])

Place characters in any position

matplotlib.pyplot.text — Matplotlib 3.1.2 documentation

#Y-axis label when there are multiple graphs(Response Time (s))Fill in
plt.gcf().text(
  plt.gcf().axes[0].get_position().x1 - 0.45,
  plt.gcf().axes[0].get_position().y1 - 0.5,
  'Response Time (s)',
  rotation=90
)

Adjust the width between graphs

matplotlib.pyplot.tight_layout — Matplotlib 3.1.2 documentation [Python] Introducing how to eliminate overlapping characters output by Matplotlib! │ Python beginner's memorandum

plt.tight_layout()

Show legend

matplotlib.pyplot.legend — Matplotlib 3.1.2 documentation

plt.legend(["legend1", "legend2"])

Displayed in Japanese

Specify the font with prop. How to easily display Japanese with Matplotlib (Windows) | Gammasoft Co., Ltd.

plt.legend(["Squared value"], prop={"family":"MS Gothic"})

Display outside the graph

Specify the position with bbox_to_anchor. python - How to put the legend out of the plot - Stack Overflow

plt.legend(["Squared value"], prop={"family":"MS Gothic"}, bbox_to_anchor=(1.05, 1))

Shows a linearly approximate slope to a scatter plot drawn with Matplotlib

#Calculate the slope when approximating a straight line
a = np.polyfit(x, y, 1)[0]

Label exponential notation changed to normal notation

plt.ticklabel_format(style='plain')

Display the numbers on the label separated by three digits

Draw numbers on axis labels separated by three digits (matplotlib) --Qiita

plt.gca().xaxis.set_major_formatter(plt.FuncFormatter(lambda x, loc: '{:,}'.format(int(x))))

Sort labels

Legend guide — Matplotlib 3.2.2 documentation python - How is order of items in matplotlib legend determined? - Stack Overflow

handles = []
for label in labels:
  handle = plt.scatter(..., label=label)
  handles.append(handle)

#Define a sort criterion function in lambda
labels, handles = zip(*sorted(zip(labels, handles)), key=lamdba x: x[0])

Adjust the size of the graph

matplotlib.pyplot.subplots_adjust — Matplotlib 3.2.2 documentation

plt.figure()

plt.subplot(121)
# ...
plt.subplot(122)
# ...

#Adjust width between subplot
plt.subplots_adjust(wspace=1, right=3)

use ggplot

ggplot is a popular graphing tool in R.

The feature is that you can describe the graphs of multiple layers so that they overlap. What is R | ggplot2 | hanaori | note

plt.style.use('ggplot')

#Plot the gender of the survivors
df_train_survived = df_train_dn[df_train_dn.Survived == 1]
df_train_survived_age = df_train_survived.iloc[:, 3]
df_train_survived_male = df_train_survived.iloc[:, 2]
plt.scatter(
  df_train_survived_age,
  df_train_survived_male,
  color="#cc6699",
  alpha=0.5
)

#Plot the gender of the dead
df_train_dead = df_train_dn[df_train_dn.Survived == 0]
df_train_dead_age = df_train_dead.iloc[:, 3]
df_train_dead_male = df_train_dead.iloc[:, 2]
plt.scatter(
  df_train_dead_age,
  df_train_dead_male,
  color="#6699cc",
  alpha=0.5
)

plt.show()

Other

Rounding

9.4. decimal — Decimal fixed point and floating point arithmetic — Python 2.7.18 documentation

Specify the number of digits with the first argument of Decimal.quantize ().

decile = lambda num: Decimal(num).quantize(Decimal('.001'), rounding=ROUND_HALF_UP)
histogram = Counter(decile(score) for score in df['Score'])
print(histogram.keys())
# dict_keys([Decimal('0.761'), Decimal('0.000'), Decimal('0.775'), ...])

Use index with map ()

Getting index of item while processing a list using map in python - Stack Overflow

Changed the number of digits display of float type

How to output by specifying the number of digits (numbers, decimal places, etc.) in Python print | HEADBOOST

#Specify the number of digits after the decimal point of the exponent to 3 digits
# e.g. float_number = 7.918330583e-06
'{:.3e}'.format(float_number)
# 7.918e-06

Recommended Posts

pandas Matplotlib Summary by usage
Basic usage of Pandas Summary
matplotlib summary
Sort by pandas
pytest usage summary
Pandas Personal Notes Summary
Summary of pyenv usage
Faker summary by language
[Numpy / pandas / matplotlib Exercise 01]
Report environment construction by python (matplotlib, pandas, sphinx) + wkhtmltopdf
Memorandum (pseudo Vlookup by pandas)
Convenient usage summary of Flask
Histogram transparent overlay by Matplotlib
Pipenv usage summary (for myself)
python pandas study recent summary
Standardize by group with pandas
Real-time graph display by matplotlib
Visualization memo by pandas, seaborn
Index of certain pandas usage
Data visualization method using matplotlib (+ pandas) (5)
[Numpy / pandas / matplotlib Exercise 01] Update template
Summary of basic implementation by PyTorch
Machine learning summary by Python beginners
Versatile data plotting with pandas + matplotlib
1D-CNN, 2D-CNN scratch implementation summary by Pytorch
Grammar summary often used in pandas
Data visualization method using matplotlib (+ pandas) (3)
Manipulating strings with pandas group by
Random number generation summary by Numpy
Cases using pandas plot, cases using (pure) matplotlib plot
GUI application by Kivy (including matplotlib)
Pandas basics summary link for beginners
Data visualization method using matplotlib (+ pandas) (4)
Summary of restrictions by file system
Feature generation with pandas group by