[PYTHON] pandas Matplotlib Summary by usage

Pandas

Data read

import pandas as pd
df = pd.read_csv('data.csv')

Output statistical information

pandas.DataFrame.describe — pandas 1.0.4 documentation

df.describe()
TeamId	Score
count	4.709900e+04	47099.000000
mean	4.409698e+06	0.749839
std	9.901986e+05	0.099161
min	2.792400e+04	0.000000
25%	4.501446e+06	0.760760
50%	4.774358e+06	0.770330
75%	4.915774e+06	0.779900
max	5.051599e+06	1.000000

#Narrow down the output columns
df['Score'].describe()
count    47099.000000
mean         0.749839
std          0.099161
min          0.000000
25%          0.760760
50%          0.770330
75%          0.779900
max          1.000000
Name: Score, dtype: float64

Narrow down the data

Python Pandas: Boolean indexing on multiple columns - Stack Overflow

total_count = df['Score'].count() # 47099
partial_count = df[(0.6 < df['Score']) & (df['Score'] < 0.8)]['Score'].count() # 42893

Convert categorized data to numbers

pandas.Series.map — pandas 1.0.4 documentation

# Embarked(C, Q, S)Numerical value(1, 2, 3)Conversion to
df_train['Embarked'] = df_train['Embarked'].map({'C': 1, 'Q': 2, 'S': 3})

Rename column

pandas.DataFrame.rename — pandas 1.0.4 documentation

# Sex(female, male)Numerical value(0, 1)Convert to and change column name to Male
df_train['Sex'] = df_train['Sex'].map({'female': 0, 'male': 1})
df_train = df_train.rename(columns={'Sex': 'Male'})

Check for missing values

pandas.isnull — pandas 1.0.4 documentation pandas.DataFrame.sum — pandas 1.0.4 documentation

df_train.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Male             0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Exclude missing values

#Exclude all rows containing missing values
df_train_dn = df_train.dropna()
#Exclude columns specified by columns
df_train_dn = df_train_dn.drop('Cabin', axis='columns

Apply function to row / column values

pandas.DataFrame.apply — pandas 1.0.4 documentation

#Extract titles
def getTitle(row):
    name = row['Name']
    p = re.compile('.*\ (.*)\.\ .*')
    surname = p.search(name)
    return surname.group(1)

df_train['Title'] = df_train.apply(getTitle, axis=1)
df_train['FamilyName'] = df_train.apply(getFamilyName, axis=1)

Extract value

Indexing and selecting data — pandas 1.0.4 documentation Get / change the value of any position with pandas at, iat, loc, iloc | note.nkmk.me

#Specify column label
df_train.loc[:, ['Title', 'FamilyName']].head()

# 	Title	FamilyName
# 0	Mr	Braund
# 1	Mrs	Cumings
# 2	Miss	Heikkinen
# 3	Mrs	Futrelle
# 4	Mr	Allen

Calculate the average for each category (GROUP BY)

How to use Pandas groupby --Qiita

#Find the average age and number of data for each title

s_age_mean_groupby_title = df_train.groupby('Title').mean().loc[:, 'Age']
s_age_count_groupby_title = df_train.groupby('Title').count().loc[:, 'Age']

df_age = pd.concat([s_age_mean_groupby_title, s_age_count_groupby_title], axis='columns')
df_age.columns.values[0] = 'AgeMean'
df_age.columns.values[1] = 'AgeCount'
df_age.sort_values(by='AgeCount', ascending=False)

#        AgeMean   AgeCount	
# Mr	 32.368090 398
# Miss	 21.773973 146
# Mrs	 35.728972 107
# Master  4.574167  36
# Rev    43.166667   6

Sort values

pandas.DataFrame.sort_values — pandas 1.0.5 documentation

--Normally, the DaraFrame that executed sort_values () is not changed, and the returned values are obtained in a sorted state. If ʻinplace = Trueis specified, the DataFrame that executedsort_values ()will be sorted and the return value will beNone`.

Extract unique values

pandas.unique — pandas 1.0.5 documentation

Color when displaying data frames

pandas.io.formats.style.Styler.apply — pandas 1.0.5 documentation python - Pandas style function to highlight specific columns - Stack Overflow

Matplotlib

Set graph axis

matplotlib.pyplot.axis — Matplotlib 3.2.1 documentation

plt.axis(xlim=(-0.005, 1.005), ylim=(0, 9000))

matplotlib.axes.Axes.set_ylim — Matplotlib 3.2.2 documentation It is also possible to set for each axis with set_xlim () and set_ylim ().

#Set the upper limit of the y-axis to 100
plt.gca().set_ylim(top=100)

Label adjustment

Specify the position of the label

plt.gca().yaxis.set_label_position('right')

Specify label coordinates

#Specify the label position to the right and set the coordinates(x, y) = (1.25, 0.5)Shift
#(Relative to the default coordinates at right(1.25, 0.5)Behaves off)
plt.gca().yaxis.set_label_position('right')
plt.gca().yaxis.set_label_coords(1.25, 0.5)

Hide label

#Hide x-axis labels
plt.gca().set_xticklabels([])
#Hide y-axis label
plt.gca().set_yticklabels([])

Place characters in any position

matplotlib.pyplot.text — Matplotlib 3.1.2 documentation

#Y-axis label when there are multiple graphs(Response Time (s))Fill in
plt.gcf().text(
  plt.gcf().axes[0].get_position().x1 - 0.45,
  plt.gcf().axes[0].get_position().y1 - 0.5,
  'Response Time (s)',
  rotation=90
)

Adjust the width between graphs

matplotlib.pyplot.tight_layout — Matplotlib 3.1.2 documentation [Python] Introducing how to eliminate overlapping characters output by Matplotlib! │ Python beginner's memorandum

plt.tight_layout()

Show legend

matplotlib.pyplot.legend — Matplotlib 3.1.2 documentation

plt.legend(["legend1", "legend2"])

Displayed in Japanese

Specify the font with prop. How to easily display Japanese with Matplotlib (Windows) | Gammasoft Co., Ltd.

plt.legend(["Squared value"], prop={"family":"MS Gothic"})

Display outside the graph

Specify the position with bbox_to_anchor. python - How to put the legend out of the plot - Stack Overflow

plt.legend(["Squared value"], prop={"family":"MS Gothic"}, bbox_to_anchor=(1.05, 1))

Shows a linearly approximate slope to a scatter plot drawn with Matplotlib

numpy.polyfit — NumPy v1.18 Manual -Curve fitting using Numpy.polyfit --Qiita -Python memorandum Make a scatter plot of x, y and try linear approximation --Eat and sleep (Digistill's stationery diary)

#Calculate the slope when approximating a straight line
a = np.polyfit(x, y, 1)[0]

Label exponential notation changed to normal notation

matplotlib.axes.Axes.ticklabel_format — Matplotlib 3.2.1 documentation -Change the Y-axis scale of Matplotlib to exponential notation (10 Nth power notation) --Qiita

plt.ticklabel_format(style='plain')

Display the numbers on the label separated by three digits

Draw numbers on axis labels separated by three digits (matplotlib) --Qiita

plt.gca().xaxis.set_major_formatter(plt.FuncFormatter(lambda x, loc: '{:,}'.format(int(x))))

Sort labels

Legend guide — Matplotlib 3.2.2 documentation python - How is order of items in matplotlib legend determined? - Stack Overflow

handles = []
for label in labels:
  handle = plt.scatter(..., label=label)
  handles.append(handle)

#Define a sort criterion function in lambda
labels, handles = zip(*sorted(zip(labels, handles)), key=lamdba x: x[0])

Adjust the size of the graph

matplotlib.pyplot.subplots_adjust — Matplotlib 3.2.2 documentation

plt.figure()

plt.subplot(121)
# ...
plt.subplot(122)
# ...

#Adjust width between subplot
plt.subplots_adjust(wspace=1, right=3)

use ggplot

ggplot is a popular graphing tool in R.

The feature is that you can describe the graphs of multiple layers so that they overlap. What is R ｜ ggplot2 ｜ hanaori ｜ note

plt.style.use('ggplot')

#Plot the gender of the survivors
df_train_survived = df_train_dn[df_train_dn.Survived == 1]
df_train_survived_age = df_train_survived.iloc[:, 3]
df_train_survived_male = df_train_survived.iloc[:, 2]
plt.scatter(
  df_train_survived_age,
  df_train_survived_male,
  color="#cc6699",
  alpha=0.5
)

#Plot the gender of the dead
df_train_dead = df_train_dn[df_train_dn.Survived == 0]
df_train_dead_age = df_train_dead.iloc[:, 3]
df_train_dead_male = df_train_dead.iloc[:, 2]
plt.scatter(
  df_train_dead_age,
  df_train_dead_male,
  color="#6699cc",
  alpha=0.5
)

plt.show()

Other

Rounding

9.4. decimal — Decimal fixed point and floating point arithmetic — Python 2.7.18 documentation

Specify the number of digits with the first argument of Decimal.quantize ().

decile = lambda num: Decimal(num).quantize(Decimal('.001'), rounding=ROUND_HALF_UP)
histogram = Counter(decile(score) for score in df['Score'])
print(histogram.keys())
# dict_keys([Decimal('0.761'), Decimal('0.000'), Decimal('0.775'), ...])

Use index with `map ()`

Getting index of item while processing a list using map in python - Stack Overflow

Changed the number of digits display of `float` type

How to output by specifying the number of digits (numbers, decimal places, etc.) in Python print | HEADBOOST

#Specify the number of digits after the decimal point of the exponent to 3 digits
# e.g. float_number = 7.918330583e-06
'{:.3e}'.format(float_number)
# 7.918e-06