Introduction

I think the first thing to do when analyzing data is to understand what characteristics the data has. In such a case, using pandas-profiling is very convenient because it will do EDA all at once. However, when I tried it with Jupyter notebook, the Japanese columns of the data became garbled (tofu) like □□□, so I would like to summarize the solution.

environment

Python3.8
Docker
Jupyter notebook

Cause

The cause of garbled characters in pandas-profiling is that ** matplotlib ** and ** seaborn ** are not compatible with Japanese localization. If it can be translated into Japanese, pandas-profiling using matplotlib and seaborn will also be supported in Japanese.

In this article, I will explain the procedure for Japaneseizing ** matplotlib ** and ** seaborn **.

Japanese localization of Matplotlib and seaborn

In this article, I will explain how to support Japanese in the Jupyter notebook environment using Docker. There may be a more efficient method, so I would appreciate it if you could comment on it.

1. Download Japanese fonts

Download ** ipaexg00401.zip (4.0MB) ** from this site and unzip it. Move ** ipaexg.ttf ** in the ipaexg00401 folder to the directory where the Dockerfile is located.

2. Copy the files on the container to the host for seaborn Japanese support

The work to be done here is to download rcmod.py necessary for Japaneseizing seaborn locally and rewrite the contents, and every time docker-compose up, rcmod.py on the container is rewritten on the host. Set to overwrite with .py. By taking such a flow, you do not have to rewrite rcmod.py every time docker-compose up.

(I really want to rewrite on the container with Dockerfile, but I didn't understand)

Do ** docker-compose up ** when Japanese is not supported. Open another terminal and check the container ID.

#Check the container ID
$ docker ps

Then save rcmod.py on the container to the host (locally).

$ docker cp [Container ID]:opt/conda/lib/python3.8/site-packages/seaborn/rcmod.py [Destination(C:\Users\....Such)]

Copy the last saved rcmod.py to the directory where the Dockerfile is.

3. Rewrite rcmod.py

Open rcmod.py and change the following:

Change the font part of def set (context = "notebook", ...) on lines 86-87 to ** font = "IPAexGothic" **.

def set_theme(context="notebook", style="darkgrid", palette="deep",
              font="IPAexGothic", font_scale=1, color_codes=True, rc=None):

Then change ** "font.family": ["sans-serif"] ** on line 205 to:

"font.family": ["IPAexGothic"]

This completes the rewriting of seaborn for Japanese support.

4. Add the following to your Dockerfile

#Japanese localization of matplotlib and scipy
#Copy Japanese font
COPY ipaexg.ttf /opt/conda/lib/python3.8/site-packages/matplotlib/mpl-data/fonts/ttf/ipaexg.ttf
#Rewritten rcmod.rcmod on container with py.Overwrite py
COPY settings/localize_ja/rcmod.py /opt/conda/lib/python3.8/site-packages/seaborn/rcmod.py
#Font at the end of the matplotlib config file.family :Add IPAex Gothic
RUN echo "font.family : IPAexGothic" >>  /opt/conda/lib/python3.8/site-packages/matplotlib/mpl-data/matplotlibrc
#Clear cache
RUN rm -r ./.cache

Now you can support matplotlib and seaborn in Japanese. You can check if the characters are garbled with the following code.

#Check if matplotlib is compatible with Japan
import matplotlib.pyplot as plt
plt.plot([1, 2, 3, 4])
plt.xlabel('Localizing into Japanese')
plt.ylabel('of matplotlib')
plt.show()

#Check if seaborn can speak Japanese
import seaborn as sns
sns.set(style="whitegrid")

# Load the example Titanic dataset
titanic = sns.load_dataset("titanic")

# Draw a nested barplot to show survival for class and sex
g = sns.catplot(x="class", y="survived", hue="sex", data=titanic,
                height=6, kind="bar", palette="muted")
g.despine(left=True)
g.set_ylabels("Japaneseization of seaborn")

Since Japanese is used for the label, it is successful if the label is not garbled (tofu).

If you can confirm that matplotlib and seaborn support Japanese, pandas-profiling should also support Japanese.

At the end

I feel that pandas-profiling will become a standard for the time being before doing EDA.

[PYTHON] About the garbled Japanese part of pandas-profiling in Jupyter notebook