[PYTHON] Science "Is Saito the representative of Saito?"

Saito problem

What to do

"Saito" recognized as a kanji

  • 4 types (2 new fonts, 2 old fonts). According to Toyo Keizai Online
  • ① Saito is the source,
  • ② Saito is the old font of the headwaters (①).
  • ③ Saito made a mistake in writing the new font (①). (Amazing fact 1)
  • ④ Saito made a mistake in writing the old font (②). (Amazing fact 2
  • Below, the population in Japan in parentheses is ** ①, which is the largest, and Saito is the source **.
New font Old font
Headwaters ①U+658E (542,000 people)
33.jpg
<fontcolor=red>Headwaters
②U+9F4B(86,800people)
13.jpg
Headwaters(①)Old font
In fact,
Writingmistake
③U+6589(323,000 people)
32.jpg
Headwaters(①)のWritingmistake
④U+9F4A(37,300people)
2.jpg
Oldfont(②)のWritingmistake

And feelings

After all, I want ** "sai" ** ((1) source) to be ** representative of all rhinoceros (4 types) ** (in the middle) **.

So let's check

  • I want to make a Saito map to confirm the ** representative (middle) ** of Saito.
  • On the other hand, the image used is 58x58 = 3,364 pixels (3,364 dimensions) and cannot be mapped to the XY coordinates (2 dimensions).
  • Therefore, I would like to use a technique called ** dimensional compression ** to compress to ** 3,364 dimensions ⇒ 2 dimensions **.
  • The dimensional compression of characters is covered in this article, so I will link to it.
  • This time, ** UMAP ** is used as the dimension compression algorithm.

image.png

  • Then, dimensionally compress with UMAP.
from umap import UMAP
# Umap decomposition
decomp = UMAP(n_components=2,random_state=42)
# fit_transform umap(Saito 4 character data)
embedding4 = decomp.fit_transform(all.T[[1,12,31,32]])

Verification 1) Decide the representatives of the four Saito

  • Using UMap, map ** Kanji image ** to ** 2D (plane) ** and check ** "representative" **.
  • Let's look at the ** "center of gravity" for all data as the "representative" **, not the "center" (0.5, 0.5).
  • The "center of gravity" is represented by the ** x mark **, but how about it? .. (It's below the center of gravity x.)
from sklearn.cluster import KMeans

#clustering (1 cluster)
clustering = KMeans(n_clusters=1,random_state=42,)
# fit_predict cluster
cl_y = clustering.fit_predict(embedding4)

# visualize (Implementation will be described later)
showScatter(
    embeddings    = embedding4,
    clusterlabels = cl_y,
    centers       = clustering.cluster_centers_,
    imgs          = all.T[[1,12,31,32]].reshape(-1,h,w)
)

image.png

  • ** It's subtle, **
  • When calculating the Euclidean distance from the "center of gravity" to "each character", it looks like this.
  • In this result, ** ② headwaters (old font) Sai ** became the representative. ..
Order of proximity from the center of gravity letter Distance from the center of gravity Note
1st place 13.jpg 0.6281 ②Headwaters(oldfont)
2nd place 32.jpg 0.6889 ③Mistake(newfont)
3rd place 33.jpg 0.7339 ①Headwaters(newfont)
4th place 02.jpg 0.8743 ④Mistake(oldfont)

Verification 2) Determine the "representative" of 33 Saito

By the way, how many types of Saito are there?

  • There are only four types of kanji, but to tell the truth, according to wikipedia
  • There are 31 patterns of variant characters other than "Sai, Sai".
  • On the other hand, the Ministry of Justice recognizes only four rhino characters, "sai, sai, sai, and sai."
  • In other words, ** Of all 33 patterns, only 4 are accepted as Kanji **
  • In addition to Saito, which is recognized as a kanji, I would like to see ** "Representatives" of all 33 Saito **

image.png

  • Now, dimensionally compress 33 characters with UMAP.
from umap import UMAP
# Umap decomposition
decomp = UMAP(n_components=2,random_state=42)
# fit_transform umap(All 33 character data)
embeddings = decomp.fit_transform(all.T)

What is the "representative" of the 33 "Saito"?

  • Similarly, use UMAP to compress the dimensions and check the kanji that are close to the "center of gravity".
from sklearn.cluster import KMeans
# clustering(Number of clusters: 1)
clustering = KMeans(n_clusters=1, random_state=42)
# fit_predict cluster
cl_y = clustering.fit_predict(embeddings)
# visualize
showScatter(embeddings, cl_y, clustering.cluster_centers_)
download.png
Instead of the expected "sai" , is close to the representative (middle). ..
  • The order of distance from the center of gravity (top) is as follows. Ww that does not go as expected
Order of proximity from the center of gravity letter Distance from the center of gravity Note
1st place 28.jpg 0.494
2nd place 30.jpg 0.787
3rd place 27.jpg 1.013
4th place 31.jpg 1.014

Verification 3) Select 4 characters for the representative "Saito"

  • "Don't middle" didn't work, but ** 4 types ** are accepted as kanji.
  • Then, the kanji on this map is divided into 4 clusters, and which kanji is the center of gravity of each cluster?
  • In other words, I would like to select and see the representative 4 characters ** from all 33 characters.
  • Using the clustering algorithm KMeans, it is divided into 4 clusters as shown below.
from sklearn.cluster import KMeans
# clustering(Number of clusters: 4)
clustering = KMeans(n_clusters=4, random_state=42)
# fit_predict cluster
cl_y = clustering.fit_predict(embeddings)
# visualize
showScatter(embeddings, cl_y, clustering.cluster_centers_)
download.png
  • The characters of each cluster and the characters near the center of gravity are as follows.
  • Somehow it seems to be a cluster that captures the characteristics of kanji (month and indication).
  • Are the points near the center of gravity of the cluster capturing the characteristics of the cluster? Is subtle.
  • 4 It cannot be classified as a cluster, and red cluster </ font> contains multiple patterns.
  • It seems that we need to classify it a little more **
  • If you take a quick look, if you have ** times 8 clusters **, you will feel that you can be divided into beautiful ones.
No cluster Center of gravity Other characters included
1 Red 25.jpg 19.jpg33.jpg20.jpg21.jpg 27.jpg28.jpg 30.jpg31.jpg
2 orange 26.jpg 13.jpg14.jpg15.jpg16.jpg17.jpg18.jpg22.jpg23.jpg24.jpg
3 Blue 29.jpg 10.jpg32.jpg 11.jpg
4 Green 08.jpg 01.jpg02.jpg03.jpg04.jpg05.jpg06.jpg07.jpg08.jpg 12.jpg

Verification 4) 8 Try clustering

  • Earlier, there were 4 clusters to select 4 representative kanji characters.
  • However, looking at the results, there were some clusters that could not be separated cleanly, so let's set the number of clusters to 8.
  • The results are as follows.
from sklearn.cluster import KMeans
# clustering(Number of clusters: 8)
clustering = KMeans(n_clusters=8, random_state=42)
# fit_predict cluster
cl_y = clustering.fit_predict(embeddings)
# visualize
showScatter(embeddings, cl_y, clustering.cluster_centers_)
download.png
  • It's not that they are separated neatly, but it feels like they were sorted.
No cluster clusterに含まれる字
1 peach 13.jpg15.jpg33.jpg18.jpg
2 Red 14.jpg22.jpg24.jpg26.jpg
3 tea 16.jpg17.jpg23.jpg
4 Ash 30.jpg31.jpg 28.jpg
5 orange 19.jpg20.jpg21.jpg25.jpg 27.jpg
6 Blue 10.jpg32.jpg29.jpg
7 purple 11.jpg12.jpg01.jpg
8 Green 02.jpg03.jpg04.jpg05.jpg06.jpg07.jpg08.jpg09.jpg
  • Unfortunately, orange </ b> </ font>27.jpg .amazonaws.com/0/183826/666133d0-3d55-6c0a-de96-ef8ab943afbe.jpeg) and gray </ b> </ font>![28.jpg] At (https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/183826/0ba6a86b-3e46-d8a4-d2ac-6443ecda4d83.jpeg), the cluster is broken.
  • However, since both are data on the boundary of the cluster, my thoughts are conveyed (laughs). image.png

Verification 5) Check how many clusters are valid

  • 4 clusters to match the 4 characters registered as kanji.
  • Then, looking at the results of 4 clusters, I tried to separate them into 8 clusters.
  • As expected, ** how many clusters is appropriate? ** **
  • Here, as a method of selecting the number of clusters, I would like to visualize and examine the cluster status using the following three methods.
  1. Elbow Chart
  2. Silhouette Chart
  3. dendrogram

Elbow Chart

  • The elbow chart is a chart with ** data variation in each cluster ** on the vertical axis and ** number of clusters ** on the horizontal axis.
  • Increasing the number of clusters can reduce the variability, but too many clusters is also a problem.
  • Therefore, ** the number of clusters ** is reasonable, and ** the number of clusters that can reduce data variability ** is considered in this figure.
  • Yellowbrick will be used for drawing.
from yellowbrick.cluster import KElbowVisualizer
vis = KElbowVisualizer(
    KMeans(random_state=42),
    k=(1,34) #Number of clusters (range on the horizontal axis))
)
vis.fit(embeddings)
vis.show()
download.png
  • The feeling you see is
  • Up to 5 clusters, ** data variability (average) decreased **, but thereafter it became flat.
  • Therefore, ** classified into 5 clusters ** seems to be good = the representative kanji is ** 5 types ** seems to be good.
  • But let's take a look at the enlarged version (enlarged by 4-18)
  • There seems to be an inflection point at 5, but it has become flat from ** around 10 **.
  • In other words, it seems that there is no mistake in ** classifying into 8 clusters and deciding 8 representative kanji **.
from yellowbrick.cluster import KElbowVisualizer
vis = KElbowVisualizer(
    KMeans(random_state=42),
    k=(4,19) #Number of clusters (range on the horizontal axis))
)
vis.fit(embeddings)
vis.show()
download.png

Silhouette Chart

  • Silhouette Chart is a chart that expresses the following for each cluster.
  • Vertical axis (thickness of graph): Number of samples in the cluster
  • Horizontal axis (graph length): Silhouette coefficient of the cluster
  • Dashed line: Average silhouette coefficient
  • From the perspective, the point is to find the number of Glasta that satisfies the following.
  • Same number of samples for all clusters = same thickness
  • Silhouette coefficient is close to average for all Glasta = Length is close to broken line
  • We will also use Yellowbrick for drawing.
from yellowbrick.cluster import silhouette_visualizer
fig = plt.figure(figsize=(15,25))
#Draw together from 4 to 9 clusters
for i in range(4,10):
    ax = fig.add_subplot(4,2,i-1)
    silhouette_visualizer(KMeans(i),embeddings)
  • As you can see, the pattern on the upper right (** clusters 5 **) is nice. download.png

dendrogram

  • It is a graph that expresses ** closeness ** between clusters like a tournament table.
  • Since it is a diagram that can be used in hierarchical clustering, Scipy's hierarchical clustering is used instead of KMeans.
  • The view is as follows.
  • Clusters with leaves as data and branches of the same color with the same range
  • Height is the distance between clusters
from scipy.cluster.hierarchy import linkage, dendrogram
Z = linkage(
    y = embeddings,
    method = 'weighted',
    metric = "euclidean",
)

R = dendrogram(
    Z=Z,
    color_threshold=1.2, #Adjust the number of clusters with this threshold
    show_contracted=False,
)
  • ** It would be nice if the number of branches of each color is well-balanced and the heights are the same. After all, is the number of clusters about 5?
Number of clusters Dendrogram comment
download.png RedJust a little expensive
download.png The height is uniform
purpleI'm worried about a few
It feels pretty good
download.png The height and number are the same,
Is it divided too finely?

Verification 6) Try to make 5 clusters

  • Since I examined the number of clusters, I would like to plot again what it will look like with 5 clusters.
  • Sounds pretty good. After all, is it 5 clusters?
from sklearn.cluster import KMeans
# clustering(Number of clusters: 5)
clustering = KMeans(n_clusters=5, random_state=42)
# fit_predict cluster
cl_y = clustering.fit_predict(embeddings)
# visualize
showScatter(embeddings, cl_y, clustering.cluster_centers_)
download.png
No cluster Center of gravity Other characters included
1 Blue 15.jpg 13.jpg14.jpg33.jpg18.jpg
2 purple 23.jpg 16.jpg17.jpg22.jpg24.jpg26.jpg21.jpg
3 Green 27.jpg 19.jpg20.jpg25.jpg 27.jpg28.jpg 30.jpg31.jpg
4 Red 29.jpg 32.jpg10.jpg 11.jpg
5 orange 08.jpg 01.jpg02.jpg03.jpg04.jpg05.jpg06.jpg07.jpg09.jpg 12.jpg

Summary

Impressions

  • As a flow,
  • Start by selecting a representative of the 4 characters registered as kanji
  • For all 33 characters that are not registered as kanji, select 1, 4 or 8 characters.
  • Considering the appropriate number of clusters, I chose 5 characters at the end because 5 clusters seemed to be good.
  • The representative kanji are as follows, but more than deciding the representative
  • It is also interesting that on the XY plane where 3000 dimensions are compressed, ** kanji with similar shapes are placed nearby **.
  • It was interesting to be able to create ** groups by radical ** with distance-based clustering.
  • The number of clusters was also judged to be 5 clusters based on the results of the elbow method, silhouette method, and dendogram.
  • It was also interesting that the ** results of the clustering visualization of 5 clusters were reasonably good **.

Verification list

No How to choose Representative Saito
1 From the 4 recognized kanji
1 characterIf you choose, the representative is
13.jpg
2 From all 33 kanji
1 characterIf you choose
28.jpg
3 From all 33 kanji
4 charactersIf you choose
25.jpg26.jpg29.jpg08.jpg
4 From all 33 kanji
8 charactersIf you choose
21.jpg26.jpg29.jpg31.jpg07.jpg12.jpg15.jpg19.jpg
5 All 33 kanji
How many clustersShould be divided into
About 5 clustersLooks good
6 From all 33 kanji
5 charactersIf you choose
08.jpg15.jpg23.jpg27.jpg29.jpg

Finally

  • Thank you for dealing with such a silly story.
  • If you like, I'd appreciate it if you could share it.

Reference information

  • About UMAP
    • https://umap-learn.readthedocs.io/en/latest/index.html
  • Discussion about Clustering in UMAP (seems to be)
    • https://umap-learn.readthedocs.io/en/latest/index.html
  • About KMeans
    • https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
  • Examining the number of clusters (Yellow Brick)
    • https://www.scikit-yb.org/en/latest/api/cluster/elbow.html
    • https://www.scikit-yb.org/en/latest/api/cluster/silhouette.html
  • Drawing with dendrogram
    • https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.dendrogram.html

Visualization function

  • I referred to this article. Thank you very much. I will link.
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
from matplotlib import offsetbox
from sklearn.preprocessing import MinMaxScaler
from PIL import Image
import matplotlib.patches as patches

rc = {
  'font.family': ['sans-serif'],
  'font.sans-serif': ['Open Sans', 'Arial Unicode MS'],
  'font.size': 12,
  'figure.figsize': (8, 6),
  'grid.linewidth': 0.5,
  'legend.fontsize': 10,
  'legend.frameon': True,
  'legend.framealpha': 0.6,
  'legend.handletextpad': 0.2,
  'lines.linewidth': 1,
  'axes.facecolor': '#fafafa',
  'axes.labelsize': 10,
  'axes.titlesize': 14,
  'axes.linewidth': 0.5,
  'xtick.labelsize': 10,
  'xtick.minor.visible': True,
  'ytick.labelsize': 10,
  'figure.titlesize': 14
}
sns.set('notebook', 'whitegrid', rc=rc)

def colorize(d, color, alpha=1.0):
  rgb = np.dstack((d,d,d)) * color
  return np.dstack((rgb, d * alpha)).astype(np.uint8)

colors = sns.color_palette('tab10')

def showScatter(
    embeddings,
    clusterlabels,
    centers = [],
    imgs = all.T.reshape(-1,h,w),
):
    fig, ax = plt.subplots(figsize=(15,15))
    
    #Scaling before drawing scatter plot
    scaler = MinMaxScaler()
    embeddings = scaler.fit_transform(embeddings)
    
    source = zip(embeddings, imgs ,clusterlabels)
    
    #Draw kanji on a scatter plot
    cnt = 0
    for pos, d , i in source:
        cnt = cnt + 1
        img = colorize(d, colors[i], 0.5)
        ab = offsetbox.AnnotationBbox(offsetbox.OffsetImage(img),0.03 + pos * 0.94,frameon=False)
        ax.add_artist(ab)
          
    #Draw concentric circles from the center of gravity
    if len(centers) != 0:
        for c in scaler.transform(centers):
            for r in np.arange(3,0,-1)*0.05:
                circle = patches.Circle(
                    xy=(c[0], c[1]),
                    radius=r,
                    fc='#FFFFFF', 
                    ec='black'
                )
                circle.set_alpha(0.3)
                ax.add_patch(circle)

            ax.scatter(c[0],c[1],s=300,marker="X")
  

    #Axis drawing range
    limit = [-0.1,1.1]
    plt.xlim(limit)
    plt.ylim(limit)
    plt.show()

Recommended Posts