[PYTHON] Visualize the "regional color" of the city by applying document vectorization

Overview

By applying the document vectorization method in natural language processing, we created a color-coded map of each region based on the region information described in text. Specifically, the results of classifying the documents of urban planning (maintenance, development, conservation policy: so-called maintenance) in the metropolitan area by topic model (LDA) are expressed on a map with RGB values. sample_20191230.png

I am thinking that this may lead to a conceptual discovery that "this area and that area are geographically separated, but have the same characteristics as expected".

At present, the data is not completely organized and the teeth are missing. For example, Ibaraki prefecture is often purple-ish, while Chiba and Saitama prefectures are mostly green and brown. However, even in Saitama, the western side such as Chichibu shows a color close to that of Ibaraki prefecture. Also, Kimitsu and Tokorozawa are surprisingly close. It is a mystery why the center of Saitama City (Omiya) is also purple, and so on. As the number of data increases, the sense of accuracy may improve.

original data

We have repaired the city planning documents for Chiba, Saitama, and Ibaraki from Ministry of Land, Infrastructure, Transport and Tourism City Planning Master Plan Links. However, not all areas are covered and are just samples. I created a dataset (CSV format) like this as an input.

name description
1 Ryugasaki / Ushiku Ryugasaki / Ushiku City Planning (Ryugasaki City, Ushiku City, Tone Town) -Strengthen cooperation between neighboring cities in the city planning area, coexisting with abundant nature and rural environment, while working and living together It became ~ (omitted)
2 Hanno City Hanno City Planning (Hanno City) City Planning Area ~ (Omitted) ~ Aim to realize a low-carbon society by promoting the use of public transportation and creating greenery. Unique development of the region ~ (omitted)

Document vectorization

--The code you actually created is as follows. Morphological analysis is performed with MeCab, vectorization is performed with a topic model with gensim, and dimension reduction is performed with TSNE with scikit-learn. There may be many improvements, but I would like to fix those areas as needed.

--Execution environment >> OS: MacOS Catalina | Language: python3.6.6

visualizer.py


from sklearn.manifold import TSNE
from gensim import corpora, models
import string
import re
import MeCab
import pandas as pd
import numpy as np

#Text Tokenizer by MeCab
def text_tokenizer(text):
    token_list = []
    tagger = MeCab.Tagger()
    tagger.parse('') 
    node = tagger.parseToNode(text)
    while node:
        pos = node.feature.split(",")
        if pos[0] in ["noun", "verb", "adjective"]: #target word-class
            if pos[6] != '*': #lemma is added when it exists
                token_list.append(pos[6])
            else:
                token_list.append(node.surface)        
        node = node.next
    return list(token_list)

#Loading input dataset
df = pd.read_csv('input.csv', encoding="utf-8")
df['text'] = df['description'] #set target column

#Remove https-links
df['text_clean'] = df.text.map(lambda x: re.sub(r'https?://[\w/:%#\$&\?\(\)~\.=\+\-]+', "", x))
#Remove numerals
df['text_clean'] = df.text_clean.map(lambda x: re.sub(r'\d+', '', x))
#Converting all letters into lower case
df['text_clean'] = df.text_clean.map(lambda x: x.lower())
#Creating DataFrame for Token-list
df['text_tokens'] = df.text_clean.map(lambda x: text_tokenizer(x))

#LDA
np.random.seed(2000)
texts = df['text_tokens'].values
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
ldamodel = models.ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=20, passes=5, minimum_probability=0)
ldamodel.save('lda_bz.model')
print(ldamodel.print_topics())

#Converting Topic-Model result into numpy matrix
hm = np.array([[y for (x,y) in ldamodel[corpus[i]]] for i in range(len(corpus))])

#Dimensionality reduction by tsne
tsne = TSNE(n_components=3, init='pca', verbose=1, random_state=2000, perplexity=50, method='exact', early_exaggeration=120, learning_rate=200, n_iter=1000)
embedding = tsne.fit_transform(hm)

x_coord = embedding[:, 0]
y_coord = embedding[:, 1]
z_coord = embedding[:, 2]

#RGB conversion with normalization
def std_norm(x, axis=None):
    xmean = x.mean(axis=axis, keepdims=True)
    xstd = np.std(x, axis=axis, keepdims=True)
    y = (x-xmean)/xstd
    min = y.min(axis=axis, keepdims=True)
    max = y.max(axis=axis, keepdims=True)
    norm_rgb = (y-min)/(max-min) * 254
    result = norm_rgb.round(0)
    return result

x_rgb = std_norm(x_coord, axis=0)
y_rgb = std_norm(y_coord, axis=0)
z_rgb = std_norm(z_coord, axis=0)

embedding = pd.DataFrame(x_coord, columns=['x'])
embedding['y'] = pd.DataFrame(y_coord)
embedding['z'] = pd.DataFrame(y_coord)
embedding["r"] = pd.DataFrame(x_rgb)
embedding["g"] = pd.DataFrame(y_rgb)
embedding["b"] = pd.DataFrame(z_rgb)
embedding['description'] = df.description

#export to csv
embedding.to_csv("output.csv", encoding="utf_8")

――The basic concept is to carry out "three-dimensional visualization of document vectors" that is often done in natural language processing, and convert the three-dimensional vectors (x, y, z values) to RGB values.

--This time, we have adopted the [Topic Model](https://qiita.com/tags/Topic Model) for vectorization of documents. The reason is that in other vectorization methods (for example, based on tf-idf and word2vec), outliers often appear in the xyz value after dimensionality reduction, and the effect is affected when converting to RGB values from 0 to 255. Because it was too big. In the case of the topic model, I thought that outliers would be less likely to occur if the parameters were set appropriately, and this problem could be solved.

Visualization on a map

--After converting the document information to RGB values, visualize the RGB values of each municipality on the map. In my case, I use QGIS to link with geographic information and use Leaflet to draw. did. This is omitted this time, but if necessary, it will be explained in detail in another article.

――If anyone knows, please let me know if there is a way to express the RGB value in the attribute information as it is in QGIS.

Points to keep in mind and future issues

――Roughly speaking, if the content being spoken (word appearance tendency, topic, etc.) is close, the document vector will appear at a short distance as a vector, so if you convert it to color, a city with similar shades I think it can be judged that the content being told is similar. Please point out that if there is a counter punch, we will be happy to accept it.

――Since this color shows the relative relationship between regions, the color will change each time you analyze by changing the initial value. It is difficult to use in cases such as "performing this work every year and tracking changes over time", so I would like to find a solution around this.

――We are using city planning documents on a trial basis this time, but not only is it not possible to cover all the cities, towns and villages yet, but it is a general feeling that these represent "characteristics of the city" before that. I think there is a gap with. Eventually, I think it would be interesting to create something that reflects the voices of tourists and the discussion records of community-based workshops.

Recommended Posts

Visualize the "regional color" of the city by applying document vectorization
Visualize the characteristic vocabulary of a document with D3.js
Visualize the orbit of Hayabusa2
Visualize the response status of the census 2020
Learn the basics of document classification by natural language processing, topic model
[Python] Visualize the information acquired by Wireshark
Visualize the boundary values of the multi-layer perceptron
Visualize the effects of deep learning / regularization
Pandas of the beginner, by the beginner, for the beginner [Python]
Visualize the export data of Piyo log