[PYTHON] Visualize keywords in documents with TF-IDF and Word Cloud

word cloud memo

Prepare word dictionary (vocab) and TF-IDF

#All words(Below is an example)
$ vocab
array(['a', 'able', 'at', ..., 'zebra', 'zone', 'zoo'], dtype='<U79')

#TF for each document-IDF vector
$ TF_IDF
array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [61.9792226 ,  0.        ,  3.38385083, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  6.76770166, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 2.75463212,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.37731606,  2.84060202,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

Create dic [word] = vec

words = vocab.tolist()
vecs = TF_IDF.tolist()
temp_dic = {}
vecs_dic = []
for vec in vecs:
    for i in range(len(vec)):
        temp_dic[words[i]] = vec[i] 
    vecs_dic.append(temp_dic)
    temp_dic = {} 
$ len(vecs_dic)
(Number of documents)

$ len(vecs_dic[0])
(Number of dimensions of vector)

Visualization

#Visualize the 89th document from the document list
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import sys

wordcloud = WordCloud(background_color='white', width=1024, height=674)
wordcloud.generate_from_frequencies(vecs_dic[88])
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.figure()
plt.show()

image.png

If you get a Zero Division Error in Word Cloud

Solved by adding small values with reference to reference [2]

words = vocab.tolist()
vecs = TF_IDF.tolist()
temp_dic = {}
vecs_dic = []
for vec in vecs:
    for i in range(len(vec)):
        temp_dic[words[i]] = vec[i] + 1e-5 #Prevent the element from becoming 0
    vecs_dic.append(temp_dic)
    temp_dic = {} 

Create and save images for each document

To save it, add wordcloud.to_file and change it as follows.

i=0
for v in vecs_dic:
  i+=1
  wordcloud = WordCloud(background_color='white', width=1024, height=674)
  wordcloud.generate_from_frequencies(v)
  wordcloud.to_file([PATH] + str(i) + ".png ")

References

[1] https://qiita.com/pma1013/items/d183b4b2504173ba037e [2] https://github.com/amueller/word_cloud/issues/456

Recommended Posts

Visualize keywords in documents with TF-IDF and Word Cloud
Visualize keywords in documents with TF-IDF and Word Cloud
Visualize the frequency of word occurrences in sentences with Word Cloud. [Python]
Visualize keywords in documents with TF-IDF and Word Cloud
Generate Word Cloud from case law data in python3
Jupyter in Cloud9 IDE
Text analysis that can be done in 5 minutes [Word Cloud]
[Flask & Bootstrap] Visualize the content of lyrics in Word Cloud ~ Lyrics Word Cloud ~
[Flask & Bootstrap] Visualize the content of lyrics in Word Cloud ~ Lyrics Word Cloud ~
Visualize graphs with Japanese labeled edges in NetworkX and PyGraphviz / Gephi
Dealing with "years and months" in Python
Text mining with Python ② Visualization with Word Cloud
WEB scraping with python and try to make a word cloud from reviews
[Python] Visualize and identify slow parts with pytest
Visualize corona infection data in Tokyo with matplotlib
Automatically create word and excel reports in python
Interactively visualize data with TreasureData, Pandas and Jupyter.