[PYTHON] Try to make Qiita's Word Cloud from your browser history

qiita.png

I made a Word Cloud like this. I simply extracted the title of qiita.com from the browser history, divided the title with Mecab, and converted only the nouns to Word Cloud to visualize the fields I was interested in, the fields I searched, and so on. It was quite interesting to visualize my history, so please give it a try.

Https://github.com/amueller/word_cloud as a Word Cloud library for visualization. For morphological analysis, prepare Mecab by referring to Making the morphological analysis engine MeCab available in Python3 (March 2016 version).

For MacOS, Safari is in ~ / Library / Safari / History.db and Chrome is in ~ / Library / Application \ Support / Google / Chrome / Default / History, so copy them as follows. To the same place as the script. Even if you specify this path directly as the path of db without copying it, it may not be read due to a read error, so it is safe to copy it considering the risk of db corruption in the operation.

By the way, the script to create Qiita's Word Cloud from the browser history is as follows.

import sqlite3
from enum import Enum
import MeCab
from wordcloud import WordCloud

#db copies the original and uses
# Safari History
# => ~/Library/Safari/History.db
# Chrome History
# => ~/Library/Application\ Support/Google/Chrome/Default/History
SAFARI_HISTORY = 'History.db'
CHROME_HISTORY = 'History'


class Browser(Enum):
    Safari = 1
    Chrome = 2


MECAB = MeCab.Tagger('-Ochasen')
MECAB.parse('') #Don't be released
def get_nouns(text):
    nouns = []
    node = MECAB.parseToNode(text)
    while node:
        if 'noun' in node.feature:
            nouns.append(node.surface)
        node = node.next
        if node is None:
            break
    return nouns


def get_db_config(browser):
    if browser == Browser.Safari:
        dbname = SAFARI_HISTORY
        sql = 'select v.title from history_items i join history_visits v on v.history_item = i.id and i.url like "http://qiita.com%"  group by i.url'
    elif browser == Browser.Chrome:
        dbname = CHROME_HISTORY
        sql = 'select u.title from urls u where u.url like "http://qiita.com%" group by u.url'
    else:
        raise ValueError('invalid argument')
    return (dbname, sql)


def get_qiitas(browser):
    qiitas = []
    (dbname, sql) = get_db_config(browser)
    for row in sqlite3.connect(dbname).cursor().execute(sql):
        if row[0]:
            qiitas.append(row[0].strip())
    return qiitas


def create_wordcloud(text, output):
    fpath = '/Library/Fonts/Hiragino Maru Go ProN W4.ttc'
    stop_words = ['thing', 'this', 'For', 'When', 'Yo']
    wordcloud = WordCloud(background_color='white', font_path=fpath, width=900, height=500,
                          stopwords=set(stop_words)).generate(text)
    wordcloud.to_file(output)


def main():
    qiita_nouns = []
    for browser in [Browser.Chrome, Browser.Safari, ]:
        for title in get_qiitas(browser):
            qiita_nouns.extend(get_nouns(title))
    create_wordcloud(','.join(qiita_nouns), 'qiita.png')


if __name__ == '__main__':
    main()

It turned out to be something like this. The history of both Safari and Chrome is stored in sqlite3, so I'm pulling out the data appropriately. Unfortunately, the schemas are completely different, so sql needs to be separate. Also, since the string "Qiita" is always included at the end of the title of the Qiita article, Qiita is displayed very strongly in Word Cloud, so add "qiita" in lowercase to stop_words to remove it. You can. Personally, I can clearly identify it as Qiita, so I leave it.

Again, it was quite interesting to visualize my history, so please give it a try!

Recommended Posts

Try to make Qiita's Word Cloud from your browser history
WEB scraping with python and try to make a word cloud from reviews
How to make Word Cloud characters monochromatic
Try to make your own AWS-SDK with bash
Scraping your Qiita articles to create a word cloud
I made a tool to create a word cloud from wikipedia
If you try to make a word cloud with comments from WEB manga, it is interesting to visually understand what kind of manga it is.
Make it possible to read .eml from your smartphone using discord bot
Try operating Nifty Cloud MQTT from C4SA
Try to make a kernel of Jupyter
Access Django's development server from your browser
Try to make something like C # LINQ