This article is the first day article of Zeals Advent Calendar 2019. Nice to meet you. I'm Tamaki, who will join Zeals next year. Zeals is engaged in a business using chatbot technology called chat commerce. Also, I am currently doing research using natural language processing at university. There are conversations in bots that do not introduce natural language processing, but I thought it would be interesting to add natural language processing technology to them, so I decided to make it the theme of this article.

What is Word2Vec

To put it very simply, it is a technology that enables the degree of similarity between words and the addition and subtraction of words by replacing words with vector representations and learning using neural networks. (It's too vague to fly Masakari) Let's incorporate the similarity and addition / subtraction functions into the chatbot! That is the story of this time.

Chatbot conversation flow to create

A chatbot conversation is a flow of hearing from users and proposing the best one based on it. So, let's create a chatbot that introduces my boom Vtuber to users this time. The flow of the chatbot conversation this time is as follows.

Have the user enter the features
Introducing Vtuber, which is probably the most applicable feature

Introducing Vtuber

I will decide the Vtuber to introduce. Therefore, this time, we selected people who have strong characteristics among Vtubers and have a large number of registrants.

The first one is Sisters. It is a Vtuber with very high VR technology. Click here for Youtube channel of Omesys

The second is Mito Tsukino. Vtuber with interesting video planning. Click here for Tsukino Mito's Youtube channel

The third person is Kaf. It is a Vtuber with very high singing ability. Click here for the Youtube channel of Kaf

Overall picture of the program

Now let's create a chatbot program that introduces the above Vtuber. The overall flow of the program is as follows.

Get tweets tweeting under the name of Vtuber from Twitter
Formatting tweets to train Word2Vec
Generate a Word2Vec model with formatted tweets
Incorporate the Word2Vec model into your chatbot

The directory structure looks like this.

.
├── chatbot.py
├── make_model.py
├── model_test.py
└── twitter

Here is the github repository created this time https://github.com/ssabcire/chatbot

Version used this time

macOS Catalina 10.15.1
Python 3.7.3
gensim==3.8.1
pyknp==0.4.1
pandas==0.25.1
go version 1.11.1

Data acquisition from Twitter

Get tweets tweeting about Vtuber. I don't think the code that hits the Twitter API is very important, so I will explain only how to create a binary from the repository that has already been created and hit the API.

Please clone this. https://github.com/ssabcire/get-tweets

git clone https://github.com/ssabcire/get-tweets.git

Next, create keys.go and decide the key of Twitter API and the directory to store tweets.

`keys.go`


package lib

const consumerKey string = ""
const consumerSecret string = ""
const accessToken string = ""
const accessTokenSecret string = ""

// path = $HOME+path format. PATH is created under the home directory
const path = "py/chatbot/search-Omeshisu"

Then hit the Twitter API to generate a binary to search.

cd search
go build
./search Omeshisu

It will take some time due to the limitation of Twitter API, but now I can get the tweets tweeted by Omeshisu.

Model creation from tweets

Next, we will format the tweet and generate a Word2Vec model.

`make_model.py`


import re
import json
from itertools import islice
from pathlib import Path
from typing import List, Set, Iterator
from pyknp import Juman
from gensim.models.word2vec import Word2Vec


def make_w2v(json_files: Iterator[Path], model_path: str):
    '''
Save Word2Vec model in Tweet
    '''
    model = Word2Vec(_make_sentences(json_files), size=100,
                     window=5, min_count=3, workers=4)
    model.save(model_path)


def morphological_analysis(tweet: str) -> List[str]:
    '''
Morphologically parse the tweet and return it as a list
    '''
    text = _remove_unnecessary(tweet)
    if not text:
        return []
    return [mrph.genkei for mrph in Juman().analysis(text).mrph_list()
            if mrph.hinsi in ['noun', 'verb', 'adjective']]


def _make_sentences(json_files: Iterator[Path]) -> List[List[str]]:
    '''
Reads tweets, performs morphological analysis, and returns a two-dimensional list
    '''
    return [morphological_analysis(tweet) for tweet in _load_files(json_files)]


def _load_files(json_files: Iterator[Path]) -> Set[str]:
    '''
Read all the files from the list containing the PATH of the retrieved JSON tweets,
Returns a set of text
    '''
    tweets = set()
    for file in json_files:
        with file.open(encoding='utf-8') as f:
            try:
                tweets.add(json.load(f)['full_text'])
            except json.JSONDecodeError as e:
                print(e, "\njsofilename: ", file)
    return tweets


def _remove_unnecessary(tweet: str) -> str:
    '''
Delete unnecessary parts of tweets
    '''
    # URL, 'RT@...:', '@<ID> '
    text = re.sub(
        r'(https?://[\w/:%#\$&\?\(\)~\.=\+\-]+)|(RT@.*?:)|(@(.)+ )',
        '', tweet
    )
    #Tweet is hiragana 1,If there are only 2 characters,Blank
    # [", #, @]Can't handle juman
    return re.sub(
        r'(^[Ah-Hmm]{1,2}$)|([ |　])|([#"@])',
        '', text
    )


if __name__ == '__main__':
    cwd = Path().cwd()
    make_w2v(
        islice((cwd / "twitter" / "search-omesis").iterdir(), 0, 5000),
        str(cwd / 'omesis.model')
    )
    make_w2v(
        islice((cwd / "twitter" / "search-kahu").iterdir(), 0, 5000),
        str(cwd / 'kahu.model')
    )
    make_w2v(
        islice((cwd / "twitter" / "search-mito").iterdir(), 0, 5000),
        str(cwd / 'mito.model')
    )

First, I will explain from this method at the top. Create a model using the Word2Vec class. Since it is necessary to pass a 2D array as the first argument, we will create a 2D array with _make_sentences ().

def make_w2v(json_files: Iterator[Path], model_path: str):
    model = Word2Vec(_make_sentences(json_files), size=100,
                     window=5, min_count=3, workers=4)
    model.save(model_path)

_make_sentences () takes a tweet from the list of tweets, morphologically parses the tweet, and creates a list of words.

def _make_sentences(json_files: Iterator[Path]) -> List[List[str]]:
    return [morphological_analysis(tweet) for tweet in _load_files(json_files)]

Juman ++ is used for morphological analysis. I'm using Human this time, but please use whatever you like as it can be anything that can analyze morphological elements.

def morphological_analysis(tweet: str) -> List[str]:
    '''
Morphologically parse the tweet and return it as a list
    '''
    text = _remove_unnecessary(tweet)
    if not text:
        return []
    return [mrph.genkei for mrph in Juman().analysis(text).mrph_list()
            if mrph.hinsi in ['noun', 'verb', 'adjective']]

Now let's run this script.

python make_model.py

It takes a long time to parse a fair amount of tweets, but I was able to generate three Word2Vec models.

Confirmation of words learned by the model

Let's take a look at what words were learned in the model

`model_test.py`


from pathlib import Path
from gensim.models.word2vec import Word2Vec

cwd = Path().cwd()
model = Word2Vec.load(str(cwd / "kahu.model"))
print(model.wv.index2entity)

['Kaf', 'To do', 'Exhibition', 'go', 'sing', 'Song', 'Become', 'of', 'I like', .......

Words are being learned like this. Next, let's look at the words that most closely resemble the kaf.

print(model.wv.most_similar(positive=['Kaf'], topn=5))
> [('To do', 0.9999604225158691), ('of', 0.9999315738677979), ('Become', 0.9999290704727173), ('Say', 0.9999224543571472), ('Observation', 0.9999198317527771)]

Observation is the only word that seems to make sense in the hypernym ... You can also check the similarity between words.

print(model.wv.similarity('song', 'Kaf'))
> 0.9998921

Let's use this Word2Vec similarity and other features to incorporate it into our chatbots.

Chatbot creation

Creating a bot using the LINE API was difficult because I didn't have much time, so this time I will use standard input and standard output.

`chatbot.py`


import random
from pathlib import Path
from typing import List, Tuple
from gensim.models.word2vec import Word2Vec
from make_model import morphological_analysis


def exec(vtubers: List[Tuple[str, str]]):
    print("Introducing Vtuber from the features. What kind of features do you want to see Vtuber?")
    utterance = input("Example:interesting,cute,High technology, ...Please enter the feature as a trial: ")
    if not utterance:
        return print("No features entered")
    wakati_utterance = morphological_analysis(utterance)
    if not wakati_utterance:
        return print("Excuse me, but please enter the features in other words.")
    s = set()
    for name, path in vtubers:
        model = Word2Vec.load(path)
        utterance_entities = [word for word in wakati_utterance
                              if word in model.wv.index2entity]
        if not utterance_entities:
            continue
        most_similar_word = model.wv.most_similar_to_given(
            name, utterance_entities)
        if model.wv.similarity(name, most_similar_word) > 0.95:
            s.add(name)
    if s:
        print("Here is the Vtuber that matches the features you entered!", _introduce(s.pop()))
    else:
        print('''I'm sorry, but I couldn't find a Vtuber with that feature..
How about this instead.''', _introduce())


def _introduce(name: str = "") -> str:
    if not name:
        return random.choice((_message1(), _message2(), _message3()))
    elif name == "Omeshisu":
        return _message1()
    elif name == "Kaf":
        return _message2()
    elif name == "Tsukino Mito":
        return _message3()


def _message1() -> str:
    return """\"Omeshisu\"
Click here for the link https://www.youtube.com/channel/UCNjTjd2-PMC8Oo_-dCEss7A"""


def _message2() -> str:
    return """\"Kaf\"
Click here for the link https://www.youtube.com/channel/UCQ1U65-CQdIoZ2_NA4Z4F7A"""


def _message3() -> str:
    return """\"Tsukino Mito\"
Click here for the link https://www.youtube.com/channel/UCD-miitqNY3nyukJ4Fnf4_A"""


if __name__ == '__main__':
    cwd = Path().cwd()
    exec([('Omeshisu', str(cwd / 'omesis.model')),
          ('Kaf', str(cwd / 'kahu.model')),
          ('Tsukino Mito', str(cwd / 'mito.model'))
          ])

Let's briefly explain the code. It receives standard input and performs morphological analysis.

def exec(vtubers: List[Tuple[str, str]]):
    print("Introducing Vtuber from the features. What kind of features do you want to see Vtuber?")
    utterance = input("Example:interesting,cute,High technology, ...Please enter the feature as a trial: ")
    if not utterance:
        return print("No features entered")
    wakati_utterance = morphological_analysis(utterance)
    if not wakati_utterance:
        return print("Excuse me, but please enter the features in other words.")

Check if each word exists in the Word2Vec model from wakati_utterance in the list containing the morphologically analyzed words, and if so, add it to the list. Then, take out the one with the highest similarity from them, and if the value is 0.95 or more (please decide each one), add it to Set and introduce Vtuber. If the similarity is 95% or higher, it's safe to say that the word is a feature of Vtuber! That is the idea.

    s = set()
    for name, path in vtubers:
        model = Word2Vec.load(path)
        utterance_entities = [word for word in wakati_utterance
                              if word in model.wv.index2entity]
        if not utterance_entities:
            continue
        most_similar_word = model.wv.most_similar_to_given(
            name, utterance_entities)
        if model.wv.similarity(name, most_similar_word) > 0.95:
            s.add(name)
    if s:
        print("Here is the Vtuber that matches the features you entered!", _introduce(s.pop()))
    else:
        print('''I'm sorry, but I couldn't find a Vtuber with that feature..
How about this instead.''', _introduce())

Let's run this script as a trial.

python chatbot.py
Introducing Vtuber from the features. What kind of features do you want to see Vtuber?
Example:interesting,cute,High technology, ...Please enter the feature as a trial:interesting
Here is the Vtuber that matches the features you entered!"Omeshisu"
Click here for the link https://www.youtube.com/channel/UCNjTjd2-PMC8Oo_-dCEss7A

python chatbot.py
Introducing Vtuber from the features. What kind of features do you want to see Vtuber?
Example:interesting,cute,High technology, ...Please enter the feature as a trial:Singing voice
Here is the Vtuber that matches the features you entered!"Kaf"
Click here for the link https://www.youtube.com/channel/UCQ1U65-CQdIoZ2_NA4Z4F7A

Ummmm. I was able to introduce it nicely. It's the best!

Problems with this program

This time I decided not to introduce it depending on whether the entered word is included in the model, so If you enter either song is good or song is bad, the above code will react to song, and people who are good at singing will be introduced. I don't understand the sentence dependency. However, Word2Vec can calculate words, so you may be able to do it well by doing something like song-bad = <a certain word>. It seems interesting to try more ideas around here.

bonus

I've written this for a long time, but in reality, chatbots don't use free speech very much. The reason is that rather than letting the user speak freely, the user is given a choice format such as quick reply. This is because the response rate is higher when prompting for answers. (↓ Quick reply is a function displayed like this)

in conclusion

The Vtuber I recommend most is Hanabasami Kyo-chan. It's cute and the song is so good that I'm glad to cry if you take a look! !! !! Hanabasami Kyo-chan's Youtube channel is here ... Thank you for your cooperation ...!

[PYTHON] Create a chatbot that supports free input with Word2Vec