[PYTHON] Create a chatbot that supports free input with Word2Vec

This article is the first day article of Zeals Advent Calendar 2019. Nice to meet you. I'm Tamaki, who will join Zeals next year. Zeals is engaged in a business using chatbot technology called chat commerce. Also, I am currently doing research using natural language processing at university. There are conversations in bots that do not introduce natural language processing, but I thought it would be interesting to add natural language processing technology to them, so I decided to make it the theme of this article.

What is Word2Vec

To put it very simply, it is a technology that enables the degree of similarity between words and the addition and subtraction of words by replacing words with vector representations and learning using neural networks. (It's too vague to fly Masakari) Let's incorporate the similarity and addition / subtraction functions into the chatbot! That is the story of this time.

Chatbot conversation flow to create

A chatbot conversation is a flow of hearing from users and proposing the best one based on it. So, let's create a chatbot that introduces my boom Vtuber to users this time. The flow of the chatbot conversation this time is as follows.

  1. Have the user enter the features
  2. Introducing Vtuber, which is probably the most applicable feature

Introducing Vtuber

I will decide the Vtuber to introduce. Therefore, this time, we selected people who have strong characteristics among Vtubers and have a large number of registrants.

The first one is Sisters. It is a Vtuber with very high VR technology. Click here for Youtube channel of Omesys

The second is Mito Tsukino. Vtuber with interesting video planning. Click here for Tsukino Mito's Youtube channel

The third person is Kaf. It is a Vtuber with very high singing ability. Click here for the Youtube channel of Kaf

Overall picture of the program

Now let's create a chatbot program that introduces the above Vtuber. The overall flow of the program is as follows.

  1. Get tweets tweeting under the name of Vtuber from Twitter
  2. Formatting tweets to train Word2Vec
  3. Generate a Word2Vec model with formatted tweets
  4. Incorporate the Word2Vec model into your chatbot

The directory structure looks like this.

.
├── chatbot.py
├── make_model.py
├── model_test.py
└── twitter

Here is the github repository created this time https://github.com/ssabcire/chatbot

Version used this time

Data acquisition from Twitter

Get tweets tweeting about Vtuber. I don't think the code that hits the Twitter API is very important, so I will explain only how to create a binary from the repository that has already been created and hit the API.

Please clone this. https://github.com/ssabcire/get-tweets

git clone https://github.com/ssabcire/get-tweets.git

Next, create keys.go and decide the key of Twitter API and the directory to store tweets.

keys.go


package lib

const consumerKey string = ""
const consumerSecret string = ""
const accessToken string = ""
const accessTokenSecret string = ""

// path = $HOME+path format. PATH is created under the home directory
const path = "py/chatbot/search-Omeshisu"

Then hit the Twitter API to generate a binary to search.

cd search
go build
./search Omeshisu

It will take some time due to the limitation of Twitter API, but now I can get the tweets tweeted by Omeshisu.

Model creation from tweets

Next, we will format the tweet and generate a Word2Vec model.

make_model.py


import re
import json
from itertools import islice
from pathlib import Path
from typing import List, Set, Iterator
from pyknp import Juman
from gensim.models.word2vec import Word2Vec


def make_w2v(json_files: Iterator[Path], model_path: str):
    '''
Save Word2Vec model in Tweet
    '''
    model = Word2Vec(_make_sentences(json_files), size=100,
                     window=5, min_count=3, workers=4)
    model.save(model_path)


def morphological_analysis(tweet: str) -> List[str]:
    '''
Morphologically parse the tweet and return it as a list
    '''
    text = _remove_unnecessary(tweet)
    if not text:
        return []
    return [mrph.genkei for mrph in Juman().analysis(text).mrph_list()
            if mrph.hinsi in ['noun', 'verb', 'adjective']]


def _make_sentences(json_files: Iterator[Path]) -> List[List[str]]:
    '''
Reads tweets, performs morphological analysis, and returns a two-dimensional list
    '''
    return [morphological_analysis(tweet) for tweet in _load_files(json_files)]


def _load_files(json_files: Iterator[Path]) -> Set[str]:
    '''
Read all the files from the list containing the PATH of the retrieved JSON tweets,
Returns a set of text
    '''
    tweets = set()
    for file in json_files:
        with file.open(encoding='utf-8') as f:
            try:
                tweets.add(json.load(f)['full_text'])
            except json.JSONDecodeError as e:
                print(e, "\njsofilename: ", file)
    return tweets


def _remove_unnecessary(tweet: str) -> str:
    '''
Delete unnecessary parts of tweets
    '''
    # URL, 'RT@...:', '@<ID> '
    text = re.sub(
        r'(https?://[\w/:%#\$&\?\(\)~\.=\+\-]+)|(RT@.*?:)|(@(.)+ )',
        '', tweet
    )
    #Tweet is hiragana 1,If there are only 2 characters,Blank
    # [", #, @]Can't handle juman
    return re.sub(
        r'(^[Ah-Hmm]{1,2}$)|([ | ])|([#"@])',
        '', text
    )


if __name__ == '__main__':
    cwd = Path().cwd()
    make_w2v(
        islice((cwd / "twitter" / "search-omesis").iterdir(), 0, 5000),
        str(cwd / 'omesis.model')
    )
    make_w2v(
        islice((cwd / "twitter" / "search-kahu").iterdir(), 0, 5000),
        str(cwd / 'kahu.model')
    )
    make_w2v(
        islice((cwd / "twitter" / "search-mito").iterdir(), 0, 5000),
        str(cwd / 'mito.model')
    )

First, I will explain from this method at the top. Create a model using the Word2Vec class. Since it is necessary to pass a 2D array as the first argument, we will create a 2D array with _make_sentences ().

def make_w2v(json_files: Iterator[Path], model_path: str):
    model = Word2Vec(_make_sentences(json_files), size=100,
                     window=5, min_count=3, workers=4)
    model.save(model_path)

_make_sentences () takes a tweet from the list of tweets, morphologically parses the tweet, and creates a list of words.

def _make_sentences(json_files: Iterator[Path]) -> List[List[str]]:
    return [morphological_analysis(tweet) for tweet in _load_files(json_files)]

Juman ++ is used for morphological analysis. I'm using Human this time, but please use whatever you like as it can be anything that can analyze morphological elements.

def morphological_analysis(tweet: str) -> List[str]:
    '''
Morphologically parse the tweet and return it as a list
    '''
    text = _remove_unnecessary(tweet)
    if not text:
        return []
    return [mrph.genkei for mrph in Juman().analysis(text).mrph_list()
            if mrph.hinsi in ['noun', 'verb', 'adjective']]

Now let's run this script.

python make_model.py

It takes a long time to parse a fair amount of tweets, but I was able to generate three Word2Vec models.

Confirmation of words learned by the model

Let's take a look at what words were learned in the model

model_test.py


from pathlib import Path
from gensim.models.word2vec import Word2Vec

cwd = Path().cwd()
model = Word2Vec.load(str(cwd / "kahu.model"))
print(model.wv.index2entity)
['Kaf', 'To do', 'Exhibition', 'go', 'sing', 'Song', 'Become', 'of', 'I like', .......

Words are being learned like this. Next, let's look at the words that most closely resemble the kaf.

print(model.wv.most_similar(positive=['Kaf'], topn=5))
> [('To do', 0.9999604225158691), ('of', 0.9999315738677979), ('Become', 0.9999290704727173), ('Say', 0.9999224543571472), ('Observation', 0.9999198317527771)]

Observation is the only word that seems to make sense in the hypernym ... You can also check the similarity between words.

print(model.wv.similarity('song', 'Kaf'))
> 0.9998921

Let's use this Word2Vec similarity and other features to incorporate it into our chatbots.

Chatbot creation

Creating a bot using the LINE API was difficult because I didn't have much time, so this time I will use standard input and standard output.

chatbot.py


import random
from pathlib import Path
from typing import List, Tuple
from gensim.models.word2vec import Word2Vec
from make_model import morphological_analysis


def exec(vtubers: List[Tuple[str, str]]):
    print("Introducing Vtuber from the features. What kind of features do you want to see Vtuber?")
    utterance = input("Example:interesting,cute,High technology, ...Please enter the feature as a trial: ")
    if not utterance:
        return print("No features entered")
    wakati_utterance = morphological_analysis(utterance)
    if not wakati_utterance:
        return print("Excuse me, but please enter the features in other words.")
    s = set()
    for name, path in vtubers:
        model = Word2Vec.load(path)
        utterance_entities = [word for word in wakati_utterance
                              if word in model.wv.index2entity]
        if not utterance_entities:
            continue
        most_similar_word = model.wv.most_similar_to_given(
            name, utterance_entities)
        if model.wv.similarity(name, most_similar_word) > 0.95:
            s.add(name)
    if s:
        print("Here is the Vtuber that matches the features you entered!", _introduce(s.pop()))
    else:
        print('''I'm sorry, but I couldn't find a Vtuber with that feature..
How about this instead.''', _introduce())


def _introduce(name: str = "") -> str:
    if not name:
        return random.choice((_message1(), _message2(), _message3()))
    elif name == "Omeshisu":
        return _message1()
    elif name == "Kaf":
        return _message2()
    elif name == "Tsukino Mito":
        return _message3()


def _message1() -> str:
    return """\"Omeshisu\"
Click here for the link https://www.youtube.com/channel/UCNjTjd2-PMC8Oo_-dCEss7A"""


def _message2() -> str:
    return """\"Kaf\"
Click here for the link https://www.youtube.com/channel/UCQ1U65-CQdIoZ2_NA4Z4F7A"""


def _message3() -> str:
    return """\"Tsukino Mito\"
Click here for the link https://www.youtube.com/channel/UCD-miitqNY3nyukJ4Fnf4_A"""


if __name__ == '__main__':
    cwd = Path().cwd()
    exec([('Omeshisu', str(cwd / 'omesis.model')),
          ('Kaf', str(cwd / 'kahu.model')),
          ('Tsukino Mito', str(cwd / 'mito.model'))
          ])

Let's briefly explain the code. It receives standard input and performs morphological analysis.

def exec(vtubers: List[Tuple[str, str]]):
    print("Introducing Vtuber from the features. What kind of features do you want to see Vtuber?")
    utterance = input("Example:interesting,cute,High technology, ...Please enter the feature as a trial: ")
    if not utterance:
        return print("No features entered")
    wakati_utterance = morphological_analysis(utterance)
    if not wakati_utterance:
        return print("Excuse me, but please enter the features in other words.")

Check if each word exists in the Word2Vec model from wakati_utterance in the list containing the morphologically analyzed words, and if so, add it to the list. Then, take out the one with the highest similarity from them, and if the value is 0.95 or more (please decide each one), add it to Set and introduce Vtuber. If the similarity is 95% or higher, it's safe to say that the word is a feature of Vtuber! That is the idea.

    s = set()
    for name, path in vtubers:
        model = Word2Vec.load(path)
        utterance_entities = [word for word in wakati_utterance
                              if word in model.wv.index2entity]
        if not utterance_entities:
            continue
        most_similar_word = model.wv.most_similar_to_given(
            name, utterance_entities)
        if model.wv.similarity(name, most_similar_word) > 0.95:
            s.add(name)
    if s:
        print("Here is the Vtuber that matches the features you entered!", _introduce(s.pop()))
    else:
        print('''I'm sorry, but I couldn't find a Vtuber with that feature..
How about this instead.''', _introduce())

Let's run this script as a trial.
python chatbot.py
Introducing Vtuber from the features. What kind of features do you want to see Vtuber?
Example:interesting,cute,High technology, ...Please enter the feature as a trial:interesting
Here is the Vtuber that matches the features you entered!"Omeshisu"
Click here for the link https://www.youtube.com/channel/UCNjTjd2-PMC8Oo_-dCEss7A

python chatbot.py
Introducing Vtuber from the features. What kind of features do you want to see Vtuber?
Example:interesting,cute,High technology, ...Please enter the feature as a trial:Singing voice
Here is the Vtuber that matches the features you entered!"Kaf"
Click here for the link https://www.youtube.com/channel/UCQ1U65-CQdIoZ2_NA4Z4F7A

Ummmm. I was able to introduce it nicely. It's the best!

Problems with this program

This time I decided not to introduce it depending on whether the entered word is included in the model, so If you enter either song is good or song is bad, the above code will react to song, and people who are good at singing will be introduced. I don't understand the sentence dependency. However, Word2Vec can calculate words, so you may be able to do it well by doing something like song-bad = <a certain word>. It seems interesting to try more ideas around here.

bonus

I've written this for a long time, but in reality, chatbots don't use free speech very much. The reason is that rather than letting the user speak freely, the user is given a choice format such as quick reply. This is because the response rate is higher when prompting for answers. (↓ Quick reply is a function displayed like this)

in conclusion

The Vtuber I recommend most is Hanabasami Kyo-chan. It's cute and the song is so good that I'm glad to cry if you take a look! !! !! Hanabasami Kyo-chan's Youtube channel is here ... Thank you for your cooperation ...!

Recommended Posts

Create a chatbot that supports free input with Word2Vec
Create a PythonBox that outputs with Random after PEPPER Input
Let's create a free group with Python
Create a page that loads infinitely with python
Python: Create a class that supports unpacked assignment
Create a homepage with django
Create a heatmap with pyqtgraph
Create a directory with python
Create a web application that recognizes numbers with a neural network
Let's create a script that registers with Ideone.com in Python.
Tornado-Let's create a Web API that easily returns JSON with JSON
Create a web API that can deliver images with Django
Create a program that can generate your favorite images with Selenium
Create a virtual environment with Python!
[LINE Messaging API] Create a BOT that connects with someone with Python
Create a poster with matplotlib to visualize multiplication tables that remember multiplication
A story that supports electronic scoring of exams with image recognition
5 Ways to Create a Python Chatbot
Create a poisson stepper with numpy.random
Create a file uploader with Django
How to call a POST request that supports Japanese (Shift-JIS) with requests
Create a BOT that can call images registered with Discord like pictograms
Technology that supports a money-throwing service that can monetize a blog with one tag
Create a web app that can be easily visualized with Plotly Dash
Create a Python function decorator with Class
[Python] A program that creates stairs with #
Build a blockchain with Python ① Create a class
Create a dummy image with Python + PIL.
[Python] Create a virtual environment with Anaconda
Create a GUI app with Python's Tkinter
Create a large text file with shellscript
Create a star system with Blender 2.80 script
Create a virtual environment with Python_Mac version
Create a VM with a YAML file (KVM)
Create a simple web app with flask
Create a new dict that combines dicts
Create a word frequency counter with Python 3.4
[Python] Create a LineBot that runs regularly
Create a Connecting Nearest Neighbor with NetworkX
A typed world that begins with Python
Create a bot that boosts Twitter trends
Create a web service with Docker + Flask
Create a private repository with AWS CodeArtifact
Create a car meter with raspberry pi
Create a devilish picture with Blender scripts
Create a matrix with PythonGUI (text box)
Create a graph with borders removed with matplotlib
Create a bot with AWS Lambda that automatically starts / stops Instances with specific tags
A story that did not end with exit when turning while with pipe input
[kotlin] Create an app that recognizes photos taken with a camera on android
A model that identifies the guitar with fast.ai
Create a frame with transparent background with tkinter [Python]
Create test data like that with Python (Part 1)
Create an app that guesses students with python
Create a GUI executable file created with tkinter
A memo that made a graph animated with plotly
Create a LINE BOT with Minette for Python
Create a game UI from scratch with pygame2!
Create a PDF file with a random page size
Create a virtual environment with conda in Python
[Note] Create a one-line timezone class with python