[PYTHON] (Note) A web application that uses TensorFlow to infer recommended song names [Machine learning]

Introduction

This article is ** (Note) A web application that uses TensorFlow to infer recommended song names. [Creating an execution environment with docker-compose] ** is a continuation. Last time, I made TensorFlow and Flask environment with docker-compose, so This time, I would like to organize machine learning using TensorFlow + Keras. Please note that this is an article I made for myself, so it may be difficult to understand, information, and technology may be out of date: bow: Also, I hope it will be helpful for those who want to make some kind of web application by themselves.

The actual web application looks like the GIF below. ezgif.com-crop.gif When I typed in a sentence in the search box, Mr. Humberd Humberd answered "same story": clap: $ \ tiny {* Since there is little learning data, only some songs will be hit. .. It's shabby} $: bow_tone1: $ \ tiny {* Click the score link to see part of the score, but it is out of the scope of the article} $: no_good_tone1:

References

I used it as a reference when creating this article: bow_tone1:

TODO map

** (Note) A web application that uses TensorFlow to infer recommended song names. [Creating an execution environment with docker-compose] ** This time is ** machine learning **.

chapter Classification Status Contents Language, FW, environment, etc.
Preface Common Already App overview Python
TensorFlow
Keras
Google Colaboratory
chapter One Web API Already Environment construction (execution environment) docker-compose
Flask
Nginx
gunicorn
Chapter II Web API Already (This time) Machine learning Python
TensorFlow
Keras
Flask
Chapter 3 screen not started yet Environment Python
Django
Nginx
gunicorn
PostgreSQL
virtualenv
Chapter 4 screen not started yet Display, Web API call part Python
Django
Chapter 5 AWS not started yet AWS auto-deploy Github
EC2
CodeDeploy
CodePipeline

environment

* I think it will work even if it is not the following Ver, but please note that it is old: no_good_tone2: </ sup>

OS:Ubuntu 18.04.4 LTS
---------------------- -----------
Flask                  1.1.0
gunicorn               19.9.0
Keras                  2.3.1
Keras-Applications     1.0.8
Keras-Preprocessing    1.1.2
matplotlib             3.1.1
mecab-python3          0.996.2
numpy                  1.16.4
pandas                 0.24.2
Pillow                 7.1.2
pip                    20.1
Python                 3.6.9
requests               2.22.0
scikit-learn           0.21.2
sklearn                0.0
tensorflow             2.2.0

Flow of the part that infers recommended songs

First of all, I would like to create the following functions. It is a Web API that returns the recommended song title when you enter it and give a sentence (song atmosphere, etc.). The actual Web API is as follows. Peek 2020-05-16 14-30.gif

In the example, the parameter of the GET method is "Song that wishes someone's happiness." I was able to get the song title "Kumoga Yukunowa" in JSON. [(Example) Web API link](http://52.192.175.215:8888/recommend/api/what-music/ A song that sadly wishes someone's happiness)

The processing flow inside this Web API is as follows. image.png The song title is returned at the end like the flow, but the weight data is read in the middle. This is a pre-trained model created by machine learning. So let's organize how to create a trained model.

Machine learning flow

The following flow is from the developer's point of view, and is the flow up to machine learning. image.png First, prepare the data of the learning source. This is made with text that humans can understand. Next, preprocessing is performed so that the machine (computer) can understand it. In this example, the training source data is converted to a numerical vector by a method called TF-IDF. Finally, machine learning is performed with MLP (Multilayer Perceptron). Details of each will be described later.

Creation of learning source data

The data that is the learning source is separated by commas as shown below. Original data of machine learning image.png Contains song information (atmosphere, artist name, etc.) for the song title to be inferred. It is separated by "|" (pipe), but it is okay without it.

Pretreatment (TF-IDF)

Convert to a numeric vector with TF-IDF. First, load the learning source data created above. Then divide the sentence into words for TF-IDF calculation (separate) This process uses MeCab for morphological analysis. For reference, the source of the word-separation is as follows.

Below is the code to paste into Google Colaboratory. $ \ tiny {* Don't stare at it} $: no_good_tone1: Please paste and execute in order from the top. ..

Install the required libraries


#Install the required libraries
!apt-get install mecab libmecab-dev mecab-ipadic-utf8
!pip3 install mecab-python3

Word-separated part (partial excerpt)


import MeCab

#Initialization of MeCab
tagger = MeCab.Tagger()

def tokenize(text):
    '''Perform morphological analysis with MeCab''' 
    result = []
    word_s = tagger.parse(text)
    for n in word_s.split("\n"):
        if n == 'EOS' or n == '': continue
        p = n.split("\t")[1].split(",")
        h, h2, org = (p[0], p[1], p[6])
        if not (h in ['noun', 'verb', 'adjective']): continue
        if h == 'noun' and h2 == 'number': continue
        result.append(org)
    return result

#Module testing
if __name__ == '__main__':
    print(tokenize("movies|Tetsuya Takeda|painful|I wish the happiness of someone I don't know"))

When you run it, you should see something on the console like this:

['movies', '*', 'Takeda', 'Tetsuya', '*', 'painful', '*', 'know', 'who', 'happiness', 'Wish']

Separated by word. The above example is only one sentence, In the actual program, this process is repeated for the number of sentences (lines) in the file.

If you can divide it, calculate TF-IDF. Regarding TF-IDF, there was an easy-to-understand explanation, so I will quote it. Source: TF-IDF

A value used to extract the feature words of the document from the document. When there are several documents, from the words that appear in them and their frequency, Quantify what is an important word for a document

TF-IDF is expressed by the following formula.

\textrm{TF_IDF}(t) = \textrm{tf}(t,d) × \textrm{idf}(t)

Also, $ \ textrm {tf} (t, d) $ and $ \ textrm {idf} (t) $ are expressed by the following formulas.

\textrm{tf}(t,d) = \frac{n_{t,d}}{\sum_{s \in d}n_{s,d}} \textrm{  , } \textrm{idf}(t) = \log{\frac{N}{df(t)}} + 1

$ n_ {t, d} $: Number of occurrences of a word $ t $ in the document $ d $ $ \ sum_ {s \ in d} n_ {s, d} $: Sum of the number of occurrences of all words in the document $ d $ $ N $: Total number of documents $ df (t) $: Number of documents in which a word $ t $ appears

The above formula can be converted into a program as follows.

TF-IDF calculation(Excerpt)


def calc_files():
    '''Calculate the added file'''
    global dt_dic
    result = []
    doc_count = len(files)
    dt_dic = {}
    #Count the frequency of word occurrence
    for words in files:
        used_word = {}
        data = np.zeros(word_dic['_id'])
        for id in words:
            data[id] += 1
            used_word[id] = 1
        #Dt if the word t is used_Add dic
        for id in used_word:
            if not(id in dt_dic): dt_dic[id] = 0
            dt_dic[id] += 1
        #Convert the number of appearances to a percentage--- (*10)
        data = data / len(words) 
        result.append(data)
    # TF-Calculate IDF
    for i, doc in enumerate(result):
        for id, v in enumerate(doc):
            idf = np.log(doc_count / dt_dic[id]) + 1
            doc[id] = min([doc[id] * idf, 1.0])
        result[i] = doc
    return result

* This [Reference](https://www.amazon.co.jp/%E3%81%99%E3%81%90%E3%81%AB%E4%BD%BF%E3%81 % 88% E3% 82% 8B-% E6% A5% AD% E5% 8B% 99% E3% 81% A7% E5% AE% 9F% E8% B7% B5% E3% 81% A7% E3% 81% 8D% E3% 82% 8B-Python% E3% 81% AB% E3% 82% 88% E3% 82% 8B-AI% E3% 83% BB% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 83% BB% E6% B7% B1% E5% B1% A4% E5% AD% A6% E7% BF% 92% E3% 82% A2% E3% 83% 97% E3% 83% AA% E3% 81% AE% E3% 81% A4% E3% 81% 8F% E3% 82% 8A% E6% 96% B9-% E3% 82% AF% E3% 82% B8 % E3% 83% A9% E9% A3% 9B% E8% A1% 8C% E6% 9C% BA / dp / 4802611641) sample source [^ 1] is almost diverted, but this time on Github I'm giving you the source. </ sup> (source) </ sup> [^ 1]: [Source: Ready to use! Can be practiced in business! Sample code for creating AI / machine learning / deep learning apps with Python](https://github.com/kujirahand/book-mlearn- gyomu)

The source for calculating and outputting TF-IDF from reading the learning source file is as follows. Since the source for calculating the essential TF-IDF is long, it is modularized and read: sweat: It also reads the learning source data. It is stored below, so please upload it. tfidfWithIni.py ← Module to calculate TF-IDF ans_studyInput_fork.txt ← Learning source file

TF-IDF vector creation procedure

Below is the code to paste into Google Colaboratory for your reference. $ \ tiny {* Don't stare at it} $: no_good_tone1: Please paste and execute in order from the top. ..

step 1_Upload file


#Upload file ("tfidfWithIni".py」、「ans_studyInput_fork.txt」)
from google.colab import files
uploaded = files.upload()

Step 2_Install the required libraries


#Create a directory for saving files
!mkdir text
#Install the required libraries
!apt-get install mecab libmecab-dev mecab-ipadic-utf8
!pip3 install mecab-python3

Step 3_TF-Convert to IDF vector


import os, csv, glob, pickle
import tfidfWithIni
import importlib

#Reloading the module (tfidfWithIni)
importlib.reload(tfidfWithIni)

#Variable initialization
y = []
x = []

#Label code conversion dictionary
labelToCode = {}

#Read csv file
def read_file(path):
    '''Add a text file for learning''' # --- (*6)
    with open(path, "r", encoding="utf-8") as f:
        reader = csv.reader(f)   
        label_id = 0  
        for row in reader:
            #Label code creation
            if row[2] not in labelToCode:
                labelToCode[row[2]] = label_id
                label_id += 1
                
            y.append(labelToCode[row[2]])  #Set label
            tfidfWithIni.add_text(row[3])  #Set sentences
           # print("label: ", row[2], "(", labelToCode[row[2]], ")",  "Sentence: ", row[3])

#Module testing--- (*15)
if __name__ == '__main__':
    # TF-Initialize IDF vector(Empty files)
    tfidfWithIni.iniForOri()
    
    #Read the file list--- (*2)
    read_file("ans_studyInput_fork.txt")

    # TF-Convert to IDF vector--- (*3)
    x = tfidfWithIni.calc_files()

    #Save--- (*4)
    pickle.dump([y, x], open('text/genre.pickle', 'wb'))
    tfidfWithIni.save_dic('text/genre-tdidf.dic')
    pickle.dump(labelToCode, open('text/label_to_code.pickle', 'wb'))

When executed, the folder and file will be created as shown below. image.png

The dictionary for TF-IDF calculation is the conversion of the words used in the calculation into ID as follows. image.png

Machine learning (MLP)

With the pre-processing, we are ready for machine learning. Based on the learning data up to the above, learning will be performed so that the song title can be correctly identified. MLP (Multilayer Perceptron) is used as a learning method. MLP is a type of neural network that imitates human nerves. It seems that it consists of layers of three or more nodes. MLP uses a certain method to learn based on learning data (correct data). Even if unknown data (the atmosphere of the song in this example) comes in, it will be possible to correctly determine (the song title in this example). We use the machine learning framework TensorFlow + Keras to do this. And this time, we will create a neural network with the following structure * Image </ sub>: sweat: image.png

Modeling with TensorFlow + Keras to create this neural network looks like this [^ 1].

#Define MLP model structure
model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(in_size,)))
model.add(Dropout(0.2))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(nb_classes, activation='softmax'))

The layer uses what is called Keras Dense. With this, each perceptron will be on the next layer It seems that everything will be connected to Perceptron. Also, the number of inputs is from x1 to xt in the image diagram, but this is defined by the argument input_shape. It's just a few minutes of the words that are made by dividing the whole sentence. In the sample learning file, there are 144 (dimensions). The output is y1 to yclass, which is the number of song titles in the learning file, and is specified by the argument nb_classes. There are 10 (songs) in the sample.

Next, set how to perform training so that it can be correctly determined (compile). RMSprop, as an optimization algorithm based on Keras Documentation Multiclass Classification Problem Let categorical_crossentropy be the loss function. * (Image of words) Loss function: Index for measuring learning deviation, optimization algorithm: Correction method to get closer to the correct answer </ sub>

#Compile the model
model.compile(
    loss='categorical_crossentropy',
    optimizer=RMSprop(),
    metrics=['accuracy'])

Finally, the learning execution part. Learning is performed by the fit method. Input (song atmosphere, etc.) and output (song title) You can learn by giving the Numpy array of to the fit method of the sequence model.

hist = model.fit(x_train, y_train,
          batch_size=16, #Number of data to calculate at one time
          epochs=150,    #Something like the number of times learning is repeated
          verbose=1,
          validation_data=(x_test, y_test))

Executing machine learning

Below is the code to paste into Google Colaboratory for your reference.

After executing up to step 3 of [the above TF-IDF vector creation procedure](# tf-idf vector creation procedure), You should be able to perform machine learning by doing the following:

Step 4_Performing Machine Learning (MLP)


import pickle
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop
import matplotlib.pyplot as plt
import numpy as np
import h5py

#Number of labels to classify
labelToCode = pickle.load(open("text/label_to_code.pickle", "rb"))
nb_classes = len(labelToCode) 

#Read database
data = pickle.load(open("text/genre.pickle", "rb"))
y = data[0] #Label code
x = data[1] # TF-IDF
#Label data one-Convert to hot vector
y = keras.utils.np_utils.to_categorical(y, nb_classes)
in_size = x[0].shape[0] #Input x[0]Number of elements of

#Separate for learning and testing
x_train, x_test, y_train, y_test = train_test_split(
        np.array(x), np.array(y), test_size=0.2)

#Define MLP model structure
model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(in_size,)))
model.add(Dropout(0.2))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(nb_classes, activation='softmax'))

#Compile the model
model.compile(
    loss='categorical_crossentropy',
    optimizer=RMSprop(),
    metrics=['accuracy'])

#Perform learning
hist = model.fit(x_train, y_train,
          batch_size=16, #Number of data to calculate at one time
          epochs=150,    #Something like the number of times learning is repeated
          verbose=1,
          validation_data=(x_test, y_test))

#evaluate
score = model.evaluate(x_test, y_test, verbose=1)
print("Correct answer rate=", score[1], 'loss=', score[0])

#Save weight data
model.save_weights('./text/genre-model.hdf5')

#Draw the state of learning on a graph
plt.plot(hist.history['val_accuracy'])
plt.title('Accuracy')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

When the execution is finished, the following graph will be displayed and the file (/content/text/genre-model.hdf5) will be displayed. It should have been created additionally. This is the end of machine learning. image.png

By analogy with song titles using trained models

In the analogy part, we define the same model as in machine learning. Load the trained model, TF-IDF dictionary, and result label dictionary. Then it converts an unknown document (song atmosphere) into a TF-IDF vector. Finally, if you give the TF-IDF vector to the predict method of Sequencial, the song title will be inferred.

Below is the code to paste into Google Colaboratory for your reference.

After executing up to step 4 of [Execution of machine learning above](# Execution of machine learning) You should be able to guess the song title by doing the following:

By analogy with song titles



import pickle, tfidfWithIni
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop
from keras.models import model_from_json
import importlib

#Reloading the module (tfidfWithIni)
importlib.reload(tfidfWithIni)

def inverse_dict(d):
    return {v:k for k,v in d.items()}

#Judgment by specifying text
def getMusicName(text):
    # TF-Convert to IDF vector
    data = tfidfWithIni.calc_text(text)
    #Predicted by MLP
    pre = model.predict(np.array([data]))[0]
    n = pre.argmax()
    print("Recommended song name: " + label_dic[n], "(", pre[n], ")")
    

#Label definition
labelToCode = pickle.load(open("text/label_to_code.pickle", "rb"))
nb_classes = len(labelToCode) 
label_dic = inverse_dict(labelToCode)

#Find the number of input elements from the dictionary.
in_size_hantei = pickle.load(open("text/genre-tdidf.dic", "rb"))[0]['_id']

# TF-Read IDF dictionary
tfidfWithIni.load_dic("text/genre-tdidf.dic")

#Define Keras model and read weight data
model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(in_size_hantei,)))
model.add(Dropout(0.2))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(nb_classes, activation='softmax'))
model.compile(
    loss='categorical_crossentropy',
    optimizer=RMSprop(),
    metrics=['accuracy'])
model.load_weights('./text/genre-model.hdf5')
    
if __name__ == '__main__':
    requestParam = """
A song that is sad and wishes for someone's happiness
    """
    getMusicName(requestParam)

It may change depending on the learning result, but it will be displayed as follows.

Recommended song name:The clouds are going( 0.99969995 )

The song title analogy with Web API using Flask is previous I'm doing a little bit, so I'd like to omit it: sweat:

About the future

This time, I was able to sort out a little about machine learning. Also, I hope I can brush up and organize it little by little when I have time: sob: It is undecided, but next time I would like to organize the environment construction on the screen side.

chapter Classification Status Contents Language, FW, environment, etc.
Preface Common Already App overview Python
TensorFlow
Keras
Google Colaboratory
chapter One Web API Already Environment construction (execution environment) docker-compose
Flask
Nginx
gunicorn
Chapter II Web API Already Machine learning Python
TensorFlow
Keras
Flask
Chapter 3 screen not started yet (Next time) Environment construction Python
Django
Nginx
gunicorn
PostgreSQL
virtualenv
Chapter 4 screen not started yet Display, Web API call part Python
Django
Chapter 5 AWS not started yet AWS auto-deploy Github
EC2
CodeDeploy
CodePipeline

Recommended Posts

(Note) A web application that uses TensorFlow to infer recommended song names [Machine learning]
(Note) A web application that uses TensorFlow to infer recommended song names.
(Note) A web application that uses TensorFlow to infer recommended song names [Create an execution environment with docker-compose]
I want to create a web application that uses League of Legends data ①
Note that I was addicted to accessing the DB with Python's mysql.connector using a web application.
I made a web application in Python that converts Markdown to HTML
[Python] Web application design for machine learning
Free version of DataRobot! ?? Introduction to "PyCaret", a library that automates machine learning
A story about simple machine learning using TensorFlow
Steps to develop a web application in Python
Installation of TensorFlow, a machine learning library from Google
Build a machine learning application development environment with Python
Machine learning beginners try to make a decision tree
An introduction to machine learning from a simple perceptron
MALSS, a tool that supports machine learning in Python
A note that deployed a Python application from Circle CI to Elastic Beanstalk and notified Slack
I made a tool that makes it convenient to set parameters for machine learning models.