[PYTHON] Automatically select BGM according to the content of the conversation

1. 1. Introduction

Re-challenge the second AI development contest "Neural Network Console Challenge" planned by Sony and Ledge. By analyzing Audiostock's audio (BGM) data and bibliography (song description), we will work on the ** free task "Create a player that automatically selects BGM according to the content of daily conversation" **. Consider a system in which smart speakers such as Google Home automatically play BGM according to the conversation content of the person in the room (although the hurdle for practical use seems to be high from the viewpoint of privacy ...).

2. Experimental environment & data

・ Google Colaboratory (python3) ・ Neural Network Console (Windows version)

Work No. Data name One line explanation tag
42554 audiostock_42554.wav The best song for the opening opening
42555 audiostock_42555.wav It's a bossa nova song Bossa Nova
42556 audiostock_42556.wav Heartwarming Easy Listening Comical comical,cute,warm,Heartwarming,easy listening
42557 audiostock_42557.wav It's a strange song Strange time signature

3. 3. BGM automatic classification

Using BGM voice data (WAV), create a model for automatic classification using NNC. Due to time constraints, we have built a model that can be classified into 3 classes this time.

3-1. Annotation and learning data

To investigate what kind of class is desirable, first investigate the words included in the above "one-line explanation" using "KHcoder" that can statistically analyze text data. The top results are as follows. image.png From these, it seems that you can classify while actually listening to BGM (tempo, tone, etc. are different) To use songs that include any of "rock", "pop", and "ballad" as learning data. We created 1468 learning data and 105 evaluation data. In addition, sound sources (jingles) such as sound effects were excluded from the creation because the length of the song is short.

3-2. Converted to Mel frequency cepstrum coefficient

We will convert the WAV data of BGM into a mel frequency cepstrum coefficient and drop it into a 40-dimensional vector (details are omitted, but this page / 34161f2facb80edd999f)). The average was taken for each pitch on the vertical axis and made into an array (1,40), which was used as learning data.

Wav_to_Mel.py


import pandas as pd
import numpy as np
import librosa

y, sr = librosa.load(file_name)
#Feature extraction in 40 dimensions
mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=40)
#Calculate the average on the vertical axis and output
S_A = np.mean(mfcc, axis = 1)
np.savetxt(output_filename, S_A.reshape(1, -1), delimiter=',', fmt="%s")

3-3. Create a classification model with NNC

I trained a model that classifies vectors with NNC. It seems that the solution method with CNN is common, but after trying various networks and activation functions, the result is that the following settings are the most accurate. If I have time, I would like to repeat the experiment. By the way, the advantage of NNC is that it is very easy because the GUI is prepared for trial and error such as changing the function. You can intuitively understand what kind of network it is, and I think it is one of the attractions when compared with Google Colab. image.png Since I trained low-dimensional vectors, the amount of processing was sufficient with the CPU (Windows version) this time, but the learning results with the cloud version learned with almost the same settings will be published for the time being. After 30 epochs were trained, the learning curve was as follows (Best Validation was the 9th epoch). image.png Next, using the created model, evaluate it with test data and try to measure the accuracy. image.png image.png Although it is a three-classification problem, Accuracy is 0.8, which seems to have some characteristics. The average precision rate is about 80% or more, and it seems to be a valuable model for the task of selecting suitable BGM.

4. Analyze everyday conversation

Utilizing Bert's trained model, select BGM that is close to the conversation content from the one-line explanation. Calculates the conversation (text) vector and selects the BGM with the closest one-line explanation in terms of cosine similarity. First, find a suitable BGM based on the text, and then 3. We will select songs based on the classification based on the BGM created in. I couldn't think of an implementation of BERT in NNC, so I processed it with Google colab and transformers who have knowledge (Personally, NNC has a lot of image fields, so next time I will strengthen around natural language. I'm happy with my work).

Conversation_to_BGM.py


import pandas as pd
import numpy as np
import torch
import transformers

from transformers import BertJapaneseTokenizer
from tqdm import tqdm
tqdm.pandas()

class BertSequenceVectorizer:
    def __init__(self):
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.model_name = 'cl-tohoku/bert-base-japanese-whole-word-masking'
        self.tokenizer = BertJapaneseTokenizer.from_pretrained(self.model_name)
        self.bert_model = transformers.BertModel.from_pretrained(self.model_name)
        self.bert_model = self.bert_model.to(self.device)
        self.max_len = 128
            
    def vectorize(self, sentence : str) -> np.array:
        inp = self.tokenizer.encode(sentence)
        len_inp = len(inp)

        if len_inp >= self.max_len:
            inputs = inp[:self.max_len]
            masks = [1] * self.max_len
        else:
            inputs = inp + [0] * (self.max_len - len_inp)
            masks = [1] * len_inp + [0] * (self.max_len - len_inp)

        inputs_tensor = torch.tensor([inputs], dtype=torch.long).to(self.device)
        masks_tensor = torch.tensor([masks], dtype=torch.long).to(self.device)
        
        seq_out, pooled_out = self.bert_model(inputs_tensor, masks_tensor)

        if torch.cuda.is_available():    
            return seq_out[0][0].cpu().detach().numpy()
        else:
            return seq_out[0][0].detach().numpy()

if __name__ == '__main__':
    #Reading the original data
    df_org = pd.read_csv('./drive/NNC/BGM data list.csv')
    #Focus only on the songs in the learning data
    df_org = df_org.dropna(subset=["One line explanation"])
    df_org = df_org[~df_org['One line explanation'].str.contains("Jingle")]
    df_org = df_org[~df_org['tag'].str.contains("Jingle")]
    df_org = df_org.head(5000)
    word = ["Lock", "pop", "Ballad"]
    df = df_org.iloc[0:0]
    for w in word:
      df_detect = df_org[df_org["One line explanation"].str.contains(w)]
      df = pd.concat([df, df_detect])
    df = df.reset_index(drop=True)
    BSV = BertSequenceVectorizer()
    #Calculate feature vector from bibliography
    df['text_feature'] = df['One line explanation'].progress_apply(lambda x: BSV.vectorize(x))
    #Search for similar vector (BGM) from input text
    nn = NearestNeighbors(metric='cosine')
    nn.fit(df["text_feature"].values.tolist())
    vec = BSV.vectorize("Good morning. It's nice weather today. Yeah. It looks like it's sunny all day.")
    ##Calculate cosine similarity
    dists, result = nn.kneighbors([vec], n_neighbors=1)
    print(df["Data name"][r], df["One line explanation"][r])

###Output result
audiostock_45838.wav
Name:Data name, dtype: object 188    
Busy but fun pop/Lock
Name:One line explanation, dtype: object

5. Experiment

Now, let's try what kind of song will be selected by inputting the expected conversational sentence. The final BGM selected was 300 songs that were not used for learning and evaluation, and did not include the words "rock", "pop", and "ballad" in the one-line explanation. The whole picture is as shown in the figure. image.png For the songs to be finally selected, we decided to play the songs with the highest prediction probability in order from the file "output_result.csv" output by NNC (NNC can set different data for the evaluation at the time of learning and the final evaluation. ). Let's select songs in various cases.

** Case 1) ** ** ◆ Conversation: ** Good morning. It's nice weather today. Yeah. It looks like it's sunny all day. ** ◆ One line with high similarity Description: ** Busy and fun pop / rock (audiostock_45838.wav) → Label "Rock"

Whether it's a morning song or not, I was able to choose a song that seems to be energetic, such as "powerful" and "lively"! There seems to be a tendency for metal-style songs that use electric guitars to be selected.

*** Case 2) *** ** ◆ Conversation: ** We plan to camp in Yamanashi on weekends. You can spend a quiet time by the lake for the first time in a while. It's getting cooler, so be careful. ** ◆ One line with high similarity Description: ** Pop (audiostock_45997.wav) that suddenly misses the love of parents on a distant day → Label "Pop" ** ◆ Song selection result: ** ・ Audiostock_45254 Pure Japanese music with a ghostly story that freezes your spine ・ Audiostock_44771 BGM of horror document touch ・ Audiostock_46760 Travel information Nostalgic melancholy lonely twilight ・ Audiostock_46657 Refreshing desired drive Light forward ・ Audiostock_44331 Heartwarming music from the tropical Caribbean

The 1st and 2nd songs are obviously bad selection results (ghost story ...), but the 4th and 5th songs are pop BGMs that are perfect for traveling. Also, the explanation of the third song has a sad atmosphere, but it was a BGM with "pop" in the tag, and when I actually listened to it, it was not that dark. From this, it can be said that there is a tendency to automatically select pop songs.

*** Case 3) *** ** ◆ Conversation: ** I heard that drama, moving thing, did you see it? .. A sad and sad story. The last cried. ** ◆ One line with high similarity Description: ** Warming ballads, teen feelings (audiostock_43810.wav) → Label "ballads" ** ◆ Song selection result: ** ・ Audiostock_46013 Fresh, mysterious and spacious environment ・ Audiostock_44891 Relaxation ambient of night stars ・ Audiostock_44575 A gentle ambi-style sound that expands the world of fairy tales ・ Audiostock_45599 A mysterious environment with a cool morning atmosphere ・ Audiostock_45452 A graceful classic with artistic elegance in the garden

I have been able to successfully extract quiet BGM such as mysterious and loose ballad songs and classical music.

All three categories were automatically selected based only on the BGM features, but it seems that we were able to extract songs that were almost as intended! Looking at the BGM that was not selected (the probability of prediction was low), "Variety program title BGM" (audiostock_43840), "Latin-flavored Euro-house style" (audiostock_42921), "Multinational African mysterious travelogue fashionable" (Audiostock_46146) etc., and it was confirmed that the model has a distinction of unsuitable BGM.

6. Summary and consideration

For the task of creating a player that automatically selects BGM according to the content of daily conversation

I was able to realize that. This time, I was only able to create a model with a total of about 1600 songs and small learning data, but by further examining the annotation and the number of data, further improvement in accuracy can be expected, and 3 or more classification classes can be created. Should do. There seems to be more room for research on how to calculate the features of BGM. The service proposal was based on the assumption of smart speakers, but it is not limited to this, but it can be used for future proposals such as posting songs from tags and texts on SNS, and automatically selecting BGM from subtitle data in video editing. It seems to be.

7. References

Recommended Posts

Automatically select BGM according to the content of the conversation
Dot according to the image
Switch the setting value of setting.py according to the development environment
Attempt to automatically adjust the speed of time-lapse movies (Part 2)
Change the volume of Pepper according to the surrounding environment (sound)
Supplement to the explanation of vscode
[Golang Error handling] The best way to separate processing according to the type of error
I tried to automatically extract the movements of PES players with software
I tried to get the RSS of the top song of the iTunes store automatically
The story of trying to reconnect the client
10 methods to improve the accuracy of BERT
How to check the version of Django
The story of adding MeCab to ubuntu 16.04
[Python] Understand the content of error messages
The story of pep8 changing to pycodestyle
Test of the difference between the mean values of count data according to the Poisson distribution
I made a tool to automatically back up the metadata of the Salesforce organization
I tried to automatically send the literature of the new coronavirus to LINE with Python
How to calculate the volatility of a brand
How to find the area of the Voronoi diagram
Python constants like None (according to the reference)
Combinatorial optimization to find the hand of "Millijan"
Setting to output the log of cron execution
The inaccuracy of Tensorflow was due to log (0)
Set the range of active strips to the preview range
I tried to touch the API of ebay
I tried to correct the keystone of the image
To get the path of the currently running python.exe
From the introduction of pyethapp to the execution of contract
Try to simulate the movement of the solar system
The story of moving from Pipenv to Poetry
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
A script that can perform stress tests according to the number of CPU cores