[PYTHON] Extract music features with Deep Learning and predict tags

motivation

It seems that conventional music recommendation technology often uses collaborative filtering, but collaborative filtering has the disadvantage that it cannot handle works that "do not collect user evaluations" such as minor songs and new songs.

Another approach to music recommendation technology, the method of "extracting music features and utilizing them for recommendation," seems to be able to avoid the above-mentioned problems.

So I tried to study using the paper'End-to-end learning for music audio'[^ 1], but I couldn't find the source code. I decided to do the "Predict" task. (Please note that some work has been changed and it is not a complete reproduction.)

I couldn't find many articles about processing music with Python and Deep Learning, so I hope it helps.

Obtaining a dataset

Use MagnaTagATune Dataset [^ 2]. Each song has 29 seconds, 25863 songs, and 188 tags.

#Obtaining MP3 data
$ wget http://mi.soi.city.ac.uk/datasets/magnatagatune/mp3.zip.001
$ wget http://mi.soi.city.ac.uk/datasets/magnatagatune/mp3.zip.002
$ wget http://mi.soi.city.ac.uk/datasets/magnatagatune/mp3.zip.003

#Integrate and unzip the split zip file
$ cat mp3.zip* > ~/music.zip
$ unzip music.zip

#Obtaining tag data
$ wget http://mi.soi.city.ac.uk/datasets/magnatagatune/annotations_final.csv

Make MP3s workable with Numpy

Install pydub

The voice features normally used in voice recognition and MIR (music information retrieval) are mel frequency cepstrum after feature extraction is applied to RAW data. However, in this paper, RAW data is used as it is as audio features. Like the image, it is soulful to put the raw data into Deep Learning and automatically extract the features.

I used a package called pydub to convert MP3 to RAW. You also need libav or ffmpeg (which seems to encode and decode audio). For more information, go to Official Github

$ pip install pydub

#For mac
$ brew install libav --with-libvorbis --with-sdl --with-theora

#For linux
$ apt-get install libav-tools libavcodec-extra-53

Also, the official method did not work in my ubuntu environment, so I referred to this article.

File import and conversion to ndarray

Let's define the following function that creates an ndarray with the path of the mp3 file as an argument.

import numpy as np
from pydub import AudioSegment

def mp3_to_array(file):
    
    #Convert MP3 to RAW
    song = AudioSegment.from_mp3(file)
    
    #Conversion from RAW to bytestring type
    song_data = song._data
    
    #Conversion from bytestring to Numpy array
    song_arr = np.fromstring(song_data, np.int16)
    
    return song_arr

Data set preparation

Preparation of music tag (y)

Let's read the tag data downloaded earlier. Also, please note the following two points.

--Limiting tags to 50 commonly used tags --Limited to 3000 samples because it does not survive the memory

import pandas as pd

tags_df = pd.read_csv('annotations_final.csv', delim_whitespace=True)
tags_df = tags_df.sample(frac=1)
tags_df = tags_df[:3000]

top50_tags = tags_df.iloc[:, 1:189].sum().sort_values(ascending=False).index[:50].tolist()
y =  tags_df[top50_tags].values

Preparation of RAW data (X)

--Use the tags_df because it contains the path to the mp3 file. --X is reshaped to [samples (number of songs), features, channel (1 this time)]. --Since RAW data is 16kHz, it has 16000 features per second and 465984 features in about 30 seconds. ――In the original paper, the sound source was divided into 3 seconds for training, but since it is troublesome, I will stick it in 30 seconds.

files = tags_df.mp3_path.values
X = np.array([ mp3_to_array(file) for file in files ])
X = X.reshape(X.shape[0], X.shape[1], 1)

Preparation of training data and test data

from sklearn.model_selection import train_test_split
random_state = 42

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=random_state)

Learning & test (7/9 revision)

Model building

I tried using Keras. Unlike the original paper, the dimension of x is as long as 465984, so we will stack a little deeper.


import keras
from keras.models import Model
from keras.layers import Dense,  Flatten, Input
from keras.layers import Conv1D, MaxPooling1D

features = train_X.shape[1]

x_inputs = Input(shape=(features, 1), name='x_inputs') # (Number of features,Number of channels)
x = Conv1D(128, 256, strides=256,
           padding='valid', activation='relu') (x_inputs)
x = Conv1D(32, 8, activation='relu') (x) # (Number of channels,Filter length)
x = MaxPooling1D(4) (x) #(Filter length)
x = Conv1D(32, 8, activation='relu') (x)
x = MaxPooling1D(4) (x)
x = Conv1D(32, 8, activation='relu') (x)
x = MaxPooling1D(4) (x)
x = Conv1D(32, 8, activation='relu') (x)
x = MaxPooling1D(4) (x)
x = Flatten() (x)
x = Dense(100, activation='relu') (x) #(Number of units)
x_outputs = Dense(50, activation='sigmoid', name='x_outputs') (x)

model = Model(inputs=x_inputs, outputs=x_outputs)
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(train_X1, train_y, batch_size=600, epochs=50)

Calculation graph visualization

'''Output to png''' 
from keras.utils.visualize_util import plot
plot(model, to_file="music_only.png ", show_shapes=True)


'''Visualize interactively'''
from IPython.display import SVG
from keras.utils.visualize_util import model_to_dot
SVG(model_to_dot(model, show_shapes=True).create(prog='dot', format='svg'))

music_only.png

test

In the original paper, the AUC was about 0.87, but in this experiment, only 0.66 was obtained. Since the sample size is less than 1/5, it will be low, but let's say that it was possible to predict to some extent by feeding the raw (and still 30 seconds) audio data as it is.

from sklearn.metrics import roc_auc_score
pred_y_x1 = model.predict(test_X1, batch_size=50)
print(roc_auc_score(test_y, pred_y_x1)) # => 0.668582599155

Summary

--I was able to convert the audio file to ndarray. --I was able to predict the tag by inserting a RAW file without feature extraction. ――The research environment will be ready soon, so I would like to increase the number of samples and try it.

Recommended Posts

Extract music features with Deep Learning and predict tags
Deep learning image analysis starting with Kaggle and Keras
Deep Kernel Learning with Pyro
Try Deep Learning with FPGA
Generate Pokemon with Deep Learning
Recognize your boss and hide the screen with Deep Learning
Deep Learning with Shogi AI on Mac and Google Colab
HIKAKIN and Max Murai with live game video and deep learning
Easy deep learning web app with NNC and Python + Flask
Cat breed identification with deep learning
Deep Learning with Shogi AI on Mac and Google Colab Chapter 11
Make ASCII art with deep learning
Deep Learning with Shogi AI on Mac and Google Colab Chapters 1-6
Deep Learning with Shogi AI on Mac and Google Colab Chapter 8
Try deep learning with TensorFlow Part 2
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3
Deep Learning with Shogi AI on Mac and Google Colab Chapter 7
Deep Learning with Shogi AI on Mac and Google Colab Chapter 10 6-9
Deep Learning with Shogi AI on Mac and Google Colab Chapter 7 5-7
Deep Learning with Shogi AI on Mac and Google Colab Chapter 9
Solve three-dimensional PDEs with deep learning.
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 1-2
Organize machine learning and deep learning platforms
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3
Check squat forms with deep learning
Categorize news articles with deep learning
Forecasting Snack Sales with Deep Learning
Make people smile with Deep Learning
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3 ~ 5
Deep Learning with Shogi AI on Mac and Google Colab Chapter 7 9
Deep Learning with Shogi AI on Mac and Google Colab Chapter 8 5-9
Deep Learning with Shogi AI on Mac and Google Colab Chapter 8 1-4
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3
Deep Learning with Shogi AI on Mac and Google Colab Chapter 7 8
Deep Learning with Shogi AI on Mac and Google Colab Chapter 7 1-4
Put your own image data in Deep Learning and play with it
Realize environment construction for "Deep Learning from scratch" with docker and Vagrant
Learn with Shogi AI Deep Learning on Mac and Google Colab Use Google Colab
Deep Learning on Mac and Google Colab Words Learned with Shogi AI
Predict power demand with machine learning Part 2
Classify anime faces with deep learning with Chainer
Introduction to Deep Learning ~ Convolution and Pooling ~
Try Bitcoin Price Forecasting with Deep Learning
Try with Chainer Deep Q Learning --Launch
Try deep learning of genomics with Kipoi
Sentiment analysis of tweets with deep learning
Deep Learning from scratch The theory and implementation of deep learning learned with Python Chapter 3
I tried to extract a line art from an image with Deep Learning
Deep Learning
The story of doing deep learning with TPU
Chainer and deep learning learned by function approximation
A memorandum of studying and implementing deep learning
99.78% accuracy with deep learning by recognizing handwritten hiragana
Parallel learning of deep learning by Keras and Kubernetes
Introduction to Deep Learning ~ Localization and Loss Function ~
Music playback server with NanoPi-NEO, MPD and OLED
"Learning word2vec" and "Visualization with Tensorboard" on Colaboratory
Overview and useful features of scikit-learn that can also be used for deep learning
Steps to quickly create a deep learning environment on Mac with TensorFlow and OpenCV