[PYTHON] I was free, so I made karaoke (2nd vocal extraction and pitch estimation)

The first is here This time, I will describe how to separate vocals from a music file with ** pleeter ** and estimate the pitch from vocals separated with ** pyworld **.

Vocal separation by spleeter

First, prepare the music you want to separate. This time, I borrowed Shining Star from Maoudamashii. This time, I renamed it to test.mp3 in ./music and placed it.

from spleeter.separator import Separator
def main():
    sp = Separator("spleeter:2stems")
    sp.separate_to_file("music/test.mp3","music/spleeter")
if __name__ == '__main__':
    main()

Will create a test folder in ./music/spleeter with vocals.wav and accompaniment.wav in it. music/ 　├ test.mp3 　└ spleeter/ 　　├ vocals.wav 　　└accompaniment.wav If it looks like this, it is a success.

Interval estimation by pyworld

The official GitHub has a method for estimating f0 (pitch), so we will execute it based on that. First, let's look at the waveform, f0 estimation, and spectrogram.

import pyworld as pw
import matplotlib.pyplot as plt
import numpy as np
from scipy import signal
from pydub import AudioSegment

start = 30 #Audio extraction range: There is a silent section in the first half, so I avoided it.
end = 35

sound = AudioSegment.from_wav("music/spleeter/test/vocals.wav")
fs = sound.frame_rate
#Since the audio is stereo, extract one
data = np.array(sound.get_array_of_samples())[start*fs*2:end*fs*2:2] 
data = data.astype(np.float)

#Display of audio waveform
t=np.arange(data.shape[0])*(data.shape[0]/fs/data.shape[0])+start
plt.subplot(3,1,1)
plt.plot(t,data)
plt.xlim([t[0],t[-1]])

#Display of f0 estimate
plt.subplot(3,1,2)

_f0, _time = pw.dio(data, fs)    #Extraction of fundamental frequency
f0 = pw.stonemask(data, _f0, _time, fs)  #Fundamental frequency correction
t=np.arange(f0.shape[0])*(data.shape[0]/fs/f0.shape[0])+start

plt.plot(t,f0)
plt.xlim([t[0],t[-1]])

#Spectrogram display
plt.subplot(3,1,3)
f, t, Zxx = signal.stft(data, fs, nperseg=1024)
plt.pcolormesh(t+start, f, np.abs(Zxx))
plt.ylim([f[1], f[-1]])
plt.ylabel('Frequency [Hz]')
plt.xlabel('Time [sec]')
plt.yscale('log')
plt.show()

When you do this An image like this is created. The estimation seems to work, but there are places where it is not estimated even though there is some audio. Let's specify a parameter to improve this. Modify the frequency extraction part as follows.

_f0, _time = pw.dio(data, fs,allowed_range=1)

If you fix it like this There are some outliers, but there are no more unestimated intervals like before. Now, let's smooth this graph by removing outliers. This time, we will use the convolution integral to estimate and correct outliers. This site has a detailed explanation of convolution integrals. The difference is that the sequence to be convolved is as follows. By doing this, when there are outliers, the difference between the values after the convolution integral and the original sequence will be wide. Also, by attenuating according to the distance, the weight of the nearby value is emphasized. The code is below.

#Function to perform convolution integration
#useCenter=When False, the center value of weight is set to 0.
#That is, the value is not used when performing convolution integration.(Used for outlier estimation)
def avg(f0,r,useCenter=False):
    '''
    :param f0:The pitch you want to convolve and integrate
    :param r:Radius to use value
    :param useCenter:Whether to use the values in the original sequence(For outliers)
    :return:Interval after convolution integration
    '''
    weight = np.arange(2*r+3)
    weight = 1-np.abs(1-weight/(r+1))[1:-1]
    if not useCenter:
        weight[r]=0
    arr = np.convolve(f0,weight,"same")
    arr_num = np.convolve(f0>0,weight,"same")
    arr_num[arr_num==0]=1
    arr_avg = arr/arr_num
    return arr_avg

If you use this In the case of outliers like this, it will be fairly smooth because it is estimated from the surroundings. Below is the full code.

import pyworld as pw
import matplotlib.pyplot as plt
import numpy as np
from scipy import signal
from pydub import AudioSegment

start=30
end=35
r=20

sound = AudioSegment.from_wav("music/spleetertest/test/vocals.wav")
fs = sound.frame_rate
data = np.array(sound.get_array_of_samples())[start*fs*2:end*fs*2:2]
data = data.astype(np.float)

_f0, _time = pw.dio(data, fs,allowed_range=1)    #Extraction of fundamental frequency
f0 = pw.stonemask(data, _f0, _time, fs)  #Fundamental frequency correction

#Function to perform convolution integration
#useCenter=When False, the center value of weight is set to 0.
#That is, the value is not used when performing convolution integration.(Used for outlier estimation)
def avg(f0,r,useCenter=False):
    '''
    :param f0:The pitch you want to convolve and integrate
    :param r:Radius to use value
    :param useCenter:Whether to use the values in the original sequence(For outliers)
    :return:Interval after convolution integration
    '''
    weight = np.arange(2*r+3)
    weight = 1-np.abs(1-weight/(r+1))[1:-1]
    if not useCenter:
        weight[r]=0
    arr = np.convolve(f0,weight,"same")
    arr_num = np.convolve(f0>0,weight,"same")
    arr_num[arr_num==0]=1
    arr_avg = arr/arr_num
    return arr_avg




t=np.arange(data.shape[0])*(data.shape[0]/fs/data.shape[0])
plt.subplot(4,1,1)
plt.plot(t,data)
plt.xlim([t[0],t[-1]])
plt.subplot(4,1,2)
index=np.arange(f0.shape[0])
t=np.arange(f0.shape[0])*(data.shape[0]/fs/f0.shape[0])+start
plt.plot(t,f0)
plt.yscale('log')
plt.ylim([200, 500])
plt.xlim([t[0],t[-1]])

plt.subplot(4,1,3)
f0_=avg(f0,r)
br=np.abs(f0-f0_)>50 #If the difference from the estimate is 50Hz or more, the value is ignored as an outlier.
f0[br]=0
f0=avg(f0,r,useCenter=True)
t=np.arange(f0.shape[0])*(data.shape[0]/fs/f0.shape[0])+start
plt.plot(t,f0)
plt.yscale('log')
plt.ylim([200, 500])
plt.xlim([t[0],t[-1]])

plt.subplot(4,1,4)
f, t, Zxx = signal.stft(data, fs, nperseg=1024)
plt.pcolormesh(t+start, f, np.abs(Zxx))
plt.ylim([200, 500])
plt.ylabel('Frequency [Hz]')
plt.xlabel('Time [sec]')
plt.yscale('log')
plt.show()

next time

Next time, I would like to decide the sound breaks.

[PYTHON] I was free, so I made karaoke (2nd vocal extraction and pitch estimation)

Contents

Vocal separation by spleeter

Interval estimation by pyworld

next time