Let's implement a phase sensitivity mask (PSM) in Python that creates a filter (mask) that extracts the desired signal from the observed signal when the desired signal is known. It is a design of a voice mask in the STFT area, which is a super-basic speech enhancement.
$ x $ Observation signal
$ s $ desired signal
$ n $ Other signals
I want $ s $ somehow.
The basic idea of speech enhancement is
$ X $: $ x $ STFT $ S $: $ s $ STFT $ \ hat {S} $: Inferred desired signal $ s $
$ G $: Time frequency mask (matrix elements are real values of $ 0 \ leq x \ leq 1 $)
The desired signal $ \ hat {S} $ is estimated by multiplying the observation signal $ X $ by the time-frequency mask $ G $ created by some means for each element of the matrix (Hadamard product). There are various ways to make the time frequency mask $ G $, such as a Wiener filter that extracts the desired signal by amplifying the frequency contained in the desired signal and attenuating other signals.
Speech enhancement is a problem of estimating $ s $ from $ x $.
PSM.py
#Calculation
import numpy as np
#Speech processing
import librosa
import librosa.display
#Data display
from IPython.display import display, Audio
# Original Data
#Original audio file
voice, rate = librosa.load("voice.wav", sr=rate)
#Original audio file with noise added
noise, _ = librosa.load("noise.wav", sr=rate)
# STFT
#STFT parameter
n_fft = 1024*4 #Frame length(The number of samples)
win_length = n_fft #Window function length(The number of samples)
hop_length = win_length // 4 #Hop length(The number of samples)
sr = rate #Sampling rate[Hz]
f_voice = librosa.stft(voice, n_fft=n_fft, win_length=win_length, hop_length=hop_length)
f_noise = librosa.stft(noise, n_fft=n_fft, win_length=win_length, hop_length=hop_length)
#PSM
X, S = f_noise, f_voice #Observation signal, desired signal
A = (np.abs(S) / np.abs(X)) * np.cos((np.angle(S)-np.angle(X)))
B = np.where(A < 0, 0, A)
G = np.where(B > 1, 1, B)
#Spectrogram display
# librosa.display.specshow(G, y_axis='log', x_axis='time', sr=sr, hop_length=hop_length)
#Mask adaptation
_data = f_cafe * G #Hadamard product
data = librosa.istft(_test,win_length=win_length, hop_length=hop_length)
#Display of completed voice
# display(Audio(test, rate=rate))
Audio files cannot be posted on Qiita, so please try it yourself. I will publish them all together in a collaboration or something.
Here, $ \ sigma_x ^ 2 $ and $ \ sigma_d ^ 2 $ represent the variances of the respective spectra of the desired signal and the noise signal. (Maybe spectral variance by frequency domain)
In other words, you can select the part where no one is talking as noise and use it as D. The fractional part of this variance is the Wiener filter.
win.py
s=voice #Desired signal
d=noise #Observation signal
axis = 1
#Observation signal
_, _, X = stft(d,fs = rate,nperseg=nperseg,noverlap = noverlap)
#Dispersion of desired signal
_, _, S = stft(s,fs = rate,nperseg=nperseg,noverlap = noverlap)
sigma_s=np.square(np.var(S,axis = axis))
#Noise distribution
noise_start = time2tap(8,rate)
noise_end = time2tap(9,rate)
_noise = d[noise_start:noise_end]
_, _, D = stft(_noise,fs = rate,nperseg=nperseg,noverlap = noverlap)
sigma_d=np.square(np.var(D,axis = axis))
#Filter generation
W = sigma_s / (sigma_s + sigma_d) #With just this, real-time processing is possible
G = np.tile(W,(63,1)).T
#Filter adaptation
ret = G * X
ret = ret / np.amax(ret)
_,test = istft(ret,fs=rate)
display(Audio(test,rate=rate))
display(Audio(d, rate=rate3))
Sound source enhancement and phase control based on deep learning
Recommended Posts