Let's implement a phase sensitivity mask (PSM) in Python that creates a filter (mask) that extracts the desired signal from the observed signal when the desired signal is known. It is a design of a voice mask in the STFT area, which is a super-basic speech enhancement.

Formulation of speech enhancement

x = s + n
$ x $ Observation signal $ s $ desired signal $ n $ Other signals

I want $ s $ somehow.

The basic idea of speech enhancement is \hat{S} = G \odot X

$ X $: $ x $ STFT $ S $: $ s $ STFT $ \ hat {S} $: Inferred desired signal $ s $

$ G $: Time frequency mask (matrix elements are real values of $ 0 \ leq x \ leq 1 $)

The desired signal $ \ hat {S} $ is estimated by multiplying the observation signal $ X $ by the time-frequency mask $ G $ created by some means for each element of the matrix (Hadamard product). There are various ways to make the time frequency mask $ G $, such as a Wiener filter that extracts the desired signal by amplifying the frequency contained in the desired signal and attenuating other signals.

Speech enhancement is a problem of estimating $ s $ from $ x $.

Phase Sensitivity Mask (PSM)

G = T[(|S| \oslash |X|) \odot cos(\angle S - \angle X)] ^1 _0 T[x]^1_0=min(max(x,0),1)

code

`PSM.py`


#Calculation
import numpy as np
#Speech processing
import librosa
import librosa.display
#Data display
from IPython.display import display, Audio


# Original Data
#Original audio file
voice, rate = librosa.load("voice.wav", sr=rate)
#Original audio file with noise added
noise, _ = librosa.load("noise.wav", sr=rate)


# STFT

#STFT parameter
n_fft = 1024*4     #Frame length(The number of samples)
win_length = n_fft     #Window function length(The number of samples)
hop_length = win_length // 4     #Hop length(The number of samples)
sr = rate     #Sampling rate[Hz]

f_voice = librosa.stft(voice, n_fft=n_fft, win_length=win_length, hop_length=hop_length)
f_noise = librosa.stft(noise, n_fft=n_fft, win_length=win_length, hop_length=hop_length)

#PSM
X, S = f_noise, f_voice #Observation signal, desired signal

A = (np.abs(S) / np.abs(X)) * np.cos((np.angle(S)-np.angle(X)))
B = np.where(A < 0, 0, A)
G = np.where(B > 1, 1, B)

#Spectrogram display
# librosa.display.specshow(G, y_axis='log', x_axis='time', sr=sr, hop_length=hop_length)

#Mask adaptation
_data = f_cafe * G #Hadamard product
data = librosa.istft(_test,win_length=win_length, hop_length=hop_length)

#Display of completed voice
# display(Audio(test, rate=rate))

Summary

Audio files cannot be posted on Qiita, so please try it yourself. I will publish them all together in a collaboration or something.

bonus

Noise reduction using a Wiener filter

\hat{X} = E[X|Y] = \frac{\sigma_x ^2}{\sigma_x ^2 + \sigma_d ^2} Y

Here, $ \ sigma_x ^ 2 $ and $ \ sigma_d ^ 2 $ represent the variances of the respective spectra of the desired signal and the noise signal. (Maybe spectral variance by frequency domain)
In other words, you can select the part where no one is talking as noise and use it as D. The fractional part of this variance is the Wiener filter.

`win.py`


s=voice #Desired signal
d=noise #Observation signal
axis = 1

#Observation signal
_, _, X = stft(d,fs = rate,nperseg=nperseg,noverlap = noverlap)

#Dispersion of desired signal
_, _, S = stft(s,fs = rate,nperseg=nperseg,noverlap = noverlap)
sigma_s=np.square(np.var(S,axis = axis))

#Noise distribution
noise_start = time2tap(8,rate)
noise_end = time2tap(9,rate)

_noise = d[noise_start:noise_end]
_, _, D = stft(_noise,fs = rate,nperseg=nperseg,noverlap = noverlap)
sigma_d=np.square(np.var(D,axis = axis))

#Filter generation
W = sigma_s / (sigma_s + sigma_d) #With just this, real-time processing is possible
G = np.tile(W,(63,1)).T

#Filter adaptation
ret = G * X
ret = ret / np.amax(ret)
_,test = istft(ret,fs=rate)

display(Audio(test,rate=rate))
display(Audio(d, rate=rate3))

References

Sound source enhancement and phase control based on deep learning

Speech enhancement with phase sensitivity mask (PSM) with Python