[PYTHON] [Introduction to pytorch] Preprocessing by audio I / O and torch audio (> <;)

As the title suggests, I tried to move the code on the following reference page. It's difficult to say that it's particularly helpful, but it worked, so I'll summarize it.

It's not crisp, but at the end, it falls off. .. ..

【reference】 ①AUDIO I/O AND PRE-PROCESSING WITH TORCHAUDIO ②TORCHAUDIO.TRANSFORMS ③SOURCE CODE FOR TORCHAUDIO.TRANSFORMS

The sales in this section are as follows. “A lot of effort is spent preparing the data to solve machine learning problems. Torchaudio leverages PyTorch's GPU support to provide many tools to make loading data easier and easier to read. To do. This tutorial will show you how to load and preprocess data from a simple dataset. For more information, see Preprocessing with Audio I / O and torchaudio (https://pytorch.org/tutorials/beginner/audio_preprocessing_tutorial.html). For this tutorial, make sure you have the matplotlib package installed for ease of visualization. "

What i did

・ Preparation ・ Opening a file ・ Transformations ・ Functional ・ Migrating to torchaudio from Kaldi ・ Available Datasets

・ Preparation

# Uncomment the following line to run in Google Colab
# !pip install torchaudio
import torch
import torchaudio
import requests
import matplotlib.pyplot as plt

When you run it for the first time, you need to install the following.

pip install torchaudio

Also, when I try to read a file, it throws an error and does not read. cannot import torch audio ' No audio backend is available.' So, as the link above, For Windows files

pip install PySoundFile

On Linux

pip install sox

・ Opening a file

"Torchaudio also supports loading wav and mp3 format sound files. Waveforms are called raw audio signals." The following code reads the wav file that exists in the url with r = requests.get (url) into r and then stores it locally as'steam-train-whistle-daniel_simon-converted-from-mp3.wav' I will.

The contents are like a whistle Then, load the data into waveform and sample_rate again with waveform, sample_rate = torchaudio.load (filename). This waveform is called a raw audio signal.
In other words, if you put the wav file locally, you can read it with this code.

url = "https://pytorch.org/tutorials/_static/img/steam-train-whistle-daniel_simon-converted-from-mp3.wav"
r = requests.get(url)

with open('steam-train-whistle-daniel_simon-converted-from-mp3.wav', 'wb') as f:
    f.write(r.content)

filename = "steam-train-whistle-daniel_simon-converted-from-mp3.wav"
waveform, sample_rate = torchaudio.load(filename)

print("Shape of waveform: {}".format(waveform.size()))
print("Sample rate of waveform: {}".format(sample_rate))

plt.figure()
plt.plot(waveform.t().numpy())

And it is drawn with plt.plot (waveform.t (). Numpy ()). The result is drawn as follows. At first, I wondered why two overlapped, but when I looked at the standard output, Shape of waveform: torch.Size([2, 276858]) Sample rate of waveform: 44100 It is, and torch.Size is 2, and you can see that there are two data of 276858. In other words, it is 2ch data (stereo data). So, if you divide the drawing, it will be drawn as follows.

・ Transformations

"Torchaudio is still growing, but it supports conversions like the ones listed below."

function	function
Resample:	Resampling the waveform at a different sample rate.
Spectrogram:	Create a spectrogram from the waveform.
GriffinLim:	Griffin-Use the Lim transform to calculate the waveform from a linear scale magnitude spectrogram.
ComputeDeltas:	Calculate the delta factor of a tensor (usually a spectrogram)
ComplexNorm:	Calculates the norm of a complex tensor.
MelScale:	A transformation matrix is used to convert a regular STFT to a mel frequency STFT.
AmplitudeToDB:	This powers the spectrogram/Converts from an amplitude scale to a decibel scale.
MFCC:	Create a mel frequency cepstrum coefficient from the waveform.
MelSpectrogram:	Use PyTorch's STFT function to create a MEL spectrogram from the waveform.
MuLawEncoding:	mu-law Encodes the waveform based on compression.
MuLawDecoding:	mu-Decodes law-encoded waveforms.
TimeStretch:	Stretches the spectrogram in time without changing the pitch at a specific rate.
FrequencyMasking:	Apply masking to the frequency domain spectrogram.
TimeMasking:	Apply masking to the time domain spectrogram.

The original English text is as follows.

function	function
Resample:	Resample waveform to a different sample rate.
Spectrogram:	Create a spectrogram from a waveform.
GriffinLim:	Compute waveform from a linear scale magnitude spectrogram using the Griffin-Lim transformation.
ComputeDeltas:	Compute delta coefficients of a tensor, usually a spectrogram.
ComplexNorm:	Compute the norm of a complex tensor.
MelScale:	This turns a normal STFT into a Mel-frequency STFT, using a conversion matrix.
AmplitudeToDB:	This turns a spectrogram from the power/amplitude scale to the decibel scale.
MFCC:	Create the Mel-frequency cepstrum coefficients from a waveform.
MelSpectrogram:	Create MEL Spectrograms from a waveform using the STFT function in PyTorch.
MuLawEncoding:	Encode waveform based on mu-law companding.
MuLawDecoding:	Decode mu-law encoded waveform.
TimeStretch:	Stretch a spectrogram in time without modifying pitch for a given rate.
FrequencyMasking:	Apply masking to a spectrogram in the frequency domain.
TimeMasking:	Apply masking to a spectrogram in the time domain.

“Each conversion supports batch processing. You can perform conversions on a single raw audio signal or spectrogram, or many of the same shape. All transformations are nn.Modules or jit.ScriptModules, so they can always be used as part of a neural network. "

View the logarithm of the spectrogram on a logarithmic scale

specgram = torchaudio.transforms.Spectrogram()(waveform)
print("Shape of spectrogram: {}".format(specgram.size()))
plt.figure()
plt.imshow(specgram.log2()[0,:,:].numpy(), cmap='gray')

The following is changed so that the output of matplotlib is tampered with and the title is output, but the code is complicated, so I will not post it here. Shape of spectrogram: torch.Size([2, 201, 1385]) In specgram.log2 () [0 ,:,:], only one of the following is output. However, although the above figure looks like a spectorgram, it cannot be distinguished because the signal is due to the whistle or it has no characteristics. So, let's read the wav file of "Good morning, good morning, good morning" that we saw before and display it in the same way. Here, Specifications of torchaudio.transforms.Spectrogram () from Reference ③ is as follows.

torchaudio.transforms.Spectrogram(n_fft: int = 400, win_length: Optional[int] = None, hop_length: Optional[int] = None, pad: int = 0, window_fn: Callable[[...], torch.Tensor] = <built-in method hann_window of type object>, power: Optional[float] = 2.0, normalized: bool = False, wkwargs: Optional[dict] = None)

paeameters	type	explaination
n_fft	(int, optional)	– Size of FFT creates n_fft // 2 + 1 bins. (Default: 400)
win_length	(int or None, optional)	– Window size. (Default: n_fft)
hop_length	(int or None, optional)	– Length of hop between STFT windows. (Default: win_length // 2)
pad	(int, optional)	– Two sided padding of signal. (Default: 0)
window_fn	(Callable[.., Tensor], optional)	– A function to create a window tensor that is applied/multiplied to each frame/window. (Default: torch.hann_window)
power	(float or None, optional)	– Exponent for the magnitude spectrogram, (must be > 0) e.g., 1 for energy, 2 for power, etc. If None, then the complex spectrum is returned instead. (Default: 2)
normalized	(bool, optional)	– Whether to normalize by magnitude after stft. (Default: False)
wkwargs	(dict or None, optional)	– Arguments for window function. (Default: None)

From the table above, the following code outputs a decent spectrogram.

I don't know the top and bottom and frequency, but it looks like STFT. (The following is the input signal) (The following is the Spectrogram. You can see that this picture is similar to the one previously obtained in the "Good morning" STFT. The scale of the vertical axis is unknown.) Shape of spectrogram: torch.Size([1, 513, 431])

filename = "10ohayo0hirakegoma_out.wav" 
waveform, sample_rate = torchaudio.load(filename)
print("Shape of waveform: {}".format(waveform.size()))
print("Sample rate of waveform: {}".format(sample_rate))

sk = "waveform"
fig, (ax1,ax2,ax3) = plt.subplots(3,1,figsize=(1.6180 * 4, 4*2))
lns1=ax1.plot(waveform.t().numpy(),"red",label = "waveform[0]")
lns2=ax2.plot(waveform.t().numpy(),"red",label = "waveform[0]")
lns3=ax3.plot(waveform.t().numpy(),"blue",label = "waveform[0]")
ax1.legend(loc=0)
ax2.legend(loc=0)
ax3.legend(loc=0)
ax1.set_title(sk)
ax2.set_xlim(50000,50000+44100*0.0625) #0,44100*0.25
ax3.set_xlim(3*44100,44100*3.0625)
plt.pause(1)
plt.savefig('./fig/fig_{}_double_.png'.format(sk)) 
plt.close()

specgram = torchaudio.transforms.Spectrogram(n_fft=1024)(waveform)
print("Shape of spectrogram: {}".format(specgram.size()))

sk = "specgram"
fig, (ax1,ax2,ax3) = plt.subplots(3,1,figsize=(1.6180 * 4, 4*2))
lns1=ax1.imshow(specgram.log2()[0,:,:].numpy(), cmap='gray') 
lns2=ax2.imshow(specgram.log2()[0,:,:].numpy(), cmap='hsv') 
lns3=ax3.imshow(specgram.log2()[0,:,:].numpy(), cmap='hsv') 

ax2.set_ylim(250,0)
ax3.set_ylim(125,0)
ax1.set_title(sk)

plt.pause(1)
plt.savefig('./fig/fig_{}_double_.png'.format(sk)) 
plt.close()

View the mel spectrogram on a logarithmic scale

Then print the mel spectrogram. What is MelSpectrogram? ** Internally, the spectrogram is multiplied by what is called a mel filter bank ** ⇒ Amplitude adjustment filter that emphasizes the low frequency region A filter similar to brightness adjustment 【reference】 ④ Understanding Melfilter Bank ⑤ Mel scale @wikipedia The mel scale is as follows. .. .. So it's log2.

m = 1000\log _2(\frac{f}{1000Hz}+1)

specgram = torchaudio.transforms.MelSpectrogram()(waveform)
print("Shape of spectrogram: {}".format(specgram.size()))
plt.figure()
p = plt.imshow(specgram.log2()[0,:,:].detach().numpy(), cmap='gray')

Shape of spectrogram: torch.Size([2, 128, 1385]) Well, a picture similar to the above Spectrogram comes out, but it seems that the above conversion is done. Therefore, check the specifications in the same way as the above Spectrogram. It looks like the following.

torchaudio.transforms.MelSpectrogram(sample_rate: int = 16000, n_fft: int = 400, win_length: Optional[int] = None, hop_length: Optional[int] = None, f_min: float = 0.0, f_max: Optional[float] = None, pad: int = 0, n_mels: int = 128, window_fn: Callable[[...], torch.Tensor] = <built-in method hann_window of type object>, power: Optional[float] = 2.0, normalized: bool = False, wkwargs: Optional[dict] = None)

The meaning of each parameter is almost the same as the above Spectrogram. However, you can specify Sampling_rate.

paeameters	type	explaination
sample_rate	(int, optional)	– Sample rate of audio signal. (Default: 16000)
win_length	(int or None, optional)	– Window size. (Default: n_fft)
hop_length	(int or None, optional)	– Length of hop between STFT windows. (Default: win_length // 2)
n_fft	(int, optional)	– Size of FFT, creates n_fft // 2 + 1 bins. (Default: 400)
f_min	(float, optional)	– Minimum frequency. (Default: 0.)
f_max	(float or None, optional)	– Maximum frequency. (Default: None)
pad	(int, optional)	– Two sided padding of signal. (Default: 0)
n_mels	(int, optional)	– Number of mel filterbanks. (Default: 128)
window_fn	(Callable[.., Tensor], optional)	– A function to create a window tensor that is applied/multiplied to each frame/window. (Default: torch.hann_window)
wkwargs	(Dict[.., ..] or None, optional)	– Arguments for window function. (Default: None)

Based on the table above, output with the following code.

The main part of the code is as follows. sample_rate = 44100, n_fft = 2048.

specgram = torchaudio.transforms.MelSpectrogram(sample_rate=44100,n_fft=2048)(waveform)
print("Shape of MelSpectrogram: {}".format(specgram.size()))

It has a similar waveform to the spectrogram above, but with a smaller vertical scale, but the resolution looks better than above.

Resampling

You can resample the waveform one channel at a time. sample_rate is set to 1/10. The horizontal axis is 1/10 of the original waveform.

new_sample_rate = sample_rate/10
# Since Resample applies to a single channel, we resample first channel here
channel = 0
transformed = torchaudio.transforms.Resample(sample_rate, new_sample_rate)(waveform[channel,:].view(1,-1))
print("Shape of transformed waveform: {}".format(transformed.size()))
plt.figure()
plt.plot(transformed[0,:].numpy())

Shape of transformed waveform: torch.Size([1, 27686])

μ-Law algorithm

As another example of conversion, you can encode the signal based on Mu-Law encoding.

The μ-law algorithm (sometimes written "mu-law", often approximated as "u-law") is a companding algorithm, primarily used in 8-bit PCM digital telecommunication systems in North America and Japan. Μ-law algorithm @wikipedia

However, this requires the signal to be between -1 and 1. The tensor is a regular PyTorch tensor, so you can apply standard operators. "

# Let's check if the tensor is in the interval [-1,1]
print("Min of waveform: {}\nMax of waveform: {}\nMean of waveform: {}".format(waveform.min(), waveform.max(), waveform.mean()))

Min of waveform: -0.572845458984375 Max of waveform: 0.575958251953125 Mean of waveform: 9.293758921558037e-05

"The waveform is already between -1 and 1, so there is no need to normalize." The following is a standardization function.

def normalize(tensor):
    # Subtract the mean, and scale to the interval [-1,1]
    tensor_minusmean = tensor - tensor.mean()
    return tensor_minusmean/tensor_minusmean.abs().max()
# Let's normalize to the full interval [-1,1]
# waveform = normalize(waveform)

If you uncomment, it will be standardized to [-1,1].

"Let's apply the waveform encoding."

transformed = torchaudio.transforms.MuLawEncoding()(waveform)
print("Shape of transformed waveform: {}".format(transformed.size()))
plt.figure()
plt.plot(transformed[0,:].numpy())

Shape of transformed waveform: torch.Size([2, 276858]) "And decode."

reconstructed = torchaudio.transforms.MuLawDecoding()(transformed)
print("Shape of recovered waveform: {}".format(reconstructed.size()))
plt.figure()
plt.plot(reconstructed[0,:].numpy())

Shape of recovered waveform: torch.Size([2, 276858]) "Finally, you can compare the original waveform with the reconstructed version."

# Compute median relative difference
err = ((waveform-reconstructed).abs() / waveform.abs()).median()
print("Median relative difference between original and MuLaw reconstucted signals: {:.2%}".format(err))

Median relative difference between original and MuLaw reconstucted signals: 1.28% In other words, the compression was encoded and decoded, and the error was 1.28%. Functional "The above conversion relies on low-level stateless functions for computation. These functions are available in torchaudio.functional. A complete list is available here" () (broken link), Things are included. " [Reference] The site with broken links seems to be changed to the following. ⑥TORCHAUDIO.FUNCTIONAL Unfortunately, the Torchaudio.Functional page has changed and doesn't include stft etc. I found it to exist in the link below.

functions	Contents
istft	: Inverse short time Fourier Transform.
stft	: Short time Fourier Transform.
gain	: Applies amplification or attenuation to the whole waveform.
dither	: Increases the perceived dynamic range of audio stored at a particular bit-depth.
compute_deltas	: Compute delta coefficients of a tensor.
equalizer_biquad	: Design biquad peaking equalizer filter and perform filtering.
lowpass_biquad	: Design biquad lowpass filter and perform filtering.
highpass_biquad	:Design biquad highpass filter and perform filtering.

STFT So, although it is an extra edition, I tried STFT. First, the specifications are as follows.

torch.stft(input: torch.Tensor, n_fft: int, hop_length: Optional[int] = None, win_length: Optional[int] = None, window: Optional[torch.Tensor] = None, center: bool = True, pad_mode: str = 'reflect', normalized: bool = False, onesided: Optional[bool] = None, return_complex: Optional[bool] = None) → torch.Tensor

The code is below. Here, I think that input is a waveform, Shape of waveform: torch.Size([1, 220160]) And the input is a 1D tensor or a 2D (batch, waveform), so waveform.reshape(220160) And reshaped to 1D. Drawing is simplified by extracting only one element as shown below. lns1=ax1.imshow(specgram.log2().numpy()[:,:,0], cmap='gray') The result, albeit tentatively (not sure if it is correct), is the following figure.

Originally, I would analyze by looking at the source, but I refrained from doing this this time.

sk = "stft"
specgram = torch.stft(input = torch.tensor(waveform.reshape(220160)) ,n_fft=1024) #(waveform)
print("Shape of stftSpectrogram: {}".format(specgram.size()))
fig, (ax1,ax2,ax3) = plt.subplots(3,1,figsize=(1.6180 * 4, 4*2))
lns1=ax1.imshow(specgram.log2().numpy()[:,:,0], cmap='gray')
lns2=ax2.imshow(specgram.log2().numpy()[:,:,0], cmap='hsv')
lns3=ax3.imshow(specgram.log2().numpy()[:,:,0], cmap='hsv')
ax2.set_ylim(250,0)
ax3.set_ylim(125,0)
ax1.set_title(sk)
plt.pause(1)
plt.savefig('./fig/fig_{}_double_.png'.format(sk)) 
plt.close()

There are the following comments regarding this code, and we will not pursue it any further. This function changed signature at version 0.4.1. Calling with the previous signature may cause error or return incorrect result. On the other hand, if you read the source of this stft, the following comments are drawn.

The STFT computes the Fourier transform of short overlapping windows of the input. 
This giving frequency components of the signal as they change over time. 
The interface of this function is modeled after the librosa_ stft function.

    .. _librosa: https://librosa.org/doc/latest/generated/librosa.stft.html

In other words, it seems that the design was borrowed from librosa.stft. So, I will hit the head family. This seems to be displayed correctly.

# Feature extraction example
import numpy as np
import librosa
import librosa.display

y, sr = librosa.load('10ohayo0hirakegoma_out.wav')  #trumpet'))

S = np.abs(librosa.stft(y))

S_left = librosa.stft(y, center=False)

D_short = librosa.stft(y, hop_length=64)

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
img = librosa.display.specshow(librosa.amplitude_to_db(S,
                                                       ref=np.max),
                               y_axis='log', x_axis='time', ax=ax)
sk = 'Power spectrogram'
ax.set_title(sk)
fig.colorbar(img, ax=ax, format="%+2.0f dB")
plt.pause(1)
plt.savefig('./fig/fig_{}_librose_.png'.format(sk)) 
plt.close()

Here, the vertical axis and the horizontal axis show a reliable display as shown below. It has nothing to do with pytorch, or it can't use GPU, which is the original purpose, but it can be used for preprocessing. Example) Preprocessing to use this stft image for voice recognition And since I saw this figure, I understood that the above graph is top and bottom and the scale is simply the number of elements. fig_Power spectrogram_librose_.png

There is the rest of the explanation, but this time I will pass it. ・ Mu_law_encoding functional: ・ Visualize a waveform with the highpass biquad filter. ・ Migrating to torchaudio from Kaldi ・ Create mel frequency cepstral coefficients from a raw audio signal Also, I feel that the provided Datasets are newer at the link below. ・ Available Datasets ⇒ TORCHAUDIO.DATASETS

It may take some time to become a culture (understand where you need the information). .. ..

Summary

・ I tried to move torch audio ・ I expected to be able to calculate gpu, but not on Uwan's machine. ・ Since the time stamp is old, it is recommended to refer to the new page as much as possible.

By the way, the following pages are 2017-2018. © Copyright 2018, Torchaudio Contributors. TORCHAUDIO © Copyright 2018, Torchaudio Contributors. [TORCHAUDIO.FUNCTIONAL] (https://pytorch.org/audio/stable/functional.html) © Copyright 2017, PyTorch. SPEECH COMMAND RECOGNITION WITH TORCHAUDIO