In a nutshell, to identify the language from voice data. For example, from the voice data "Good morning, good weather" to "This voice data is in Japanese!" From the voice data "Buenas Tardes", it feels like "This voice data is in Spanish!".
The intended use is to identify the language for someone who does not know what language they are speaking. It seems that existing automatic translators basically have to give information such as "English" and "Spanish" in advance. Therefore, there is no way to translate it for someone who does not know what language they are speaking. (I think.)
So for those who don't know what language they are speaking Identify the language using language identification → Translate It is used like this. (I think.)
There are various methods for language identification, but this time I tried using CNN. The reason is that I happened to find an easy-to-understand article in English. (http://yerevann.github.io/2015/10/11/spoken-language-identification-with-deep-convolutional-networks/)
This article seems to have been ranked 10th in the language identification contest held in 2015 by Topcoder, so I tried it after studying.
In the above article, the problem was to classify 66,176 10-second MP3 files prepared in advance into 176 languages.
But this time I am VoxForge (http://www.voxforge.org/) I used wav format audio files in English, French and Spanish, which I got from. Each can be downloaded from the URL below. I got it with the wget command because of the large amount of audio data.
http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/Audio/Main/16kHz_16bit/
http://www.repository.voxforge1.org/downloads/fr/Trunk/Audio/Main/16kHz_16bit/
http://www.repository.voxforge1.org/downloads/es/Trunk/Audio/Main/16kHz_16bit/
Since it uses CNN, it converts the wav format audio file of each language obtained above into an image.
This time, I used a "melspectogram", where the horizontal axis indicates time, the vertical axis indicates frequency, and the shade of the image indicates intensity. You can easily get the melspectogram from a wav file using a library called librosa.
Below is the code to convert a wav file into a mel spectrogram image. (I think the same code works for mp3 files)
#input:Audio file path
#output:Melspectogram image of audio data(192×192)
import librosa as lr
def wav_to_img(path, height=192, width=192):
signal, sr = lr.load(path, res_type='kaiser_fast')
if signal.shape[0] < sr: #If the wav file is less than 3 seconds
return False, False
else:
signal = signal[:sr*3] #Extract only the first 3 seconds
hl = signal.shape[0]//(width*1.1)
spec = lr.feature.melspectrogram(signal, n_mels=height, hop_length=int(hl))
img = lr.amplitude_to_db(spec)**2
start = (img.shape[1] - width) // 2
return True, img[:, start:start+width]
The data set I mentioned earlier contains data for a few seconds to a few tens of seconds. This time, I extracted and used only the first 3 seconds from the data of 3 seconds or more in the data. Data less than 3 seconds will not be used.
All audio data in the folder specified by the following function is converted to a melspectogram image and saved.
#Converts all audio files in the specified folder to spectrogram images and saves them in the specified folder
import os
import glob
import imageio
def process_audio(in_folder, out_folder):
os.makedirs(out_folder, exist_ok=True)
files = glob.glob(in_folder)
start = len(in_folder)
files = files[:]
for file in files:
bo, img = mp3_to_img(file)
if bo == True:
imageio.imwrite(out_folder + '.jpg', img)
As shown below, select the folder for the audio data of each language in the first argument and the output destination folder in the second argument, and execute. Do it for all languages and convert all audio files to melspectograms.
#The following specifies the path to save the audio file and the path to output
process_audio('data/voxforge/english/*wav', 'data/voxforge/english_3s_imgp/')
Save the Melspectogram folder for each language obtained above in one HDF5 file for your convenience. It's a bit clunky, but save the melspectogram image for each language saved above in HDF5 format in the following path. Destination path:'data / voxforge / 3sImg.h5'
import dask.array.image
import h5py
dask.array.image.imread('data/voxforge/english_3s_img/*.jpg').to_hdf5('data/voxforge/3sImg.h5', english)
dask.array.image.imread('data/voxforge/french_3s_img/*.jpg').to_hdf5('data/voxforge/3sImg.h5', french)
dask.array.image.imread('data/voxforge/spanish_3s_img/*.jpg').to_hdf5('data/voxforge/3sImg.h5', spanish)
Divide into training data, validation data, and test data.
import h5py
#Please decide the data size etc. by yourself in consideration of the obtained melspectogram image etc.
data_size = 60000
tr_size = 50000
va_size = 5000
te_size = 5000
x_english = h5py.File('data/voxforge/3sImg.h5')['english']
x_french = h5py.File('data/voxforge/3sImg.h5')['french']
x_spanish = h5py.File('data/voxforge/3sImg.h5')['spanish']
x = np.vstack((x_english[:20000], x_french[:20000], x_spanish[:20000]))
del x_french
del x_english
del x_spanish
x = da.from_array(x, chunks=1000)
#Preparation for correct answer data
y = np.zeros(data_size)
#0 labels for English, French and Spanish respectively,1,2
y[0:20000] = 0
y[20000:40000] = 1
y[40000:60000] = 2
#Shuffle and split data
import numpy as np
shfl = np.random.permutation(data_size)
training_size = tr_size
validation_size = va_size
test_size = te_size
#A randomly prepared index shfl is assigned for each division of training data, evaluation, and test size.
train_idx = shfl[:training_size]
validation_idx = shfl[training_size:training_size+validation_size]
test_idx = shfl[training_size+validation_size:]
#Create training data, evaluation, and test size with the assigned index
x_train = x[train_idx]
y_train = y[train_idx]
x_vali = x[validation_idx]
y_vali = y[validation_idx]
x_test = x[test_idx]
y_test = y[test_idx]
#Image normalization
x_train = x_train/255
x_vali = x_vali/255
x_test = x_test/255
#Shape transformation for learning
x_train = x_train.reshape(tr_size, 192, 192, 1)
x_vali = x_vali.reshape(va_size, 192, 192, 1)
x_test = x_test.reshape(te_size, 192, 192, 1)
#One hot vector of teacher data
y_train = y_train.astype(np.int)
y_vali = y_vali.astype(np.int)
y_test = y_test.astype(np.int)
With the above processing, it can be divided into training data, validation data, and test data.
The network structure used is as follows. You may change it freely here. The framework used keras.
import tensorflow as tf
from tensorflow.python import keras
from tensorflow.python.keras import backend as K
from tensorflow.python.keras.models import Model, Sequential, load_model
from tensorflow.python.keras.layers import Conv2D, MaxPooling2D, Dense, Flatten, Dropout, Input, BatchNormalization, Activation
from tensorflow.python.keras.preprocessing.image import load_img, img_to_array, array_to_img, ImageDataGenerator
i = Input(shape=(192,192,1))
m = Conv2D(16, (7, 7), activation='relu', padding='same', strides=1)(i)
m = MaxPooling2D(pool_size=(3, 3), strides=2, padding='same')(m)
m = BatchNormalization()(m)
m = Conv2D(32, (5, 5), activation='relu', padding='same', strides=1)(m)
m = MaxPooling2D(pool_size=(3, 3), strides=2, padding='same')(m)
m = BatchNormalization()(m)
m = Conv2D(64, (3, 3), activation='relu', padding='same', strides=1)(m)
m = MaxPooling2D()(m)
m = BatchNormalization()(m)
m = Conv2D(128, (3, 3), activation='relu', padding='same', strides=1)(m)
m = MaxPooling2D(pool_size=(3, 3), strides=2, padding='same')(m)
m = BatchNormalization()(m)
m = Conv2D(128, (3, 3), activation='relu', padding='same', strides=1)(m)
m = MaxPooling2D(pool_size=(3, 3), strides=2, padding='same')(m)
m = BatchNormalization()(m)
m = Conv2D(256, (3, 3), activation='relu', padding='same', strides=1)(m)
m = MaxPooling2D(pool_size=(3, 3), strides=2, padding='same')(m)
m = BatchNormalization()(m)
m = Flatten()(m)
m = Activation('relu')(m)
m = BatchNormalization()(m)
m = Dropout(0.5)(m)
m = Dense(512, activation='relu')(m)
m = BatchNormalization()(m)
m = Dropout(0.5)(m)
o = Dense(3, activation='softmax')(m)
model = Model(inputs=i, outputs=o)
model.summary()
I learned below. Because the number of training data was small or the model was bad, there was a tendency to overfit immediately. About 5 epochs is enough. I'm sorry I haven't been able to think about it at all.
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=32, epochs=5, verbose=1, validation_data=(x_vali, y_vali), shuffle = True)
This is the result of the test data.
model.evaluate(x_test, y_test)
[0.2763474455833435, 0.8972]
It feels like you can predict it properly at about 90%.
However, in fact, the training data and the test data contain a lot of the same human voice data, so I think the accuracy is high.
If you want to check the accuracy more accurately, 'data/voxforge/english_3s_imgp/' Data that is not used in the data before shuffling, in this example,
x_english = h5py.File('data/voxforge/3sImg.h5')['english']
x_french = h5py.File('data/voxforge/3sImg.h5')['french']
x_spanish = h5py.File('data/voxforge/3sImg.h5')['spanish']
x = np.vstack((x_english[20000:], x_french[20000:], x_spanish[20000:]))
I think that you can check the accuracy more accurately by using the data of. By the way, the correct answer rate for each language at that time was as follows.
English: 0.8414201183431953 French: 0.7460106382978723 Spanish: 0.8948035487959443
I can't identify French very well.
This time, after studying, I tried voice recognition using CNN. I'm sorry that there may be places where the explanation is insufficient due to a rush on the way. I introduced the basic idea at the beginning, but you can check it from the following URL, so if you are good at English, you may want to look there. (http://yerevann.github.io/2015/10/11/spoken-language-identification-with-deep-convolutional-networks/)
Thank you for reading until the end.
Recommended Posts