Hello everybody

Google made an automatic composition learner called magenta, so you can make as many songs as you like! I thought, but it wasn't really that easy. The file of the learning source seems to be midi, and it seems that it is not possible to learn by inserting any recorded file.

That's why I tried various things because I couldn't make my own automatic composition learning machine that can be made using wav. The last few articles are experiments for this time.

TL;DR

midi I'm not sure, so I want to automatically compose with wav
It seems that Keras has a stateful RNN, so let's use it.
I made a learner / generator with 256 frames of Fourier transform as input.
I can learn, but it's not working so far
I learned python (numpy)!

The repository uses the following https://github.com/niisan-tokyo/music_generator

The actual learning uses the following files. https://github.com/niisan-tokyo/music_generator/blob/master/src/stateful_learn.py

Motivation

Google has released magenta, an automatic composition tool that uses tensorflow. https://magenta.tensorflow.org/ This is a great tool, but I'm a little unfamiliar with it because the target file is midi. (There used to be a lot, but nowadays there are a lot of mp3s ...)

For the time being, in order to handle the recorded file easily, it seems good to use the data of the waveform itself such as wav. Moreover, python can handle wav natively, so I thought I had to try this.

plan

I thought about using the raw waveform data as it is for learning, but it was totally useless, so I thought of the following plan.

Fourier transform the waveform data at regular intervals to obtain the time transition of the complex amplitude of the frequency.
Let a stateful RNN learn the time change of complex amplitude
If you input a small amount of sound as a sample to the completed learning, a sequential sound will be created. .. .. Should be

In other words, I thought I could make something like this Once you have a time-series frequency distribution, you can make music by performing an inverse Fourier transform on it.

Fourier transform

In the previous article, the wav file returned by the inverse transform for the 256-frame Fourier transform could be heard without any problem. I think this is enough as a feature that expresses the sound itself. http://qiita.com/niisan-tokyo/items/764acfeec77d8092eb73

Fast Fourier Transform (FFT) with the numpy library will give you 256 frequency distributions, for example in a 256 frame interval. Normally, the frame rate (frames per second) of an audio file is 44100, so you can get a frequency distribution every 256/44100 = 5.8 (msec). The idea is that if we can create a learning machine that can automatically generate this time transition every 5.8 msec, music will be created automatically.

stateful RNN RNN is a recurrent neural network (Recurrent Neural Network), which is a type of network that handles continuous states by referring to previously calculated contents when obtaining output from input. http://qiita.com/kiminaka/items/87afd4a433dc655d8cfd

When dealing with RNNs in Keras, it usually takes several consecutive states as inputs and creates an output, but at each input the previous state is reset. The stateful RNN performs the following processing while maintaining the state after this previous processing. It is expected that this will enable complex series processing at irregular intervals.

This time, I thought that it would be possible to create a generator with a song flow by sequentially learning the input as the frequency distribution at the current moment and the output as the frequency distribution at the next moment with a stateful RNN.

Preparation for learning

File preparation

If you have m4a or mp3 files, you can use ffmpeg to convert them to wav. http://qiita.com/niisan-tokyo/items/135824905e4a3021d358 I record my favorite game music on mac and spit it out to wav.

Creating a dataset

Since the dataset is created by Fourier transforming wav, basically you can refer to the code, but there are some caveats.

def create_test_data(left, right):
    arr = []
    for i in range(0, len(right)-1):
        #Complex number vectorization
        temp = np.array([])
        temp = np.append(temp, left[i].real)
        temp = np.append(temp, left[i].imag)
        temp = np.append(temp, right[i].real)
        temp = np.append(temp, right[i].imag)
        arr.append(temp)

    return np.array(arr)

This is the part where the frequency distribution of the Fourier-transformed stereo sound source is combined to create the data for input. Here, the real and imaginary parts of the frequency distribution represented by complex numbers are reinserted into separate elements of the vector. This is because if you try to calculate with complex numbers, the imaginary part will be dropped. As a result, the number of samples in one section is 256 frames, but the actual input dimension is 1024.

learn

Creating a model

The model is as simple as connecting three LSTMs and finally inserting a fully connected layer.

model = Sequential()
model.add(LSTM(256,
              input_shape=(1, dims),
              batch_size=samples,
              return_sequences=True,
              activation='tanh',
              stateful=True))
model.add(Dropout(0.5))
model.add(LSTM(256, stateful=True, return_sequences=True, activation='tanh'))
model.add(Dropout(0.3))
model.add(LSTM(256, stateful=True, return_sequences=False, activation='tanh'))
model.add(Dropout(0.2))
model.add(Dense(dims))
model.compile(loss='mse', optimizer='adam')

Now, there are some conditions when using a stateful RNN. First, you have to specify the input dimensions per batch. Then each sample in the previous batch and each sample in the next batch must be continuous as a series. As a specific example, if there is a first batch X_1 and a second batch X_2, the i-th sample $ X_1 [i] $ and $ X_2 [i] $ of both are related. It means that there must be. This time, we assume that the generator is made up of $ x_ {n + 1} = RNN (x_n) $, and create the next set of the same number of states from multiple consecutive states.

In other words, $ X_2 [i] $ is always ahead of $ X_1 [i] $ by the number of samples. I don't think it's a little too sloppy, but I think it's a machine that repeats the number of samples, that is, 32 pieces each to create the next state.

fitting

Now that you're ready, let's start learning.

for num in range(0, epochs):
    print(num + 1, '/', epochs, ' start')
    for one_data in test:
        in_data = one_data[:-samples]
        out_data = np.reshape(one_data[samples:], (batch_num, dims))
        model.fit(in_data, out_data, epochs=1, shuffle=False, batch_size=samples)

        model.reset_states()
    print(num+1, '/', epochs, ' epoch is done!')

model.save('/data/model/mcreator')

Since the order of batches is important during learning, I try not to shuffle the samples on the batch. In addition, learning is performed for each wav, and the internal state is reset after learning once.

result

Learning results

First of all, when I try to learn, it seems that the fitting is progressing somehow.

1 / 10  start
Epoch 1/1
16384/16384 [==============================] - 87s - loss: 1.9879e-04
Epoch 1/1
16384/16384 [==============================] - 84s - loss: 1.9823e-04
Epoch 1/1
16384/16384 [==============================] - 75s - loss: 1.1921e-04
Epoch 1/1
16384/16384 [==============================] - 82s - loss: 2.3389e-04
Epoch 1/1
16384/16384 [==============================] - 80s - loss: 3.7428e-04
Epoch 1/1
16384/16384 [==============================] - 90s - loss: 3.3968e-04
Epoch 1/1
16384/16384 [==============================] - 87s - loss: 5.0188e-04
Epoch 1/1
16384/16384 [==============================] - 76s - loss: 4.9725e-04
Epoch 1/1
16384/16384 [==============================] - 74s - loss: 3.7447e-04
Epoch 1/1
16384/16384 [==============================] - 87s - loss: 4.1855e-04
1 / 10  epoch is done!
2 / 10  start
Epoch 1/1
16384/16384 [==============================] - 82s - loss: 1.9742e-04
Epoch 1/1
16384/16384 [==============================] - 85s - loss: 1.9718e-04
Epoch 1/1
16384/16384 [==============================] - 90s - loss: 1.1876e-04
Epoch 1/1
16384/16384 [==============================] - 104s - loss: 2.3144e-04
Epoch 1/1
16384/16384 [==============================] - 97s - loss: 3.7368e-04
Epoch 1/1
16384/16384 [==============================] - 78s - loss: 3.3906e-04
Epoch 1/1
16384/16384 [==============================] - 87s - loss: 5.0128e-04
Epoch 1/1
16384/16384 [==============================] - 79s - loss: 4.9627e-04
Epoch 1/1
16384/16384 [==============================] - 82s - loss: 3.7420e-04
Epoch 1/1
16384/16384 [==============================] - 90s - loss: 4.1857e-04
2 / 10  epoch is done!
...

Let's make a sound using the model that is completed.

Sound generation

I will leave the detailed code to stateful_use.py in the repository, so if you write only the general flow,

Load the generated model
Read the seed music file
Create time series data by predicting the model to some extent using seeds,
Sequentially create the next state from the time series data generated by yourself
Once the time series data of a certain length can be generated, the original waveform data is created from the inverse Fourier transform based on it.

The generator part is as follows

#Fourier transform of seed file
Kl = fourier(left, N, samples * steps)
Kr = fourier(right, N, samples * steps)
sample = create_test_data(Kl, Kr)
sample = np.reshape(sample, (samples * steps, 4 * N))
music = []

#Enter seed data into the model=>"Foster" the state
for i in range(steps):
    in_data = np.reshape(sample[i * samples:(i + 1) * samples], (samples, 1, 4 * N))
    model.predict(np.reshape(in_data, (samples, 1, 4 * N)))

#Music is self-generated by sequentially substituting the last output data into the model whose state has been changed with seed data.
for i in range(0, frames):
    if i % 50 == 0:
        print('progress: ', i, '/', frames)

    music_data = model.predict(np.reshape(in_data, (samples, 1, 4 * N)))
    music.append(np.reshape(music_data, (samples, 4 * N)))
    in_data = music_data

music = np.array(music)

The data obtained in this way is subjected to inverse Fourier transform, converted to real space, and then written to wav. Looking at the waveform in real space, it looks like the following.

A little longer span スクリーンショット 2017-06-14 18.11.22.png

When I listen to this with wav, it is in a surreal state where the buzzer sound of a certain scale called "boo" is played all the time. .. ..

Far from music, it has become a mysterious machine that only produces steady sounds. No matter what kind of music file you put in, it will make the same sound.

Consideration

I think the reason why it didn't work is that the transition of sound is so intense that learning has progressed in the direction of taking a constant that minimizes the error. Maybe that's why the error fluctuation is too small despite the complicated system. If you try to do it stateless, you don't know how many sequences you should take, and the number of dimensions increases by the amount of the sequence, making it difficult to learn easily. Or it may be that the number of learnings is too small, but since the loss is already very small, the idea may be different.

Either way, we need to improve more.

Summary

If you could make a song so easily, it wouldn't be that easy. It didn't work very well, but I think I've learned to some extent how to use python, especially the meaning of numpy, so I think that's a good point.

Also, I don't care, but I ended up studying numpy's reshape very much.

This time is such a place

Postscript

2017/06/18 The following changes have been made.

Number of Fourier transform samples N: 256-> 1024
Fourier transform constant: (n / 2)-> 100
Number of LSTM units: 256-> 512
Number of batches: 512-> 128

result

I got the following waveform. スクリーンショット 2017-06-18 8.48.25.png スクリーンショット 2017-06-18 8.48.35.png The background noise remained as usual, but now it sounds like it has a constant rhythm. In addition, the obtained frequency distribution (the real part) is as follows. The distribution map for 10 pieces is overlaid with 5 frames each.

N = 256, number of LSTM neurons = 256 スクリーンショット 2017-06-18 8.54.15.png N = 1024, number of LSTM neurons = 512 スクリーンショット 2017-06-18 8.52.57.png When N = 1024 and the number of neurons are 256, it is the same as when N = 256.

Consideration

When the Fourier transform was performed, the factor to be applied was changed, and of course the loss increased. Since mse is used for loss, changing the factor increases loss by the square. This makes it possible to observe changes in smaller components, which may have improved accuracy. Also, by increasing the number of neurons, the power of expression increased.

As an improvement, it is possible to increase the factor applied during the Fourier transform or further increase the number of neurons. Since the number of epochs is 10 and the number of original songs is 9, you can change this, but since the change in loss is small, I could not see the effect of not being good, so I thought it was reserved for the time being.

[PYTHON] Using Keras's stateful RNN, I tried automatic composition based on wav files.

Motivation

plan

Fourier transform

Preparation for learning

File preparation

Creating a dataset

learn

Creating a model

fitting

result

Learning results

Sound generation

Consideration

Summary

Postscript

result

Consideration