[PYTHON] Apply for SONY's NNC Challenge (use audio data to create something useful for AudioStock).

0. Who am I?

I am a graduate student in the doctoral program in mathematics. My specialty is analysis, and I have little knowledge of computer algebra, let alone machine learning, but I sometimes use computers for research, and I can manage to use only Python. I also participated in the first NNC Challenge (image classification), but I did not have a clear sense of purpose and lacked knowledge, so I could not obtain any remarkable deliverables (I only submitted it).

1. What is SONY NNC Challenge?

This is a machine learning contest using the integrated development environment "Neural Network Console (NNC)" for machine learning operated by SONY. The operation of the contest seems to be SONY and ledge Co., Ltd. The outline can be found at the following URL. https://nnc-challenge.com/

2. This theme

Since we are dealing with voice data in this open call for participants, it was clearly decided what we wanted to do. My hobby is so-called "sound game", and most of the music I usually listen to is game music. Among them, my favorite composer, t + pazolite (pronounced "to pazolite"), makes very unique music. The theme of this time is that the unique view of the world can be discerned by machines. In other words, "Can the machine distinguish t + pazolite songs?" To put it in line with the theme of the contest, is it "learning your favorite songs and creating a function to recommend new songs to users"? Amazon Prime Music and others have already implemented the function to recommend songs to users. It may not be unique as a "challenge", but considering my ability and the deadline of the contest, I can't spend much time selecting the theme. Above all, the theme that you are interested in will be more enthusiastic about the challenge. This time we will use this theme.

3. What you need

My theme this time is to make a "discriminator" that determines whether the input music is by t + pazolite (hereinafter "topazo"). In other words, create a program (the machine learns by itself) that returns "0" (not "topazo song") or "1" ("topazo song") when some music data is input. Will be. A large number of songs labeled "0" or "1" are used to learn what song is "0" and what song is "1", and a program that can make this judgment for unknown songs is created. In NNC, the input is an array, so it is necessary to arrange the music data in some way. Since the wav format music file is an array format, the music is converted to this format. (NNC official document, https://support.dl.sony.com/wp-content/uploads/sites/2/2017/11/19120828/starter_guide_sound_classification.pdf) Also, all arrays of input data must be arrays of the same shape. This time, more than 10,000 songs were distributed in wav format of the same size from the NNC official (data for learning: Audiostock), so we will format the "topazo songs" according to this.

4. Formatting music data

Formats the song as "1" judgment data. My Topazo song was in m4a format, so I need to convert it to wav format. I programmed in Python by referring to this post. (Convert M4A to WAV and vice versa) https://wave.hatenablog.com/entry/2017/01/29/160000) Once the song has been converted to wav format, the next step is to size it to the same size as the training data provided. The arrangement size of the music is 1. Stereo or monaural, 2. Sampling rate, 3. Music length (seconds) (See NNC official documentation). As for stereo monaural, I didn't need to change it because both the data provided and the data I had were monaural. Regarding the sampling rate, the official distribution was 8kHz and the music on hand was 48kHz, so it is necessary to downsample. However, if you simply reduce the sampling rate to 1/6, delayed (slowed down) music data will be generated, so you need to choose a method. I was able to easily achieve the desired sampling rate conversion using the method in this article. (Downsampling wav audio file https://stackoverflow.com/questions/30619740/downsampling-wav-audio-file) Finally, align the lengths of the songs. All the data provided was formatted in 24 seconds, so you can "cut" your songs in 24 seconds. With reference to this article, I changed the name of the output file to the name of the original song with a serial number, and wrote a program that repeats the same operation for all the songs in the folder. ([Python] A program that divides WAV files at equal intervals [Sound programming] http://tacky0612.hatenablog.com/entry/2017/11/21/164409)

5. Create annotation CSV file

Finally, the annotation work is done to enter the data into the NNC. Annotation is to label each song as "1" or "0". NNC does this by uploading this input as a CSV file with the song path and the "0" or "1" corresponding to that song. Music data is automatically uploaded from the path written in CSV (see NNC official document). I wrote a program that gets the path from the folder containing the music in Python and writes it to CSV. "Topazo songs" and "other songs" are placed in separate folders, and "1" is associated with the path of the top songs and "0" is associated with the other songs. I shuffled the order of the songs when I made the CSV, but I didn't need it because it has a function to shuffle the data that came up on NNC.

6. Learning

Now that we have all the data, it's time to learn. However, at this point, it was 2:00 pm on the day of the deadline. I knew that it would take some time to learn, so I borrowed one for audio data from the NNC sample project and decided to use the input and output that matched this theme. Specifically, I used the input of the project wav_keyboard_sound (96000, 1) and changed the processing before output to Affine-> Sigmoid-> Binary crossentropy. It's a shame that the time allocation didn't go well.

  1. Result12sec_completed.jpeg

12sec_result1.jpeg

The result is just overwhelming, and you can see that "Topazokyoku" boasts an overwhelming recognition rate even for machines. The photo uses 5450 pieces of music data cut into 12 seconds for learning. Learning with four TESLA V100s took 14 minutes for these processes. I also tried learning with the CPU, but it took more than 2 hours to learn and gave up the results. I regret the poor time allocation even if I return it. If there is a next time, I would like to make good use of the given time so that I can try comparing data preprocessing, efficient learning, etc.

Recommended Posts

Apply for SONY's NNC Challenge (use audio data to create something useful for AudioStock).
How to use "deque" for Python data
How to use data analysis tools for beginners
Create a dataset of images to use for learning