TL;DR
General MIDI (general midi) is a unified MIDI standard that defines basic tone maps and control changes. Abbreviation is GM. (From wikipedia https://ja.wikipedia.org/wiki/General_MIDI)
So, Roland compliantly extended this standard with GS, and Yamaha extended it with XG. At that time, Roland's SC series was the best selling sound source of this kind, so quite a lot of GS data was distributed at Nifty forums and so on. Nostalgic,,,
So, search for GM, GS, XG data from 130,000 songs on this page of reddit (https://www.reddit.com/r/WeAreTheMusicMakers/comments/3ajwe4/the_largest_midi_collection_on_the_internet/). I made a script.
In the case of GM, GS, XG data, it often contains information to configure the device with SysEx (if you have created the data properly for those devices), so in the midi data I tried to make it with the policy that you should judge by looking at SysEx.
Abbreviation for System Exclusive, this page (https://www.g200kg.com/jp/docs/dic/systemexclusive.html) explains as follows.
This is one of the types of MIDI messages, and is not a function common to MIDI, but a message used to control functions such as effects specific to the model of the sound source.
So, when I checked the SysEx of each company by saying that MIDI data that uses some function of GM, GS, XG should always contain the corresponding SysEx, ...
Types of MIDI data standards | System Exclusive |
---|---|
GM (General MIDI) | F0 7E xx 09 |
GS (Roland's GM expansion) | F0 41 xx 42 |
XG (Yamaha GM expansion | F0 43 xx 4C |
If it is included at the beginning of SysEx, it can be said that it is a project of each manufacturer. In the table above, F0 marks the beginning of SysEx. Next is the manufacturer ID, xx is the device ID (device-specific ID), and the last is the model ID (If you listen to Roland's GS, it looks like model ID 42, so you can identify the GS sound module), so use that.
There is a library called mido (https://mido.readthedocs.io/en/latest/#) to handle midi in Python, so I'm using it. The repository name is also used to master it, isn't it? Mido is well maintained (important) and can be used in various ways, so I think it's perfect for handling midi.
The execution result is a file called GMMidiCheck.ipynb
. When I was writing, my friend told me, "If you write in .py, you can run tests in CI," and I thought that was the case, and all the functions were written in midi_utill.py
(after all). I haven't written a test yet, but ...). Therefore, in each cell of .jpynb
importlib.reload(midi_utill)
So, I'm reloading midi_utill.py
. So, the point of this processing is that we have to compare SysEx of MIDI files, so we compare after converting all midi to hex with the following function.
def getMidiHexData(midifilename):
import mido
midi = mido.MidiFile(midifilename)
MidiData = []
for i in range(len(midi.tracks)):
for msg in midi.tracks[i]:
# print(msg.hex())
MidiData.append(msg.hex())
return MidiData
Besides, I learned a lot about file handling, directory handling, etc., but since it is Python itself, I will omit it, so if you are interested, please read the script.
By the way, please specify the following two variables as variables in GMMidiCheck.ipynb
.
For the time being, it is assumed that the reddit data is in the same directory as GMMidiCheck.ipynb
. Similarly, it is assumed that the data judged to be GM compliant data will be written out by creating a directory in the same place.
Of the 130,000 songs (with this script), 33,000 songs were caught saying "It looks like GM, GS, XG data".
This is enough as learning source data for machine learning, isn't it? Moreover, since the tone information is open to the public in the case of GM, it is also possible to extract "drums and percussion", "single music instruments", "double music instruments", etc. from the track (and since it is GM compliant, it can be reliably extracted with control change information. , Should be. I'm sure).
By the way, it takes about 5 hours to check 130,000 songs, so use it systematically.
Recommended Posts