background

When analyzing the video you sang, the information about "which song" is important.
However, in the case of YouTube, the song title may not be linked to the video.
It's hard to manually find the song title from there, so I thought I'd do something about it.

code

I usually processed it using regular expressions. (There is an improved version of the code at the bottom, so it's recommended to scroll to the bottom before copying)

`get_song_titles.py`



def get_song_title(raw_title):

    # ()（）[]Exclude []. Sometimes the left is half-width and the right is full-width.
    title = re.sub("[【（\(\[].+?[】）\)\]]","",raw_title)

    #If there are "" and "", extract the character string in them
    if "「" in title and "」" in title:
        title = title.split("「")[1].split("」")[0]

    if "『" in title and "』" in title:
        title = title.split("『")[1].split("』")[0]

    #I tried singing X with Y,I tried to sing and erase
    title = re.sub("To.*tried singing","",title)
    title = title.replace("tried singing", "") 

    # cover, covered,Erase the character string after covered by
    title = re.sub("[cC]over(ed)?( by)?.*", "", title)
 
    #Last/Delete after that
    if "/" in title:
        title = "/".join(title.split("/")[:-1])
    
    if "／" in title:
        title = "／".join(title.split("／")[:-1])
  
    # -If there is only one, erase the back
    if len(title.split("-")) == 2: 
        title = title.split("-")[0]

    #Erase the part that represents the collaboration member with x
    title_part_list = []
    for title_part in title.split(" "):
        if "×" not in title_part:
            title_part_list.append(title_part)
    title = " ".join(title_part_list)

    title = title.strip()

    return title


#Example when reading and using the data described later
if __name__ == "__main__":
    x = []
    y = []

    with open("sing_videos.tsv") as f:    
        for line in f:
            x.append(line.strip().split("\t")[1])

    #The first line is the header, so delete it
    del x[0]

    estimated_titles = [ get_song_titles(x) for raw_title in x ]

Performance evaluation

data

I manually labeled it using the metadata of the video I tried to sing, which I collected in the previous article (https://qiita.com/miyatsuki/items/fb933bb233d2896ca644). I have posted the data on GitHub, so please refer to it if necessary for additional tests. https://github.com/miyatsuki/VTuberNayoseDataset/blob/57fe0d785b40c19fa7b249034bdfe1fa62363743/data/sing_videos.tsv

result

Number of videos: 277 Correct answer rate: 92.42% (256/277)

It's not a big deal, but it's over 90%. Since the score is tuned while looking at the result, it is unknown whether the accuracy so far will be obtained for unknown videos.

What was useless

Pattern that erases necessary information too much
Almost if the song title contains /, it will definitely fail if it contains ()
There may be a title in [], which will definitely fail
A pattern that cannot erase unnecessary information
Weak against typo such as typographical errors and omissions
Fails if there is a symbol unrelated to the title such as ♡
If there is a lot of information other than the song title, it is easy to fail.

Video title	Estimated title	Correct answer
Natto!! -Moon!!(Tsukino Mito/iru) Parody-I tried to sing [Kou Uzuki]	Natto!! -Moon!!parody-	Moon!!
Disney Medley ｜ covered by Inui Toko	Disney Medley｜	Disney Medley
【Virtual to LIVE（covered by #Sanbaka)] Thank you for half a year of activity [Nijisanji]	] Thank you for half a year of activity	Virtual to LIVE
[I tried to sing] Farewell Itching with Kenmochi [Crotch Warrior M Zune]	Farewell Itching with Kenmochi	Farewell itching
[Renai Circulation] I tried to sing. Renai Circulation- Bakemonogatari Cover By Utako suzuka		Love circulation
[Yukitoki] cover.Eru [listening video]		Yukitoki
[I tried to sing] Anime "How many kilograms of dumbbells can I have?" OP Request Muscle [Sister Claire x Hanahata Chaika]	How many kilograms of dumbbells can you carry?	Please muscle
[Parody] Do not blame the youkai Gymnastics No. 1 [Singing statement]	Do not blame youkai Gymnastics first	Yokai Gymnastics No. 1
[Original MV] Cinderella Girl with Ryushen and Suzuka Uta/ King&Prince I tried to sing [cover]	Cinderella Girl with Ryushen and Suzuka Uta	Cinderella Girl
⚙*.。..Karakuri Pierrot/Mahiro Yukishiro [I tried to sing]	⚙*.。..Karakuri Pierrot	Karakuri clown
[Weathering with You] Grand Escape(Movie edit) feat.Toko Miura-Covered by Chima Machida & Piropar	Grand Escape feat.Toko Miura	Grand escape(Movie edit) feat.Toko Miura
[LOL part] VD&I tried to sing Blessing with G [parody]	VD&Blessing with G	Blessing
[1st Anniversary] K-ON!!U & I I tried to sing [Meiji Warabeda]	K-ON!!U & I	U＆I
♡ future base I tried to sing ♡	♡future base	future base
[Yorushika] Suddenly I tried to sing rain and cappuccino [Alice Mononobe]	Suddenly rain and cappuccino	Rain with Cappuccino
Volume note] I tried to sing the old book mansion murder case/Rion Takamiya	Volume note] Old book mansion murder case	Old book mansion murder case
White sun/ King Gnu (Covered by Kakeru Yumeoi)Nippon Television's "Innocence Innocent Lawyer" theme song [I tried to sing] [Cover] King Gnu	Innocence guilty lawyer	White sun
[I tried to sing] Haro/Hawayu [Hand-painted PV]	Halo	Halo/Hawayu
[JK playing talk] I played Oration and sang [Alice Mononobe]	Play the oration	Oration
[I tried to sing] Red trap(who loves it?) /LiSA [Kaede Higuchi cover]	Red trap	Red trap(who loves it?)
[Your name. ] Nothing/ RADWIMPS (cover)Utako Suzuka [Original PV in the Sanctuary] Nandemonaiya"Your Name"/Utako Suzuka	Nothing/RADWIMPS Utako Suzuka Nandemonaiya"Your Name"	Nothing

Retest

I increased the number of data and investigated the accuracy again

data

We increased the number of target singers (all VTubers) and collected data again (the number of target videos has increased 4.8 times). https://github.com/miyatsuki/VTuberNayoseDataset/commit/576b89b5c8a6f74744cb24c62a5d8cb77a736ea7

result

Number of videos: 1335 Correct answer rate: 75.95% (1014/1335)

Logic adjustment

Since the percentage of correct answers has dropped sharply, I tried to support a slightly special pattern. Also, since there is a pattern that other people are singing the same song and the title is acquired correctly, I tried to utilize other estimation results so that I could pick up that information.

import pandas as pd
import re

def get_song_title(raw_title):
    
    #There is a pattern called [Song title] from "Title of work", so in that case, the content of [] is used as the title.
    if "Than【" in raw_title:
        title = raw_title.split("【")[1].split("】")[0]
    else:
        title = raw_title
    
    #If there is a symbol in the header, delete it
    if title[0] == "★":
        title = title[1:]
    
    # ()（）[]Exclude []. Sometimes the left is half-width and the right is full-width.
    title = re.sub("[【（《\(\[].+?[】）》\)\]]"," ",title)

    #In the case of a pattern such as the "work name" theme song, delete that part
    for keyword in ["Theme song", "OP", "CM song"]:
        if "」{}".format(keyword) in title:
            end_index = title.index("」{}".format(keyword))
            for start_index in range(end_index, -1, -1):
                if title[start_index] == "「":
                    title = title[:start_index] + title[end_index + len(keyword) + 1:]
                    break
        
    for keyword in ["Theme song", "OP", "CM song"]:
        if "』{}".format(keyword) in title:
            end_index = title.index("』{}".format(keyword))
            for start_index in range(end_index, -1, -1):
                if title[start_index] == "『":
                    title = title[:start_index] + title[end_index + len(keyword) + 1:]
                    break
        
    #If there are "" and "", extract the character string in them
    #However, in rare cases, you may put your name in "". Ignore in that case
    if "「" in title and "」" in title:
        temp_title = title = title.split("「")[1].split("」")[0]
        if "cover" not in temp_title.lower():
            title = temp_title

    if "『" in title and "』" in title:
        temp_title = title.split("『")[1].split("』")[0]
        if "cover" not in temp_title.lower():
            title = temp_title

    #Erase the character string after singing
    title = re.sub("I tried to sing.*"," ", title)
    title = re.sub("tried singing.*"," ", title)

    # cover, covered,Erase the character string after covered by
    title = re.sub("[cC]over(ed)?( by)?.*", "", title)

    # /Delete after that
    if "/" in title:
        title = title.split("/")[0]

    if "／" in title:
        title =  title.split("／")[0]

    # -If there is, erase the back
    title = title.split("-")[0]

    #Erase the part that represents the collaboration member with x
    # #Erase 012-like expressions
    title_part_list = []
    for title_part in title.split(" "):
        if "×" not in title_part and not re.fullmatch("#[0-9]+", title_part):
            title_part_list.append(title_part)
    title = " ".join(title_part_list)

    #Remove leading and trailing whitespace
    title = title.strip()

    return title


#Video title and song title(Estimated value)And return the longest of the partially matched song titles
def get_nearest_title(video_title, music_titles):
    longest = 0
    ans = ""
    
    for music_title in music_titles:
        if len(music_title) <= longest:
            continue
        
        if music_title in video_title:
            ans = music_title
            longest = len(music_title)
    
    return ans


def decide_title(row):
    return row["estimated_title"] if len(row["estimated_title"]) > 0 else row["estimated_title2"]    


if __name__ == "__main__":

    evaluate_df = pd.read_table("sing_videos.tsv")
    evaluate_df["estimated_title"] = evaluate_df["video_title"].apply(get_song_title)

    #If the regular expression estimation result is empty, look for the most likely one from the estimation results of other videos.
    evaluate_df["estimated_title2"] = evaluate_df["video_title"].apply(
        get_nearest_title, music_titles = evaluate_df["estimated_title"].unique()
    )
    evaluate_df["estimated_title"] = evaluate_df.apply(decide_title, axis=1)
    evaluate_df = evaluate_df.drop(columns=["estimated_title2"])

Result (after logic correction)

Number of videos: 1335 Correct answer rate: 85.24% (1138/1335)

[PYTHON] Get the song name from the title of the video you tried to sing

background

code

get_song_titles.py

Performance evaluation

data

result

What was useless

Retest

data

result

Logic adjustment

Result (after logic correction)

`get_song_titles.py`