I usually processed it using regular expressions. (There is an improved version of the code at the bottom, so it's recommended to scroll to the bottom before copying)
get_song_titles.py
def get_song_title(raw_title):
# ()()[]Exclude []. Sometimes the left is half-width and the right is full-width.
title = re.sub("[【(\(\[].+?[】)\)\]]","",raw_title)
#If there are "" and "", extract the character string in them
if "「" in title and "」" in title:
title = title.split("「")[1].split("」")[0]
if "『" in title and "』" in title:
title = title.split("『")[1].split("』")[0]
#I tried singing X with Y,I tried to sing and erase
title = re.sub("To.*tried singing","",title)
title = title.replace("tried singing", "")
# cover, covered,Erase the character string after covered by
title = re.sub("[cC]over(ed)?( by)?.*", "", title)
#Last/Delete after that
if "/" in title:
title = "/".join(title.split("/")[:-1])
if "/" in title:
title = "/".join(title.split("/")[:-1])
# -If there is only one, erase the back
if len(title.split("-")) == 2:
title = title.split("-")[0]
#Erase the part that represents the collaboration member with x
title_part_list = []
for title_part in title.split(" "):
if "×" not in title_part:
title_part_list.append(title_part)
title = " ".join(title_part_list)
title = title.strip()
return title
#Example when reading and using the data described later
if __name__ == "__main__":
x = []
y = []
with open("sing_videos.tsv") as f:
for line in f:
x.append(line.strip().split("\t")[1])
#The first line is the header, so delete it
del x[0]
estimated_titles = [ get_song_titles(x) for raw_title in x ]
I manually labeled it using the metadata of the video I tried to sing, which I collected in the previous article (https://qiita.com/miyatsuki/items/fb933bb233d2896ca644). I have posted the data on GitHub, so please refer to it if necessary for additional tests. https://github.com/miyatsuki/VTuberNayoseDataset/blob/57fe0d785b40c19fa7b249034bdfe1fa62363743/data/sing_videos.tsv
Number of videos: 277 Correct answer rate: 92.42% (256/277)
It's not a big deal, but it's over 90%. Since the score is tuned while looking at the result, it is unknown whether the accuracy so far will be obtained for unknown videos.
Video title | Estimated title | Correct answer |
---|---|---|
Natto!! -Moon!!(Tsukino Mito/iru) Parody-I tried to sing [Kou Uzuki] | Natto!! -Moon!!parody- | Moon!! |
Disney Medley | covered by Inui Toko | Disney Medley| | Disney Medley |
【Virtual to LIVE(covered by #Sanbaka)] Thank you for half a year of activity [Nijisanji] | ] Thank you for half a year of activity | Virtual to LIVE |
[I tried to sing] Farewell Itching with Kenmochi [Crotch Warrior M Zune] | Farewell Itching with Kenmochi | Farewell itching |
[Renai Circulation] I tried to sing. Renai Circulation- Bakemonogatari Cover By Utako suzuka | Love circulation | |
[Yukitoki] cover.Eru [listening video] | Yukitoki | |
[I tried to sing] Anime "How many kilograms of dumbbells can I have?" OP Request Muscle [Sister Claire x Hanahata Chaika] | How many kilograms of dumbbells can you carry? | Please muscle |
[Parody] Do not blame the youkai Gymnastics No. 1 [Singing statement] | Do not blame youkai Gymnastics first | Yokai Gymnastics No. 1 |
[Original MV] Cinderella Girl with Ryushen and Suzuka Uta/ King&Prince I tried to sing [cover] | Cinderella Girl with Ryushen and Suzuka Uta | Cinderella Girl |
⚙*.。..Karakuri Pierrot/Mahiro Yukishiro [I tried to sing] | ⚙*.。..Karakuri Pierrot | Karakuri clown |
[Weathering with You] Grand Escape(Movie edit) feat.Toko Miura-Covered by Chima Machida & Piropar | Grand Escape feat.Toko Miura | Grand escape(Movie edit) feat.Toko Miura |
[LOL part] VD&I tried to sing Blessing with G [parody] | VD&Blessing with G | Blessing |
[1st Anniversary] K-ON!!U & I I tried to sing [Meiji Warabeda] | K-ON!!U & I | U&I |
♡ future base I tried to sing ♡ | ♡future base | future base |
[Yorushika] Suddenly I tried to sing rain and cappuccino [Alice Mononobe] | Suddenly rain and cappuccino | Rain with Cappuccino |
Volume note] I tried to sing the old book mansion murder case/Rion Takamiya | Volume note] Old book mansion murder case | Old book mansion murder case |
White sun/ King Gnu (Covered by Kakeru Yumeoi)Nippon Television's "Innocence Innocent Lawyer" theme song [I tried to sing] [Cover] King Gnu | Innocence guilty lawyer | White sun |
[I tried to sing] Haro/Hawayu [Hand-painted PV] | Halo | Halo/Hawayu |
[JK playing talk] I played Oration and sang [Alice Mononobe] | Play the oration | Oration |
[I tried to sing] Red trap(who loves it?) /LiSA [Kaede Higuchi cover] | Red trap | Red trap(who loves it?) |
[Your name. ] Nothing/ RADWIMPS (cover)Utako Suzuka [Original PV in the Sanctuary] Nandemonaiya"Your Name"/Utako Suzuka | Nothing/RADWIMPS Utako Suzuka Nandemonaiya"Your Name" | Nothing |
I increased the number of data and investigated the accuracy again
We increased the number of target singers (all VTubers) and collected data again (the number of target videos has increased 4.8 times). https://github.com/miyatsuki/VTuberNayoseDataset/commit/576b89b5c8a6f74744cb24c62a5d8cb77a736ea7
Number of videos: 1335 Correct answer rate: 75.95% (1014/1335)
Since the percentage of correct answers has dropped sharply, I tried to support a slightly special pattern. Also, since there is a pattern that other people are singing the same song and the title is acquired correctly, I tried to utilize other estimation results so that I could pick up that information.
import pandas as pd
import re
def get_song_title(raw_title):
#There is a pattern called [Song title] from "Title of work", so in that case, the content of [] is used as the title.
if "Than【" in raw_title:
title = raw_title.split("【")[1].split("】")[0]
else:
title = raw_title
#If there is a symbol in the header, delete it
if title[0] == "★":
title = title[1:]
# ()()[]Exclude []. Sometimes the left is half-width and the right is full-width.
title = re.sub("[【(《\(\[].+?[】)》\)\]]"," ",title)
#In the case of a pattern such as the "work name" theme song, delete that part
for keyword in ["Theme song", "OP", "CM song"]:
if "」{}".format(keyword) in title:
end_index = title.index("」{}".format(keyword))
for start_index in range(end_index, -1, -1):
if title[start_index] == "「":
title = title[:start_index] + title[end_index + len(keyword) + 1:]
break
for keyword in ["Theme song", "OP", "CM song"]:
if "』{}".format(keyword) in title:
end_index = title.index("』{}".format(keyword))
for start_index in range(end_index, -1, -1):
if title[start_index] == "『":
title = title[:start_index] + title[end_index + len(keyword) + 1:]
break
#If there are "" and "", extract the character string in them
#However, in rare cases, you may put your name in "". Ignore in that case
if "「" in title and "」" in title:
temp_title = title = title.split("「")[1].split("」")[0]
if "cover" not in temp_title.lower():
title = temp_title
if "『" in title and "』" in title:
temp_title = title.split("『")[1].split("』")[0]
if "cover" not in temp_title.lower():
title = temp_title
#Erase the character string after singing
title = re.sub("I tried to sing.*"," ", title)
title = re.sub("tried singing.*"," ", title)
# cover, covered,Erase the character string after covered by
title = re.sub("[cC]over(ed)?( by)?.*", "", title)
# /Delete after that
if "/" in title:
title = title.split("/")[0]
if "/" in title:
title = title.split("/")[0]
# -If there is, erase the back
title = title.split("-")[0]
#Erase the part that represents the collaboration member with x
# #Erase 012-like expressions
title_part_list = []
for title_part in title.split(" "):
if "×" not in title_part and not re.fullmatch("#[0-9]+", title_part):
title_part_list.append(title_part)
title = " ".join(title_part_list)
#Remove leading and trailing whitespace
title = title.strip()
return title
#Video title and song title(Estimated value)And return the longest of the partially matched song titles
def get_nearest_title(video_title, music_titles):
longest = 0
ans = ""
for music_title in music_titles:
if len(music_title) <= longest:
continue
if music_title in video_title:
ans = music_title
longest = len(music_title)
return ans
def decide_title(row):
return row["estimated_title"] if len(row["estimated_title"]) > 0 else row["estimated_title2"]
if __name__ == "__main__":
evaluate_df = pd.read_table("sing_videos.tsv")
evaluate_df["estimated_title"] = evaluate_df["video_title"].apply(get_song_title)
#If the regular expression estimation result is empty, look for the most likely one from the estimation results of other videos.
evaluate_df["estimated_title2"] = evaluate_df["video_title"].apply(
get_nearest_title, music_titles = evaluate_df["estimated_title"].unique()
)
evaluate_df["estimated_title"] = evaluate_df.apply(decide_title, axis=1)
evaluate_df = evaluate_df.drop(columns=["estimated_title2"])
Number of videos: 1335 Correct answer rate: 85.24% (1138/1335)