With the help of Deep Learning, I made software that reads out sentences related to the game while playing Minecraft.
↓ Such a guy
Since it is difficult to combine them into one article, I will divide it into several articles. Here, ** The part that prepares the data ** </ font> is explained.
Text that is nice to be displayed and read aloud during the game. Text related to in-game objects (zombies, creepers, etc.).
First, ask Google teacher.
Search Google for "Zombie Minecraft" and so on. It would be nice to collect the texts of the top sites in the search results.
By the way, search for related words that appear below the search results.
I was able to feel good using ↓. [Python] Get Google search results without access restrictions
I got the URL of the website I want to get, so Get it at Selenium, which everyone loves.
However, I did not confirm the contents of the URL to be accessed properly. If an error occurs during loading, the program will stop.
-Timeout is retried several times. -Alert modal is automatically OK.
Other than that, skip it for the time being! !!
And save the html. Then use BeautifulSoup etc. to extract the text from html.
I discarded the ones whose characters did not start with Japanese or []. I wanted to erase tags and dates that couldn't be erased.
A dataset distributed at here. Nico Nico's comments are organized for about 10 years. Not only comments, but also metadata such as names, tags, and descriptions are included.
** I did it! Defrost immediately ...... **
Unfinished decompression process. It doesn't really end. The number of files is large anyway. ** I can't wait for this. ** **
No, wait. The only data I want to use is comments that seem to be related to Minecraft.
Looking at the metadata tags, can you process only the comments of Minecraft related videos by decompressing or without decompressing? It was possible using zipfile.
Thanks to the Nico Nico dataset, I got a lot of data! !!
I used GiNZA.
↓ You can divide words like this.
import spacy
nlp = spacy.load('ja_ginza')
with open(path, mode='r', encodeing='utf-8', errors='ignore'):
text = list(f.read().split('\n'))
docs = nlp.pipe(text, disable=['ner'])
for doc in docs:
for sent in doc.sents:
for word in sent:
# hogehoge
You can stop unnecessary functions with disable
ofnlp.pipe ()
.
--Deleted sentences with 3 or less words
--Deleted a sentence with one word containing Japanese (r'[a-n-an-ichi-鿐]'
) in a regular expression
-Use set to delete sentences with the same word only.
--Deleted sentences with kanji only and hiragana only with regular expressions
[Python] Summary of regular expression notation (re module)
Collecting data is difficult. It seems that we can devise various ways to make the data usable.
There is something called AI Dungeon 2 that trained the text adventure summary site. It's amazing to be able to play while automatically generating a story. I wonder if there is a text adventure site that can be scraped in Japanese.
Recommended Posts