With the help of Deep Learning, I made software that reads out sentences related to the game while playing Minecraft.

↓ Such a guy でもミニ.gif

Since it is difficult to combine them into one article, I will divide it into several articles. Here, ** The part that prepares the data ** </ font> is explained.

↓ Other articles

-Overall flow

-Deep Learning

-Move with mod

Data you want

Text that is nice to be displayed and read aloud during the game. Text related to in-game objects (zombies, creepers, etc.).

Scraping

Hit the site

First, ask Google teacher.

Search Google for "Zombie Minecraft" and so on. It would be nice to collect the texts of the top sites in the search results.

By the way, search for related words that appear below the search results.

I was able to feel good using ↓. [Python] Get Google search results without access restrictions

Download site text

I got the URL of the website I want to get, so Get it at Selenium, which everyone loves.

However, I did not confirm the contents of the URL to be accessed properly. If an error occurs during loading, the program will stop.

-Timeout is retried several times. -Alert modal is automatically OK.

Other than that, skip it for the time being! !!

And save the html. Then use BeautifulSoup etc. to extract the text from html.

I discarded the ones whose characters did not start with Japanese or []. I wanted to erase tags and dates that couldn't be erased.

Nico Nico Comment Dataset

A dataset distributed at here. Nico Nico's comments are organized for about 10 years. Not only comments, but also metadata such as names, tags, and descriptions are included.

** I did it! Defrost immediately ...... **

Unfinished decompression process. It doesn't really end. The number of files is large anyway. ** I can't wait for this. ** **

No, wait. The only data I want to use is comments that seem to be related to Minecraft.

Looking at the metadata tags, can you process only the comments of Minecraft related videos by decompressing or without decompressing? It was possible using zipfile.

A little processing

Thanks to the Nico Nico dataset, I got a lot of data! !!

Torima, word division

I used GiNZA.

↓ You can divide words like this.

import spacy

nlp = spacy.load('ja_ginza')

with open(path, mode='r', encodeing='utf-8', errors='ignore'):
    text = list(f.read().split('\n'))
    docs = nlp.pipe(text, disable=['ner'])
    for doc in docs:
        for sent in doc.sents:
            for word in sent:
                # hogehoge

You can stop unnecessary functions with disable ofnlp.pipe ().

Remove unnecessary garbage for learning

--Deleted sentences with 3 or less words --Deleted a sentence with one word containing Japanese (r'[a-n-an-ichi-鿐]') in a regular expression -Use set to delete sentences with the same word only. --Deleted sentences with kanji only and hiragana only with regular expressions

[Python] Summary of regular expression notation (re module)

Summary

Collecting data is difficult. It seems that we can devise various ways to make the data usable.

Digression

There is something called AI Dungeon 2 that trained the text adventure summary site. It's amazing to be able to play while automatically generating a story. I wonder if there is a text adventure site that can be scraped in Japanese.

[PYTHON] "Minecraft where a heckler flies" Generate appropriate text with Deep Learning ~ Collect data ~