This is a memo for myself as I read Introduction to Natural Language Processing Applications in 15 Steps. This time, in Chapter 4, Step 15, make a note of your own points. (Although I rarely write)

Preparation

--Personal MacPC: MacOS Mojave version 10.14.6 --docker version: Version 19.03.2 for both Client and Server

Chapter overview

As the final chapter of the book, it will be a collection of tips for searching public data for data sets that match each purpose or building your own in performing the natural language processing and machine learning that you have seen so far. ing.

--Dataset collection --Crowdsourcing

15.2 Dataset collection

Use of public datasets

data set	Feature
Wikipedia	A dump file of all data is officially published in the Web encyclopedia.
Aozora Bunko	You can download a text file of a literary work whose copyright has expired for free.
livedoor news corpus	Part of the Livedoor News article is a Creative Commons license (display)-It is provided under (No modification).
Japanese WordNet	It is a database that expresses the hierarchical structure of word meanings, and can be used for preprocessing and morphological analysis.

In addition to these, there are some that are charged, require a usage application, and have limited usage.

Crawling

If you don't have the public dataset you want, you might consider crawling your website to collect data. Unsupervised data is easy to collect.

-** Many Web Sarnis prohibit mass access for crawling purposes by convention ** --The terms of use of the website from which data is collected may impose restrictions on the purpose of use of the content **

15.3 Crowdsourcing

Crawling is free, but it is difficult to collect supervised data. Crowdsourcing is charged (a reward is required for cloud workers), but tasks can be set and many workers can request many tasks in parallel at low cost.

Since the work of a Japanese speaker is required to build a Japanese dataset, domestic services (such as CrowdWorks and Lancers) will inevitably be used.

Recommended Posts

Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 4 Step 15 Memo "Data Collection"

Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 06 Memo "Identifier"

Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 02 Memo "Pre-processing"

Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 07 Memo "Evaluation"

Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 4 Step 14 Memo "Hyperparameter Search"

Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 04 Memo "Feature Extraction"

Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 08 Memo "Introduction to Neural Networks"

Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 05 Memo "Features Conversion"

Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 11 Memo "Word Embeddings"

Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 12 Memo "Convolutional Neural Networks"

Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 13 Memo "Recurrent Neural Networks"