[PYTHON] Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 4 Step 15 Memo "Data Collection"

Contents

This is a memo for myself as I read Introduction to Natural Language Processing Applications in 15 Steps. This time, in Chapter 4, Step 15, make a note of your own points. (Although I rarely write)

Preparation

--Personal MacPC: MacOS Mojave version 10.14.6 --docker version: Version 19.03.2 for both Client and Server

Chapter overview

As the final chapter of the book, it will be a collection of tips for searching public data for data sets that match each purpose or building your own in performing the natural language processing and machine learning that you have seen so far. ing.

--Dataset collection --Crowdsourcing

15.2 Dataset collection

Use of public datasets

data set Feature
Wikipedia A dump file of all data is officially published in the Web encyclopedia.
Aozora Bunko You can download a text file of a literary work whose copyright has expired for free.
livedoor news corpus Part of the Livedoor News article is a Creative Commons license (display)-It is provided under (No modification).
Japanese WordNet It is a database that expresses the hierarchical structure of word meanings, and can be used for preprocessing and morphological analysis.

In addition to these, there are some that are charged, require a usage application, and have limited usage.

Crawling

If you don't have the public dataset you want, you might consider crawling your website to collect data. Unsupervised data is easy to collect.

-** Many Web Sarnis prohibit mass access for crawling purposes by convention ** --The terms of use of the website from which data is collected may impose restrictions on the purpose of use of the content **

15.3 Crowdsourcing

Crawling is free, but it is difficult to collect supervised data. Crowdsourcing is charged (a reward is required for cloud workers), but tasks can be set and many workers can request many tasks in parallel at low cost.

Since the work of a Japanese speaker is required to build a Japanese dataset, domestic services (such as CrowdWorks and Lancers) will inevitably be used.

Recommended Posts

Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 4 Step 15 Memo "Data Collection"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 06 Memo "Identifier"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 02 Memo "Pre-processing"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 07 Memo "Evaluation"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 4 Step 14 Memo "Hyperparameter Search"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 04 Memo "Feature Extraction"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 08 Memo "Introduction to Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 05 Memo "Features Conversion"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 11 Memo "Word Embeddings"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 12 Memo "Convolutional Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 13 Memo "Recurrent Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 09 Memo "Identifier by Neural Network"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 01 Memo "Let's Make a Dialogue Agent"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 03 Memo "Morphological Analysis and Word Separation"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 10 Memo "Details and Improvements of Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 1 Memo "Preliminary Knowledge Before Beginning Exercises"
[WIP] Pre-processing memo in natural language processing
Summary from the beginning to Chapter 1 of the introduction to design patterns learned in the Java language
100 language processing knock-92 (using Gensim): application to analogy data
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
Performance verification of data preprocessing in natural language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
Try to decipher the login data stored in Firefox
[Technical book] Introduction to data analysis using Python -1 Chapter Introduction-
[Chapter 4] Introduction to Python with 100 knocks of language processing
[Job change meeting] Try to classify companies by processing word-of-mouth in natural language with word2vec
[Natural language processing] I tried to visualize the hot topics this week in the Slack community
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
Natural language processing with Word2Vec developed by a researcher in the US google (original data)
[Python] Try to classify ramen shops by natural language processing
A story about everything from data collection to AI development and Web application release in Python (3. AI development)
Summary of Chapter 2 of Introduction to Design Patterns Learned in Java Language
Chapter 4 Summary of Introduction to Design Patterns Learned in Java Language
Summary of Chapter 3 of Introduction to Design Patterns Learned in Java Language
[Introduction to RasPi4] Environment construction; natural language processing system mecab, etc. .. .. ♪
Dockerfile with the necessary libraries for natural language processing in python
100 natural language processing knocks Chapter 4 Commentary
100 Language Processing Knock Chapter 1 in Python
Web application development memo in python
Try to put data in MongoDB
Cython to try in the shortest
Preparing to start natural language processing
From the introduction of GoogleCloudPlatform Natural Language API to how to use it
Easy padding of data that can be used in natural language processing
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]