This is a memo for myself as I read Introduction to Natural Language Processing Applications in 15 Steps. This time, in Chapter 4, Step 15, make a note of your own points. (Although I rarely write)
--Personal MacPC: MacOS Mojave version 10.14.6 --docker version: Version 19.03.2 for both Client and Server
As the final chapter of the book, it will be a collection of tips for searching public data for data sets that match each purpose or building your own in performing the natural language processing and machine learning that you have seen so far. ing.
--Dataset collection --Crowdsourcing
data set | Feature |
---|---|
Wikipedia | A dump file of all data is officially published in the Web encyclopedia. |
Aozora Bunko | You can download a text file of a literary work whose copyright has expired for free. |
livedoor news corpus | Part of the Livedoor News article is a Creative Commons license (display)-It is provided under (No modification). |
Japanese WordNet | It is a database that expresses the hierarchical structure of word meanings, and can be used for preprocessing and morphological analysis. |
In addition to these, there are some that are charged, require a usage application, and have limited usage.
If you don't have the public dataset you want, you might consider crawling your website to collect data. Unsupervised data is easy to collect.
-** Many Web Sarnis prohibit mass access for crawling purposes by convention ** --The terms of use of the website from which data is collected may impose restrictions on the purpose of use of the content **
Crawling is free, but it is difficult to collect supervised data. Crowdsourcing is charged (a reward is required for cloud workers), but tasks can be set and many workers can request many tasks in parallel at low cost.
Since the work of a Japanese speaker is required to build a Japanese dataset, domestic services (such as CrowdWorks and Lancers) will inevitably be used.