Introduction

100 knocks of language processing, which has been popular for a long time as a collection of problems to learn the basics of natural language processing, 2020 version was released on 4/6. !! This is the first revision in 5 years. Those who have done 2015 version but are interested, those who are disappointed that the 15 year version of Qiita article is no longer useful, 15 year version For those who have been doing it halfway but are about to break their hearts with the 20th edition, I will summarize what has changed. Of course, since it is unofficial, there may be oversights of changes.

Summary of revision

As of 4/7, according to the official Update History, the following three points have changed significantly.

--Added problems with deep neural networks --Chapter 8, 9 and 10 are all newly created problems. --English version released (up to 39) --The number 40 and beyond will be released sequentially (Author Twitter) --Old Chapter 6 (Processing English text) moved to English version --The corresponding English version has not been released. It seems to be under construction (Github).

As I will explain later, as far as I compared the Japanese version ** Chapters 1 to 5 of the 2015 version have not changed, Chapter 8 has changed tasks, Chapters 9 and 10 have been simplified, the rest have been deleted * It's like *. Other than that, the URL of the file to download has changed completely (did you care about the problem of the character code or the link destination?).

What has changed specifically

Old Chapter 1: Preparatory Movement

no change --There seems to be no difference in the English version except that "Patatokukashi" has become "schooled".

Old Chapter 2: Unix Command Basics

--Title changed to "Unix Command" --The tsv file has changed. The number of lines has also increased. --No change in question text

Old Chapter 3: Regular Expressions

--No change in question text --Japanese Wikipedia data has been updated --English version uses English Wikipedia data

Old Chapter 4: Morphological Analysis

--The old 33rd "Sahen noun" has been deleted, and the new 37th "Top 10 words that frequently co-occur with" cat "" have been added. --The URL of the text file has been updated --The English version is "POS tagging" and comes with a script to download Alice ’s Adventures in Wonderland. --MeCab is specified in the Japanese version, but POS tagger is not specified in the English version.

Old Chapter 5: Dependency Analysis

--No change in question text

Old Chapter 6: Processing English Texts

--Move to English version as described in the overview

Old Chapter 7: Database

Delete ――I think it's a reasonable change because there was no NLP feeling only in the old Chapter 7.

Old Chapter 8: Machine Learning

--Move to new chapter 6 --However, there are many small changes ――You may have been aware of the introduction to the method using deep learning in the latter half. --Task changed from polarity analysis (binary classification) to categorization (multiclass classification) --When I try to open the download destination for the new version, I get a warning that the certificate may have expired. --In the old No. 78, split cross validation was performed, but in the new version, holdout verification is performed instead. ――I think it's a reasonable change because the size of the dataset has increased. --Along with this, in the new version, the data is divided into learning data, verification data, and evaluation data at number 50. --It seems that the verification data is used only for No. 58, but is that okay ... --The old 71st "Stop Word" has been deleted. --Old 72-75, 77 moved to the new version almost as it is --There are minor changes such as features → features --Stemming was also omitted following the stop word ――Is it the influence of the passage of time? --The old 76 "Labeling" has been deleted, and the new 55 "Create Confusion Matrix" has been added instead. --Originally, the significance of existence was a delicate issue --The old 79th "Compliance rate-Drawing recall rate graph" has been deleted, and the new 58th number replaces the correct answer rate-Draws a regularization parameter graph. --New 59th "Hyperparameter search" added

Old Chapters 9 and 10: Vector Space Law

--Move to new Chapter 7 "Word Vector" --In the old version, you learned the word vector by yourself, but in the new version, you decided to use the learned one. --Other than that, the same as the old chapter 10 --The evaluation data was broken, but it was updated.

New Chapter 8: Neural Net

--Create a single-layer (feedforward) neural network using PyTorch, TensorFlow, etc. --Matrix creation-> prediction-> loss and gradient calculation-> mini-batch-> GPU-> multi-layer, etc.

New Chapter 9: RNN, CNN

--Implement RNN / CNN (a simple model) that was once often used in natural language processing research ――No. 89 is to try transfer learning using a pre-learned language model such as BERT. --This is the approach currently used in various tasks of natural language processing. ――It seems that you will just run the library according to the sample, but I think that is also an important experience.

New Chapter 10: Machine Translation

--Perform neural machine translation using existing tools ――Machine translation is a central field of natural language processing, and I think that just using tools will be an important experience. --No. 99 creates a demo that displays the result of machine translation on the browser. --The old 69th "Creating a Web Application" was a difficult problem for those who didn't know anything, but I think it has become a reasonable form.

Personal impression

It reflects the latest trends in natural language processing research, and I think it has become more suitable as a collection of problems used for new employee training in the laboratory!

Digression

--Click here for articles recommended for those who want to tackle 100 knocks without knowing Python → Introduction to Python with 100 knocks of language processing --Click here for articles recommended for those who are wondering what to use when solving the new chapter 10 → Rough introduction of neural machine translation library

[PYTHON] 100 language processing knock 2020 version released! What has changed?