[PYTHON] 100 Amateur Language Processing Knock: Summary

This is a summary of the challenge records of 100 Language Processing Knock 2015.

: warning: ** This is not a challenge record of 100 Language Processing Knock 2020. The old 2015 version is the target. Please note: bangbang: **

Challenged environment

Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). (Only Problem 00 and Problem 01 are Python 2.7.)

Chapter 1: Preparatory movement

Review some advanced topics in programming languages while working on subjects dealing with text and strings.

Link to post What I learned mainly, what I learned in the comments, etc.
Problem 00 slice,print()
Problem 01 slice
Problem 02 Anaconda、zip()itertools.zip_longest(),Beforeiterable*Ifyouadd,itwillbeseparatedintoarguments,str.join()、functools.reduce()
Problem 03 len()、list.append()str.split()list.count()
Problem 04 enumerate()、Python3.Hashes are randomized by default after 3
Problem 05 n-gram、range()
Problem 06 set()set.union()、set.intersection()、set.difference()
Problem 07 str.format()string.Templatestring.Template.substitute()
Problem 08 chr()、str.islower()、input(), Ternary operator
Problem 09 Typoglycemia、random.shuffle()

Chapter 2: UNIX Command Basics

Experience useful UNIX tools for research and data analysis. Through these reimplements, you will experience the ecosystem of existing tools while improving your programming skills.

Link to post What I learned mainly, what I learned in the comments, etc.
Problem 10 [UNIXcommands]manJapaneselocalization,open(), Shell script,[UNIX commands]wc,chmod, File execute permission
Problem 11 str.replace()、[UNIX commands]sedtrexpand
Problem 12 io.TextIOBase.write()、[UNIX commands]cut,diff、UNIX commandsの短いオプションと長いオプション
Problem 13 [UNIXcommands]paste、str.rstrip(), Python definition of "whitespace"
Problem 14 [UNIX commands]echo,read,head
Problem 15 io.IOBase.readlines()、[UNIX commands]tail
Problem 16 [UNIXcommands]split、math.ceil()、str.format()//Can be truncated and divided by
Problem 17 set.add()、[UNIX commands]cut,sort,uniq
Problem 18 Lambda expression
Problem 19 Listcomprehension,itertools.groupby()、list.sort()

Chapter 3: Regular Expressions

By applying regular expressions to the markup description on Wikipedia pages, various information and knowledge can be extracted.

Link to post What I learned mainly, what I learned in the comments, etc.
Problem 20 JSONmanipulation,gzip.open()、json.loads()
Problem 21 Regularexpression,rawstringnotation,raise、re.compile()、re.regex.findall()
Problem 22 [Regular expressions]Greedy match,Non-greedy match
Problem 23 [Regular expressions]Back reference
Problem 24
Problem 25 [Regularexpressions]Affirmativelook-ahead,sorted()
Problem 26 re.regex.sub()
Problem 27
Problem 28
Problem 29 Useofwebservices,urllib.request.Request()、urllib.request.urlopen()、bytes.decode()

Chapter 4: Morphological analysis

Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.

Link to post What I learned mainly, what I learned in the comments, etc.
Problem 30 conda、pip、apt、[MeCab]Installation,How to use, morphological analysis, generator,yield
Problem 31 [Morphological analysis]Surface type
Problem 32 [Morphological analysis]Prototype / basic form, list comprehension
Problem 33 [Morphological analysis]Noun of s-irregular connection, inclusion notation of double loop list
Problem 34
Problem 35 [Morphological analysis]Noun articulation
Problem 36 collections.Counter、collections.Counter.update()
Problem 37 [matplotlib]Installation,bar graph,Japanese display,Axis range,Grid display
Problem 38 [matplotlib]histogram
Problem 39 [matplotlib]Scatter plot, Zipf's law

Chapter 5: Dependency Analysis

Apply the dependency analyzer CaboCha to "I am a cat" and experience the operation of the dependency tree and syntactic analysis.

Link to post What I learned mainly, what I learned in the comments, etc.
Problem 40 [CaboCha]Installation,Howtouse,__str__()、__repr__()、repr()
Problem 41 [Dependency analysis]Phrase and dependency
Problem 42
Problem 43
Problem 44 [pydot-ng]Installation,How to check the source of directed graphs and modules made in Python
Problem 45 [Dependency analysis]Case,[UNIX commands]grep
Problem 46 [Dependency analysis]Case frame / case grammar
Problem 47 [Dependency analysis]Functional verb
Problem 48 [Dependency analysis]Path from noun to root
Problem 49 [Dependency analysis]Dependency path between nouns

Chapter 6: Processing English Text

An overview of various basic technologies of natural language processing through English text processing using Stanford Core NLP.

Link to post What I learned mainly, what I learned in the comments, etc.
Problem 50 generator
Problem 51
Problem 52 Stem, stemming, snowball stemmer: how to use
Problem 53 [StanfordCoreNLP]Installation,Howtouse,subprocess.run(),XMLparsing,xml.etree.ElementTree.ElementTree.parse()、xml.etree.ElementTree.ElementTree.iter()
Problem 54 [StanfordCoreNLP]Partofspeech,Lemma,XMLparsing,xml.etree.ElementTree.Element.findtext()
Problem 55 [StanfordCoreNLP]Namedentity,XPath,xml.etree.ElementTree.Element.iterfind()
Problem 56 [Stanford Core NLP]Co-reference
Problem 57 [Stanford Core NLP]Dependent,[pydot-ng]Directed graph
Problem 58 [Stanford Core NLP]subject,predicate,Object
Problem 59 [StanfordCoreNLP]Phrasestructureanalysis,S-expression,recursivecall,sys.setrecursionlimit()、threading.stack_size()

Chapter 7: Database

Learn how to build and search databases using Key Value Store (KVS) and NoSQL. We will also develop a demo system using CGI.

Link to post What I learned mainly, what I learned in the comments, etc.
Problem 60 [LevelDB]Installation,Howtouse,str.encode()、bytes.decode()
Problem 61 [LevelDB]Search,Unicodecodepoint,ord()
Problem 62 [LevelDB]Enumeration
Problem 63 JSONmanipulation,json.dumps()
Problem 64 [MongoDB]Installation,How to use,Interactive shell,Bulk insert,index
Problem 65 [MongoDB]Search,Handling of types not found in ObjectId and JSON format conversion tables
Problem 66
Problem 67
Problem 68 [MongoDB]sort
Problem 69 Webserver,CGI,HTMLescaping,html.escape()、html.unescape()、[MongoDB]Search for multiple conditions

Chapter 8: Machine Learning

Build a reputation analyzer (positive / negative analyzer) by machine learning. In addition, you will learn how to evaluate the method.

Link to post What I learned mainly, what I learned in the comments, etc.
Problem 70 [Machine learning]Automatic classification,label,Supervised learning / unsupervised learning
Problem 71 Stopwords, assertions,assert
Problem 72 [Machine learning]Feature
Problem 73 [NumPy]Installation,Matrix operation,[Machine learning]Logistic regression,Vectorization,Hypothetical function,Sigmoid function,Objective function,The steepest descent method,Learning rate and number of repetitions
Problem 74 [Machine learning]Forecast
Problem 75 [Machine learning]The weight of the feature,[NumPy]Get index of sorted results
Problem 76
Problem 77 Correct answer rate, precision rate, recall rate, F1 score
Problem 78 [Machine learning]5-fold cross-validation
Problem 79 [matplotlib]Line graph

Chapter 9: Vector Space Law (I)

Find the word context co-occurrence matrix from a large corpus and learn the vector that represents the meaning of the word. The word vector is used to find the similarity and analogy of words.

Link to post What I learned mainly, what I learned in the comments, etc.
Problem 80 Wordvectorization,bz2.open()
Problem 81 [Word vector]Dealing with compound words
Problem 82
Problem 83 Objectserialization/serialization,pickle.dump()、pickle.load()
Problem 84 [Wordvector]Wordcontextmatrix,PPMI(PositiveMutualInformation),[SciPy]Installation,Treatment of sparse matrices,Serialization,collections.OrderedDict
Problem 85 Principalcomponentanalysis(PCA),[scikit-learn]Installation,PCA
Problem 86
Problem 87 Cosine similarity
Problem 88
Problem 89 Additive composition, analogy

Chapter 10: Vector Space Law (II)

Use word2vec to learn the vector that represents the meaning of the word, and evaluate it using the correct answer data. In addition, you will experience clustering and vector visualization.

Link to post What I learned mainly, what I learned in the comments, etc.
Problem 90 [word2vec]Installation,How to use
Problem 91
Problem 92
Problem 93
Problem 94
Problem 95 Spearman's rank correlation coefficient, dynamic member addition to instances,**Exponentiation
Problem 96
Problem 97 Classification, clustering, K-Means、[scikit-learn]K-Means
Problem 98 Hierarchical clustering, Ward's method, dendrogram,[SciPy]Ward method,Dendrogram
Problem 99 t-SNE、[scikit-learn]t-SNE、[matplotlib]Labeled scatter plot

After 100 knocks

It took 8 months, but I managed to withstand 100 knocks. I am very grateful to Dr. Okazaki for publishing such a wonderful issue with a data corpus.

Also, I was really encouraged by the comments, editing requests, likes, stocks, follow-ups, and introductions on blogs and SNS. Thanks to everyone for continuing to the end. Thank you very much.

I hope that the article you posted will be helpful to those who follow.

Recommended Posts

100 Amateur Language Processing Knock: Summary
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Amateur Language Processing Knock: 67
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 amateur language processing knocks: 41
100 amateur language processing knocks: 71
100 amateur language processing knocks: 56
100 amateur language processing knocks: 50
100 language processing knock 2020 [00 ~ 69 answer]
100 amateur language processing knocks: 59
100 amateur language processing knocks: 70
100 amateur language processing knocks: 62
100 amateur language processing knocks: 60
100 Language Processing Knock 2020 Chapter 1
100 amateur language processing knocks: 30
100 amateur language processing knocks: 06
100 amateur language processing knocks: 84
100 language processing knock 2020 [00 ~ 49 answer]
100 amateur language processing knocks: 81
100 amateur language processing knocks: 33
100 amateur language processing knocks: 46
100 amateur language processing knocks: 88
100 amateur language processing knocks: 89
100 amateur language processing knocks: 40
100 amateur language processing knocks: 45
100 amateur language processing knocks: 43
100 amateur language processing knocks: 55
100 Language Processing Knock-52: Stemming
100 amateur language processing knocks: 22
100 amateur language processing knocks: 61
100 amateur language processing knocks: 94
100 amateur language processing knocks: 54
100 amateur language processing knocks: 04
100 Language Processing Knock Chapter 1
100 amateur language processing knocks: 63
100 amateur language processing knocks: 78
100 amateur language processing knocks: 12
100 amateur language processing knocks: 14
100 amateur language processing knocks: 08
100 amateur language processing knocks: 42
100 amateur language processing knocks: 19
100 amateur language processing knocks: 73
100 amateur language processing knocks: 75
100 amateur language processing knocks: 98
100 amateur language processing knocks: 32
100 amateur language processing knocks: 96
100 amateur language processing knocks: 87
100 amateur language processing knocks: 72
100 amateur language processing knocks: 79
100 amateur language processing knocks: 23
100 amateur language processing knocks: 00
100 amateur language processing knocks: 02
100 amateur language processing knocks: 37
100 amateur language processing knocks: 21
100 amateur language processing knocks: 68
100 Language Processing Knock 2020 Chapter 3
100 amateur language processing knocks: 90
100 amateur language processing knocks: 74