[PYTHON] Study natural language processing with Kikagaku

This time, I would like to write a place where I learned when I was studying natural language processing on a site called Kikagaku where I can learn about deep learning for free.

Execution environment

・ MacOS ・ Python3.6 (anaconda) ・ VS Code

Citations

[Illustration! Thorough explanation of how to use Python Beautiful Soup! (select, find, find_all, install, scraping, etc.)](https://ai-inter1.com/beautifulsoup_1/#:~:text=Beautiful%20Soup(%E3%83%93%E3%83%A5%E3) % 83% BC% E3% 83% 86% E3% 82% A3% E3% 83% 95% E3% 83% AB% E3% 83% BB% E3% 82% B9% E3% 83% BC% E3% 83 % 97),% E3% 82% B9% E3% 82% AF% E3% 83% AC% E3% 82% A4% E3% 83% 94% E3% 83% B3% E3% 82% B0% E7% 94 % A8% E3% 81% AE% E3% 83% A9% E3% 82% A4% E3% 83% 96% E3% 83% A9% E3% 83% AA% E3% 81% A7% E3% 81% 99 % E3% 80% 82 & text = Python% E3% 81% A7% E3% 81% AF% E3% 80% 81Beautiful% 20Soup% E3% 82% 92,% E3% 81% 99% E3% 82% 8B% E3% 81% 93% E3% 81% A8% E3% 81% 8C% E3% 81% A7% E3% 81% 8D% E3% 81% BE% E3% 81% 99% E3% 80% 82) [Python] Case conversion of character strings (lower, upper function) Clone repository (https://docs.github.com/en/free-pro-team@latest/github/creating-cloning-and-archiving-repositories/cloning-a-repository) Comparison of morphological analyzers at the end of 2019 Preparing the environment for using MeCab on Mac Morphological analysis with MeCab on Mac Settings for using JUMAN ++ in Python's pyenv environment on mac I tried morphological analysis with pyknp (JUMANN ++) Running Human ++ with Python

I did not understand and investigated

What is Beautiful Soup?

First of all, nice to meet you, beautifulsoup. What this is is that it is a library that can extract only the necessary information from the htmlm statement. For example, I think that HTML sentences on the net are surrounded by 'div' and 'h1'. However, these tags are an obstacle to parsing sentences, so it is beautifulsoup to get only the information without them.

What is the lower () function?

There are lower function and ʻupper function`, which are a function to make ** lowercase ** and a function to make ** uppercase ** respectively.

Morphological analysis

I installed MeCab andHuman ++because they were good for morphological analysis. MeCab is probably the best in terms of speed, and if you only pursue accuracy, Human ++ seems to be good. Installing Mecab was pretty easy.

$ brew install mecab
$ brew install mecab-ipadic
$ pip install mecab-python3
$ git clone url

For the URL after git clone, paste the URL copied from Github repository.

import MeCab
m = MeCab.Tagger('-d/usr/local/lib/mecab/dic/mecab-ipadic-neologd')
text = '<html>Deep learning from scratch</html>'
print(m.parse(text))

MeCab will be written like this. By the way, in the parentheses of Tagger ()

Can be specified.

Juman ++ also needed pyknp installation to be usable in python.

$ brew install jumanpp
$ pip install pyknp

This completes the installation of Juman ++ and Pyknp. Next, I will write about how to write Human ++ on Python.

from pyknp.juman.juman import Juman
juman = Juman()
result = juman.analysis("Foreigners to vote")
for mrph in result.mrph_list():
    print(mrph.midasi, mrph.yomi)

Juman ++ is written on Python like this. I'm using Juman ++, but it seems that it's okay to leave it as Juman when writing code. In addition to midasi and yomi in the print part, you can also add mrph.genkei, mrph.hinsi, mrph.bunrui, mrph.katuyou1, mrph.katuyou2, mrph.imis, mrph.repname.

What I didn't understand regardless of natural language processing

I used split, which is a function that can be extracted word by word. At first I thought it was a function specialized in natural language processing, but it was usually a python function. ** split outputs sentences separated by words. ** **

However, if only split is used, the margin after the comma is output as it is, so the function called strip is used because it does not look very good. By using this, it is possible to output with the margins removed.

Summary

This time, it took a long time for the morphological analysis part. However, I thought that it was necessary knowledge for natural language processing from now on, so I am glad that I could learn it carefully.

Recommended Posts

Study natural language processing with Kikagaku
[Natural language processing] Preprocessing with Japanese
3. Natural language processing with Python 2-1. Co-occurrence network
3. Natural language processing with Python 1-1. Word N-gram
I tried natural language processing with transformers.
Python: Natural language processing
RNN_LSTM2 Natural language processing
Natural Language Processing Case Study: Word Frequency in'Anne with an E'
3. Natural language processing with Python 2-2. Co-occurrence network [mecab-ipadic-NEologd]
[Python] I played with natural language processing ~ transformers ~
Let's enjoy natural language processing with COTOHA API
100 Language Processing with Python Knock 2015
Natural language processing 1 Morphological analysis
Natural language processing 3 Word continuity
Natural language processing 2 Word similarity
3. Natural language processing with Python 4-1. Analysis for words with KWIC
Building an environment for natural language processing with Python
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
100 natural language processing knocks Chapter 4 Commentary
100 Language Processing Knock with Python (Chapter 1)
Natural language processing for busy people
100 Language Processing Knock with Python (Chapter 3)
Artificial language Lojban and natural language processing (artificial language processing)
Preparing to start natural language processing
Natural language processing analyzer installation summary
3. Natural language processing with Python 5-2. Emotion intensity analysis tool VADER
Easily build a natural language processing model with BERT + LightGBM + optuna
Dockerfile with the necessary libraries for natural language processing in python
Summarize how to preprocess text (natural language processing) with tf.data.Dataset api
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
Natural language processing of Yu-Gi-Oh! Card name-Yu-Gi-Oh!
100 Knocking Natural Language Processing Chapter 1 (Preparatory Movement)
100 Language Processing Knock with Python (Chapter 2, Part 2)
[WIP] Pre-processing memo in natural language processing
100 Language Processing Knock with Python (Chapter 2, Part 1)
Convenient goods memo around natural language processing
100 Language Processing Knock-88: 10 Words with High Similarity
100 language processing knocks 03 ~ 05
100 language processing knocks (2020): 40
100 language processing knocks (2020): 35
100 language processing knocks (2020): 47
100 language processing knocks (2020): 39
100 language processing knocks (2020): 22
100 language processing knocks (2020): 26
100 language processing knocks (2020): 34
100 Language Processing Knock (2020): 28
100 language processing knocks (2020): 42
100 language processing knocks (2020): 29
100 language processing knocks (2020): 49
100 language processing knocks 06 ~ 09
100 language processing knocks (2020): 43
100 language processing knocks (2020): 24
100 language processing knocks (2020): 45
100 language processing knocks (2020): 10-19
100 language processing knocks (2020): 30
100 language processing knocks (2020): 00-09
100 language processing knocks (2020): 31
That's right, let's eat it. [Natural language processing starting with Kyoto dialect]
100 Language Processing Knock (2020): 38
100 language processing knocks (2020): 48
100 language processing knocks (2020): 44