This time, I would like to write a place where I learned when I was studying natural language processing on a site called Kikagaku where I can learn about deep learning for free.

Execution environment

・ MacOS ・ Python3.6 (anaconda) ・ VS Code

Citations

[Illustration! Thorough explanation of how to use Python Beautiful Soup! (select, find, find_all, install, scraping, etc.)](https://ai-inter1.com/beautifulsoup_1/#:~:text=Beautiful%20Soup(%E3%83%93%E3%83%A5%E3) % 83% BC% E3% 83% 86% E3% 82% A3% E3% 83% 95% E3% 83% AB% E3% 83% BB% E3% 82% B9% E3% 83% BC% E3% 83 % 97),% E3% 82% B9% E3% 82% AF% E3% 83% AC% E3% 82% A4% E3% 83% 94% E3% 83% B3% E3% 82% B0% E7% 94 % A8% E3% 81% AE% E3% 83% A9% E3% 82% A4% E3% 83% 96% E3% 83% A9% E3% 83% AA% E3% 81% A7% E3% 81% 99 % E3% 80% 82 & text = Python% E3% 81% A7% E3% 81% AF% E3% 80% 81Beautiful% 20Soup% E3% 82% 92,% E3% 81% 99% E3% 82% 8B% E3% 81% 93% E3% 81% A8% E3% 81% 8C% E3% 81% A7% E3% 81% 8D% E3% 81% BE% E3% 81% 99% E3% 80% 82) [Python] Case conversion of character strings (lower, upper function) Clone repository (https://docs.github.com/en/free-pro-team@latest/github/creating-cloning-and-archiving-repositories/cloning-a-repository) Comparison of morphological analyzers at the end of 2019 Preparing the environment for using MeCab on Mac Morphological analysis with MeCab on Mac Settings for using JUMAN ++ in Python's pyenv environment on mac I tried morphological analysis with pyknp (JUMANN ++) Running Human ++ with Python

I did not understand and investigated

What is Beautiful Soup?

First of all, nice to meet you, beautifulsoup. What this is is that it is a library that can extract only the necessary information from the htmlm statement. For example, I think that HTML sentences on the net are surrounded by 'div' and 'h1'. However, these tags are an obstacle to parsing sentences, so it is beautifulsoup to get only the information without them.

What is the lower () function?

There are lower function and ʻupper function`, which are a function to make ** lowercase ** and a function to make ** uppercase ** respectively.

Morphological analysis

I installed MeCab andHuman ++because they were good for morphological analysis. MeCab is probably the best in terms of speed, and if you only pursue accuracy, Human ++ seems to be good. Installing Mecab was pretty easy.

$ brew install mecab
$ brew install mecab-ipadic
$ pip install mecab-python3
$ git clone url

For the URL after git clone, paste the URL copied from Github repository.

import MeCab
m = MeCab.Tagger('-d/usr/local/lib/mecab/dic/mecab-ipadic-neologd')
text = '<html>Deep learning from scratch</html>'
print(m.parse(text))

MeCab will be written like this. By the way, in the parentheses of Tagger ()

-Ochasen
-Owakati
-Oyomi

Can be specified.

Juman ++ also needed pyknp installation to be usable in python.

$ brew install jumanpp
$ pip install pyknp

This completes the installation of Juman ++ and Pyknp. Next, I will write about how to write Human ++ on Python.

from pyknp.juman.juman import Juman
juman = Juman()
result = juman.analysis("Foreigners to vote")
for mrph in result.mrph_list():
    print(mrph.midasi, mrph.yomi)

Juman ++ is written on Python like this. I'm using Juman ++, but it seems that it's okay to leave it as Juman when writing code. In addition to midasi and yomi in the print part, you can also add mrph.genkei, mrph.hinsi, mrph.bunrui, mrph.katuyou1, mrph.katuyou2, mrph.imis, mrph.repname.

What I didn't understand regardless of natural language processing

I used split, which is a function that can be extracted word by word. At first I thought it was a function specialized in natural language processing, but it was usually a python function. ** split outputs sentences separated by words. ** **

However, if only split is used, the margin after the comma is output as it is, so the function called strip is used because it does not look very good. By using this, it is possible to output with the margins removed.

Summary

This time, it took a long time for the morphological analysis part. However, I thought that it was necessary knowledge for natural language processing from now on, so I am glad that I could learn it carefully.

[PYTHON] Study natural language processing with Kikagaku