[PYTHON] ■ [Google Colaboratory] Preprocessing of Natural Language Processing & Morphological Analysis (janome)

  1. Read Data by "with open" method

Try reading ** Ryunosuke Akutagawa's "nose" ** from Aozora Bunko The character code of the file is ** shift_jis ** image.png

#Reading and writing text files in Python (input / output)
with open('/hana.txt', mode='r', encoding='shift_jis') as f: 
  nose_hana = f.read()

print(nose_hana)

image.png

  1. Preprocessing of "HANA"
#Data preprocessing
import re
import pickle

nose = re.sub('《[^》]+》', '', nose_hana)    #Delete ruby
nose = re.sub('[|―  「」\n]', '', nose)      # |-And double-byte space, "" and line break deletion
nose = re.sub('[ ]', '', nose)                #Delete half-width space
nose = re.sub('[\u3000]', '', nose)           #\u3000 deleted

sentense_end = '。'

nose_list = nose.split(sentense_end)
nose_list.pop()
nose_list = [x+sentense_end for x in nose_list]

print(nose_list)

image.png

3. WAKATI "separate writing"

from janome import tokenizer

s = Tokenizer()

t = nose_list

for _ in nose_list:
  print(s.tokenize(_, wakati=True))

image.png

  1. Analysis of results of "WAKATI"
#You can count the frequency of appearance in collections
import collections

s = Tokenizer() #Instantiation
words = []
for _ in nose_list:
  words += s.tokenize(_, wakati=True)

c = collections.Counter(words)
print(c)

Reference

  1. Installation of morphological analysis tool (janome)

Recommended Posts

■ [Google Colaboratory] Preprocessing of Natural Language Processing & Morphological Analysis (janome)
Natural language processing 1 Morphological analysis
■ [Google Colaboratory] Use morphological analysis (janome)
100 natural language processing knocks Chapter 4 Morphological analysis (first half)
100 natural language processing knocks Chapter 4 Morphological analysis (second half)
Performance verification of data preprocessing in natural language processing
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
Overview of natural language processing and its data preprocessing
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
Types of preprocessing in natural language processing and their power
100 language processing knocks 2020: Chapter 4 (morphological analysis)
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
[Natural language processing] Preprocessing with Japanese
100 Language Processing Knock Chapter 4: Morphological Analysis
■ [Google Colaboratory] Use morphological analysis (MeCab)
100 Language Processing Knock-59: Analysis of S-expressions
100 language processing knock 2020 "for Google Colaboratory"
[Language processing 100 knocks 2020] Chapter 4: Morphological analysis
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
Natural language processing of Yu-Gi-Oh! Card name-Yu-Gi-Oh!
100 language processing knocks Chapter 4: Morphological analysis 31. Verbs
[WIP] Pre-processing memo in natural language processing
100 language processing knocks Morphological analysis learned in Chapter 4
Unbearable shortness of Attention in natural language processing
Python: Natural language processing
RNN_LSTM2 Natural language processing
3. Natural language processing with Python 4-1. Analysis for words with KWIC
100 language processing knock-30 (using pandas): reading morphological analysis results
100 natural language processing knocks Chapter 5 Dependency analysis (second half)
100 natural language processing knocks Chapter 5 Dependency analysis (first half)
[For beginners] Language analysis using the natural language processing tool "GiNZA" (from morphological analysis to vectorization)
Japanese morphological analysis using Janome
100 Language Processing Knock-57: Dependency Analysis
3. Natural language processing with Python 5-3. Emotion value analysis of Japanese sentences [Word emotion polarity value correspondence table]
Natural language processing 3 Word continuity
100 language processing knock-56: co-reference analysis
Natural language processing 2 Word similarity
3. Natural language processing with Python 5-2. Emotion intensity analysis tool VADER
3. Natural language processing with Python 5-4. Emotion value analysis of Japanese sentences [Japanese evaluation polarity dictionary (noun edition)]
Japanese Natural Language Processing Using Python3 (4) Sentiment Analysis by Logistic Regression
Why is distributed representation of words important for natural language processing?
[Word2vec] Let's visualize the result of natural language processing of company reviews
Study natural language processing with Kikagaku
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 natural language processing knocks Chapter 4 Commentary
Natural language processing for busy people
[Language processing 100 knocks 2020] Chapter 5: Dependency analysis
Artificial language Lojban and natural language processing (artificial language processing)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
Japanese analysis processing using Janome part1
Time series analysis 3 Preprocessing of time series data
Preparing to start natural language processing
Natural language processing analyzer installation summary
Summary of multi-process processing of script language
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
Easy padding of data that can be used in natural language processing
Learn the basics of document classification by natural language processing, topic model
Answers and impressions of 100 language processing knocks-Part 1
100 Language Processing Knock-91: Preparation of Analogy Data
100 Language Processing Knock-44: Visualization of Dependent Tree
Answers and impressions of 100 language processing knocks-Part 2