3. Natural language processing with Python 1-1. Word N-gram

⑴ Reading text data

from google.colab import files
uploaded = files.upload()

image.png

image.png

with open('Neko.txt', mode='rt', encoding='utf-8') as f:
    read_text = f.read()
nekotxt = read_text

print(nekotxt)

image.png

⑵ Morphological analysis by MeCab

!apt install aptitude
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3==0.7
import MeCab
tagger = MeCab.Tagger("-Owakati")
nekotxt = tagger.parse(nekotxt)

print(nekotxt)

image.png

nekotxt = nekotxt.split()
print(nekotxt)

image.png

⑶ Generation of N-gram dictionary

from collections import Counter
import numpy as np
from numpy.random import *
string = nekotxt

#Character symbols to exclude
delimiter = ['「', '」', '…', ' ']

#2-word list
double = list(zip(string[:-1], string[1:]))
double = filter((lambda x: not((x[0] in delimiter) or (x[1] in delimiter))), double)

#List of 3 words
triple = list(zip(string[:-2], string[1:-1], string[2:]))
triple = filter((lambda x: not((x[0] in delimiter) or (x[1] in delimiter) or (x[2] in delimiter))), triple)

#Count the number of elements and generate a dictionary
dic2 = Counter(double)
dic3 = Counter(triple)
for u,v in dic2.items():
    print(u, v)

image.png

for u,v in dic3.items():
    print(u, v)

image.png

⑷ Definition of sentence generation method

def nextword(words, dic):
    ##➀ Get the number of elements grams of the first word words
    grams = len(words)

    ## ➁N-Extract matching elements from gram dictionary dic
    #For 2 words
    if grams == 2:
        matcheditems = np.array(list(filter(
            (lambda x: x[0][0] == words[1]), #1st matches
            dic.items())))
    #For 3 words
    else:
        matcheditems = np.array(list(filter(
            (lambda x: x[0][0] == words[1]) and (lambda x: x[0][1] == words[2]), #1st and 2nd match
            dic.items())))

    ##➂ Error message when there is no matching word
    if(len(matcheditems) == 0):
        print("No matched generator for", words[1])
        return ''

    ##➃ Weighted appearance frequency list
    #Get frequency of occurrence from matched items
    probs = [row[1] for row in matcheditems]
    #Generate a pseudo-random number from 0 to 1 and multiply it by the frequency of appearance
    weightlist = rand(len(matcheditems)) * probs

    ##➄ Get the element with the highest weighted appearance frequency from matched items
    if grams == 2:
        u = matcheditems[np.argmax(weightlist)][0][1]
    else:
        u = matcheditems[np.argmax(weightlist)][0][2]
    return u

⑸ Execution of sentence generation program

#Enter the first word words
words = ['', 'I'] # 2-gram
#words = ['', 'I', 'Is'] # 3-gram

#Embed words at the beginning of output output
output = words[1:]

#Get "next word"
for i in range(100):
    #For 2 words
    if len(words) == 2:
        newword = nextword(words, dic2)
    #For 3 words
    else:
        newword = nextword(words, dic3)

    #Add the following words to the output output
    output.append(newword)
    #End if the next character is a full stop
    if newword in ['', '。', '?', '!']:
        break
    #Preparing the next next word
    words = output[-len(words):]
    print(words)

#Display output output
for u in output:
    print(u, end='')

image.png

image.png

Recommended Posts

3. Natural language processing with Python 1-1. Word N-gram
Python: Natural language processing
3. Natural language processing with Python 2-1. Co-occurrence network
3. Natural language processing with Python 2-2. Co-occurrence network [mecab-ipadic-NEologd]
[Python] I played with natural language processing ~ transformers ~
Natural language processing 3 Word continuity
Natural language processing 2 Word similarity
3. Natural language processing with Python 4-1. Analysis for words with KWIC
Building an environment for natural language processing with Python
Study natural language processing with Kikagaku
100 Language Processing Knock with Python (Chapter 1)
[Natural language processing] Preprocessing with Japanese
100 Language Processing Knock with Python (Chapter 3)
3. Natural language processing with Python 3-1. Important word extraction tool TF-IDF analysis [original definition]
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
3. Natural language processing with Python 5-2. Emotion intensity analysis tool VADER
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
I tried natural language processing with transformers.
Dockerfile with the necessary libraries for natural language processing in python
Natural Language Processing Case Study: Word Frequency in'Anne with an E'
RNN_LSTM2 Natural language processing
3. Natural language processing with Python 5-3. Emotion value analysis of Japanese sentences [Word emotion polarity value correspondence table]
Image processing with Python
Getting started with Python with 100 knocks on language processing
Python: Deep Learning in Natural Language Processing: Basics
Let's enjoy natural language processing with COTOHA API
3. Natural language processing with Python 1-2. How to create a corpus: Aozora Bunko
Image processing with Python (Part 2)
100 Language Processing Knock-51: Word Clipping
"Apple processing" with OpenCV3 + Python3
Acoustic signal processing with Python (2)
Acoustic signal processing with Python
[Chapter 5] Introduction to Python with 100 knocks of language processing
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
Image processing with Python (Part 1)
Natural language processing 1 Morphological analysis
[Chapter 3] Introduction to Python with 100 knocks of language processing
Image processing with Python (Part 3)
100 Language Processing Knock-87: Word Similarity
[Chapter 2] Introduction to Python with 100 knocks of language processing
Python: Natural language vector representation
[Chapter 4] Introduction to Python with 100 knocks of language processing
[Python] Image processing with scikit-image
3. Natural language processing with Python 3-3. A year of corona looking back at TF-IDF
Quick batch text formatting + preprocessing for Aozora Bunko data for natural language processing with Python
[Practice] Make a Watson app with Python! # 3 [Natural language classification]
[Python] Try to classify ramen shops by natural language processing
[Python] Easy parallel processing with Joblib
100 natural language processing knocks Chapter 4 Commentary
100 Language Processing Knock Chapter 1 in Python
Natural language processing for busy people
100 Language Processing Knock-82 (Context Word): Context Extraction
Image processing with Python 100 knocks # 3 Binarization
Artificial language Lojban and natural language processing (artificial language processing)
Language processing 100 knock-86: Word vector display
[Language processing 100 knocks 2020] Chapter 7: Word vector
10 functions of "language with battery" python
100 Language Processing Knock 2020 Chapter 7: Word Vector
Python beginner tried 100 language processing knock 2015 (05 ~ 09)