[Let's play with Python] Aiming for automatic sentence generation ~ Completion of automatic sentence generation ~

Introduction

This is the third time to aim for automatic sentence generation. This time, we will create a function for sentence generation. The code will be long. Let's do it in order.

Code part

Prepare text data

Now let's talk about the code. Here is the one to use first.

import re
from janome.tokenizer import Tokenizer
from tqdm import tqdm
from collections import Counter
from collections import defaultdict
import random
t = Tokenizer()

Prepare the text and load it. Also keep the text content clean. This area is as I did in the previous article.

a = open('test.txt', 'r', encoding = "utf-8") 
original_text = a.read()
#print(original_text) #View document

first_sentence = '"Description of Python."'
last_sentence = 'The reptile python, which means the English word Python, is used as the mascot and icon in the Python language.'
#Organize text data.
_, text = original_text.split(first_sentence)
text, _ = text.split(last_sentence)
text = first_sentence + text + last_sentence

text = text.replace('!', '。') #!! What? To. Change to. Be careful of full-width and half-width
text = text.replace('?', '。')
text = text.replace('(', '').replace(')', '') #Delete ().
text = text.replace('\r', '').replace('\n', '') #Displayed with line breaks in text data\Delete n
text = re.sub('[、「」?]', '', text) 
sentences = text.split('。') #.. Divide sentences into sentences with
print('word count:', len(sentences))
sentences[:10] #Display 10 sentences

Break down sentences

Break down sentence by sentence.

start = '__start__'  #Sentence start mark
fin = '__fin__'  #End of sentence

def get_three_words_list(sentence):  #Return a sentence as a set of 3 words
    t = Tokenizer()
    words = t.tokenize(sentence, wakati=True)
    words = [start] + words + [fin]
    three_words_list = []
    for i in range(len(words) - 2):
        three_words_list.append(tuple(words[i:i+3]))
    return three_words_list

three_words_list = []
for sentence in tqdm(sentences):
    three_words_list += get_three_words_list(sentence)
    
three_words_count = Counter(three_words_list)
len(three_words_count) 

Connect and weight words

#Markov chain
def generate_markov_dict(three_words_count):
    markov_dict = {}
    for three_words, count in three_words_count.items():
        two_words = three_words[:2]  #Divided into the first two words and the next word
        next_word = three_words[2]
        if two_words not in markov_dict: #Generate empty data if it does not exist in the dictionary
            markov_dict[two_words] = {'words': [], 'weights': []}
            markov_dict[two_words]['words'].append(next_word)  #Add the following words and times
            markov_dict[two_words]['weights'].append(count)
    return markov_dict

markov_dict = generate_markov_dict(three_words_count)
markov_dict
def get_first_words_weights(three_words_count):
    first_word_count = defaultdict(int)
    
    for three_words, count in three_words_count.items():
        if three_words[0] == start:
            next_word = three_words[1]
            first_word_count[next_word] += count

    words = []  #Words and weights(Number of appearances)List to store
    weights = []
    for word, count in first_word_count.items():
        words.append(word)  #Add words and weights to the list
        weights.append(count)
    return words, weights

get_first_words_weights(three_words_count)
markov_dict = generate_markov_dict(three_words_count)
print(len(markov_dict))
first_words, first_weights = get_first_words_weights(three_words_count)
print(len(first_words))
def get_first_words_weights(three_words_count):
    first_word_count = defaultdict(int)
    
    for three_words, count in three_words_count.items():
        if three_words[0] == start:
            next_word = three_words[1]
            first_word_count[next_word] += count

    words = []  #Words and weights(Number of appearances)List to store
    weights = []
    for word, count in first_word_count.items():
        words.append(word)  #Add words and weights to the list
        weights.append(count)
    return words, weights

get_first_words_weights(three_words_count)
def get_first_words_weights(three_words_count):
    first_word_count = defaultdict(int)  #Create a defaultdict with a value of int
    for three_words, count in three_words_count.items():
        if three_words[0] == start:  #Extract only those that start with start
            next_word = three_words[1]
            first_word_count[next_word] += count #Add the number of appearances
    return first_word_count

get_first_words_weights(three_words_count)

Automatically generate sentences

def generate_text(fwords, fweights, markov_dict):
    first_word = random.choices(fwords, weights=fweights)[0]  #Get the first word
    generate_words = [start, first_word]  #List to store words for sentence generation
    while True:
        pair = tuple(generate_words[-2:])  #Get the last two words
        words = markov_dict[pair]['words']  #Get a list of the following words and weights
        weights = markov_dict[pair]['weights']
        next_word = random.choices(words, weights=weights)[0]  #Get the next word
        if next_word == fin:  #Exit the loop when the sentence ends
            break
        generate_words.append(next_word)
    return ''.join(generate_words[1:])  #Create sentences from words

Start generation!

for l in range(3):
    sentence = generate_text(first_words, first_weights, markov_dict)
    print(sentence)

The result is this. 2020-02-19.png

Reflection and caution

(゚ Д ゚) The original sic has come out ... Absolutely the amount of original text is small. Attention) When I tried to execute it, the specifications of the PC were insufficient or it was not generated properly. The reason is unknown. It takes time for a large amount of text.

Chat

The text is now automatically generated. This is the end of "Aiming for automatic sentence generation". There are some improvements, but the original amount of text and features are still insufficient. For this reason there are many that inevitably finished the sentence of sic. So it's a pity personally that the text in this example isn't interesting. I will write an article again if I can fix it a little.

This code is based on the book and its sample code. (I forgot the name of the book.) Osamu Dazai. I will also post the text that was created based on the disqualification of human beings. 2020-02-19 (1).png

Recommended Posts

[Let's play with Python] Aiming for automatic sentence generation ~ Completion of automatic sentence generation ~
[Let's play with Python] Aiming for automatic sentence generation ~ Perform morphological analysis ~
[Let's play with Python] Aiming for automatic sentence generation ~ Read .txt and make it one sentence unit ~
Let's play with Excel with Python [Beginner]
Play with 2016-Python
Python hand play (let's get started with AtCoder?)
Automatic operation of Chrome with Python + Selenium + pandas
Play with Lambda layer (python) for about 5 minutes
Move THORLABS automatic stage with Python [for research]
[Python3] Automatic sentence generation using janome and markovify
Let's play with Python Receive and save / display the text of the input form
VS Code settings for developing in Python with completion
[Let's play with Python] Make a household account book
Summary of tools for operating Windows GUI with Python
[For play] Let's make Yubaba a LINE Bot (Python)
[Piyopiyokai # 1] Let's play with Lambda: Creating a Python script
Let's operate GPIO of Raspberry Pi with Python CGI
I wrote the code for Japanese sentence generation with DeZero
[For beginners] Summary of standard input in Python (with explanation)
Simulation of late damages for child support delinquency with python
[Let's play with Python] Image processing to monochrome and dots
Mass generation of QR code with character display by Python
Turn an array of strings with a for statement (Python3)
Play with the password mechanism of GitHub Webhook and Python
Automatic quiz generation with COTOHA
Let's play with 4D 4th
Let's run Excel with Python
Automatic update of Python module
Sentence generation with GRU (keras)
Let's play with Amedas data-Part 4
[Python] Play with Discord's Webhook.
Play RocketChat with API / Python
Let's play with Amedas data-Part 3
Let's play with Amedas data-Part 2
Let's build git-cat with Python
Mechanism for automatic lint check with flake8 when committing python code
Be careful of LANG for UnicodeEncodeError when printing Japanese with Python 3
I made a lot of files for RDP connection with Python
The story of making a standard driver for db with python.
Let's summarize the degree of coupling between modules with Python code
Automatic creation of 2021 monthly calendar (refill for personal organizer) by Python
Beginners will make a Bitcoin automatic trading bot aiming for a lot of money! Part 2 [Transaction with API]
Play with numerical calculation of magnetohydrodynamics
Let's make a GUI with python.
Getting Started with Python for PHPer-Classes
Getting Started with Python Basics of Python
Life game with Python! (Conway's Game of Life)
10 functions of "language with battery" python
Let's do image scraping with Python
4th night of loop with for
Password generation in texto with python
Let's make a graph with python! !!
Implementation of Dijkstra's algorithm with python
CSRF countermeasure token generation with Python
Masashi Sada Automatic generation of senryu
Introductory table of contents for python3
Coexistence of Python2 and 3 with CircleCI (1.0)
Getting Started with Python for PHPer-Functions
I tried sentence generation with GPT-2
Record of Python introduction for newcomers
Let's analyze voice with Python # 1 FFT