[Let's play with Python] Aiming for automatic sentence generation ~ Read .txt and make it one sentence unit ~

Introduction

This is the second time to aim for automatic sentence generation. Last time, I did morphological analysis to examine the structure of sentences. This time, we will read .txt and divide it into sentences one by one.

Read sentences

Prepare the text data created in advance with Notepad. Be careful about the encoding method. (In the example, it is'utf-8'.) Let's read and display the text.

import re
a = open('test.txt', 'r', encoding = "utf-8") 
original_text = a.read()
print(original_text) #Show text

I feel like this. 2020-02-11.png

Organize text data

Next, organize the text data. Depending on how you write the original text, you will need to make your own adjustments. The code is for my text data. (For example, in the case of frigana such as'text', it must be deleted.)

first_sentence = '"Description of Python."'
last_sentence = 'The reptile python, which means the English word Python, is used as the mascot and icon in the Python language.'
#Organize text data.
_, text = original_text.split(first_sentence)
text, _ = text.split(last_sentence)
text = first_sentence + text + last_sentence

text = text.replace('!', '。') #!! What? To. Change to. Be careful of full-width and half-width
text = text.replace('?', '。')
text = text.replace('(', '').replace(')', '') #Delete ().
text = text.replace('\r', '').replace('\n', '') #Displayed with line breaks in text data\Delete n
text = re.sub('[、「」?]', '', text) 
sentences = text.split('。') #.. Divide sentences into sentences with
print('word count:', len(sentences))
sentences[:10] #Display 10 sentences

This is what I was able to do 2020-02-11 (1).png

That's it for this code. Now you have a sentence-by-sentence list! I plan to put this into a sentence through morphological analysis.

Chat

I personally stumbled upon some of them, so I will introduce them.

--Error without encoding ='utf-8'. ――I can't grasp the characteristics of the text data'! 'Sentences are not separated

Is it such a place? It took a long time because I didn't notice it though it was relatively easy. After thinking about what to do with the example sentences in the article, it became a safe one (Wikipedia's explanation of Python).

Recommended Posts

[Let's play with Python] Aiming for automatic sentence generation ~ Read .txt and make it one sentence unit ~
[Let's play with Python] Aiming for automatic sentence generation ~ Completion of automatic sentence generation ~
Fractal to make and play with Python
[Python3] Automatic sentence generation using janome and markovify
[Let's play with Python] Make a household account book
[For play] Let's make Yubaba a LINE Bot (Python)
Read json file with Python, format it, and output json
Associate Python Enum with a function and make it Callable
[Let's play with Python] Image processing to monochrome and dots
Let's play with Excel with Python [Beginner]
Let's make a graph with python! !!
2. Make a decision tree from 0 with Python and understand it (2. Python program basics)
Read a Python # .txt file for a super beginner in Python with a working .py
Let's make an image recognition model with your own data and play!
Read CSV file with Python and convert it to DataFrame as it is
Make a decision tree from 0 with Python and understand it (4. Data structure)
Let's make a shiritori game with Python
Let's read the RINEX file with Python ①
Let's make a voice slowly with Python
[Python] Let's make matplotlib compatible with Japanese
Read Python csv and export to txt
[Python] Read images with OpenCV (for beginners)
Let's make a web framework with Python! (1)
Let's make a Twitter Bot with Python!
Let's make a web framework with Python! (2)
Let's play with Python Receive and save / display the text of the input form
Let's try analysis! Chapter 8: Analysis environment for Windows created with Python and Eclipse (PyDev)