This is the second time to aim for automatic sentence generation. Last time, I did morphological analysis to examine the structure of sentences. This time, we will read .txt and divide it into sentences one by one.
Prepare the text data created in advance with Notepad. Be careful about the encoding method. (In the example, it is'utf-8'.) Let's read and display the text.
import re
a = open('test.txt', 'r', encoding = "utf-8")
original_text = a.read()
print(original_text) #Show text
I feel like this.
Next, organize the text data. Depending on how you write the original text, you will need to make your own adjustments. The code is for my text data. (For example, in the case of frigana such as'text', it must be deleted.)
first_sentence = '"Description of Python."'
last_sentence = 'The reptile python, which means the English word Python, is used as the mascot and icon in the Python language.'
#Organize text data.
_, text = original_text.split(first_sentence)
text, _ = text.split(last_sentence)
text = first_sentence + text + last_sentence
text = text.replace('!', '。') #!! What? To. Change to. Be careful of full-width and half-width
text = text.replace('?', '。')
text = text.replace('(', '').replace(')', '') #Delete ().
text = text.replace('\r', '').replace('\n', '') #Displayed with line breaks in text data\Delete n
text = re.sub('[、「」?]', '', text)
sentences = text.split('。') #.. Divide sentences into sentences with
print('word count:', len(sentences))
sentences[:10] #Display 10 sentences
This is what I was able to do
That's it for this code. Now you have a sentence-by-sentence list! I plan to put this into a sentence through morphological analysis.
I personally stumbled upon some of them, so I will introduce them.
--Error without encoding ='utf-8'. ――I can't grasp the characteristics of the text data'! 'Sentences are not separated
Is it such a place? It took a long time because I didn't notice it though it was relatively easy. After thinking about what to do with the example sentences in the article, it became a safe one (Wikipedia's explanation of Python).
Recommended Posts