[PYTHON] Examination of grammar proofreading and sentence comparison model for insight considering the similarity of style and style

Trigger

・ I wanted a "grammar proofing tool that considers a specific style and style" instead of a general-purpose grammar proofing tool. (Proofreading tools for regulatory application documents, proofreading tools for checking patent specifications, proofreading tools for contracts, proofreading tools for applying case law, proofreading tools for imitating the style of a particular author, etc.)

・ In promoting the understanding of the individuality of the model seen in the self-made AI, I wanted a "simple and understandable model to confirm which part of the sentence can be said to be characteristic of that individuality". I also wanted to generate sentences based on individuality and compare the individuality of the models.

・ I wanted to change the style of sentences in an understandable and muddy way.

・ I wanted a means to make predictions with small data, not deep learning.

・ I wanted to understand what the writing style is (I wanted to understand what kind of template there is. At the moment, even if I train the writing style, put it in the evaluation function, or use it as a template for tfidf embeddings cluster vis, it is too much. I don't understand anything too much. It's kind of unpleasant to give up understanding and run for distillation.)

I thought it would be a good idea to modify the n-gram probability model assuming Markov property.

reference

・ This code is based on the DLAI code, and is created and disclosed with the permission. (You can leave it free, but don't write it so that you can understand the origin. Also, just write that DLAI is a masterpiece (free translation).)

*Style Transformer: Unpaired Text Style Transfer without Disentangled Latent Representation  https://arxiv.org/abs/1905.05621 There was a place to read. Stylistic structure in the end2end model? If it can be said that style conversion is difficult because it cannot be completely removed, it may be a good idea to add styles later. Does style (at least microscopic) correlate with long-distance dependence? It wouldn't be a problem to make it independent. Also, personally, I think there are solutions other than memory networks, and they also have their own strengths. Isn't it good to replace it a priori and check it inductively? I feel that style conversion can be done relatively easily with encoder-decoders such as mbart and mT5. I am also looking forward to the paper published in 2021. (Essentially, like translation, you should be able to shift the vector to the center of gravity of the template. The problem is probably that you don't know what template to assume?)

['[CLS]','Go','## Ji',' is',' [MASK]','de','is','. ',' Name',' is','yet','not','. ',' [SEP]'] Cats, dogs, humans (correct answer? Is a cat) ['[CLS]','that','rule',' [MASK]',' is','customer','',' needs','to',' fit','',',' No',' [SEP]']

Semi-## Reason ## Indication ## Sentence, (Correct answer? Is a standard.)

image.png

Code overview

0 Preparation Throw a text file for learning into a specific folder (raw text is OK) Enter test text 1 Pretreatment 2 OOV processing 3 Creation of multiple probabilistic models (matrix) using n-gram Bidirectional 4 Model execution Take the weighted sum of post-learning probabilities obtained from each model If a word outside the model learning is detected, ● is added after the word. 5 output Output test sentences with ●

code

・ Examination of cut-offs, correction of style specialization, correction of code appearance, and other corrections ...

Aozora Bunko Specific Author Extraction


def file2data(path, exti='', aut=0):
    """
File reading
    """
    data = ''
    #All subfolders
    for current, subfolders, subfiles in os.walk(path):
        for file in subfiles:
            if file.endswith('.txt'):
                try:
                    with open(current+'/'+file, 'r', encoding='utf-8') as f:
                        fd = f.read()
                except:
                    with open(current+'/'+file, 'r', encoding='Shift-JIS', errors='ignore') as f:
                        fd = f.read()
                #Exclusion (It is better to put it after the author only, but I do not want to deepen the code, so here)
                if exti=='':
                    pass
                elif exti in fd[:20]:
                    print(exti,'● Excluded', file, fd.replace('\n',' ')[:50])
                    continue
                #Limited to specific authors
                if aut==0:
                    data += fd + '\n'
                    print(file, fd.replace('\n',' ')[:50])
                    continue
                for a in aut:
                    #autareastart = fd.find('\n')#The author name is not always on the second line, so it is better to correct it.
                    #autareaend = fd[autareastart+1:].find('\n') + autareastart
                    autareaend = fd.find('\n\n')#\n\Author name 20210116 modified before n
                    autareastart = fd[:autareaend].rfind('\n') + 1
                    if a in fd[autareastart:autareaend]:
                        data += fd + '\n'
                        print(file, fd.replace('\n',' ')[:50])
    return data
train_data = file2data(path, aut=['Miyazawa','Dazai','Fukuzawa','Natsume','Akutagawa'])
#'Akutagawa'Then'Akutagawa紗織'Etc. are also included. Please use it after excluding important matters while checking.
#I didn't put it in the code, but you could add the ability to ignore authors in the list.

gingatetsudono_yoru.txt Night on the Galactic Railroad Kenji Miyazawa ------------------------------------- gusukobudorino_denki.txt Biography of gusukobudori Kenji Miyazawa --------------------------------- kazeno_matasaburo.txt Matasaburo of the Wind Kenji Miyazawa -------------------------------------- kenjukoenrin.txt 虔 10 Koenbayashi Kenji Miyazawa -------------------------------------- ・ ・ ・

print(train_data[:200])

Night on the Galactic Railroad Kenji Miyazawa


[About the symbols that appear in the text]

"":ruby (Example) Words

|: Symbol that identifies the beginning of a character string with ruby (Example) 1 | Bag 《Fukuro》

[#]: Enterer's note: Mainly explanation of external characters and designation of emphasis marks (The number is the JIS X 0213 area number or the base ~

Trial

Output during test run 1

image.png

Output during test run 2

image.png

Trial of author style extraction using Aozora Bunko

Estimating authors of literary works focusing on vocabulary and context Keisuke Hanahata Masaki Aono Proceedings of the 25th Annual Meeting of the Natural Language Processing Society (March 2019) https://www.anlp.jp/proceedings/annual_meeting/2019/pdf_dir/P7-27.pdf 「」

Consider these references.

First, in the above difference, "words that are open to hiragana" are marked with a ● ("~ Kimi wa wa", "beginning to scold", "road dust", "done", " Hiragana ", etc.) It can be said that such words opened in hiragana are suitable as a part that shows the style and style of the child writer Nankichi Niimi.

In addition, in the above difference, ● is attached to the notation "What is a". I checked the post-learning probabilities of each model to see if it was a unique expression (excluding the reverse model for the time being). image.png image.png The notation "What is it?" Seems to be used in the works of Nankichi Niimi other than Grandpa's lamp. Also, it seems that it is not used (too much?) In the work of Miyazawa et al. The notation "What is it" can be regarded as part of Nankichi Niimi's style and style. (I can't find the same word in the original text. It seems that a often comes at the end of the sentence?)

In addition, in the above difference, ● is attached to the notation "to the outside road" and "it seems to be a child". I don't know, but it's a little unique expression. To the road, I think, children think, but I think it's a commonly used expression. image.png A word whose blue line indicates a style that is expected to be unique to the author. If only blocks with a structure in which a certain range before and after this is taken and replaced with part of speech names are used as teachers, "more accurate author estimation with less noise" may be possible. * It can be said that the wording tends to be one-shot and unique.

Kenji Miyazawa "Night on the Galactic Railroad" image.png I don't understand at all (hey

No, it's for practical insights, so what you read depends on your ability and domain knowledge.

As far as I can tell, the notations "I hurriedly stopped" and "I don't understand anything" seem to be characteristic notations with a long inter-phrase distance. From the post-learning probability, it seems that there are also characteristics near "everyone" and "and". If you dare, there may be many "numbers marked with ●". image.png * It can be said that the wording is unique and the proper nouns are not unique.

Stylistic similarity of writers Analysis using syntactic distance by introducing the information amount tree kernel Eriko Kanagawa Ryosuke Sahara Go Okadome The 29th Annual Conference of the Japanese Society for Artificial Intelligence, 2015 https://www.ai-gakkai.or.jp/jsai2015/webprogram/2015/pdf/2K1-1in.pdf "As you can see from Table 1, a place to compare the degree of similarity with other writers. In that case, the information amount tree kernel values ​​of Akutagawa and Natsume are large. Therefore, this Two people write a lot of sentences using a generally rare dependency It can be inferred that they are syntactically similar. For a rare dependency, It is thought that it contains the characteristics of the writer's style and style. To. Since the information amount tree kernel value of Miyazawa and Niimi is small, this It can be said that the two write sentences that are not syntactically similar. " "From now on, Natsume said in a sentence," My cup and his cup. Compare contrasting sentences that use the same structure twice, such as "pu", to other writers It can be said that it has the characteristic of writing as much as possible. In addition, Akutagawa said, "My friend Not only "find Maiko" like "find our Maiko" Detailed depictions such as "my" and "friend's" are compared to other writers It is thought that it is characterized by a relatively large number. "

Ryunosuke Akutagawa "Rashomon" image.png image.png If you dare to say it, it may be that "● is continuously attached". From the post-learning probability, it seems that there are also characteristics near "there" and "light". Is "the light of fire" a characteristic of the work or a characteristic of the author (detailed description?)? (Personally, I have a strong impression of "light of fire" in "Rashomon". I get a pictorial impression from Rashomon, but I think this is one of the reasons. (The impression is mixed with "hell strange". I don't feel like I'm doing it. That's the fire and the painting itself.)) (The light of fire may have been calculated to be characteristic of the author from the description of the hellish change.) image.png * It can be said that there are many proper nouns and they are often unique. It is not unpredictable that "the wording may be unique (dare you?)" As far as explaining the proper noun. (If you dare to strengthen the impression by making the wording and dependency characteristic in the impressive scene of the work, I am impressed. If so, ● If you read only the surroundings, you can understand the synopsis Or something.)

Natsume Soseki "I am a cat" image.png If you dare to say, the notation "*" seems to be a characteristic notation. Judging from the post-learning probability, it seems that there are also characteristics near "beginning" and "understanding". image.png * It can be said that a specific proper noun is unique.   Of the writing style, I can judge the unique wording, but I wonder if I can judge the dependency. The dependency seems to be affected longer than expected. Would you like to change the weighting? Is it possible to analyze from the distance and waveform of ●? Under this condition, is it okay to use a CNN-like method that can consider the characteristics of a certain range unit in the entire sentence in a specific domain? .. Since the structure is necessary and the content is unnecessary ... Is each word replaced with a part of speech name? image.png Is there a feature in the range that can only be predicted by reverse prediction? image.png

Reference memo

・ Multiword expressions MWEs Integrated non-structuralness (each part of speech is difficult to predict) or semantic non-structural (difficult to predict the meaning given by a group from each constituent word * There was a long-distance quotation It is not guaranteed to be a subtree of a phrase structure that is a group of multiple words that have a function word group that has a meaning for the first time.

Recommended Posts

Examination of grammar proofreading and sentence comparison model for insight considering the similarity of style and style
Comparison of Python and Ruby (Environment / Grammar / Literal)
[For beginners] Quantify the similarity of sentences with TF-IDF
Comparison of CoffeeScript with JavaScript, Python and Ruby grammar