Trigger

・ I wanted a "grammar proofing tool that considers a specific style and style" instead of a general-purpose grammar proofing tool. (Proofreading tools for regulatory application documents, proofreading tools for checking patent specifications, proofreading tools for contracts, proofreading tools for applying case law, proofreading tools for imitating the style of a particular author, etc.)

・ In promoting the understanding of the individuality of the model seen in the self-made AI, I wanted a "simple and understandable model to confirm which part of the sentence can be said to be characteristic of that individuality". I also wanted to generate sentences based on individuality and compare the individuality of the models.

・ I wanted to change the style of sentences in an understandable and muddy way.

・ I wanted a means to make predictions with small data, not deep learning.

・ I wanted to understand what the writing style is (I wanted to understand what kind of template there is. At the moment, even if I train the writing style, put it in the evaluation function, or use it as a template for tfidf embeddings cluster vis, it is too much. I don't understand anything too much. It's kind of unpleasant to give up understanding and run for distillation.)

I thought it would be a good idea to modify the n-gram probability model assuming Markov property.

reference

・ This code is based on the DLAI code, and is created and disclosed with the permission. (You can leave it free, but don't write it so that you can understand the origin. Also, just write that DLAI is a masterpiece (free translation).)

＊Style Transformer: Unpaired Text Style Transfer without Disentangled Latent Representation 　https://arxiv.org/abs/1905.05621 There was a place to read. Stylistic structure in the end2end model? If it can be said that style conversion is difficult because it cannot be completely removed, it may be a good idea to add styles later. Does style (at least microscopic) correlate with long-distance dependence? It wouldn't be a problem to make it independent. Also, personally, I think there are solutions other than memory networks, and they also have their own strengths. Isn't it good to replace it a priori and check it inductively? I feel that style conversion can be done relatively easily with encoder-decoders such as mbart and mT5. I am also looking forward to the paper published in 2021. (Essentially, like translation, you should be able to shift the vector to the center of gravity of the template. The problem is probably that you don't know what template to assume?)

Trial sentence generation (prediction of MASK part) using transformer (Tohoku University pre-learning BERT). It looks like this under the condition without fine-tuning. I want to compare it in the end.

['[CLS]','Go','## Ji',' is',' [MASK]','de','is','. ',' Name',' is','yet','not','. ',' [SEP]'] Cats, dogs, humans (correct answer? Is a cat) ['[CLS]','that','rule',' [MASK]',' is','customer','',' needs','to',' fit','',',' No',' [SEP]']

Semi-## Reason ## Indication ## Sentence, (Correct answer? Is a standard.)

Personally, a storage network model like a transformer? I feel inflexible stubbornness (I have the impression that I have been using BERT for 22 months, so I may have expanded it too much). Feeling that it forces the grammatical structure that learned the order of part of speech? .. If the accuracy is high, it is not always better for any task, or if you learn only from experience, it is easy to make a mistake, the tendency and axiom are different, or from a post-structuralist point of view ... As far as it goes, it seems quite so. BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance https://arxiv.org/abs/1911.02969
In practical use, it may be more useful to be able to judge based on "some specific" contexts than to be able to judge in "any" context like a transformer.
It can be said that the n-gram model is overflowing in the streets, but if the optimum one is searched for 10 minutes and cannot be found, it is faster to make it than to continue the investigation.
Ngram_range of sklearn Count Vectorizer is not used. It's hard to put your hands in.

Code overview

0 Preparation Throw a text file for learning into a specific folder (raw text is OK) Enter test text 1 Pretreatment 2 OOV processing 3 Creation of multiple probabilistic models (matrix) using n-gram Bidirectional 4 Model execution Take the weighted sum of post-learning probabilities obtained from each model If a word outside the model learning is detected, ● is added after the word. 5 output Output test sentences with ●

code

・ Examination of cut-offs, correction of style specialization, correction of code appearance, and other corrections ...

It may be possible to add and correct the prediction result of the transformer ... ・ Since it has become a little long, we are considering where to put it ...
Aozora Bunko's specific author work extraction code is created for testing. I often forget it, so make a note. 1 Download and unzip all files from aozorabunko_text 　　https://github.com/aozorahack/aozorabunko_text 2 Specify the path to the unzipped folder and the author you want to extract with the following function, and extract only the work of a specific author

`Aozora Bunko Specific Author Extraction`


def file2data(path, exti='', aut=0):
    """
File reading
    """
    data = ''
    #All subfolders
    for current, subfolders, subfiles in os.walk(path):
        for file in subfiles:
            if file.endswith('.txt'):
                try:
                    with open(current+'/'+file, 'r', encoding='utf-8') as f:
                        fd = f.read()
                except:
                    with open(current+'/'+file, 'r', encoding='Shift-JIS', errors='ignore') as f:
                        fd = f.read()
                #Exclusion (It is better to put it after the author only, but I do not want to deepen the code, so here)
                if exti=='':
                    pass
                elif exti in fd[:20]:
                    print(exti,'● Excluded', file, fd.replace('\n',' ')[:50])
                    continue
                #Limited to specific authors
                if aut==0:
                    data += fd + '\n'
                    print(file, fd.replace('\n',' ')[:50])
                    continue
                for a in aut:
                    #autareastart = fd.find('\n')#The author name is not always on the second line, so it is better to correct it.
                    #autareaend = fd[autareastart+1:].find('\n') + autareastart
                    autareaend = fd.find('\n\n')#\n\Author name 20210116 modified before n
                    autareastart = fd[:autareaend].rfind('\n') + 1
                    if a in fd[autareastart:autareaend]:
                        data += fd + '\n'
                        print(file, fd.replace('\n',' ')[:50])
    return data

train_data = file2data(path, aut=['Miyazawa','Dazai','Fukuzawa','Natsume','Akutagawa'])
#'Akutagawa'Then'Akutagawa紗織'Etc. are also included. Please use it after excluding important matters while checking.
#I didn't put it in the code, but you could add the ability to ignore authors in the list.

gingatetsudono_yoru.txt Night on the Galactic Railroad Kenji Miyazawa ------------------------------------- gusukobudorino_denki.txt Biography of gusukobudori Kenji Miyazawa --------------------------------- kazeno_matasaburo.txt Matasaburo of the Wind Kenji Miyazawa -------------------------------------- kenjukoenrin.txt 虔 10 Koenbayashi Kenji Miyazawa -------------------------------------- ・・・

print(train_data[:200])

Night on the Galactic Railroad Kenji Miyazawa

[About the symbols that appear in the text]

"":ruby (Example) Words

｜: Symbol that identifies the beginning of a character string with ruby (Example) 1 ｜ Bag 《Fukuro》

[#]: Enterer's note: Mainly explanation of external characters and designation of emphasis marks (The number is the JIS X 0213 area number or the base ～

Trial

Output during test run 1

How to read: Give ['s','bone','formation'] and predict the next word. Actually "o". Learning probability is about 7.1%, ~
We are verifying whether we can say that we have achieved our goal ...
Since it is a next word prediction model, sentences can be generated. (As you all know, a simple probabilistic model (simply apply) cannot generate decent sentences, but this time, the purpose is to give humans understandable insights, and to pass the Turing test. It's not the purpose, so it doesn't matter.)

Output during test run 2

A place that can be said to be new as a use patent? Is marked with ●. I don't have a window to reach the next sentence prediction or related words ... it's just barely reached ... so it's a coincidence. (I thought that it would be possible to estimate the "degree of novelty of the sentence" by adding up the probabilities of all correct answers in the sentence. Let's show it as the size of the symbol in the tfidf embeddings cluster vis. Wonder)
The mistakes are properly marked with ●.
After punctuation ● It is necessary to clearly adjust the grant. Is it interesting to imitate BERT and learn from both directions? .. → Change to interactive learning. Is it good?
If you learn all the texts of Aozora Bunko, you will find that 13GB of memory is in use. I realized again that transformers are heavy, but they are much lighter than making them muddy.

Trial of author style extraction using Aozora Bunko

Literary works are difficult because the peculiarities of proper nouns and the peculiarities of syntax such as dependency are easily mixed. It is good that peculiarities such as dependencies contribute to style extraction, but since the peculiarities of proper nouns often depend on the work rather than the author, I would like to remove them as much as possible. Whether to take the difference so that the domain can be easily raised like transfer learning, to exclude the noun from the target of ● assignment, or to remove it with tfidf etc. as a problem of OOV (problem with reproducibility and consistency of judgment) There is = difficult to understand, but) ...
For Nankichi Niimi's "Grandfather's Lamp", aut = ['Miyazawa','Dazai','Fukuzawa','Natsume','Akutagawa'] by learning model ● Grant and exti ='Grandfather', aut = [' Niimi'] Difference in grant by learning model (Nankichi Niimi-like part?) Stylistic similarity of writers Analysis using syntactic distance by introducing the information amount tree kernel Eriko Kanagawa Ryosuke Sahara Go Okadome The 29th Annual Conference of the Japanese Society for Artificial Intelligence, 2015 https://www.ai-gakkai.or.jp/jsai2015/webprogram/2015/pdf/2K1-1in.pdf "Niimi and Miyazawa are children's writers, such as Akutagawa and Dazai. Writing a sentence that is relatively shorter than Natsume has something in common. In Aozora Bunko The type of dependency structure used in all of Niimi's works is 19,124. There are 36,428 types of Miyazawa. Used by Niimi The number of types of dependencies is small. In addition, the dependency structure of Niimi and Miyazawa Comparing the top 20 most used times, Niimi is all inter-phrase distance 1 On the other hand, Miyazawa has a distance between phrases of 4 or 7. Exists. Therefore, Niimi frequently appears, which is important for composing sentences. It uses many dependency structures with a small internode distance and is characteristic. Because there are few dependency structures, the distance from other writers has become shorter. Conceivable. In contrast, Miyazawa has various in short sentences. It is thought to write a sentence that uses structural dependencies. "

Estimating authors of literary works focusing on vocabulary and context Keisuke Hanahata Masaki Aono Proceedings of the 25th Annual Meeting of the Natural Language Processing Society (March 2019) https://www.anlp.jp/proceedings/annual_meeting/2019/pdf_dir/P7-27.pdf 「」

Consider these references.

First, in the above difference, "words that are open to hiragana" are marked with a ● ("~ Kimi wa wa", "beginning to scold", "road dust", "done", " Hiragana ", etc.) It can be said that such words opened in hiragana are suitable as a part that shows the style and style of the child writer Nankichi Niimi.

In addition, in the above difference, ● is attached to the notation "What is a". I checked the post-learning probabilities of each model to see if it was a unique expression (excluding the reverse model for the time being). The notation "What is it?" Seems to be used in the works of Nankichi Niimi other than Grandpa's lamp. Also, it seems that it is not used (too much?) In the work of Miyazawa et al. The notation "What is it" can be regarded as part of Nankichi Niimi's style and style. (I can't find the same word in the original text. It seems that a often comes at the end of the sentence?)

In addition, in the above difference, ● is attached to the notation "to the outside road" and "it seems to be a child". I don't know, but it's a little unique expression. To the road, I think, children think, but I think it's a commonly used expression. A word whose blue line indicates a style that is expected to be unique to the author. If only blocks with a structure in which a certain range before and after this is taken and replaced with part of speech names are used as teachers, "more accurate author estimation with less noise" may be possible. * It can be said that the wording tends to be one-shot and unique.

I don't think it was appropriate to use Nankichi Niimi, who seems to have few stylistic features as a test target. Next is Kenji Miyazawa.

Kenji Miyazawa "Night on the Galactic Railroad" I don't understand at all (hey

No, it's for practical insights, so what you read depends on your ability and domain knowledge.

As far as I can tell, the notations "I hurriedly stopped" and "I don't understand anything" seem to be characteristic notations with a long inter-phrase distance. From the post-learning probability, it seems that there are also characteristics near "everyone" and "and". If you dare, there may be many "numbers marked with ●". * It can be said that the wording is unique and the proper nouns are not unique.

Ryunosuke Akutagawa and Soseki Natsume

Stylistic similarity of writers Analysis using syntactic distance by introducing the information amount tree kernel Eriko Kanagawa Ryosuke Sahara Go Okadome The 29th Annual Conference of the Japanese Society for Artificial Intelligence, 2015 https://www.ai-gakkai.or.jp/jsai2015/webprogram/2015/pdf/2K1-1in.pdf "As you can see from Table 1, a place to compare the degree of similarity with other writers. In that case, the information amount tree kernel values of Akutagawa and Natsume are large. Therefore, this Two people write a lot of sentences using a generally rare dependency It can be inferred that they are syntactically similar. For a rare dependency, It is thought that it contains the characteristics of the writer's style and style. To. Since the information amount tree kernel value of Miyazawa and Niimi is small, this It can be said that the two write sentences that are not syntactically similar. " "From now on, Natsume said in a sentence," My cup and his cup. Compare contrasting sentences that use the same structure twice, such as "pu", to other writers It can be said that it has the characteristic of writing as much as possible. In addition, Akutagawa said, "My friend Not only "find Maiko" like "find our Maiko" Detailed depictions such as "my" and "friend's" are compared to other writers It is thought that it is characterized by a relatively large number. "

Ryunosuke Akutagawa "Rashomon" If you dare to say it, it may be that "● is continuously attached". From the post-learning probability, it seems that there are also characteristics near "there" and "light". Is "the light of fire" a characteristic of the work or a characteristic of the author (detailed description?)? (Personally, I have a strong impression of "light of fire" in "Rashomon". I get a pictorial impression from Rashomon, but I think this is one of the reasons. (The impression is mixed with "hell strange". I don't feel like I'm doing it. That's the fire and the painting itself.)) (The light of fire may have been calculated to be characteristic of the author from the description of the hellish change.) * It can be said that there are many proper nouns and they are often unique. It is not unpredictable that "the wording may be unique (dare you?)" As far as explaining the proper noun. (If you dare to strengthen the impression by making the wording and dependency characteristic in the impressive scene of the work, I am impressed. If so, ● If you read only the surroundings, you can understand the synopsis Or something.)

Natsume Soseki "I am a cat" If you dare to say, the notation "*" seems to be a characteristic notation. Judging from the post-learning probability, it seems that there are also characteristics near "beginning" and "understanding". * It can be said that a specific proper noun is unique. 　 Of the writing style, I can judge the unique wording, but I wonder if I can judge the dependency. The dependency seems to be affected longer than expected. Would you like to change the weighting? Is it possible to analyze from the distance and waveform of ●? Under this condition, is it okay to use a CNN-like method that can consider the characteristics of a certain range unit in the entire sentence in a specific domain? .. Since the structure is necessary and the content is unnecessary ... Is each word replaced with a part of speech name? Is there a feature in the range that can only be predicted by reverse prediction?

Hypothesis formation of observation target and index examination ...
We are working to extend the affected distance. → Added pseudo next sentence prediction. Changed to emphasize long-distance weighting. n * 10, 3-9. Hmm n100, 2-9 n100, 1-5
It would be useful if you could get the knowledge to get the context at will. I want a solution with strong arbitrariness other than the non-robust contextual consideration method.
By the way, the last time I was a cat was like this. sad.
Although it doesn't matter, the word with the highest post-learning probability that comes after "Grandfather" is often "Medium", but by the way, "What's inside" is ** Is it? Very horror.

Reference memo

Stylistic characteristics Common classification ・ Lexical features Character word level ・ Syntactic features Frequency of punctuation marks and function words ・ Structural features: sentence layout, sentence length in a paragraph, number of paragraphs in a sentence, etc. ・ Content specific features Content words ・ Idiosyncratic features Spelling mistakes and grammatical mistakes

・ Multiword expressions MWEs Integrated non-structuralness (each part of speech is difficult to predict) or semantic non-structural (difficult to predict the meaning given by a group from each constituent word * There was a long-distance quotation It is not guaranteed to be a subtree of a phrase structure that is a group of multiple words that have a function word group that has a meaning for the first time.

Check the following regarding the possibilities of n-gram. "Text Mining for Criminal Investigation" Wataru Zaitsu (2019) Kyoritsu Shuppan ・ The character unit n-gram is not easily affected by the content of the text. Stamatatos2013 ・ In the case of bi, tri in character unit n-gram, it is effective for author identification Matsuura Kaneda 2000 ・ In the character unit n-gram, the larger n is, the more easily it is affected by the topic content. -For word-based n-grams, research using function words is preferred for author identification. (It is rare to analyze including content words) ・ The author identification ability of non-content words is high. -The auxiliary verb that is effective for author identification may be the auxiliary verb that appears at the end of the sentence. Zaitsu p45 (although there are certainly differences ...) -Characteristics such as the characters before the comma are effective for identifying the author. Zaitsu p45 ・ The author's discriminating power was high for the end words, following the usage rate of non-content words and the bigram of part of speech. * Analysis target is sentences with a relatively high degree of freedom such as blogs ・ Abundance of vocabulary. (In the above program, vocabularies that are not used by comparison authors and are used by specific authors are also marked with ●. Vocabularies that are not used by comparison authors or specific authors are not marked with ●.) -The author identification ability of syntactic units and part of speech combinations is high. (Is it better to be aware of the part of speech before and after ●)
I tried to generate a sentence. "There is something in my grandfather," yes ... What are the characteristics of the grandfather who appears in the literature of the Meiji and Taisho eras?
It may be interesting to convert nouns into distributed expressions and generate words that are similar to any topic with priority. Since the model includes writing style, it is easy to generate writing style and you can switch topics.

[PYTHON] Examination of grammar proofreading and sentence comparison model for insight considering the similarity of style and style