[PYTHON] I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09]

The teaching material of the programming basic study session, which is one of the training for newcomers, created by Tohoku University Inui / Okazaki Lab (currently Inui / Suzuki Lab) io / ja /) ”is the second article in Python (3.7).

Having studied Python on my own, there may be mistakes and more efficient ways to do it. We would appreciate it if you could point out any improvements you may find.

The source code is also available on GitHub.

Chapter 1: Preparatory movement

Review some advanced topics in programming languages while working on subjects dealing with texts and strings.

05.n-gram

Create a function that creates an n-gram from a given sequence (string, list, etc.). Use this function to get the word bi-gram and the letter bi-gram from the sentence "I am an NLPer".

What is n-gram

An arbitrary n-character string in an arbitrary document or character string.

Bi-gram represents a character string that is two consecutive characters.

Reference: What is n-gram? Weblio Dictionary

05.py


def n_gram(target, n):
    return [target[index: index + n] for index in range(len(target) - n + 1)]


words = "I am an NLPer"
print(n_gram(words.split(), 2))
# >> [['I', 'am'], ['am', 'an'], ['an', 'NLPer']]
print(n_gram(words, 2))
# >> ['I ', ' a', 'am', 'm ', ' a', 'an', 'n ', ' N', 'NL', 'LP', 'Pe', 'er']

I will use slices this time as well. In the n_gram function, it returns a list that extracts the specified number of elements while shifting the index one by one for the given list / string.

The word bi-gram splits the input string with spaces using the split method and passes it as an argument. The character bi-gram slices what is given as it is as a character string.

06. Meeting

Find the set of character bi-grams contained in "paraparaparadise" and "paragraph" as X and Y, respectively, and find the union, intersection, and complement of X and Y, respectively. In addition, find out if the bi-gram'se'is included in X and Y.

06.py


def n_gram(target, n):
    return [target[index: index + n] for index in range(len(target) - n + 1)]


word1 = "paraparaparadise"
word2 = "paragraph"
x = set(n_gram(word1, 2))
y = set(n_gram(word2, 2))
# x = {'is', 'ra', 'ad', 'se', 'ar', 'ap', 'pa', 'di'}
# y = {'ag', 'gr', 'ra', 'ar', 'ap', 'pa', 'ph'}

#Union
print(x | y)
# >> {'ph', 'ap', 'is', 'ad', 'pa', 'se', 'di', 'ar', 'gr', 'ag', 'ra'}

#Intersection
print(x & y)
# >> {'ar', 'pa', 'ap', 'ra'}

#Difference set
print(x - y)
# >> {'di', 'is', 'se', 'ad'}

print(y - x)
# >> {'ph', 'ag', 'gr'}

print('se' in x)
# >> True

print('se' in y)
# >> False

The part to create the character bi-gram is the same as before, so I will omit the explanation. In Python, you can handle sets by using the Set type. Convert the list of bi-grams returned from the n_gram function to Set type and find each set.

Whether or not a character string is included can be determined by using ʻin`.

07. Sentence generation by template

Implement a function that takes arguments x, y, z and returns the string "y at x is z". Furthermore, set x = 12, y = ”temperature”, z = 22.4, and check the execution result.

07.py


def something_at_that_time(hour, something, predicate):
    return "{}of time{}Is{}".format(str(hour), something, str(predicate))


x = 12
y = "temperature"
z = 22.4
print(something_at_that_time(x, y, z))

The format method is useful for template statement generation. You can convert {} to a character string by putting {} in the character string and specifying the character string in the argument of the format method in the order of the inserted{}.

You can also put a variable inside {} like {h1} and specify it with " {h1} ".format (h1 = variable A) and a keyword argument.

You can also define the format of the string to be converted (such as how many decimal places and zeros).

08. Ciphertext

Implement the function cipher that converts each character of the given character string with the following specifications.

--If lowercase letters, replace with (219 --character code) characters --Other characters are output as they are

Use this function to encrypt / decrypt English messages.

08.py


def cipher(string):
    encyption = ""
    for i in list(string):
        if i.islower():
            encyption += chr((219 - ord(i)))
        else:
            encyption += i
    return encyption


test = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
encyption = cipher(test)
print(encyption)
# >> Hr Hv Lrvw Bvxzfhv Blilm Clfow Nlg Ocrwrav Foflirmv. Nvd Nzgrlmh Mrtsg Aohl Srtm Pvzxv Svxfirgb Cozfhv. Aigsfi Krmt Czm.
normal = cipher(encyption)
print(normal)
# >> Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.

Encrypt / decrypt with the cipher function that takes a character string as an argument. As in the question, in the case of lowercase letters, get the Unicode code point with the ʻordfunction, get the encrypted Unicode code point by subtracting it from 219, and use thechr` function to get the Unicode code point. Converting to characters.

Decryption can also be done by subtracting from 219, which is the miso of this problem.

  1. Typoglycemia

Create a program that randomly rearranges the order of the other letters, leaving the first and last letters of each word for the word string separated by spaces. However, words with a length of 4 or less are not rearranged. Give an appropriate English sentence (for example, "I couldn't believe that I could actually understand what I was reading: the phenomenal power of the human mind.") And check the execution result.

It is a reproduction of the characteristic that humans can read even if the order of the characters between them is different if only the first and last characters are present.

09.py


import random

input_line = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
words_list = input_line.split()
ans = []
for i in words_list:
    if len(i) <= 4:
        ans.append(i)
        continue
    char = list(i)
    middle_char = char[1:len(i) - 1]
    ans.append(char[0] + "".join(random.sample(middle_char, len(middle_char))) + char[-1])
print(" ".join(ans))
# >> I cod'nult bvieele that I culod auclatly unserdatnd what I was rdienag : the panmhoeenl poewr of the hmaun mind .

This problem is first listed word by word and then processed word by word. If the word is 4 letters or less, add it to the answer word list as it is.

If it does not go into the ʻifprocess, the words are further listed in a string and the characters in between are extracted in slices. In the answer word list, the first character and the last character are fixed, and the part between them is a list of all elements randomly sorted without duplication from the extracted character list, and joined with thejoin` method. Stringify and add the first and last characters concatenated.

Finally, with the answer word list as an argument, execute the join method on the blank to get the answer.

Summary

In this article, I tried to solve 100 language processing knocks 2020 edition Chapter 1: Preparatory movement problem numbers 05 to 09.

The set type is used a lot unexpectedly, and the conversion between characters and character codes is unique to language processing, so I learned a lot this time as well.

I'm still immature, so if you have a better answer, please let me know! !! Thank you.

Continued

-Language processing 100 knocks 2020 version [Chapter 2: UNIX commands 10-14] -I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 15-19] -Language processing 100 knocks 2020 version [Chapter 3: Regular expressions 20-24] -Language processing 100 knocks 2020 version [Chapter 3: Regular expressions 25-29]

Until last time

-I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04]

Recommended Posts

I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 3: Regular expressions 20 to 24]
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 10 to 14]
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 15 to 19]
[Language processing 100 knocks 2020] Chapter 1: Preparatory movement
100 natural language processing knocks Chapter 1 Preparatory movement (first half)
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
I tried 100 language processing knock 2020: Chapter 3
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
Python practice 100 knocks I tried to visualize the decision tree of Chapter 5 using graphviz
100 Knocking Natural Language Processing Chapter 1 (Preparatory Movement)
I tried to get the batting results of Hachinai using image processing
I tried to solve the E qualification problem collection [Chapter 1, 5th question]
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
100 language processing knocks ~ Chapter 1
100 language processing knocks Chapter 2 (10 ~ 19)
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
I tried to extract named entities with the natural language processing library GiNZA
Try to solve the problems / problems of "Matrix Programmer" (Chapter 1)
I tried to solve the soma cube with python
Solve 100 language processing knocks 2020 (00. Reverse order of character strings)
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried to solve the problem with Python Vol.1
I tried to identify the language using CNN + Melspectogram
I tried to classify the voices of voice actors
I tried to summarize the string operations of Python
I tried to solve the virtual machine placement optimization problem (simple version) with blueqat
I tried 100 language processing knock 2020
I tried to compare the processing speed with dplyr of R and pandas of Python
The 15th offline real-time I tried to solve the problem of how to write with python
I tried to find the entropy of the image with python
[Horse Racing] I tried to quantify the strength of racehorses
Language processing 100 knocks-48: Extraction of paths from nouns to roots
I tried to find the average of the sequence with TensorFlow
I tried to illustrate the time and time in C language
Try to solve the problems / problems of "Matrix Programmer" (Chapter 0 Functions)
[Python] I tried to visualize the follow relationship of Twitter
[Machine learning] I tried to summarize the theory of Adaboost
I tried to fight the Local Minimum of Goldstein-Price Function
How to write offline real time I tried to solve the problem of F02 with Python
[Natural language processing] I tried to visualize the hot topics this week in the Slack community
Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text
[Language processing 100 knocks 2020] Chapter 3: Regular expressions
100 natural language processing knocks Chapter 4 Commentary
[Language processing 100 knocks 2020] Chapter 6: Machine learning
100 language processing knocks 2020: Chapter 4 (morphological analysis)
[Language processing 100 knocks 2020] Chapter 5: Dependency analysis
100 language processing knocks 2020: Chapter 3 (regular expression)