[PYTHON] 100 Language Processing Knock 2020 Chapter 1

Language processing 100 knock 2020 version has been released, so I will solve it at this opportunity. Since it is a markdown of my jupyter notebook, the explanation is less. I also posted to Qiita what I plan to put together on my blog and GitHub, hoping that it would be helpful as much as possible. I can't help the teachers who can provide such wonderful teaching materials.

Chapter 5 and beyond will be added as soon as they are completed.

Chapter 1: Preparatory movement

(Corrected on 2020/04/15) Revised by @ hi-asano, the author of "100 language processing knock 2020 version released! What has changed?" Did. I will add the inclusion notation of 04 when I have time.

00. Reverse order of strings

Get a string in which the characters of the string "stressed" are arranged in reverse (from the end to the beginning).

def reverse_strings(s):
    return s[::-1]

print(reverse_strings("stressed"))

desserts

01. Patatoku Kashii

Take out the 1st, 3rd, 5th, and 7th characters of the character string "Patatokukashi" and get the concatenated character string.

def extract_strings(s):
    return s[::2]

print(extract_strings("Patatoku Kashii"))

Police car

02. "Police car" + "Taxi" = "Patatokukashi"

Obtain the character string "Patatokukashi" by alternately connecting the characters "Police car" + "Taxi" from the beginning.

#Can only be used with strings of the same length
def connect_strings(sone, stwo):
    result = "".join(s1+s2 for s1,s2 in zip(sone, stwo))
    return result

print(connect_strings("Police car", "taxi"))

Patatoku Kashii

03. Pi

Break down the sentence “Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics.” Into words, and create a list of the number of characters (alphabet) in each word in order of appearance.

The output result is pi. I was at a loss to write it neatly, but I intended to make it as short as possible by removing commas and periods with regular expressions. Character count is processed using map without turning the for statement.

import re
def circumference(s):
    splited = re.split('\s', re.sub(r'[,.]',  '', s))
    words_len = list(map(len, splited))
    return words_len

sentence = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
print(circumference(sentence))

[3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]

04. Element symbol

Break down the sentence “Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.” Into words 1, 5, 6, 7, 8, 9, 15, 16, The 19th word is the first character, and the other words are the first two characters, and the associative array (dictionary type or map type) from the extracted character string to the word position (what number of words from the beginning) Create

You should be able to write more beautifully using zip and comprehension notation. I would appreciate it if you could tell me if there is a good way. .. ..

def element_symbol(s, number):
    out_dict = {}
    splited = re.split('\s', s)
    for i, w in enumerate(splited, start=1):
        if i in number:
            out_dict[w[:1]] = i
        else :
            out_dict[w[:2]] = i
            
    return out_dict

sentence = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
only_first_number = [1, 5, 6, 7, 8, 9, 15, 16, 19]
print(element_symbol(sentence, only_first_number))

{'H': 1, 'He': 2, 'Li': 3, 'Be': 4, 'B': 5, 'C': 6, 'N': 7, 'O': 8, 'F': 9, 'Ne': 10, 'Na': 11, 'Mi': 12, 'Al': 13, 'Si': 14, 'P': 15, 'S': 16, 'Cl': 17, 'Ar': 18, 'K': 19, 'Ca': 20}

05.n-gram

Create a function that creates an n-gram from a given sequence (string, list, etc.). Use this function to get the word bi-gram and the letter bi-gram from the sentence "I am an NLPer".

According to the wiki, N-gram is as follows

A method of decomposing the search target by characters instead of words and finding the frequency of occurrence including the following N-1 characters. If the value of N is 1, it is called "uni-gram", if it is 2, it is called "bi-gram", and if it is 3, it is called "tri-gram".

~~ This time, I decided to implement the word n-gram and the character n-gram as separate functions. ~~ It can be used for general purposes by separating the given character strings with split and passing them as a list.

def generate_ngram(sentence, N):
    return [sentence[i:i+N] for i in range(len(sentence) - N + 1)]

input_text = "I am an NLPer"

print("Word bi-gram : " + str(generate_ngram(input_text.split(' '), 2)))
print("Character bi-gram : " + str(generate_ngram(input_text, 2)))

Words bi-gram: [['I','am'], ['am','an'], ['an','NLPer']] Character bi-gram: ['I','a','am','m','a','an','n','N','NL','LP','Pe', 'er']

06. Meeting

Find the set of characters bi-grams contained in "paraparaparadise" and "paragraph" as X and Y, respectively, and find the union, intersection, and complement of X and Y, respectively. In addition, find out if the bi-gram'se'is included in X and Y.

X_text = "paraparaparadise"
Y_text = "paragraph"

X = set(generate_ngram(X_text, 2))
Y = set(generate_ngram(Y_text, 2))

print("Union: " + str(X.union(Y)))
print("Intersection: " + str(X.intersection(Y)))
print("Difference set: " + str(X.difference(Y)))

print("Is se included in X: " + str('se' in X))
print("Is se included in Y: " + str('se' in Y))

Union: {'pa','di','ph','gr','is','ag','ra','ad','se','ap','ar'} Intersection: {'ap','pa','ra','ar'} Complement: {'se','is','di','ad'} Is se included in X: True Is se included in Y: False

07. Sentence generation by template

Implement a function that takes arguments x, y, z and returns the string "y at x is z". Furthermore, set x = 12, y = ”temperature”, z = 22.4, and check the execution result.

Note that f-string cannot be used before python 3.6. Conversely, f-string is not used in articles before that. Basic information is here.

def generate_temp(x, y, z):
    return f"{x}of time{y}Is{z}"

print(generate_temp(12, "temperature", 22.4))

The temperature at 12:00 is 22.4

08. Ciphertext

Implement the function cipher that converts each character of the given character string with the following specifications. ・ If lowercase letters, replace with (219 --character code) characters ・ Other characters are output as they are Use this function to encrypt / decrypt English messages.

It seems that the Unicode code point of a character can be obtained with the built-in function ʻord () `, so I tried using it. It's quick to replace with a regular expression because you only have to convert the one that matches the lowercase letter.

def chipher(s):
    result = ""
    for character in s:
        result += re.sub(r'[a-z]', chr(219 - ord(character)), character)
    return result


sentence = "Hi, Thank you for reading my article!!"
print(chipher(sentence))
print(chipher(chipher(sentence)))

Hr, Tszmp blf uli ivzwrmt nb zigrxov!!
Hi, Thank you for reading my article!!

@ suzu6 used the lambda expression to solve it more beautifully. Rather, it wasn't very good to run a for loop to replace it with a regular expression. In the example below, m is match object and group (https://note.nkmk.me/python-re-match-object-span-group/) is used to get the matched string. ) Must be used. Get the entire string matched by group (0).

def cipher(src):
    return re.sub(r'[a-z]', lambda m: chr(219 - ord(m.group(0))), src)

text = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."

#encryption
print(cipher(text))

Hr Hv Lrvw Bvxzfhv Blilm Clfow Nlg Ocrwrav Foflirmv. Nvd Nzgrlmh Mrtsg Aohl Srtm Pvzxv Svxfirgb Cozfhv. Aigsfi Krmt Czm.

09.Typoglycemia

Create a program that randomly rearranges the order of the other letters, leaving the first and last letters of each word for the word string separated by spaces. However, words with a length of 4 or less are not rearranged. Give an appropriate English sentence (for example, "I couldn't believe that I could actually understand what I was reading: the phenomenal power of the human mind.") And check the execution result.

I think it would have been better not to use list comprehension, but I used list comprehension because it was just the right length for practice. A sentence that can be read somehow is output.

import random

def mixing_word(sentence):
    splited = sentence.split(" ")     
    randomed_list = [ s[0] + ''.join(random.sample(s[1:-1], len(s)-2)) + s[-1] if len(s) >= 4 else s for s in splited]
    return " ".join(randomed_list)

input_text = "I couldn’t believe that I could actually understand what I was reading : the phenomenal power of the human mind."
mixing_word(input_text)

'I cndolu’t beveile taht I colud aulltacy unndtserad what I was raednig : the penhaneoml pwoer of the hmuan mdin.'

Finally

I'm a bio-based person who learned programming by myself without doing a competition professional, so I would appreciate it if you could tell me anything wrong.