The other day, 100 Language Processing Knock 2020 was released. I myself have only been in natural language processing for a year, and I don't know the details, but I will solve all the problems and publish them in order to improve my technical skills.

This is my first article on Qiita. I don't understand anything.

All shall be executed on jupyter notebook, and the restrictions of the problem statement may be broken conveniently. The source code is also on github. Yes.

The environment is Python 3.8.2 and Ubuntu 18.04.

Chapter 1: Preparatory movement

00. Reverse order of strings

Get a string in which the characters of the string "stressed" are arranged in reverse (from the end to the beginning)

`code`


x = 'stressed'
x = x[::-1]
x

`output`


'desserts'

It's a slice operation. If the step width of [start position: end position: step width] is set to a negative value, it will be cut out in the reverse order.

01. "Patatokukashi"

Take out the 1st, 3rd, 5th, and 7th characters of the character string "Patatokukashi" and get the concatenated character string.

Patatokukasee, when I googled, the original story seems to be PythagoraSwitch.

`code`


x = 'Patatoku Cassie'
x = x[::2]
x

`output`


'Police car'

I think it's okay to extract the 1st, 3rd, 5th, and 7th characters in order, but it's easier to use the slice operation. I think it's a feeling.

02. "Police car" + "Taxi" = "Patatokukashi"

Obtain the character string "Patatokukashi" by alternately connecting the characters "Police car" + "Taxi" from the beginning.

The operation of grouping elements in the same position in multiple lists into a list is realized by the zip function. It's okay for police cars and taxis to be the same length, but if they are different in length, the list generated by zip (to be exact, iterator) will match the length of the shorter list. It's supposed that there wasn't the last letter on the longest ...

If you use zip_longest of ʻitertools`, it will fit the longer one. Wow.

`code`


from itertools import zip_longest

`code`


x1 = 'Police car'
x2 = 'taxi'
x = [
    char
    for two_chars in zip_longest(x1, x2, fillvalue = '')
    for char in two_chars
]
x = ''.join(x)
x

`output`


'Patatoku Kashii'

zip_longest fills the end of the shorter list with None by default, so fill in the empty string with fillvalue =''. It is a double loop that there are two for statements in the list comprehension. If you've actually confirmed that a double-written for loop works the same, you'll feel that nothing is difficult.

03. Pi

Break down the sentence "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."

It is a memorable song of pi. It's easier to remember normally.

`code`


import re

`code`


x = 'Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics.'
x = re.sub(r'[^\w\s]', '', x)
x = x.split(' ')
x = [len(word) for word in x]
x

`output`


[3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]

First, use a regular expression to remove all but Latin letters and whitespace. Then, delimit it with a space and find the length of the word to write. This is like a regular expression that humans can read. I haven't learned magic, so I can't write clever regular expressions.

04. Element symbol

Break down the sentence “Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.” Into words 1, 5, 6, 7, 8, 9, 15, 16, The 19th word is the first character, and the other words are the first two characters, and the associative array (dictionary type or map type) from the extracted character string to the word position (what number of words from the beginning) Create.

`code`


x = 'Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.'
x = x.split(' ')
idx = {1, 5, 6, 7, 8, 9, 15, 16, 19}
d1 = [
    (num + 1, word[:1])
    for num, word in enumerate(x)
    if num + 1 in idx
]
d2 = [
    (num + 1, word[:2])
    for num, word in enumerate(x)
    if num + 1 not in idx
]
dct = {name:num for num, name in d1 + d2}
dct

`output`


{'H': 1,
 'B': 5,
 'C': 6,
 'N': 7,
 'O': 8,
 'F': 9,
 'P': 15,
 'S': 16,
 'K': 19,
 'He': 2,
 'Li': 3,
 'Be': 4,
 'Ne': 10,
 'Na': 11,
 'Mi': 12,
 'Al': 13,
 'Si': 14,
 'Cl': 17,
 'Ar': 18,
 'Ca': 20}

Separate the input by spaces and cut the token at the specified address by one or two characters from the front to make a tuple of the address and the element symbol. So, we will store the tuple in the dictionary. Since the addresses in the Python list are counted from 0, it is not particularly difficult to access it, just be careful that it deviates from the element number by 1.

Magnesium has become Mi.

n-gram

Create a function that creates an n-gram from a given sequence (string, list, etc.). Use this function to get the word bi-gram and the letter bi-gram from the sentence "I am an NLPer".

I like the one in the following article on how to make n-gram.

Easy and fast ngram with Python

`code`


def ngram(n, lst):
    return list(zip(*[lst[i:] for i in range(n)]))

`code`


chars = 'I am an NLPer'
char_bi_gram = ngram(2, chars)

words = chars.split(' ')
word_bi_gram = ngram(2, words)

print('Character bi-gram', char_bi_gram)
print('Word bi-gram', word_bi_gram)

`output`


Character bi-gram [('I', ' '), (' ', 'a'), ('a', 'm'), ('m', ' '), (' ', 'a'), ('a', 'n'), ('n', ' '), (' ', 'N'), ('N', 'L'), ('L', 'P'), ('P', 'e'), ('e', 'r')]
Word bi-gram [('I', 'am'), ('am', 'an'), ('an', 'NLPer')]

If you do the above, it is convenient because you can realize the n-gram of the character string and the list with the same function.

The order of the arguments of the ngram function was decided on the assumption that functools.partial would be applied to create a function that obtains individual n-grams.

`code`


from functools import partial

`code`


bigram = partial(ngram, 2)
bigram(chars)

`output`


[('I', ' '),
 (' ', 'a'),
 ('a', 'm'),
 ('m', ' '),
 (' ', 'a'),
 ('a', 'n'),
 ('n', ' '),
 (' ', 'N'),
 ('N', 'L'),
 ('L', 'P'),
 ('P', 'e'),
 ('e', 'r')]

I was able to create a function to find the bi-gram.

06. Meeting

Find the set of characters bi-grams contained in "paraparaparadise" and "paragraph" as X and Y, respectively, and find the union, intersection, and complement of X and Y, respectively. In addition, find out if the bi-gram'se'is included in X and Y.

`code`


str1 = 'paraparaparadise'
str2 = 'paragraph'
X = set(ngram(2, str1))
Y = set(ngram(2, str2))
print('Intersection', X & Y)
print('Difference set', X - Y)
print('se in X?', ('s', 'e') in X)
print('se in Y?', ('s', 'e') in Y)

`output`


Intersection{('r', 'a'), ('p', 'a'), ('a', 'r'), ('a', 'p')}
Difference set{('a', 'd'), ('s', 'e'), ('i', 's'), ('d', 'i')}
se in X? True
se in Y? False

It is a set operation. It seems that ʻand or ʻor can be used, but it cannot be used. You will use & and |, but ʻunion, ʻintersection, etc. are also available. There are various things such as being able to judge whether it is a subset with <=.

07. Sentence generation by template

Implement a function that takes arguments x, y, z and returns the string "y at x is z". Furthermore, set x = 12, y = ”temperature”, z = 22.4, and check the execution result.

`code`


def temperature(x, y, z):
    return '{}of time{}Is{}'.format(x, y, z)

`code`


temperature(12, 'temperature', 22.4)

`output`


'The temperature at 12:00 is 22.4'

It is a character string format. You can control the display format by writing various things in {}. If you don't need complicated operations, {y} at f'{x} can be written as {z}'. There is also a way to use the % operator, but I don't use it except when I want to write something like printf (?), And I feel like it's okay.

08. Ciphertext

Implement the function cipher that converts each character of the given character string with the following specifications. ・ If lowercase letters, replace with (219 --character code) characters ・ Other characters are output as they are Use this function to encrypt / decrypt English messages.

It's a Caesar cipher.

Try using The quick brown fox jumps over the lazy dog as an English message.

`code`


def cipher(xs):
    xs = [
        chr(219 - ord(x)) if x.islower() else x
        for x in xs
    ]
    return ''.join(xs)

`code`


x = 'The quick brown fox jumps over the lazy dog. 1234567890'
print('Plaintext', x)
x = cipher(x)
print('Cryptogram', x)
x = cipher(x)
print('Decryption statement', x)

`output`


Plaintext The quick brown fox jumps over the lazy dog. 1234567890
Ciphertext Tsv jfrxp yildm ulc qfnkh levi gsv ozab wlt. 1234567890
Decrypted text The quick brown fox jumps over the lazy dog. 1234567890

It is a problem to implement the operation of converting a character string to ASCII code, applying encryption, and returning it to a character string. You need to know how to use ʻordandchr, and realize that you can also decrypt by applying cipher` twice.

Typoglycemia

Create a program that randomly rearranges the order of the other letters, leaving the first and last letters of each word for the word string separated by spaces. However, words with a length of 4 or less are not rearranged. Give an appropriate English sentence (for example, "I couldn't believe that I could actually understand what I was reading: the phenomenal power of the human mind.") And check the execution result.

Nico Nico Pedia is detailed. Typoglycemia

`code`


import random as rd

`code`


def shuffle_str(x):
    x = list(x)
    rd.shuffle(x)
    return ''.join(x)

def typoglycemia(x):
    if len(x) <= 4:
        return x
    return x[0] + shuffle_str(x[1:-1]) + x[-1]

`code`


x = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
x = x.split(' ')
x = [typoglycemia(word) for word in x]
x = ' '.join(x)
x

`output`


"I c'lndout bliveee that I could allcutay uasntrdend what I was radneig : the penhnmaeol poewr of the haumn mind ."

Character strings with a length of 4 or less are returned as they are, and for other character strings, the second character from the beginning and the second character from the end are shuffled. Python strings are immutable and cannot be assigned like tuples. You need to create a new string by connecting the beginning, end, and shuffled middle part.

Next is Chapter 2

Language processing 100 knocks 2020 Chapter 2: UNIX commands

[PYTHON] 100 Language Processing Knock 2020 Chapter 1: Preparatory Movement

Chapter 1: Preparatory movement

00. Reverse order of strings

code

output

01. "Patatokukashi"

code

output

02. "Police car" + "Taxi" = "Patatokukashi"

code

code

output

03. Pi

code

code

output

04. Element symbol

code

output

code

code

output

code

code

output

06. Meeting

code

output

07. Sentence generation by template

code

code

output

08. Ciphertext

code

code

output

code

code

code

output

Next is Chapter 2

`code`

`output`

`code`

`output`

`code`

`code`

`output`

`code`

`code`

`output`

`code`

`output`

`code`

`code`

`output`

`code`

`code`

`output`

`code`

`output`

`code`

`code`

`output`

`code`

`code`

`output`

`code`

`code`

`code`

`output`