[PYTHON] 100 Language Processing Knock 2020 Chapter 1: Preparatory Movement

The other day, 100 Language Processing Knock 2020 was released. I myself have only been in natural language processing for a year, and I don't know the details, but I will solve all the problems and publish them in order to improve my technical skills.

This is my first article on Qiita. I don't understand anything.

All shall be executed on jupyter notebook, and the restrictions of the problem statement may be broken conveniently. The source code is also on github. Yes.

The environment is Python 3.8.2 and Ubuntu 18.04.

Chapter 1: Preparatory movement

00. Reverse order of strings

Get a string in which the characters of the string "stressed" are arranged in reverse (from the end to the beginning)

code


x = 'stressed'
x = x[::-1]
x

output


'desserts'

It's a slice operation. If the step width of [start position: end position: step width] is set to a negative value, it will be cut out in the reverse order.

01. "Patatokukashi"

Take out the 1st, 3rd, 5th, and 7th characters of the character string "Patatokukashi" and get the concatenated character string.

Patatokukasee, when I googled, the original story seems to be PythagoraSwitch.

code


x = 'Patatoku Cassie'
x = x[::2]
x

output


'Police car'

I think it's okay to extract the 1st, 3rd, 5th, and 7th characters in order, but it's easier to use the slice operation. I think it's a feeling.

02. "Police car" + "Taxi" = "Patatokukashi"

Obtain the character string "Patatokukashi" by alternately connecting the characters "Police car" + "Taxi" from the beginning.

The operation of grouping elements in the same position in multiple lists into a list is realized by the zip function. It's okay for police cars and taxis to be the same length, but if they are different in length, the list generated by zip (to be exact, iterator) will match the length of the shorter list. It's supposed that there wasn't the last letter on the longest ...

If you use zip_longest of ʻitertools`, it will fit the longer one. Wow.

code


from itertools import zip_longest

code


x1 = 'Police car'
x2 = 'taxi'
x = [
    char
    for two_chars in zip_longest(x1, x2, fillvalue = '')
    for char in two_chars
]
x = ''.join(x)
x

output


'Patatoku Kashii'

zip_longest fills the end of the shorter list with None by default, so fill in the empty string with fillvalue =''. It is a double loop that there are two for statements in the list comprehension. If you've actually confirmed that a double-written for loop works the same, you'll feel that nothing is difficult.

03. Pi

Break down the sentence "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."

It is a memorable song of pi. It's easier to remember normally.

code


import re

code


x = 'Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics.'
x = re.sub(r'[^\w\s]', '', x)
x = x.split(' ')
x = [len(word) for word in x]
x

output


[3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]

First, use a regular expression to remove all but Latin letters and whitespace. Then, delimit it with a space and find the length of the word to write. This is like a regular expression that humans can read. I haven't learned magic, so I can't write clever regular expressions.

04. Element symbol

Break down the sentence “Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.” Into words 1, 5, 6, 7, 8, 9, 15, 16, The 19th word is the first character, and the other words are the first two characters, and the associative array (dictionary type or map type) from the extracted character string to the word position (what number of words from the beginning) Create.

code


x = 'Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.'
x = x.split(' ')
idx = {1, 5, 6, 7, 8, 9, 15, 16, 19}
d1 = [
    (num + 1, word[:1])
    for num, word in enumerate(x)
    if num + 1 in idx
]
d2 = [
    (num + 1, word[:2])
    for num, word in enumerate(x)
    if num + 1 not in idx
]
dct = {name:num for num, name in d1 + d2}
dct

output


{'H': 1,
 'B': 5,
 'C': 6,
 'N': 7,
 'O': 8,
 'F': 9,
 'P': 15,
 'S': 16,
 'K': 19,
 'He': 2,
 'Li': 3,
 'Be': 4,
 'Ne': 10,
 'Na': 11,
 'Mi': 12,
 'Al': 13,
 'Si': 14,
 'Cl': 17,
 'Ar': 18,
 'Ca': 20}

Separate the input by spaces and cut the token at the specified address by one or two characters from the front to make a tuple of the address and the element symbol. So, we will store the tuple in the dictionary. Since the addresses in the Python list are counted from 0, it is not particularly difficult to access it, just be careful that it deviates from the element number by 1.

Magnesium has become Mi.

  1. n-gram

Create a function that creates an n-gram from a given sequence (string, list, etc.). Use this function to get the word bi-gram and the letter bi-gram from the sentence "I am an NLPer".

I like the one in the following article on how to make n-gram.

Easy and fast ngram with Python

code


def ngram(n, lst):
    return list(zip(*[lst[i:] for i in range(n)]))

code


chars = 'I am an NLPer'
char_bi_gram = ngram(2, chars)

words = chars.split(' ')
word_bi_gram = ngram(2, words)

print('Character bi-gram', char_bi_gram)
print('Word bi-gram', word_bi_gram)

output


Character bi-gram [('I', ' '), (' ', 'a'), ('a', 'm'), ('m', ' '), (' ', 'a'), ('a', 'n'), ('n', ' '), (' ', 'N'), ('N', 'L'), ('L', 'P'), ('P', 'e'), ('e', 'r')]
Word bi-gram [('I', 'am'), ('am', 'an'), ('an', 'NLPer')]

If you do the above, it is convenient because you can realize the n-gram of the character string and the list with the same function.

The order of the arguments of the ngram function was decided on the assumption that functools.partial would be applied to create a function that obtains individual n-grams.

code


from functools import partial

code


bigram = partial(ngram, 2)
bigram(chars)

output


[('I', ' '),
 (' ', 'a'),
 ('a', 'm'),
 ('m', ' '),
 (' ', 'a'),
 ('a', 'n'),
 ('n', ' '),
 (' ', 'N'),
 ('N', 'L'),
 ('L', 'P'),
 ('P', 'e'),
 ('e', 'r')]

I was able to create a function to find the bi-gram.

06. Meeting

Find the set of characters bi-grams contained in "paraparaparadise" and "paragraph" as X and Y, respectively, and find the union, intersection, and complement of X and Y, respectively. In addition, find out if the bi-gram'se'is included in X and Y.

code


str1 = 'paraparaparadise'
str2 = 'paragraph'
X = set(ngram(2, str1))
Y = set(ngram(2, str2))
print('Intersection', X & Y)
print('Difference set', X - Y)
print('se in X?', ('s', 'e') in X)
print('se in Y?', ('s', 'e') in Y)

output


Intersection{('r', 'a'), ('p', 'a'), ('a', 'r'), ('a', 'p')}
Difference set{('a', 'd'), ('s', 'e'), ('i', 's'), ('d', 'i')}
se in X? True
se in Y? False

It is a set operation. It seems that ʻand or ʻor can be used, but it cannot be used. You will use & and |, but ʻunion, ʻintersection, etc. are also available. There are various things such as being able to judge whether it is a subset with <=.

07. Sentence generation by template

Implement a function that takes arguments x, y, z and returns the string "y at x is z". Furthermore, set x = 12, y = ”temperature”, z = 22.4, and check the execution result.

code


def temperature(x, y, z):
    return '{}of time{}Is{}'.format(x, y, z)

code


temperature(12, 'temperature', 22.4)

output


'The temperature at 12:00 is 22.4'

It is a character string format. You can control the display format by writing various things in {}. If you don't need complicated operations, {y} at f'{x} can be written as {z}'. There is also a way to use the % operator, but I don't use it except when I want to write something like printf (?), And I feel like it's okay.

08. Ciphertext

Implement the function cipher that converts each character of the given character string with the following specifications. ・ If lowercase letters, replace with (219 --character code) characters ・ Other characters are output as they are Use this function to encrypt / decrypt English messages.

It's a Caesar cipher.

Try using The quick brown fox jumps over the lazy dog as an English message.

code


def cipher(xs):
    xs = [
        chr(219 - ord(x)) if x.islower() else x
        for x in xs
    ]
    return ''.join(xs)

code


x = 'The quick brown fox jumps over the lazy dog. 1234567890'
print('Plaintext', x)
x = cipher(x)
print('Cryptogram', x)
x = cipher(x)
print('Decryption statement', x)

output


Plaintext The quick brown fox jumps over the lazy dog. 1234567890
Ciphertext Tsv jfrxp yildm ulc qfnkh levi gsv ozab wlt. 1234567890
Decrypted text The quick brown fox jumps over the lazy dog. 1234567890

It is a problem to implement the operation of converting a character string to ASCII code, applying encryption, and returning it to a character string. You need to know how to use ʻordandchr, and realize that you can also decrypt by applying cipher` twice.

  1. Typoglycemia

Create a program that randomly rearranges the order of the other letters, leaving the first and last letters of each word for the word string separated by spaces. However, words with a length of 4 or less are not rearranged. Give an appropriate English sentence (for example, "I couldn't believe that I could actually understand what I was reading: the phenomenal power of the human mind.") And check the execution result.

Nico Nico Pedia is detailed. Typoglycemia

code


import random as rd

code


def shuffle_str(x):
    x = list(x)
    rd.shuffle(x)
    return ''.join(x)

def typoglycemia(x):
    if len(x) <= 4:
        return x
    return x[0] + shuffle_str(x[1:-1]) + x[-1]

code


x = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
x = x.split(' ')
x = [typoglycemia(word) for word in x]
x = ' '.join(x)
x

output


"I c'lndout bliveee that I could allcutay uasntrdend what I was radneig : the penhnmaeol poewr of the haumn mind ."

Character strings with a length of 4 or less are returned as they are, and for other character strings, the second character from the beginning and the second character from the end are shuffled. Python strings are immutable and cannot be assigned like tuples. You need to create a new string by connecting the beginning, end, and shuffled middle part.

Next is Chapter 2

Language processing 100 knocks 2020 Chapter 2: UNIX commands

Recommended Posts

100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
[Language processing 100 knocks 2020] Chapter 1: Preparatory movement
100 Language Processing Knock 2020 Chapter 1
100 Language Processing Knock Chapter 1
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Knocking Natural Language Processing Chapter 1 (Preparatory Movement)
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 natural language processing knocks Chapter 1 Preparatory movement (second half)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 Language Processing Knock (2020): 28
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
100 Language Processing Knock (2020): 38
I tried 100 language processing knock 2020: Chapter 1
100 language processing knock 00 ~ 02
100 Language Processing Knock Chapter 1 by Python
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 1
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 language processing knocks ~ Chapter 1
100 Amateur Language Processing Knock: 07
100 language processing knocks Chapter 2 (10 ~ 19)
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing Knock UNIX Commands Learned in Chapter 2
100 Language Processing Knock Regular Expressions Learned in Chapter 3
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity