[PYTHON] I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04]

The teaching material of the programming basic study session, which is one of the training for newcomers, created by Tohoku University Inui / Okazaki Lab (currently Inui / Suzuki Lab) "[Language Processing 100 Knock 2020 Edition](https://nlp100.github. io / ja /) ”is solved with Python (3.7).

Having studied Python on my own, there may be mistakes and more efficient ways to do it. We would appreciate it if you could point out any improvements you may find.

The source code is also available on GitHub.

I would like to publish 5 questions each in a small amount.

Chapter 1: Preparatory movement

Review some advanced topics in programming languages while working on subjects dealing with texts and strings.

00. Reverse order of strings

Get a string in which the characters of the string "stressed" are arranged in reverse (from the end to the beginning).

`00.py`


strings = "stressed"
reversed_strings = strings[::-1]
print(reversed_strings) 
# >> desserts

I used slices to display the strings in reverse order. Slices are used for sequence objects such as strings, lists, and tuples with three values, like [start: stop: step], start position start, end position stop, and increment step Is specified and used. Values that fit the specified index will be returned.

Note that the index contains the value of start, such as start <= index <stop, but not the value of stop. start and stop can be omitted when ** from the beginning ** or ** to the end **. Even if a value exceeding the length of the sequence object is specified for the stop part, no error will occur and it will be ignored.

This time, a negative value is specified for the increment, but in this case it can be retrieved in reverse order. The start and stop must be specified in reverse as the order is reversed.

Negative values can also be specified for start and stop. In this case, the position is specified from the end, just like the normal index specification.

01. "Patatokukashi"

Take out the 1st, 3rd, 5th, and 7th characters of the character string "Patatokukashi" and get the concatenated character string.

`01.py`


strings = "Patatoku Kashii"
answer = strings[::2]
print(answer)
# >>Police car

The 1st, 3rd, 5th, and 7th characters were extracted by setting the increment of the slice mentioned above to 2 and skipping one character from the beginning of the character string.

02. "Police car" + "Taxi" = "Patatokukashi"

Obtain the character string "Patatokukashi" by alternately connecting the characters "Police car" + "Taxi" from the beginning.

`02.py`


strings1 = "Police car"
strings2 = "taxi"

answer = "".join([i + j for i, j in zip(strings1, strings2)])
print(answer)
# >>Patatoku Kashii

I solved it using the built-in function zip () which gets multiple elements at once. You can get the elements of multiple sequence objects at the same time in the for loop, and here it is a comprehension notation that assigns to the variables ʻi and j` and concatenates them.

Since the number of characters was the same, I was able to solve it neatly, but when using zip (), please note that if the number of elements is different, the larger one will be ignored.

In the comprehension notation, a list containing concatenated characters of the same index is created, so the result is concatenated into one character string using the join method.

03. Pi

Break down the sentence "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."

`03.py`


pi = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
table = str.maketrans('', '', ',.')
words = pi.translate(table).split()

print([len(i) for i in words])
# >> [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]

By breaking down into words and outputting the number of characters in each word, the parts other than the words are removed first. I wanted to remove the commas and periods, so I used the translate method.

translate requires a translation table, which you create with thestr.maketrans ()function. The str.maketrans () function can specify a dictionary or three strings as arguments. For a dictionary, specify the character before replacement (1 character) for the key, and specify the character after replacement (None for deletion) for the value. When specifying three character strings, specify the character string before replacement in the first argument, the character string after replacement in the second argument, and the deletion character string (optional) in the third argument. Since the lengths of the strings of the first and second arguments must match, you cannot specify more than one character in the replaced string.

If you use the translate method with the conversion table as an argument, the character string will be replaced / deleted (here, commas and periods will be deleted), so use the split method as an argument to extract words as a space-separated list. Is complete.

The rest is completed by outputting the comprehension notation that finds the length of each word and lists it.

04. Element symbol

Break down the sentence “Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.” Into words 1, 5, 6, 7, 8, 9, 15, 16, The 19th word is the first character, and the other words are the first two characters, and the associative array (dictionary type or map type) from the extracted character string to the word position (what number of words from the beginning) Create.

`04.py`


element_symbol = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
table = str.maketrans('', '', ',.')
words = element_symbol.translate(table).split()

element_symbol_dict = {}
single_chars = [i - 1 for i in [1, 5, 6, 7, 8, 9, 15, 16, 19]]
for index, word in enumerate(words):
    length = 1 if index in single_chars else 2
    element_symbol_dict[word[:length]] = index + 1
print(element_symbol_dict)
# {'H': 1, 'He': 2, 'Li': 3, 'Be': 4, 'B': 5, 'C': 6, 'N': 7, 'O': 8, 'F': 9, 'Ne': 10, 'Na': 11, 'Mi': 12, 'Al': 13, 'Si': 14, 'P': 15, 'S': 16, 'Cl': 17, 'Ar': 18, 'K': 19, 'Ca': 20}

This problem also removes commas and periods like "03. Pi" and gets a list of words separated by spaces.

Initialize the dictionary of element symbols, create a list of index of words to extract only the first specified character, and you are ready to go.

Then, first use the ʻenumerate () function to get the word and index word by word from the list of words. This allows you to use ʻin to check if the index of the retrieved word is included in the index that extracts only the first character, and if it is included, set length 1 and if not, set length 2. I can do it.

After that, use the number of characters specified in the slice as the key, and add an element whose value is the position of the word at the word index + 1.

Summary

In this article, I tried to solve 100 language processing knocks 2020 edition Chapter 1: Preparatory movement problem numbers 00 to 04.

I didn't usually use slices, but since this 100 knocks came out from the beginning, is it often used in language processing? I hope you get used to language processing by solving 100 knocks.

I want to improve my ability, so please point out if you have a better answer! !! Thank you.

Continued

-I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09] -Language processing 100 knocks 2020 version [Chapter 2: UNIX commands 10-14] -I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 15-19] -Language processing 100 knocks 2020 version [Chapter 3: Regular expressions 20-24] -Language processing 100 knocks 2020 version [Chapter 3: Regular expressions 25-29]