[PYTHON] I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04]

The teaching material of the programming basic study session, which is one of the training for newcomers, created by Tohoku University Inui / Okazaki Lab (currently Inui / Suzuki Lab) "[Language Processing 100 Knock 2020 Edition](https://nlp100.github. io / ja /) ”is solved with Python (3.7).

Having studied Python on my own, there may be mistakes and more efficient ways to do it. We would appreciate it if you could point out any improvements you may find.

The source code is also available on GitHub.

I would like to publish 5 questions each in a small amount.

Chapter 1: Preparatory movement

Review some advanced topics in programming languages while working on subjects dealing with texts and strings.

00. Reverse order of strings

Get a string in which the characters of the string "stressed" are arranged in reverse (from the end to the beginning).

00.py


strings = "stressed"
reversed_strings = strings[::-1]
print(reversed_strings) 
# >> desserts

I used slices to display the strings in reverse order. Slices are used for sequence objects such as strings, lists, and tuples with three values, like [start: stop: step], start position start, end position stop, and increment step Is specified and used. Values that fit the specified index will be returned.

Note that the index contains the value of start, such as start <= index <stop, but not the value of stop. start and stop can be omitted when ** from the beginning ** or ** to the end **. Even if a value exceeding the length of the sequence object is specified for the stop part, no error will occur and it will be ignored.

This time, a negative value is specified for the increment, but in this case it can be retrieved in reverse order. The start and stop must be specified in reverse as the order is reversed.

Negative values can also be specified for start and stop. In this case, the position is specified from the end, just like the normal index specification.

01. "Patatokukashi"

Take out the 1st, 3rd, 5th, and 7th characters of the character string "Patatokukashi" and get the concatenated character string.

01.py


strings = "Patatoku Kashii"
answer = strings[::2]
print(answer)
# >>Police car

The 1st, 3rd, 5th, and 7th characters were extracted by setting the increment of the slice mentioned above to 2 and skipping one character from the beginning of the character string.

02. "Police car" + "Taxi" = "Patatokukashi"

Obtain the character string "Patatokukashi" by alternately connecting the characters "Police car" + "Taxi" from the beginning.

02.py


strings1 = "Police car"
strings2 = "taxi"

answer = "".join([i + j for i, j in zip(strings1, strings2)])
print(answer)
# >>Patatoku Kashii

I solved it using the built-in function zip () which gets multiple elements at once. You can get the elements of multiple sequence objects at the same time in the for loop, and here it is a comprehension notation that assigns to the variables ʻi and j` and concatenates them.

Since the number of characters was the same, I was able to solve it neatly, but when using zip (), please note that if the number of elements is different, the larger one will be ignored.

In the comprehension notation, a list containing concatenated characters of the same index is created, so the result is concatenated into one character string using the join method.

03. Pi

Break down the sentence "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."

03.py


pi = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
table = str.maketrans('', '', ',.')
words = pi.translate(table).split()

print([len(i) for i in words])
# >> [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]

By breaking down into words and outputting the number of characters in each word, the parts other than the words are removed first. I wanted to remove the commas and periods, so I used the translate method.

translate requires a translation table, which you create with thestr.maketrans ()function. The str.maketrans () function can specify a dictionary or three strings as arguments. For a dictionary, specify the character before replacement (1 character) for the key, and specify the character after replacement (None for deletion) for the value. When specifying three character strings, specify the character string before replacement in the first argument, the character string after replacement in the second argument, and the deletion character string (optional) in the third argument. Since the lengths of the strings of the first and second arguments must match, you cannot specify more than one character in the replaced string.

If you use the translate method with the conversion table as an argument, the character string will be replaced / deleted (here, commas and periods will be deleted), so use the split method as an argument to extract words as a space-separated list. Is complete.

The rest is completed by outputting the comprehension notation that finds the length of each word and lists it.

04. Element symbol

Break down the sentence “Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.” Into words 1, 5, 6, 7, 8, 9, 15, 16, The 19th word is the first character, and the other words are the first two characters, and the associative array (dictionary type or map type) from the extracted character string to the word position (what number of words from the beginning) Create.

04.py


element_symbol = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
table = str.maketrans('', '', ',.')
words = element_symbol.translate(table).split()

element_symbol_dict = {}
single_chars = [i - 1 for i in [1, 5, 6, 7, 8, 9, 15, 16, 19]]
for index, word in enumerate(words):
    length = 1 if index in single_chars else 2
    element_symbol_dict[word[:length]] = index + 1
print(element_symbol_dict)
# {'H': 1, 'He': 2, 'Li': 3, 'Be': 4, 'B': 5, 'C': 6, 'N': 7, 'O': 8, 'F': 9, 'Ne': 10, 'Na': 11, 'Mi': 12, 'Al': 13, 'Si': 14, 'P': 15, 'S': 16, 'Cl': 17, 'Ar': 18, 'K': 19, 'Ca': 20}

This problem also removes commas and periods like "03. Pi" and gets a list of words separated by spaces.

Initialize the dictionary of element symbols, create a list of index of words to extract only the first specified character, and you are ready to go.

Then, first use the ʻenumerate () function to get the word and index word by word from the list of words. This allows you to use ʻin to check if the index of the retrieved word is included in the index that extracts only the first character, and if it is included, set length 1 and if not, set length 2. I can do it.

After that, use the number of characters specified in the slice as the key, and add an element whose value is the position of the word at the word index + 1.

Summary

In this article, I tried to solve 100 language processing knocks 2020 edition Chapter 1: Preparatory movement problem numbers 00 to 04.

I didn't usually use slices, but since this 100 knocks came out from the beginning, is it often used in language processing? I hope you get used to language processing by solving 100 knocks.

I want to improve my ability, so please point out if you have a better answer! !! Thank you.

Continued

-I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09] -Language processing 100 knocks 2020 version [Chapter 2: UNIX commands 10-14] -I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 15-19] -Language processing 100 knocks 2020 version [Chapter 3: Regular expressions 20-24] -Language processing 100 knocks 2020 version [Chapter 3: Regular expressions 25-29]

Recommended Posts

I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09]
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 15 to 19]
[Language processing 100 knocks 2020] Chapter 1: Preparatory movement
[Language processing 100 knocks 2020] Chapter 1: Preparatory movement
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
100 Knocking Natural Language Processing Chapter 1 (Preparatory Movement)
100 natural language processing knocks Chapter 1 Preparatory movement (second half)
100 natural language processing knocks Chapter 1 Preparatory movement (first half)
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09]
100 natural language processing knocks Chapter 1 Preparatory movement (second half)
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock: Chapter 1 Preparatory Movement
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
Python practice 100 knocks I tried to visualize the decision tree of Chapter 5 using graphviz
100 Knocking Natural Language Processing Chapter 1 (Preparatory Movement)
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
I tried to get the batting results of Hachinai using image processing
I tried to solve the E qualification problem collection [Chapter 1, 5th question]
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
100 language processing knocks ~ Chapter 1
100 language processing knocks Chapter 2 (10 ~ 19)
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
I tried to extract named entities with the natural language processing library GiNZA
I tried to summarize the basic form of GPLVM
Try to solve the problems / problems of "Matrix Programmer" (Chapter 1)
I tried to solve the soma cube with python
Solve 100 language processing knocks 2020 (00. Reverse order of character strings)
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried to solve the problem with Python Vol.1
I tried to identify the language using CNN + Melspectogram
I tried to classify the voices of voice actors
I tried to summarize the string operations of Python
I tried 100 language processing knock 2020
I tried to compare the processing speed with dplyr of R and pandas of Python
The 15th offline real-time I tried to solve the problem of how to write with python
[Horse Racing] I tried to quantify the strength of racehorses
I tried to get the location information of Odakyu Bus
Language processing 100 knocks-48: Extraction of paths from nouns to roots
I tried to find the average of the sequence with TensorFlow
I tried to illustrate the time and time in C language
Try to solve the problems / problems of "Matrix Programmer" (Chapter 0 Functions)
[Python] I tried to visualize the follow relationship of Twitter
I tried to fight the Local Minimum of Goldstein-Price Function
How to write offline real time I tried to solve the problem of F02 with Python
Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text
[Language processing 100 knocks 2020] Chapter 3: Regular expressions
100 natural language processing knocks Chapter 4 Commentary
[Language processing 100 knocks 2020] Chapter 6: Machine learning
100 language processing knocks 2020: Chapter 4 (morphological analysis)
[Language processing 100 knocks 2020] Chapter 5: Dependency analysis
[Language processing 100 knocks 2020] Chapter 7: Word vector
100 language processing knocks 2020: Chapter 3 (regular expression)