"Chapter 1: Preparatory Movement" Of Language Processing 100 Knock 2015 It is a record of tohoku.ac.jp/nlp100/#ch1). This is a review of what I did over a year ago. Looking at the code at that time again, there are many corrections, and it seems that it is my own growth. I feel that the amount of code has been compressed to about half that of the program I did at that time. And now that I have some Python experience, it's a ** good tutorial to learn Python and language processing **. Compared to the latter half, one knock is lighter, which is exactly what the name "preparatory movement" deserves.
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.15 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.6.9 | python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv |
Review some advanced topics in programming languages while working on subjects dealing with texts and strings.
String, Unicode, List type, Dictionary type, Collective type, Iterator, Slice, Random number
Get a string in which the characters of the string "stressed" are arranged in reverse (from the end to the beginning).
Specify the slice with [start: stop: step]
and make it a negative number to reverse the order.
python:000.Reverse order of strings.ipynb
print('stressed'[::-1])
Terminal output result
desserts
Take out the 1st, 3rd, 5th, and 7th characters of the character string "Patatokukashi" and get the concatenated character string.
Specify the slice with [start: stop: step]
and output the 8th character from the beginning in 2 character steps.
python:001."Patatoku Cassie".ipynb
print('Patatoku Kashii'[0:7:2])
Terminal output result
Police car
Obtain the character string "Patatokukashi" by alternately connecting the characters "Police car" + "Taxi" from the beginning.
Use the zip
function to loop the two words" police car "and" taxi "and list them as['Patter',' Toku',' Kashi',' ー ー']
in inclusion notation. Output by connecting the list with the join
function.
I understand the zip
function in my head, but it's a kind of command that I haven't experienced in the language, so it's hard to come up with the idea of using it.
python:002."Police car" + "taxi" = "patatokukashi".ipynb
result = [char1+char2 for char1, char2 in zip('Police car', 'taxi')]
print(''.join(result))
Terminal output result
Patatoku Kashii
Break down the sentence "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
Use the split
function to divide the space. It's a very useful guy in English language processing. The strip
function removes commas and periods at the end of words.
python:003.Pi.ipynb
sentence = 'Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics.'
for word in sentence.split():
print(len(word.strip(',.')), word.strip(',.'))
The number of characters is the pi.
Terminal output result
3 Now
1 I
4 need
1 a
5 drink
9 alcoholic
2 of
6 course
5 after
3 the
5 heavy
8 lectures
9 involving
7 quantum
9 mechanics
Break down the sentence "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can." Into words 1, 5, 6, 7, 8, 9, 15, 16, The 19th word is the first character, and the other words are the first two characters, and the associative array (dictionary type or map type) from the extracted character string to the word position (what number of words from the beginning) Create.
I'm using a dictionary type of comprehension (I had a hard time not knowing how to combine it with an if statement).
The dictionary is sorted so that the output is in the order of element symbols.
Finally, I used pprint
for the output because I wanted to break each element.
python:004.Element symbol.ipynb
from pprint import pprint
sentence = 'Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.'
word_list = sentence.split()
result = ({word[0] if i in {1, 5, 6, 7, 8, 9, 15, 16, 19} else word[:2]: i for i, word in enumerate(word_list, 1)})
pprint(sorted(result.items(), key=lambda x:x[1]))
Terminal output result
[('H', 1),
('He', 2),
('Li', 3),
('Be', 4),
('B', 5),
('C', 6),
('N', 7),
('O', 8),
('F', 9),
('Ne', 10),
('Na', 11),
('Mi', 12),
('Al', 13),
('Si', 14),
('P', 15),
('S', 16),
('Cl', 17),
('Ar', 18),
('K', 19),
('Ca', 20)]
Create a function that creates an n-gram from a given sequence (string, list, etc.). Use this function to get the word bi-gram and the letter bi-gram from the sentence "I am an NLPer".
Is it about using range
in for
as a new technical element?
python:005.n-gram.ipynb
def generate_ngram(sentence):
#List by splitting with spaces
words = sentence.split()
#White space removal
chars = sentence.replace(' ','')
#Word bi-gram generation
bigram_word = [words[i-1] + ' ' + words[i] for i in range(len(words)) if i > 0]
#Character bi-gram generation
bigram_char = [chars[i-1] + chars[i] for i in range(len(chars)) if i > 0]
return bigram_word, bigram_char
print(generate_ngram('I am an NLPer'))
Terminal output result
(['I am', 'am an', 'an NLPer'], ['Ia', 'am', 'ma', 'an', 'nN', 'NL', 'LP', 'Pe', 'er'])
Find the set of characters bi-grams contained in "paraparaparadise" and "paragraph" as X and Y, respectively, and find the union, intersection, and complement of X and Y, respectively. In addition, find out if the bi-gram'se'is included in X and Y.
In Python, there is something called set
, and it seems that you can easily find the union, intersection, and difference set.
python:006.set.ipynb
def generate_ngram(sentense):
#White space removal
chars = sentense.replace(' ','')
#Character bi-gram generation
bigram_char = [chars[i-1] + chars[i] for i in range(len(chars)) if i > 0]
return bigram_char
bigram_x = set(generate_ngram('paraparaparadise'))
bigram_y = set(generate_ngram('paragraph'))
#Union
print(bigram_x.union(bigram_y))
#Intersection
print(bigram_x.intersection(bigram_y))
#Difference set
print(bigram_x.difference(bigram_y))
search_word = {'se'}
print(search_word.intersection(bigram_x))
print(search_word.intersection(bigram_y))
Terminal output result
{'ag', 'ap', 'se', 'ra', 'is', 'pa', 'ad', 'ph', 'di', 'ar', 'gr'}
{'pa', 'ar', 'ap', 'ra'}
{'ad', 'se', 'di', 'is'}
{'se'}
set()
Implement a function that takes arguments x, y, z and returns the string "y at x is z". Furthermore, set x = 12, y = "temperature", z = 22.4, and check the execution result.
Characters are combined with +
. {} At the time of '{} can be {}'. Format (x, y, z)
.
python:007.Sentence generation by template.ipynb
def create_sentence(x,y,z):
return str(x) + 'of time' + str(y) + 'Is' + str(z)
print(create_sentence(12, 'temperature', 22.4))
Terminal output result
The temperature at 12:00 is 22.4
Implement the function cipher that converts each character of the given character string with the following specifications.
--Replace with lowercase letters (219 --character code) --Other characters are output as they are
Use this function to encrypt / decrypt English messages.
"219 --Character code" seems to mean something like this.
The character code of a is 97, and if 219 --97 = 122 is set in this encryption, the character code becomes 122, which is z.
The character code of z is 122, and if 219 --122 = 97 is set in this encryption, the character code 97 is a.
In other words, it is an encryption that replaces the lowercase Roman letters a to z in the reverse order of z to a.
Use the built-in function chr
to control the character code.
I was wondering whether to use the inclusion notation, but I stopped it because it seems to be troublesome twice to have to put join
at the end.
python:008.Cryptogram.ipynb
def cipher(sentence):
result = ''
for char in sentence:
if char.islower():
result += chr(219-ord(char))
else:
result += char
return result
print(cipher('I Am An Idiot'))
Terminal output result
I An Am Iwrlg
Create a program that randomly rearranges the order of the other letters, leaving the first and last letters of each word for the word string separated by spaces. However, words with a length of 4 or less are not rearranged. Give an appropriate English sentence (for example, "I couldn't believe that I could actually understand what I was reading: the phenomenal power of the human mind.") And check the execution result.
It is a phenomenon that some words in a sentence can be read correctly even if the order other than the first and last letters is changed.
I see, you can read it somehow.
Characters are sorted using the shuffle
function of the random
package.
python:009.Typoglycemia.ipynb
from random import shuffle
def typoglycemia(word):
mid_chars = list(word[1:-1])
shuffle(mid_chars)
return word[0] + ''.join(mid_chars) + word[-1]
sentence = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
' '.join([word if len(word) <= 4 else typoglycemia(word) for word in sentence.split(' ')])
Terminal output result
"I cul'dnot beilvee that I culod altualcy udnnrseatd what I was riadeng : the paemhnenol peowr of the hmuan mind ."
Recommended Posts