Knock 100 Language Processing published on the Tohoku University Inui-Okazaki Laboratory Web page for training in natural language processing and Python. I will challenge nlp100 /). I plan to make a note of the code implemented in it and the techniques that should be suppressed. The code is also available on GitHub.
The textbook uses "Introduction to Python 2 & 3 support (written by Kenji Hosoda et al., Shuwa System)".
We would like to introduce the articles that we referred to when starting up. I cannot deny the feeling that it is too helpful, so please contact me if you feel uncomfortable.
I'm an amateur of Zub, so it's very unsightly because the notation is not unified and Python 2/3 relations are mixed, but I would appreciate it if you could point out. The execution environment itself is Python 2.
Get a string in which the characters of the string "stressed" are arranged in reverse (from the end to the beginning).
00.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 00.py
str = "stressed"
print(str[-1::-1])
Exercise of the technique of "slicing" for character strings. As mentioned in Article mentioned above, studying slices again.
A slice is described in the form of character string index [start index: end index: number of steps]
, and a part of the character string is cut out and acquired. In addition to character strings, lists can be used.
str = "abcdefgh"
#Get a specific character
str[0] # 'a', Zero from the beginning-based
str[-1] # 'h', Can be specified even with a negative number (going back from the end of the sentence) This time str[7]Synonymous with
#slice
str[1:3] # 'bc', Note that the character of the end index is not included. Not the number of characters
str[0:-3] # 'abcde', Negative numbers are OK. This time str[0:5]Synonymous with
str[:4] # 'abcd', If the start index is omitted, from the beginning
str[4:] # 'efgh', Until the end if the end index is omitted
#Specify the number of steps
str[0:6:2] # 'ace', Acquires discrete characters by the amount specified by the number of steps (0),2,4th)
str[::3] # 'adg', Can be omitted
str[-3::2] # 'fh', Negative numbers are also possible
str[::-3] # 'hed', If the number of steps is negative, it goes back in reverse order
So the answer this time was str [:: -1]
, but it was my first experience with slicing, so please take a good look ...
Take out the 1st, 3rd, 5th, and 7th characters of the character string "Patatokukashi" and get the concatenated character string.
01.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 01.py
str = u'Patatoku Kashii'
print(str[0::2])
Slice exercises as well as 00. Similarly, you can omit the start position and use str [:: 2]
.
In addition, Japanese (Unicode) character strings can be prefixed with u, such as ʻu'hogehoge'` (UTF-8 environment).
Obtain the character string "Patatokukashi" by alternately connecting the characters "Police car" + "Taxi" from the beginning.
02.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 02.py
str1 = u'Police car'
str2 = u'taxi'
str3 = u''
for a,b in zip(str1, str2):
str3 = str3 + a + b
print str3
zip ()
is a function that takes an element from each argument and creates a tuple. Techniques that can be used when specifying conditions for for loop
.
print
is suddenly no longer a function, but this is Python 2 notation. I'm sorry for the mixture.
And this is also mentioned in the above article, but it seems that the method of adding to the end every time during a loop is problematic in terms of execution speed. thing.
It seems best to combine the strings later as print (''. Join ([a + b for a, b in zip (str1, str2)]))
.
''. Join ()
joins the elements in the argument after separating them with the delimiter in ''
. Please note that the writing style has changed.
Break down the sentence "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics." Into words, and create a list of the number of characters (in the alphabet) of each word in order of appearance.
03.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 03.py
str = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
str = str.replace('.', "")
str = str.replace(',', "")
str = str.split()
list = []
for word in str:
list.append(len(word))
print list
After removing periods and commas with replace ()
, separate each word with split ()
, get the length with len ()
, and plunge into list
. ..
I wondered if there was a better way to remove periods and commas ... but I gave up. Since split ()
can specify a delimiter as an argument (default is a space), I thought I'd specify them all at once, but I couldn't.
Break down the sentence "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can." Into words 1, 5, 6, 7, 8, 9, 15, 16, The 19th word is the first character, and the other words are the first two characters, and the associative array (dictionary type or map type) from the extracted character string to the word position (what number of words from the beginning) Create.
04.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 04.py
str = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
str = str.split()
dict = {}
single = [1, 5, 6, 7, 8, 9, 15, 16, 19]
for element in str:
if str.index(element) + 1 in single:
dict[element[:1]] = str.index(element) + 1
else:
dict[element[:2]] = str.index(element) + 1
#Sort by atomic number and print
for k, v in sorted(dict.items(), key=lambda x:x[1]):
print k, v
As with 03, each word is separated and processed individually with for loop
. Since only the beginning is seen anyway, the period processing is omitted.
It's hard to say that magnesium becomes Mi at this rate, but ... is it unavoidable? You can specify them individually and slice them (ʻelement [: 3: 2] ). You don't have to put it as
single, but
str.index (element) + 1` appears three times, so I want to organize this area well. Is it a solution if you assign it to an appropriate variable?
Also, the dictionary does not guarantee the order in the first place, but it is sorted for easy viewing.
Modified version
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 04.py
str = "Hi He Lied Because Boron Could Not Oxidize Fluorine.\
New Nations Might Also Sign Peace Security Clause. Arthur King Can."
words_list = str.split()
dict = {}
single = [0, 4, 5, 6, 7, 8, 14, 15, 18]
for i in range(len(words_list)):
clen = 1 if i in single else 2
dict[words_list[i][:clen]] = i + 1
#Sort by atomic number and print
# for k, v in sorted(dict.items(), key=lambda x: x[1]):
# print(k, v)
The main improvements are as follows.
\
str
single
to zero-basedfor
from element to indexI think the biggest thing in this time is turning for
by index.
I wanted to try the way to write for
, which I learned for the first time in the previous code, and as a result, I got the index again with ʻindex ()` ...
Create a function that creates an n-gram from a given sequence (string, list, etc.). Use this function to get the word bi-gram and the letter bi-gram from the sentence "I am an NLPer".
05.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 05.py
original = "I am an NLPer"
def ngram(input, n):
#Character n-gram (Argument str)
l = len(input)
if type(input) == str:
input = "$" * (n - 1) + input + "$" * (n - 1)
for i in xrange(l + 1):
print input[i:i+n]
#Word n-gram (Argument list)
elif type(input) == list:
input = ["$"] * (n - 1) + input + ["$"] * (n - 1)
for i in xrange(l + 1):
print input[i:i+n]
ngram(original, 2) #Character n-gram
original = original.split()
ngram(original, 2) #Word n-gram
It was harder than I expected. Many fine adjustments such as ± 1 due to the length of the number of characters ...
I inserted $
before the beginning and after the end of the character string.
I wanted to implement a function like Java overload, but it seems that overload is not implemented by default in Python, so I implemented it with type ()
.
Comments and implementation from knok.
Implementation of ngram function by knok
def ngram(input, n):
last = len(input) - n + 1
ret = []
for i in range(0, last):
ret.append(input[i:i+n])
return ret
By not inserting $
at the beginning and end, you can implement smartly at once.
In both the character string and the list, you can specify the element by index and slice it, so you do not have to be aware of the type.
Hmmm beautiful. Looking at my code again makes me dizzy. Thank you very much.
Find the set of characters bi-grams contained in "paraparaparadise" and "paragraph" as X and Y, respectively, and find the union, intersection, and complement of X and Y, respectively. In addition, find out if the bi-gram'se'is included in X and Y.
06.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 06.py
str1 = "paraparaparadise"
str2 = "paragraph"
def ngram(input, n):
l = len(input)
list = []
input = "$" * (n - 1) + input + "$" * (n - 1)
for i in xrange(l + 1):
list.append(input[i:i+n])
return list
#ngram list to set;It is possible to eliminate duplication and perform set operations.
X = set(ngram(str1, 2))
Y = set(ngram(str2, 2))
print X.union(Y) #Union
print X.intersection(Y) #Intersection
print X.difference(Y) #Difference set
print "se" in X # in:To X"se"True if,False if not
print "se" in Y #Almost the same (X-> Y)
See the code for specific usage. I'm happy to be able to write such operations intuitively.
Modified version
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 06.py
from mymodule import ngram
str1 = "paraparaparadise"
str2 = "paragraph"
X = set(ngram(str1, 2))
Y = set(ngram(str2, 2))
#Omission
With reference to this article, I set it so that the self-made function created in 05 can be reused.
Implement a function that takes arguments x, y, z and returns the string "y at x is z". Furthermore, set x = 12, y = "temperature", z = 22.4, and check the execution result.
07.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 07.py
x = 12
y = u'temperature'
z = 22.4
def function(x, y, z):
return unicode(x) + u'of time' + unicode(y) + u'Is' + unicode(z)
print function(x, y, z)
Since x and y
are ʻint and float, respectively, they must be converted when concatenating with ʻUnicode
.
The character code conversion seems to be quite deep if you dig deeper, but this time it worked, so above all.
Is there a way to use ~~ zip ()
? ~~ It doesn't look like it.
Implement the function cipher that converts each character of the given character string with the following specifications.
Replace with (219 --character code) characters in lowercase letters
Output other characters as they are
Use this function to encrypt / decrypt English messages.
08.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 08.py
#Source:Wikipedia English version"Atbash"Than
str = "Atbash is a simple substitution cipher for the Hebrew alphabet."
def cipher(input):
ret = ""
for char in input:
ret += chr(219-ord(char)) if char.islower() else char
return ret
str = cipher(str)
print str
str = cipher(str)
print str
Since it is a so-called "Atbash encryption", it can be encrypted and decrypted with the same function.
chr ()
is a function that converts ASCII code to concrete characters (chr (97)->'a'
).
ʻOrd () is the opposite, but Unicode returns Unicode code points. The Unicode version of
chr ()is ʻunichr ()
.
Before conversion | After conversion | Function to use |
---|---|---|
ASCII code | ASCII characters | chr() |
Unicode code point | Unicode characters | unichr() |
ASCII characters | ASCII code | ord() |
Unicode characters | Unicode code point | ord() |
See also Official Document.
We have put together ʻif` branches using the ternary operator.
Ternary operator
#Value 1 when the conditional expression is true, value 2 when the conditional expression is false
Value 1 if conditional expression else value 2
Create a program that randomly rearranges the order of the other letters, leaving the first and last letters of each word for the word string separated by spaces. However, words with a length of 4 or less are not rearranged. Give an appropriate English sentence (for example, "I couldn't believe that I could actually understand what I was reading: the phenomenal power of the human mind.") And check the execution result.
09.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 09.py
import random
str = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
words = str.split()
shuffled_list = []
for word in words:
if len(word) < 4:
pass
else:
char_list = list(word)
mid_list = char_list[1:-1]
random.shuffle(mid_list)
word = word[0] + "".join(mid_list) + word[-1]
shuffled_list.append(word)
shuffled_str = " ".join(shuffled_list)
print shuffled_str
Randomly replace the strings with random.shuffle ()
! Very convenient.
It seems to be quite annoying to implement in C, but ... Python has such a rich library, so I'm grateful.
Since it is completely random, the same character string as the original character string may be returned. If it is the same after comparing the character strings, you may try again or implement it.
(String) There are ==
and ʻisin the comparison. Whereas
== compares purely content, ʻis
compares whether they are the same object.
This time it is more correct to use ==
when implementing string comparison.
(Quotation)
In Python, both "
and'
are OK when enclosing a string.
However, if you use '
for English possessives or abbreviations, enclosing the entire string in '
will result in an error because the quotation marks do not correspond.
As a workaround, you can either enclose it in "
or escape it with a backslash such as \'
.
Modified version
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 09.py
import random
def word_typoglycemia(word):
if len(word) <= 4:
return word
mid_list = list(word[1:-1])
while mid_list == list(word[1:-1]):
random.shuffle(mid_list)
return word[0] + "".join(mid_list) + word[-1]
def str_typoglycemia(str):
shuffled_list = []
for word in str.split():
shuffled_list.append(word_typoglycemia(word))
return " ".join(shuffled_list)
str = "I couldn't believe that I could actually understand \
what I was reading : the phenomenal power of the human mind ."
print(str_typoglycemia(str))
The main improvements are as follows.
\
The big change is the elimination of coincidences, but it's a little regrettable that there is no guarantee that while
will end.
It's very unlikely ... (Even the most dangerous 5 characters, if you loop n times, the probability is $ \ frac {1} {6 ^ {n}} $)
Continue to Chapter 2, Part 1.
Recommended Posts