Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 1) Has been renewed, and it seems that the 2015 version has been released.

I tried using my favorite python (2 series). I know there are a lot of similar articles, but (and I see) It is also open to the public as a memo of your progress + sharing. If you have any suggestions, thank you.

I want to continue after Chapter 2 ... "Rehabilitation" is meaningful only if it continues!

Chapter 1: Preparatory movement

00. Reverse order of strings

Get a string in which the characters of the string "stressed" are arranged in reverse (from the end to the beginning).


text = "stressed"
text_reverse = ""
n = len(text)
for i in xrange(n):
    text_reverse += text[n-i-1]
print text_reverse
#>>> desserts

Get from the end to the beginning honestly using subscripts. It also works when text = "".

revised edition

text = "stressed"
n = len(text)
text_reverse_list = [text[n-i-1] for i in xrange(n)]
text_reverse = ''.join(text_reverse_list)

print text_reverse
#>>> desserts

Postscript: It seems that connecting to a character string with a for loop is not good in terms of execution speed and memory. So I referred to the method of "creating a list of character strings → connecting with join".


text = "stressed"
text_reverse = text[::-1]
print text_reverse
#>>> desserts

Simple with slices. String object [Start index: End index: Step]

01. "Patatokukashi"

Take out the 1st, 3rd, 5th, and 7th characters of the character string "Patatokukashi" and get the concatenated character string.


text = u"Patatoku Kashii"
text_concat = text[0] + text[2] + text[4] + text[6]
print text_concat
#>>>Police car

I have given the string as unicode.


text = u"Patatoku Kashii"
text_concat = text[::2]
print text_concat
#>>>Police car

I see, can you do the same with slices?

02. "Police car" + "Taxi" = "Patatokukashi"

Obtain the character string "Patatokukashi" by alternately connecting the characters "Police car" + "Taxi" from the beginning.


text1 = u"Police car"
text2 = u"taxi"
text_concat = ""
m = len(text1)
n = len(text2)
for i in xrange(m):
    if i<n:
        text_concat += text1[i] + text2[i]
        if i == m-1:
            text_concat += text2[i+1:]
        text_concat += text1[i:]

print text_concat
#>>>Patatoku Kashii

Let's look at the two strings from the beginning. This is not a direct issue, but Consider the case where text1 and text2 are not the same length (m! = N). At this time, when one is finished, The other decided to concatenate the strings after that.

03. Pi

Break down the sentence "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."


sentence = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."

word_length = [len(x.strip(',.')) for x in sentence.split()]
print word_length
#>>> [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]

--List comprehension

I like python because it can be smartly one-liner like this.

04. Element symbol

Break down the sentence "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can." Into words 1, 5, 6, 7, 8, 9, 15, 16, The 19th word is the first character, and the other words are the first two characters, and the associative array (dictionary type or map type) from the extracted character string to the word position (what number of words from the beginning) Create.


sentence = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."

word_list = [x.strip(',.') for x in sentence.split()]
word_dict = dict()
specified = [1, 5, 6, 7, 8, 9, 15, 16, 19]
for i, word in enumerate(word_list):
    if i in [x-1 for x in specified]:
        word_dict[word[:1]] = i+1
        word_dict[word[:2]] = i+1

print word_dict
#>>> {'Be': 4, 'C': 6, 'B': 5, 'Ca': 20, 'F': 9, 'S': 16, 'H': 1, 'K': 19, 'Al': 13, 'Mi': 12, 'Ne': 10, 'O': 8, 'Li': 3, 'P': 15, 'Si': 14, 'Ar': 18, 'Na': 11, 'N': 7, 'Cl': 17, 'He': 2}
print word_dict['Be']
#>>> 4

I put i + 1 in word_dict to make it consistent with the "th" given in the problem. Well, I noticed after solving it, but did I get the "atomic number, that is, the number of protons"? I miss "Suihe, Ribe, my boat".

I think ... Magnesium is'Mg'?


'Mi': 12
  1. n-gram

Create a function that creates an n-gram from a given sequence (string, list, etc.). Use this function to get the word bi-gram and the letter bi-gram from the sentence "I am an NLPer".


def n_gram(sentence_str, n, type):
    result = set()
    if type == 'word':
        words = [x.strip(',.') for x in sentence_str.split()]
    elif type == 'letter':
        words = sentence_str
    m = len(words)
    for i in xrange(m-n+1):
    return result

#If you also want frequency information
from collections import defaultdict
def n_gram_freq(sentence_str, n, type):
    result = defaultdict(int)
    if type == 'word':
        words = [x.strip(',.') for x in sentence_str.split()]
    elif type == 'letter':
        words = sentence_str
    m = len(words)
    for i in xrange(m-n+1):
        result[tuple(words[i:i+n])] += 1
    return result

sentence_str = "I am an NLPer"
#sentence_list = ['I', 'am', 'an', 'NLPer']

print n_gram(sentence_str, 2, 'word')
#>>> set([('am', 'an'), ('an', 'NLPer'), ('I', 'am')])
print n_gram(sentence_str, 2, 'letter')
#>>> set([('N', 'L'), ('m', ' '), ('e', 'r'), ('a', 'n'), ('I', ' '), ('n', ' '), ('L', 'P'), (' ', 'N'), (' ', 'a'), ('a', 'm'), ('P', 'e')])

print n_gram_freq(sentence_str, 2, 'word')
#>>> defaultdict(<type 'int'>, {('am', 'an'): 1, ('an', 'NLPer'): 1, ('I', 'am'): 1})
print n_gram_freq(sentence_str, 2, 'letter')
#>>>defaultdict(<type 'int'>, {('N', 'L'): 1, ('m', ' '): 1, ('e', 'r'): 1, ('a', 'n'): 1, ('I', ' '): 1, ('n', ' '): 1, ('L', 'P'): 1, (' ', 'N'): 1, (' ', 'a'): 2, ('a', 'm'): 1, ('P', 'e'): 1})

Take a string as an argument. Allows you to select the word n-gram and the character n-gram with another argument type.

06. Meeting

Find the set of characters bi-grams contained in "paraparaparadise" and "paragraph" as X and Y, respectively, and find the union, intersection, and complement of X and Y, respectively. In addition, find out if the bi-gram'se'is included in X and Y.


str_x = "paraparaparadise"
str_y = "paragraph"

X = n_gram(str_x, 2, 'letter')
Y = n_gram(str_y, 2, 'letter')

print X.union(Y) #Union. X|Synonymous with Y
#>>>set([('g', 'r'), ('p', 'h'), ('p', 'a'), ('s', 'e'), ('a', 'p'), ('a', 'g'), ('a', 'd'), ('i', 's'), ('r', 'a'), ('a', 'r'), ('d', 'i')])

print X.intersection(Y) #Intersection. X&Synonymous with Y
#>>>set([('a', 'p'), ('r', 'a'), ('p', 'a'), ('a', 'r')])

print X.difference(Y) #The difference set. X-Synonymous with Y
#>>>set([('a', 'd'), ('s', 'e'), ('d', 'i'), ('i', 's')])

tuple('se') in X
#>>> True
tuple('se') in Y
#>>> False

Use the function defined in the previous question. In the definition of the function, the given'se' is converted to tuple and the condition is judged.

07. Sentence generation by template

Implement a function that takes arguments x, y, z and returns the string "y at x is z". Furthermore, set x = 12, y = "temperature", z = 22.4, and check the execution result.


from string import Template

def generate_sentence(x,y,z):
    t = Template("${x_tmpl}of time${y_tmpl}Is${z_tmpl}")
    return t.safe_substitute(x_tmpl=x, y_tmpl=y, z_tmpl=z)

def generate_sentence_incomplete(x,y,z):
    t = Template("${x_tmpl}of time${y_tmpl}Is${z_tmpl}")
    return t.safe_substitute(x_tmpl=x, y_tmpl=y)

x, y, z =12, "temperature", 22.4
print generate_sentence(x,y,z)
#>>>The temperature at 12:00 is 22.4
print generate_sentence_incomplete(x,y,z)
#>>>The temperature at 12 o'clock${z_tmpl}

Since I used Template.safe_substitute (),

Same as substitute (), but instead of throwing a KeyError exception, the original placeholder is included as is if the mapping or kws cannot find a corresponding placeholder.

08. Ciphertext

Implement the function cipher that converts each character of the given character string with the following specifications. Replace with lowercase letters (219 --character code) Output other characters as they are Use this function to encrypt / decrypt English messages.


def cipher(input_str):
    result = ""
    for letter in input_str:
        #If lowercase letters
        if letter.isalpha() and letter.islower():
            result += chr(219-ord(letter))
            result += letter
    return result

english_message = "This is a pen."

print cipher(english_message)
#>>> Tsrh rh z kvm.

print cipher(cipher(english_message))
#>>> This is a pen.

I didn't know this subject, but it seems to be called Atbash cipher.

Cryptography was also used in the Old Testament. One of them is the Hebrew substitution cipher Atbash. This cipher is made by numbering the letters and swapping the order from the beginning and the order from the end. If you want to encrypt the 26 letters of the alphabet, change the order of A to Z, B to Y, and so on.

That's why encryption and decryption can be achieved with the same function. (You can get the original string by applying the same function twice).

Click here for the built-in functions used.

Built-in functions


  1. Typoglycemia

Create a program that randomly rearranges the order of the other letters, leaving the first and last letters of each word for the word string separated by spaces. However, words with a length of 4 or less are not rearranged. Give an appropriate English sentence (for example, "I couldn't believe that I could actually understand what I was reading: the phenomenal power of the human mind.") And check the execution result.


import random

def random_shuffled_str(input_str):
    input_list = list(input_str)
    result = ''.join(input_list)
    return result

def typoglycemia(sentence):
    str_list = sentence.split()
    result_list = [x[0]+ random_shuffled_str(x[1:-1]) +x[-1] if len(x) > 4 else x for x in str_list]
    return ' '.join(result_list)

message = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."

print typoglycemia(message)
#>>> I cunl'dot beveile that I cloud aallucty unsrndtead what I was rdaineg : the phaenmneol pweor of the huamn mind .

print typoglycemia(message)
#>>> I cu'dolnt bivelee that I cloud aculltay udetnnasrd what I was ridneag : the pheonaenml peowr of the hamun mind .

When you want to use else in list comprehension


if len(x) > 4 else x

I'll bring it to this position ... And random.shuffle.

--Directly change the list (etc.) given as an argument. --So I defined the function so that the changed result can be obtained in the list comprehension.

random.shuffle(x[, random]) Mix sequence x by direct modification. The optional argument random is a function that has no arguments to return a random floating point number in the range [0.0, 1.0]; by default, this function is random ().

Note that even with a fairly small len (x), the permutations of x will be larger than the period of most random number generators; this means that most permutations will not be generated for long sequences. Means.

