Introduction

Knock 100 Language Processing published on the Tohoku University Inui-Okazaki Laboratory Web page for training in natural language processing and Python. I will challenge nlp100 /). I plan to make a note of the code implemented in it and the techniques that should be suppressed. The code is also available on GitHub.

The textbook uses "Introduction to Python 2 & 3 support (written by Kenji Hosoda et al., Shuwa System)".

We would like to introduce the articles that we referred to when starting up. I cannot deny the feeling that it is too helpful, so please contact me if you feel uncomfortable.

http://qiita.com/tanaka0325/items/08831b96b684d7ecb2f7

I'm an amateur of Zub, so it's very unsightly because the notation is not unified and Python 2/3 relations are mixed, but I would appreciate it if you could point out. The execution environment itself is Python 2.

Chapter 1: Preparatory movement

00. Reverse order of strings

Get a string in which the characters of the string "stressed" are arranged in reverse (from the end to the beginning).

Answer

`00.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 00.py

str = "stressed"
print(str[-1::-1])

comment

Exercise of the technique of "slicing" for character strings. As mentioned in Article mentioned above, studying slices again.

A slice is described in the form of character string index [start index: end index: number of steps], and a part of the character string is cut out and acquired. In addition to character strings, lists can be used.

str = "abcdefgh"

#Get a specific character
str[0]        # 'a'， Zero from the beginning-based
str[-1]       # 'h'， Can be specified even with a negative number (going back from the end of the sentence) This time str[7]Synonymous with

#slice
str[1:3]      # 'bc'， Note that the character of the end index is not included. Not the number of characters
str[0:-3]     # 'abcde'， Negative numbers are OK. This time str[0:5]Synonymous with
str[:4]       # 'abcd'， If the start index is omitted, from the beginning
str[4:]       # 'efgh'， Until the end if the end index is omitted

#Specify the number of steps
str[0:6:2]    # 'ace'， Acquires discrete characters by the amount specified by the number of steps (0),2,4th)
str[::3]      # 'adg'， Can be omitted
str[-3::2]    # 'fh'， Negative numbers are also possible
str[::-3]     # 'hed'， If the number of steps is negative, it goes back in reverse order

So the answer this time was str [:: -1], but it was my first experience with slicing, so please take a good look ...

01. "Patatokukashi"

Take out the 1st, 3rd, 5th, and 7th characters of the character string "Patatokukashi" and get the concatenated character string.

Answer

`01.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 01.py

str = u'Patatoku Kashii'
print(str[0::2])

comment

Slice exercises as well as 00. Similarly, you can omit the start position and use str [:: 2]. In addition, Japanese (Unicode) character strings can be prefixed with u, such as ʻu'hogehoge'` (UTF-8 environment).

02. "Police car" + "Taxi" = "Patatokukashi"

Obtain the character string "Patatokukashi" by alternately connecting the characters "Police car" + "Taxi" from the beginning.

Answer

`02.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 02.py

str1 = u'Police car'
str2 = u'taxi'
str3 = u''

for a,b in zip(str1, str2):
    str3 = str3 + a + b

print str3

comment

zip () is a function that takes an element from each argument and creates a tuple. Techniques that can be used when specifying conditions for for loop. print is suddenly no longer a function, but this is Python 2 notation. I'm sorry for the mixture.

And this is also mentioned in the above article, but it seems that the method of adding to the end every time during a loop is problematic in terms of execution speed. thing. It seems best to combine the strings later as print (''. Join ([a + b for a, b in zip (str1, str2)])). ''. Join () joins the elements in the argument after separating them with the delimiter in ''. Please note that the writing style has changed.

03. Pi

Break down the sentence "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics." Into words, and create a list of the number of characters (in the alphabet) of each word in order of appearance.

Answer

`03.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 03.py

str = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
str = str.replace('.', "")
str = str.replace(',', "")
str = str.split()

list = []

for word in str:
    list.append(len(word))

print list

comment

After removing periods and commas with replace (), separate each word with split (), get the length with len (), and plunge into list. .. I wondered if there was a better way to remove periods and commas ... but I gave up. Since split () can specify a delimiter as an argument (default is a space), I thought I'd specify them all at once, but I couldn't.

04. Element symbol

Break down the sentence "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can." Into words 1, 5, 6, 7, 8, 9, 15, 16, The 19th word is the first character, and the other words are the first two characters, and the associative array (dictionary type or map type) from the extracted character string to the word position (what number of words from the beginning) Create.

Answer

`04.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 04.py

str = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
str = str.split()

dict = {}
single = [1, 5, 6, 7, 8, 9, 15, 16, 19]

for element in str:
    if str.index(element) + 1 in single:
        dict[element[:1]] = str.index(element) + 1
    else:
        dict[element[:2]] = str.index(element) + 1

#Sort by atomic number and print
for k, v in sorted(dict.items(), key=lambda x:x[1]):
    print k, v

comment

As with 03, each word is separated and processed individually with for loop. Since only the beginning is seen anyway, the period processing is omitted. It's hard to say that magnesium becomes Mi at this rate, but ... is it unavoidable? You can specify them individually and slice them (ʻelement [: 3: 2] ). You don't have to put it as single, but str.index (element) + 1` appears three times, so I want to organize this area well. Is it a solution if you assign it to an appropriate variable? Also, the dictionary does not guarantee the order in the first place, but it is sorted for easy viewing.

Fix

`Modified version`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 04.py

str = "Hi He Lied Because Boron Could Not Oxidize Fluorine.\
 New Nations Might Also Sign Peace Security Clause. Arthur King Can."
words_list = str.split()

dict = {}
single = [0, 4, 5, 6, 7, 8, 14, 15, 18]

for i in range(len(words_list)):
    clen = 1 if i in single else 2
    dict[words_list[i][:clen]] = i + 1

#Sort by atomic number and print
# for k, v in sorted(dict.items(), key=lambda x: x[1]):
#     print(k, v)

The main improvements are as follows.

Newline in the middle of a line that is too long with \
Avoid reusing str
Change single to zero-based
Organize redundant code
Change the condition of for from element to index

I think the biggest thing in this time is turning for by index. I wanted to try the way to write for, which I learned for the first time in the previous code, and as a result, I got the index again with ʻindex ()` ...

n-gram

Create a function that creates an n-gram from a given sequence (string, list, etc.). Use this function to get the word bi-gram and the letter bi-gram from the sentence "I am an NLPer".

Answer

`05.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 05.py

original = "I am an NLPer"

def ngram(input, n):
    #Character n-gram (Argument str)
    l = len(input)
    if type(input) == str:
        input = "$" * (n - 1) + input + "$" * (n - 1)
        for i in xrange(l + 1):
            print input[i:i+n]
    #Word n-gram (Argument list)
    elif type(input) == list:
        input = ["$"] * (n - 1) + input + ["$"] * (n - 1)
        for i in xrange(l + 1):
            print input[i:i+n]

ngram(original, 2)              #Character n-gram
original = original.split()
ngram(original, 2)              #Word n-gram

comment

It was harder than I expected. Many fine adjustments such as ± 1 due to the length of the number of characters ... I inserted $ before the beginning and after the end of the character string. I wanted to implement a function like Java overload, but it seems that overload is not implemented by default in Python, so I implemented it with type ().

Comments and implementation

Comments and implementation from knok.

`Implementation of ngram function by knok`


def ngram(input, n):
    last = len(input) - n + 1
    ret = []
    for i in range(0, last):
        ret.append(input[i:i+n])
    return ret

By not inserting $ at the beginning and end, you can implement smartly at once. In both the character string and the list, you can specify the element by index and slice it, so you do not have to be aware of the type.

Hmmm beautiful. Looking at my code again makes me dizzy. Thank you very much.

06. Meeting

Find the set of characters bi-grams contained in "paraparaparadise" and "paragraph" as X and Y, respectively, and find the union, intersection, and complement of X and Y, respectively. In addition, find out if the bi-gram'se'is included in X and Y.

Answer

`06.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 06.py

str1 = "paraparaparadise"
str2 = "paragraph"

def ngram(input, n):
    l = len(input)
    list = []
    input = "$" * (n - 1) + input + "$" * (n - 1)
    for i in xrange(l + 1):
        list.append(input[i:i+n])
    return list

#ngram list to set;It is possible to eliminate duplication and perform set operations.
X = set(ngram(str1, 2))
Y = set(ngram(str2, 2))

print X.union(Y)            #Union
print X.intersection(Y)     #Intersection
print X.difference(Y)       #Difference set

print "se" in X		# in:To X"se"True if,False if not
print "se" in Y		#Almost the same (X-> Y）

comment

See the code for specific usage. I'm happy to be able to write such operations intuitively.

Fix

`Modified version`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 06.py

from mymodule import ngram

str1 = "paraparaparadise"
str2 = "paragraph"

X = set(ngram(str1, 2))
Y = set(ngram(str2, 2))

#Omission

With reference to this article, I set it so that the self-made function created in 05 can be reused.

07. Sentence generation by template

Implement a function that takes arguments x, y, z and returns the string "y at x is z". Furthermore, set x = 12, y = "temperature", z = 22.4, and check the execution result.

Answer

`07.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 07.py

x = 12
y = u'temperature'
z = 22.4

def function(x, y, z):
    return unicode(x) + u'of time' + unicode(y) + u'Is' + unicode(z)

print function(x, y, z)

comment

Since x and y are ʻint and float, respectively, they must be converted when concatenating with ʻUnicode. The character code conversion seems to be quite deep if you dig deeper, but this time it worked, so above all. Is there a way to use ~~ zip ()? ~~ It doesn't look like it.

08. Ciphertext

Implement the function cipher that converts each character of the given character string with the following specifications.

Replace with (219 --character code) characters in lowercase letters

Output other characters as they are

Use this function to encrypt / decrypt English messages.

Answer

`08.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 08.py

#Source:Wikipedia English version"Atbash"Than
str = "Atbash is a simple substitution cipher for the Hebrew alphabet."

def cipher(input):
    ret = ""
    for char in input:
        ret += chr(219-ord(char)) if char.islower() else char
    return ret

str = cipher(str)
print str
str = cipher(str)
print str

comment

Since it is a so-called "Atbash encryption", it can be encrypted and decrypted with the same function. chr () is a function that converts ASCII code to concrete characters (chr (97)->'a'). ʻOrd () is the opposite, but Unicode returns Unicode code points. The Unicode version of chr ()is ʻunichr ().

Before conversion	After conversion	Function to use
ASCII code	ASCII characters	chr()
Unicode code point	Unicode characters	unichr()
ASCII characters	ASCII code	ord()
Unicode characters	Unicode code point	ord()

`Ternary operator`


#Value 1 when the conditional expression is true, value 2 when the conditional expression is false
Value 1 if conditional expression else value 2

Typoglycemia

Create a program that randomly rearranges the order of the other letters, leaving the first and last letters of each word for the word string separated by spaces. However, words with a length of 4 or less are not rearranged. Give an appropriate English sentence (for example, "I couldn't believe that I could actually understand what I was reading: the phenomenal power of the human mind.") And check the execution result.

Answer

`09.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 09.py
import random

str = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
words = str.split()
shuffled_list = []

for word in words:
    if len(word) < 4:
        pass
    else:
        char_list = list(word)
        mid_list = char_list[1:-1]
        random.shuffle(mid_list)
        word = word[0] + "".join(mid_list) + word[-1]
    shuffled_list.append(word)

shuffled_str = " ".join(shuffled_list)
print shuffled_str

comment

Randomly replace the strings with random.shuffle ()! Very convenient. It seems to be quite annoying to implement in C, but ... Python has such a rich library, so I'm grateful. Since it is completely random, the same character string as the original character string may be returned. If it is the same after comparing the character strings, you may try again or implement it.

Supplement

(String) There are == and ʻisin the comparison. Whereas== compares purely content, ʻis compares whether they are the same object. This time it is more correct to use == when implementing string comparison.

(Quotation) In Python, both " and' are OK when enclosing a string. However, if you use ' for English possessives or abbreviations, enclosing the entire string in ' will result in an error because the quotation marks do not correspond. As a workaround, you can either enclose it in " or escape it with a backslash such as \'.

Fix

`Modified version`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 09.py

import random


def word_typoglycemia(word):
    if len(word) <= 4:
        return word

    mid_list = list(word[1:-1])
    while mid_list == list(word[1:-1]):
        random.shuffle(mid_list)
    return word[0] + "".join(mid_list) + word[-1]


def str_typoglycemia(str):
    shuffled_list = []
    for word in str.split():
        shuffled_list.append(word_typoglycemia(word))
    return " ".join(shuffled_list)


str = "I couldn't believe that I could actually understand \
 what I was reading : the phenomenal power of the human mind ."

print(str_typoglycemia(str))

The main improvements are as follows.

Functionalization
Organize redundant code
Newline in the middle of a line that is too long with \
Strict compliance with the question sentence (excludes less than 4 characters → excludes less than 4 characters)
Excludes accidentally matching the same character string before and after processing (I think from the perspective of randomness)

The big change is the elimination of coincidences, but it's a little regrettable that there is no guarantee that while will end. It's very unlikely ... (Even the most dangerous 5 characters, if you loop n times, the probability is $ \ frac {1} {6 ^ {n}} $)

in conclusion

Continue to Chapter 2, Part 1.

100 Language Processing Knock with Python (Chapter 1)

Introduction

Chapter 1: Preparatory movement

00. Reverse order of strings

Answer

00.py

comment

01. "Patatokukashi"

Answer

01.py

comment

02. "Police car" + "Taxi" = "Patatokukashi"

Answer

02.py

comment

03. Pi

Answer

03.py

comment

04. Element symbol

Answer

04.py

comment

Fix

Modified version

Answer

05.py

comment

Comments and implementation

Implementation of ngram function by knok

06. Meeting

Answer

06.py

comment

Fix

Modified version

07. Sentence generation by template

Answer

07.py

comment

08. Ciphertext

Answer

08.py

comment

Ternary operator

Answer

09.py

comment

Supplement

Fix

Modified version

in conclusion

`00.py`

`01.py`

`02.py`

`03.py`

`04.py`

`Modified version`

`05.py`

`Implementation of ngram function by knok`

`06.py`

`Modified version`

`07.py`

`08.py`

`Ternary operator`

`09.py`

`Modified version`