100 Language Processing Knock with Python (Chapter 1)

Introduction

Knock 100 Language Processing published on the Tohoku University Inui-Okazaki Laboratory Web page for training in natural language processing and Python. I will challenge nlp100 /). I plan to make a note of the code implemented in it and the techniques that should be suppressed. The code is also available on GitHub.

The textbook uses "Introduction to Python 2 & 3 support (written by Kenji Hosoda et al., Shuwa System)".

We would like to introduce the articles that we referred to when starting up. I cannot deny the feeling that it is too helpful, so please contact me if you feel uncomfortable.

I'm an amateur of Zub, so it's very unsightly because the notation is not unified and Python 2/3 relations are mixed, but I would appreciate it if you could point out. The execution environment itself is Python 2.

Chapter 1: Preparatory movement

00. Reverse order of strings

Get a string in which the characters of the string "stressed" are arranged in reverse (from the end to the beginning).

Answer

00.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 00.py

str = "stressed"
print(str[-1::-1])

comment

Exercise of the technique of "slicing" for character strings. As mentioned in Article mentioned above, studying slices again.

A slice is described in the form of character string index [start index: end index: number of steps], and a part of the character string is cut out and acquired. In addition to character strings, lists can be used.

str = "abcdefgh"

#Get a specific character
str[0]        # 'a', Zero from the beginning-based
str[-1]       # 'h', Can be specified even with a negative number (going back from the end of the sentence) This time str[7]Synonymous with

#slice
str[1:3]      # 'bc', Note that the character of the end index is not included. Not the number of characters
str[0:-3]     # 'abcde', Negative numbers are OK. This time str[0:5]Synonymous with
str[:4]       # 'abcd', If the start index is omitted, from the beginning
str[4:]       # 'efgh', Until the end if the end index is omitted

#Specify the number of steps
str[0:6:2]    # 'ace', Acquires discrete characters by the amount specified by the number of steps (0),2,4th)
str[::3]      # 'adg', Can be omitted
str[-3::2]    # 'fh', Negative numbers are also possible
str[::-3]     # 'hed', If the number of steps is negative, it goes back in reverse order

So the answer this time was str [:: -1], but it was my first experience with slicing, so please take a good look ...

01. "Patatokukashi"

Take out the 1st, 3rd, 5th, and 7th characters of the character string "Patatokukashi" and get the concatenated character string.

Answer

01.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 01.py

str = u'Patatoku Kashii'
print(str[0::2])

comment

Slice exercises as well as 00. Similarly, you can omit the start position and use str [:: 2]. In addition, Japanese (Unicode) character strings can be prefixed with u, such as ʻu'hogehoge'` (UTF-8 environment).

02. "Police car" + "Taxi" = "Patatokukashi"

Obtain the character string "Patatokukashi" by alternately connecting the characters "Police car" + "Taxi" from the beginning.

Answer

02.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 02.py

str1 = u'Police car'
str2 = u'taxi'
str3 = u''

for a,b in zip(str1, str2):
    str3 = str3 + a + b

print str3

comment

zip () is a function that takes an element from each argument and creates a tuple. Techniques that can be used when specifying conditions for for loop. print is suddenly no longer a function, but this is Python 2 notation. I'm sorry for the mixture.

And this is also mentioned in the above article, but it seems that the method of adding to the end every time during a loop is problematic in terms of execution speed. thing. It seems best to combine the strings later as print (''. Join ([a + b for a, b in zip (str1, str2)])). ''. Join () joins the elements in the argument after separating them with the delimiter in ''. Please note that the writing style has changed.

03. Pi

Break down the sentence "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics." Into words, and create a list of the number of characters (in the alphabet) of each word in order of appearance.

Answer

03.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 03.py

str = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
str = str.replace('.', "")
str = str.replace(',', "")
str = str.split()

list = []

for word in str:
    list.append(len(word))

print list

comment

After removing periods and commas with replace (), separate each word with split (), get the length with len (), and plunge into list. .. I wondered if there was a better way to remove periods and commas ... but I gave up. Since split () can specify a delimiter as an argument (default is a space), I thought I'd specify them all at once, but I couldn't.

04. Element symbol

Break down the sentence "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can." Into words 1, 5, 6, 7, 8, 9, 15, 16, The 19th word is the first character, and the other words are the first two characters, and the associative array (dictionary type or map type) from the extracted character string to the word position (what number of words from the beginning) Create.

Answer

04.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 04.py

str = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
str = str.split()

dict = {}
single = [1, 5, 6, 7, 8, 9, 15, 16, 19]

for element in str:
    if str.index(element) + 1 in single:
        dict[element[:1]] = str.index(element) + 1
    else:
        dict[element[:2]] = str.index(element) + 1

#Sort by atomic number and print
for k, v in sorted(dict.items(), key=lambda x:x[1]):
    print k, v

comment

As with 03, each word is separated and processed individually with for loop. Since only the beginning is seen anyway, the period processing is omitted. It's hard to say that magnesium becomes Mi at this rate, but ... is it unavoidable? You can specify them individually and slice them (ʻelement [: 3: 2] ). You don't have to put it as single, but str.index (element) + 1` appears three times, so I want to organize this area well. Is it a solution if you assign it to an appropriate variable? Also, the dictionary does not guarantee the order in the first place, but it is sorted for easy viewing.

Fix

Modified version


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 04.py

str = "Hi He Lied Because Boron Could Not Oxidize Fluorine.\
 New Nations Might Also Sign Peace Security Clause. Arthur King Can."
words_list = str.split()

dict = {}
single = [0, 4, 5, 6, 7, 8, 14, 15, 18]

for i in range(len(words_list)):
    clen = 1 if i in single else 2
    dict[words_list[i][:clen]] = i + 1

#Sort by atomic number and print
# for k, v in sorted(dict.items(), key=lambda x: x[1]):
#     print(k, v)

The main improvements are as follows.

I think the biggest thing in this time is turning for by index. I wanted to try the way to write for, which I learned for the first time in the previous code, and as a result, I got the index again with ʻindex ()` ...

  1. n-gram

Create a function that creates an n-gram from a given sequence (string, list, etc.). Use this function to get the word bi-gram and the letter bi-gram from the sentence "I am an NLPer".

Answer

05.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 05.py

original = "I am an NLPer"

def ngram(input, n):
    #Character n-gram (Argument str)
    l = len(input)
    if type(input) == str:
        input = "$" * (n - 1) + input + "$" * (n - 1)
        for i in xrange(l + 1):
            print input[i:i+n]
    #Word n-gram (Argument list)
    elif type(input) == list:
        input = ["$"] * (n - 1) + input + ["$"] * (n - 1)
        for i in xrange(l + 1):
            print input[i:i+n]

ngram(original, 2)              #Character n-gram
original = original.split()
ngram(original, 2)              #Word n-gram

comment

It was harder than I expected. Many fine adjustments such as ± 1 due to the length of the number of characters ... I inserted $ before the beginning and after the end of the character string. I wanted to implement a function like Java overload, but it seems that overload is not implemented by default in Python, so I implemented it with type ().

Comments and implementation

Comments and implementation from knok.

Implementation of ngram function by knok


def ngram(input, n):
    last = len(input) - n + 1
    ret = []
    for i in range(0, last):
        ret.append(input[i:i+n])
    return ret

By not inserting $ at the beginning and end, you can implement smartly at once. In both the character string and the list, you can specify the element by index and slice it, so you do not have to be aware of the type.

Hmmm beautiful. Looking at my code again makes me dizzy. Thank you very much.

06. Meeting

Find the set of characters bi-grams contained in "paraparaparadise" and "paragraph" as X and Y, respectively, and find the union, intersection, and complement of X and Y, respectively. In addition, find out if the bi-gram'se'is included in X and Y.

Answer

06.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 06.py

str1 = "paraparaparadise"
str2 = "paragraph"

def ngram(input, n):
    l = len(input)
    list = []
    input = "$" * (n - 1) + input + "$" * (n - 1)
    for i in xrange(l + 1):
        list.append(input[i:i+n])
    return list

#ngram list to set;It is possible to eliminate duplication and perform set operations.
X = set(ngram(str1, 2))
Y = set(ngram(str2, 2))

print X.union(Y)            #Union
print X.intersection(Y)     #Intersection
print X.difference(Y)       #Difference set

print "se" in X		# in:To X"se"True if,False if not
print "se" in Y		#Almost the same (X-> Y)

comment

See the code for specific usage. I'm happy to be able to write such operations intuitively.

Fix

Modified version


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 06.py

from mymodule import ngram

str1 = "paraparaparadise"
str2 = "paragraph"

X = set(ngram(str1, 2))
Y = set(ngram(str2, 2))

#Omission

With reference to this article, I set it so that the self-made function created in 05 can be reused.

07. Sentence generation by template

Implement a function that takes arguments x, y, z and returns the string "y at x is z". Furthermore, set x = 12, y = "temperature", z = 22.4, and check the execution result.

Answer

07.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 07.py

x = 12
y = u'temperature'
z = 22.4

def function(x, y, z):
    return unicode(x) + u'of time' + unicode(y) + u'Is' + unicode(z)

print function(x, y, z)

comment

Since x and y are ʻint and float, respectively, they must be converted when concatenating with ʻUnicode. The character code conversion seems to be quite deep if you dig deeper, but this time it worked, so above all. Is there a way to use ~~ zip ()? ~~ It doesn't look like it.

08. Ciphertext

Implement the function cipher that converts each character of the given character string with the following specifications.

  • Replace with (219 --character code) characters in lowercase letters

  • Output other characters as they are

Use this function to encrypt / decrypt English messages.

Answer

08.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 08.py

#Source:Wikipedia English version"Atbash"Than
str = "Atbash is a simple substitution cipher for the Hebrew alphabet."

def cipher(input):
    ret = ""
    for char in input:
        ret += chr(219-ord(char)) if char.islower() else char
    return ret

str = cipher(str)
print str
str = cipher(str)
print str

comment

Since it is a so-called "Atbash encryption", it can be encrypted and decrypted with the same function. chr () is a function that converts ASCII code to concrete characters (chr (97)->'a'). ʻOrd () is the opposite, but Unicode returns Unicode code points. The Unicode version of chr ()is ʻunichr ().

Before conversion After conversion Function to use
ASCII code ASCII characters chr()
Unicode code point Unicode characters unichr()
ASCII characters ASCII code ord()
Unicode characters Unicode code point ord()

See also Official Document.

We have put together ʻif` branches using the ternary operator.

Ternary operator


#Value 1 when the conditional expression is true, value 2 when the conditional expression is false
Value 1 if conditional expression else value 2
  1. Typoglycemia

Create a program that randomly rearranges the order of the other letters, leaving the first and last letters of each word for the word string separated by spaces. However, words with a length of 4 or less are not rearranged. Give an appropriate English sentence (for example, "I couldn't believe that I could actually understand what I was reading: the phenomenal power of the human mind.") And check the execution result.

Answer

09.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 09.py
import random

str = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
words = str.split()
shuffled_list = []

for word in words:
    if len(word) < 4:
        pass
    else:
        char_list = list(word)
        mid_list = char_list[1:-1]
        random.shuffle(mid_list)
        word = word[0] + "".join(mid_list) + word[-1]
    shuffled_list.append(word)

shuffled_str = " ".join(shuffled_list)
print shuffled_str

comment

Randomly replace the strings with random.shuffle ()! Very convenient. It seems to be quite annoying to implement in C, but ... Python has such a rich library, so I'm grateful. Since it is completely random, the same character string as the original character string may be returned. If it is the same after comparing the character strings, you may try again or implement it.

Supplement

(String) There are == and ʻisin the comparison. Whereas== compares purely content, ʻis compares whether they are the same object. This time it is more correct to use == when implementing string comparison.

(Quotation) In Python, both " and' are OK when enclosing a string. However, if you use ' for English possessives or abbreviations, enclosing the entire string in ' will result in an error because the quotation marks do not correspond. As a workaround, you can either enclose it in " or escape it with a backslash such as \'.

Fix

Modified version


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 09.py

import random


def word_typoglycemia(word):
    if len(word) <= 4:
        return word

    mid_list = list(word[1:-1])
    while mid_list == list(word[1:-1]):
        random.shuffle(mid_list)
    return word[0] + "".join(mid_list) + word[-1]


def str_typoglycemia(str):
    shuffled_list = []
    for word in str.split():
        shuffled_list.append(word_typoglycemia(word))
    return " ".join(shuffled_list)


str = "I couldn't believe that I could actually understand \
 what I was reading : the phenomenal power of the human mind ."

print(str_typoglycemia(str))

The main improvements are as follows.

The big change is the elimination of coincidences, but it's a little regrettable that there is no guarantee that while will end. It's very unlikely ... (Even the most dangerous 5 characters, if you loop n times, the probability is $ \ frac {1} {6 ^ {n}} $)

in conclusion

Continue to Chapter 2, Part 1.

Recommended Posts

100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
100 Language Processing with Python Knock 2015
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock Chapter 1 by Python
100 Language Processing Knock 2020 Chapter 1
100 Language Processing Knock Chapter 1
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 Language Processing Knock (2020): 28
I tried 100 language processing knock 2020: Chapter 3
[Chapter 5] Introduction to Python with 100 knocks of language processing
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
[Chapter 3] Introduction to Python with 100 knocks of language processing
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
100 Language Processing Knock (2020): 38
[Chapter 2] Introduction to Python with 100 knocks of language processing
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 1)
I tried 100 language processing knock 2020: Chapter 1
100 language processing knock 00 ~ 02
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
[Chapter 4] Introduction to Python with 100 knocks of language processing
Python beginner tried 100 language processing knock 2015 (00 ~ 04)
3. Natural language processing with Python 2-1. Co-occurrence network
Image processing with Python 100 knock # 12 motion filter
3. Natural language processing with Python 1-1. Word N-gram
[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 1
100 Language Processing Knock-88: 10 Words with High Similarity
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 first half)
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 language processing knocks ~ Chapter 1
100 Amateur Language Processing Knock: 07
100 language processing knocks Chapter 2 (10 ~ 19)
Image processing with Python
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization