100 Language Processing Knock Chapter 1 in Python

This is the first chapter of 100 knocks on language processing.

The environment is Windows 10, python 3.6.0. I referred to here.

00. Reverse order of strings

Get a string with the letters "stressed" reversed.

# coding: utf-8
target = "stressed"
new_target = target[::-1]
print(new_target)

desserts

default is 0,8 when step is positive, 8,0 when step is negative

01. "Patatoku Cassie"

Take out the 1st, 3rd, 5th, and 7th characters of the character string "Patatokukashi" and get the concatenated character string.

# coding: utf-8
word = "Patatoku Kashii"
new_word = word[::2]
print(new_word)

Police car

** Don't forget u. ** **

02. "Police car" + "taxi" = "patatokukashi"

Get the character string "Patatokukashi" by alternately connecting the characters "Police car" + "Taxi" from the beginning.

# cording utf-8
word1 = u"Police car"
word2 = u"taxi"
mix_word = ""
for w1,w2 in zip (word1,word2):
    mix_word += w1 + w2
print(mix_word)

Patatoku Kashii

--Zip to fit the longer one

import itertools
target1 = '12345'
target2 = 'abc'
zipped = itertools.zip_longest(target1,target2)
print(list(zipped))

[('1', 'a'), ('2', 'b'), ('3', 'c'), ('4', None), ('5', None)]

--Set the value to something other than None

import itertools
target1 = '12345'
target2 = 'abc'
zipped = itertools.zip_longest(target1,target2,fillvalue = False )
print(list(zipped))

[('1', 'a'), ('2', 'b'), ('3', 'c'), ('4', False), ('5', False)]

--If you zip () it again, it will return to the original.

import itertools
target1 = '12345'
target2 = 'abc'
zipped = itertools.zip_longest(target1,target2,fillvalue = False )
zipped_list = list(zipped)
zizipped = zip(zipped_list[0],zipped_list[1],zipped_list[2],zipped_list[3],zipped_list[4])
print(list(zizipped))

[('1', '2', '3', '4', '5'), ('a', 'b', 'c', False, False)]

-Ver using *

import itertools
target1 = '12345'
target2 = 'abc'
zipped = itertools.zip_longest(target1,target2,fillvalue = False )
zipped_list = list(zipped)
zizipped = zip(*zipped_list)
print(list(zizipped))

[('1', '2', '3', '4', '5'), ('a', 'b', 'c', False, False)]

03. Pi

Break down the sentence "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics." Into words, and create a list of the number of characters (in the alphabet) of each word in order of appearance.

words = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
result = []
new_words = words.translate(str.maketrans("","",",."))
for word in new_words.split(' '):
    word_length = len(word)
    result.append(word_length)
print(result)

[3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]

strip removes the specified string from both ends.

str.translate(str.maketrans("","",".,"))

1st → 2nd 3rd factor is the character string you want to delete.

--Beautiful answer

words = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
result = [len(word.strip(",.")) for word in words.split(" ")]
print(result)

[3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]

import re
words = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
result = [len(word) for word in (re.sub(r"[,.]","",words).split(" "))]
print(result)

[3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]

04. Element symbol

Break down the sentence "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can." Into words 1, 5, 6, 7, 8, 9, 15, 16, 19 The first word is the first character, the other words are the first two characters, and the associative array (dictionary type or map type) from the extracted character string to the word position (what number of words from the beginning) is created. Create it.

sentence = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
words = [word.strip(',.') for word in sentence.split()]
dic = {word[0]:words.index(word) + 1 for word in words if words.index(word) in (0,4,5,6,7,8,14,15,18)}
dic.update({word[:2]:words.index(word) + 1 for word in words if words.index(word) not in (0,4,5,6,7,8,14,15,18)})
print(dic)

{'H': 1, 'B': 5, 'C': 6, 'N': 7, 'O': 8, 'F': 9, 'P': 15, 'S': 16, 'K': 19, 'He': 2, 'Li': 3, 'Be': 4, 'Ne': 10, 'Na': 11, 'Mi': 12, 'Al': 13, 'Si': 14, 'Cl': 17, 'Ar': 18, 'Ca': 20}

--Another solution 1

sentence = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
words = [word.strip(',.') for word in sentence.split()]
link = {}
for i,v in enumerate(words,1):
    length = 1 if i in [1,5,6,7,8,9,15,16,19] else 2
    link.update({v[:length]:i})
print(link)

{'H': 1, 'He': 2, 'Li': 3, 'Be': 4, 'B': 5, 'C': 6, 'N': 7, 'O': 8, 'F': 9, 'Ne': 10, 'Na': 11, 'Mi': 12, 'Al': 13, 'Si': 14, 'P': 15, 'S': 16, 'Cl': 17, 'Ar': 18, 'K': 19, 'Ca': 20}

--Another solution 2

sentence ="Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
#First character(Or 2 letters)And create a dictionary that associates the index of that word
link = {w[:2-(i in (1,5,6,7,8,9,15,16,19))]:i for i,w in enumerate(sentence.split(),1)}
print(link)

{'H': 1, 'He': 2, 'Li': 3, 'Be': 4, 'B': 5, 'C': 6, 'N': 7, 'O': 8, 'F': 9, 'Ne': 10, 'Na': 11, 'Mi': 12, 'Al': 13, 'Si': 14, 'P': 15, 'S': 16, 'Cl': 17, 'Ar': 18, 'K': 19, 'Ca': 20}

Use of Boolean values True = 1 False = 0

n-gram Create a function that creates an n-gram from a given sequence (string, list, etc.). Use this function to get the word bi-gram and the letter bi-gram from the sentence "I am an NLPer".

import re

sentence_string = "I am an NLPer"
sentence_list = sentence_string.split()
def n_gram(sequence,n):
    u"""Character bi when passed as a string-gram,The word bi when passed as a list-Treated as a gram.
    """
    result = []
    if isinstance(sequence,str):
        sequence = list(re.sub("[,. ]","",sequence))
    for i in range(len(sequence)- n+1):
        result.append('-'.join(sequence[i:i+n]))        
    return result
print(n_gram(sentence_string,2))
print(n_gram(sentence_list,2))

['I-a', 'a-m', 'm-a', 'a-n', 'n-N', 'N-L', 'L-P', 'P-e', 'e-r']
['I-am', 'am-an', 'an-NLPer']

--Comprehension version (from shiracamus)

import re

sentence_string = "I am an NLPer"
sentence_list = sentence_string.split()

def n_gram(sequence, n):
    u"""Character bi when passed as a string-gram,The word bi when passed as a list-Treated as a gram.
    """
    if isinstance(sequence, str):
        sequence = list(re.sub("[,. ]", "", sequence))
    return ['-'.join(sequence[i:i+n])
            for i in range(len(sequence) - n + 1)]

print(n_gram(sentence_string, 2))
print(n_gram(sentence_list, 2))

You can use the inclusion notation. ..

06. Meeting

Find the set of character bi-grams contained in "paraparaparadise" and "paragraph" as X and Y, respectively, and find the union, intersection, and complement of X and Y, respectively. In addition, find out if the bi-gram'se'is included in X and Y.

import re
X = "paraparaparadise"
Y = "paragraph"
def n_gram(sequence,n):
    result = []
    if isinstance(sequence,str):
        sequence = list(re.sub("[,. ]","",sequence))
    for i in range(len(sequence)- n+1):
        result.append('-'.join(sequence[i:i+n]))
    return result

X = (set(n_gram(X,2)))
Y = (set(n_gram(Y,2)))
print("X:",X)
print("Y:",Y)
print("Union:",X | Y)
print("Intersection:",X & Y)
print("Difference set 1:",X - Y)
print("Difference set 2:",Y - X)
if 's-e' in X:
    print('se is included in X')
if 's-e' in Y:
    print('se is included in Y')

X: {'a-d', 'a-r', 'r-a', 'i-s', 's-e', 'd-i', 'p-a', 'a-p'}
Y: {'a-r', 'r-a', 'p-h', 'g-r', 'a-g', 'p-a', 'a-p'}

Union: {'a-r','i-s','p-a','a-p','a-d','r-a','p-h','g-r ',' s-e',' d-i','a-g'} Intersection: {'p-a','a-p','a-r','r-a'} Difference set 1: {'a-d','d-i','s-e','i-s'} Complement 2: {'a-g','g-r','p-h'} se is included in X

07. Sentence generation by template

Implement a function that takes arguments x, y, z and returns the string "y at x is z". Furthermore, set x = 12, y = "temperature", z = 22.4, and check the execution result.

def make_sentence(x,y,z):
    print(u"{0}of time{1}Is{2}".format(x,y,z))
make_sentence(x = 12,y = "temperature",z = 22.4)

The temperature at 12:00 is 22.4

08. Ciphertext

Implement the function cipher that converts each character of the given character string according to the following specifications. --If lowercase letters, replace with (219 --character code) characters --Other characters are output as they are Use this function to encrypt / decrypt English messages.

import re
pat = re.compile(u"[a-z]") 
def cipher(string):
    return ''.join(chr(219-ord(c)) if pat.match(c) else c for c in string)

if __name__ == "__main__":
    sentence = u"Hello world!"
    ciphertext = cipher(sentence)
    print(sentence)
    print(ciphertext)
    print(cipher(ciphertext))

re.compile('[a-z]')
Hello world!
Hvool dliow!
Hello world!

--Regular expression non-use version (from shiracamus)


def cipher(string):
    return ''.join(chr(219 - ord(c)) if c.islower() else c for c in string)

if __name__ == "__main__":
    sentence = u"Hello world!"
    ciphertext = cipher(sentence)
    print(sentence)
    print(ciphertext)
    print(cipher(ciphertext))

You can use str.islower (). str.islower () seems to be True even if it is a character string that is not case sensitive.

chr(i) Returns a string that represents a character whose Unicode code point is the integer i. For example, chr (97) returns the string'a' and chr (8364) returns the string'€'. The opposite of ord ().

The valid range of arguments is 0 to 1,114,111 (0x10FFFF in hexadecimal). ValueError is raised if i is out of range.

Typoglycemia Create a program that randomly rearranges the order of the other letters, leaving the first and last letters of each word for the word string separated by spaces. However, words with a length of 4 or less are not rearranged. Appropriate English sentences (eg "I couldn't believe that I could actually" Understand what I was reading: the phenomenal power of the human mind. ") And check the execution result.

from random import shuffle
def change_order(sentence):
    produced_word_list = []
    word_list = sentence.split(' ')
    for word in word_list:
        if len(word) <= 4:
            produced_word_list.append(word)
        else:
            middle = list(word[1:-1])
            shuffle(middle)
            produced_word = word[0] + ''.join(middle) + word[-1]
            produced_word_list.append(produced_word)
    return ' '.join(produced_word_list)
sentence = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
print(change_order(sentence))

I cnud'olt bvieele that I cluod aucltaly uestradnnd what I was reading : the pnemonaehl power of the huamn mind .

--Another solution

import random
def change_order(sentence):
    produced_word_list = []
    word_list = sentence.split(' ')
    for word in word_list:
        if len(word) <= 4:
            produced_word_list.append(word)
        else:
            middle_list = list(word[1:-1])
            new_middle = ''
            while len(middle_list) > 0:
                rnd = random.randint(0,len(middle_list)-1)
                new_middle += middle_list.pop(rnd)
            new_word = word[0] + new_middle + word[-1]
            produced_word_list.append(new_word)
    return ' '.join(produced_word_list)
sentence = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
print(change_order(sentence))

I cl'oundt beevile that I culod aauctlly unnetdarsd what I was rdeaing : the pneoenhaml peowr of the haumn mind .

--Generator and random.shuffle version (from shiracamus)

import random

def change_order(sentence):
    def produced_words():
        word_list = sentence.split()
        for word in word_list:
            if len(word) <= 4:
                yield word
            else:
                middle = list(word[1:-1])
                random.shuffle(middle)
                yield word[0] + ''.join(middle) + word[-1]
    return ' '.join(produced_words())

sentence = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
print(change_order(sentence))

Generator mystery

I wasn't very good at generators, but since it's a big deal, Classes and Iterators --Dive Into Python 3 Japanese Version Python iterator I tried to understand with reference to. ..

The generator is Is it okay to recognize that the for statement contains the next () and ʻiter ()methods and calls theiterandnextmethods in the generator function (?)? .. The iterator object has two methods, Theiter method returns itself, and the next` method returns the next element.

Note from Classes and Iterators --Dive Into Python 3 Japanese Version

>>> import plural6
>>> r1 = plural6.LazyRules()
>>> r2 = plural6.LazyRules()
>>> r1.rules_filename                               ①
'plural6-rules.txt'
>>> r2.rules_filename
'plural6-rules.txt'
>>> r2.rules_filename = 'r2-override.txt'           ②
>>> r2.rules_filename
'r2-override.txt'
>>> r1.rules_filename
'plural6-rules.txt'
>>> r2.__class__.rules_filename                     ③
'plural6-rules.txt'
>>> r2.__class__.rules_filename = 'papayawhip.txt'  ④
>>> r1.rules_filename
'papayawhip.txt'
>>> r2.rules_filename                               ⑤
'r2-overridetxt'

(1) Each instance of this class inherits the attribute rules_filename that has the value defined in the class. (2) Changing the attribute value of one instance does not affect the attribute value of other instances ... ③ …… Do not change the class attributes. You can refer to class attributes (rather than the attributes of individual instances) by using the special attribute class to access the class itself. (4) When the class attribute is changed, the instance (here, r1) that still inherits the value is affected. ⑤ The instance that overwrites the attribute (r2 in this case) is not affected.

Individual instances and class instances are different.

`sample1.py`


class MyIterator(object):
    def __init__(self, *numbers):
        self._numbers = numbers
        self._i = 0
    def __iter__(self):
        # next()Is implemented by self, so it returns self as it is
        return self
    def next(self):
        if self._i == len(self._numbers):
            raise StopIteration()
        value = self._numbers[self._i]
        self._i += 1
        return value

my_iterator = MyIterator(10, 20, 30)
for num in my_iterator:
    print 'hello %d' % num

`sample2.py`


class Fib:
    '''iterator that yields numbers in the Fibonacci sequence'''

    def __init__(self, max):
        self.max = max

    def __iter__(self):
        self.a = 0
        self.b = 1
        return self

    def __next__(self):
        fib = self.a
        if fib > self.max:
            raise StopIteration
        self.a, self.b = self.b, self.a + self.b
        return fib

Why is there a case where next is a special method __next__ and a case where it is a normal method next in Class? The __next__ in the Class is called from the external methodnext, while the next in the Class is executed in the Class (?)

I don't understand.