[PYTHON] 100 Language Processing Knock Chapter 1

I don't know how much I can do. (cf. Knock 100 language processing 2015)

I added it with a memorandum because I was told how to do it in the comments.

00. Reverse order of strings

# coding: utf-8

s = "stressed"

print(s[::-1])

01. "Patatokukashi"

# coding: utf-8

s = "Patatoku Kashii"

print(s[::2])

02. "Police car" + "Taxi" = "Patatokukashi"

# coding: utf-8

s1 = 'Police car'
s2 = 'taxi'

s = ''.join([i+j for i, j in zip(s1, s2)])
print(s)

You can create an iterator from multiple iterables with zip ().

Postscript

# coding: utf-8

s1 = 'Police car'
s2 = 'taxi'

s = ''.join(i+j for i, j in zip(s1, s2))
print(s)

It is not necessary to create an unnecessary list by using the generator comprehension notation.

03. Pi

# coding: utf-8
import re

s = 'Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics.'

#Exclude commas and commas and then break down into a word-by-word list
s = re.sub('[,.]', '', s)
s = s.split()

#Count the number of characters and list
result = []
for w in s:
    result.append(len(w))
    
print(result)

Postscript

# coding: utf-8
s = 'Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics.'

#After breaking it down into a word-by-word list,.Exclude and count the number of characters
result = [len(w.rstrip(',.')) for w in s.split()]

print(result)

Processing such as initializing with an empty list and turning for is rewritten with comprehension notation. rstrip () removes the specified character from the right. Characters can be specified together.

04. Element symbol

# coding: utf-8
s = 'Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.'.split()

target = [1, 5, 6, 7, 8, 9, 15, 16, 19]

result ={}

for i in range(len(s)):
    if i + 1 in target:
        result[i+1] = s[i][:1]
    else:
        result[i+1] = s[i][:2]
        
print(result)

Postscript

# coding: utf-8
s = 'Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.'

target = 1, 5, 6, 7, 8, 9, 15, 16, 19

result = [w[: 1 if i in target else 2] for i, w in enumerate(s.split(), 1)]

print(result)

At first I wondered if else could be used in the inclusion notation of the [: 1 if i in target else 2] part, but this is the slice [: x] and the ternary operator ʻa if cond else b. A combination of `.

n-gram

# coding: utf-8

def n_gram(n, s):
    result = []
    for i in range(0, len(s)-n+1):
        result.append(s[i:i+n])
    return result

print(n_gram(2, 'I am an NLPer'))

Postscript

# coding: utf-8

def n_gram(n, s):
    return [s[i:i+n] for i in range(0, len(s)-n+1)]

print(n_gram(2, 'I am an NLPer'))

This is also for in the initialized list, so it can be rewritten in the inclusion notation. I'm not used to the comprehension, so I have the impression that it's difficult to write it in for and rewrite it.

06. Meeting

# coding: utf-8
#Union, intersection, and difference of two bigrams

def bi_gram(s):
    result = []
    for i in range(0, len(s)-1):
        result.append(s[i:i+2])
    return result
    
s1 = 'paraparaparadise'
s2 = 'paragraph'

X = set(bi_gram(s1))
Y = set(bi_gram(s2))
print("X = ", X)
print("Y = ", Y)
print("union: ", X | Y) # union
print("intersection: ", X & Y) # intersection
print("difference: ", X - Y) # difference

if "se" in X:
    print("X contain 'se'.")
else:
    print("X doesn't contain 'se'.")
    
if "se" in Y:
    print("Y contain 'se'.")
else:
    print("Y doesn't contain 'se'.")

07. Sentence generation by template

# coding: utf-8

def gen_sentence(x, y, z):
    return "{}of time{}Is{}".format(x, y, z)

x = 12
y = 'temperature'
z = 22.4
print(gen_sentence(x, y, z))

08. Ciphertext

# coding: utf-8

def cipher(S):
    result = []
    for i in range(len(S)):
        if(S[i].islower()):
            result.append(chr(219 - ord(S[i])))
        else:
            result.append(S[i])
    return ''.join(result)
    
S = "abcDe"
print(cipher(S))
print(cipher(cipher(S)))

Convert characters to Unicode code point integers with ʻord () . The inverse function is chr ()`.

Postscript

# coding: utf-8

def cipher(S):
    return ''.join(chr(219 - ord(c)) if c.islower() else c for c in S)

S = "abcDe"
print(cipher(S))
print(cipher(cipher(S)))

Those that conditionally branch to the initialized list and appendwithfor` can be replaced by ternary operators and comprehensions. If you know that it is such a thing, it seems that you can read it and write it.

Typoglycemia

# coding: utf-8

import numpy.random as rd

def gen_typo(S):
    if len(S) <= 4:
        return S
    else:
        idx = [0]
        idx.extend(rd.choice(range(1, len(S)-1), len(S)-2, replace=False))
        idx.append(len(S)-1)
        result = []
        for i in range(len(S)):
            result.append(S[idx[i]])
        return ''.join(result)

s = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
s = s.split()
print(' '.join([gen_typo(i) for i in s]))

There are various random sampling methods, but you can use numpy.random.choice to choose between restore extraction and non-restore extraction.

Postscript

# coding: utf-8

import random

def gen_typo(S):
    return ' '.join(
        s 
        #Returns words less than 4 in length
        if len(s) <= 4 
        #Shuffle words 5 or longer, leaving the first and second letters
        else s[0] + ''.join(random.sample(s[1:-1], len(s)-2)) + s[-1] 
        for s in S.split()
        )

S = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
print(gen_typo(S))

This is also rewritten with comprehensions and ternary operators.

Also, random.sample (population, k) randomly unrestores and extracts k elements from population (sequence or set).