Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 1)

http://www.cl.ecei.tohoku.ac.jp/nlp100/ Has been renewed, and it seems that the 2015 version has been released.

I tried using my favorite python (2 series). I know there are a lot of similar articles, but (and I see) http://qiita.com/tanaka0325/items/08831b96b684d7ecb2f7 It is also open to the public as a memo of your progress + sharing. If you have any suggestions, thank you.

I want to continue after Chapter 2 ... "Rehabilitation" is meaningful only if it continues!

Chapter 1: Preparatory movement

00. Reverse order of strings

Get a string in which the characters of the string "stressed" are arranged in reverse (from the end to the beginning).

Honestly


text = "stressed"
text_reverse = ""
n = len(text)
for i in xrange(n):
    text_reverse += text[n-i-1]
print text_reverse
#>>> desserts

Get from the end to the beginning honestly using subscripts. It also works when text = "".

revised edition


text = "stressed"
n = len(text)
text_reverse_list = [text[n-i-1] for i in xrange(n)]
text_reverse = ''.join(text_reverse_list)

print text_reverse
#>>> desserts

Postscript: It seems that connecting to a character string with a for loop is not good in terms of execution speed and memory. So I referred to the method of "creating a list of character strings → connecting with join".

slice


text = "stressed"
text_reverse = text[::-1]
print text_reverse
#>>> desserts

Simple with slices. String object [Start index: End index: Step]

01. "Patatokukashi"

Take out the 1st, 3rd, 5th, and 7th characters of the character string "Patatokukashi" and get the concatenated character string.

python


text = u"Patatoku Kashii"
text_concat = text[0] + text[2] + text[4] + text[6]
print text_concat
#>>>Police car

I have given the string as unicode.

python


text = u"Patatoku Kashii"
text_concat = text[::2]
print text_concat
#>>>Police car

I see, can you do the same with slices?

02. "Police car" + "Taxi" = "Patatokukashi"

Obtain the character string "Patatokukashi" by alternately connecting the characters "Police car" + "Taxi" from the beginning.

python


text1 = u"Police car"
text2 = u"taxi"
text_concat = ""
m = len(text1)
n = len(text2)
for i in xrange(m):
    if i<n:
        text_concat += text1[i] + text2[i]
        if i == m-1:
            text_concat += text2[i+1:]
    else:
        text_concat += text1[i:]
        break

print text_concat
#>>>Patatoku Kashii

Let's look at the two strings from the beginning. This is not a direct issue, but Consider the case where text1 and text2 are not the same length (m! = N). At this time, when one is finished, The other decided to concatenate the strings after that.

03. Pi

Break down the sentence "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."

python


sentence = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."

word_length = [len(x.strip(',.')) for x in sentence.split()]
print word_length
#>>> [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]

--List comprehension

I like python because it can be smartly one-liner like this.

04. Element symbol

Break down the sentence "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can." Into words 1, 5, 6, 7, 8, 9, 15, 16, The 19th word is the first character, and the other words are the first two characters, and the associative array (dictionary type or map type) from the extracted character string to the word position (what number of words from the beginning) Create.

python


sentence = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."

word_list = [x.strip(',.') for x in sentence.split()]
word_dict = dict()
specified = [1, 5, 6, 7, 8, 9, 15, 16, 19]
for i, word in enumerate(word_list):
    if i in [x-1 for x in specified]:
        word_dict[word[:1]] = i+1
    else:
        word_dict[word[:2]] = i+1

print word_dict
#>>> {'Be': 4, 'C': 6, 'B': 5, 'Ca': 20, 'F': 9, 'S': 16, 'H': 1, 'K': 19, 'Al': 13, 'Mi': 12, 'Ne': 10, 'O': 8, 'Li': 3, 'P': 15, 'Si': 14, 'Ar': 18, 'Na': 11, 'N': 7, 'Cl': 17, 'He': 2}
print word_dict['Be']
#>>> 4

I put i + 1 in word_dict to make it consistent with the "th" given in the problem. Well, I noticed after solving it, but did I get the "atomic number, that is, the number of protons"? I miss "Suihe, Ribe, my boat".

I think ... Magnesium is'Mg'?

python


'Mi': 12
  1. n-gram

Create a function that creates an n-gram from a given sequence (string, list, etc.). Use this function to get the word bi-gram and the letter bi-gram from the sentence "I am an NLPer".

python


def n_gram(sentence_str, n, type):
    result = set()
    if type == 'word':
        words = [x.strip(',.') for x in sentence_str.split()]
    elif type == 'letter':
        words = sentence_str
    m = len(words)
    for i in xrange(m-n+1):
        result.add(tuple(words[i:i+n]))
    return result

#If you also want frequency information
from collections import defaultdict
def n_gram_freq(sentence_str, n, type):
    result = defaultdict(int)
    if type == 'word':
        words = [x.strip(',.') for x in sentence_str.split()]
    elif type == 'letter':
        words = sentence_str
    m = len(words)
    for i in xrange(m-n+1):
        result[tuple(words[i:i+n])] += 1
    return result


sentence_str = "I am an NLPer"
#sentence_list = ['I', 'am', 'an', 'NLPer']

print n_gram(sentence_str, 2, 'word')
#>>> set([('am', 'an'), ('an', 'NLPer'), ('I', 'am')])
print n_gram(sentence_str, 2, 'letter')
#>>> set([('N', 'L'), ('m', ' '), ('e', 'r'), ('a', 'n'), ('I', ' '), ('n', ' '), ('L', 'P'), (' ', 'N'), (' ', 'a'), ('a', 'm'), ('P', 'e')])


print n_gram_freq(sentence_str, 2, 'word')
#>>> defaultdict(<type 'int'>, {('am', 'an'): 1, ('an', 'NLPer'): 1, ('I', 'am'): 1})
print n_gram_freq(sentence_str, 2, 'letter')
#>>>defaultdict(<type 'int'>, {('N', 'L'): 1, ('m', ' '): 1, ('e', 'r'): 1, ('a', 'n'): 1, ('I', ' '): 1, ('n', ' '): 1, ('L', 'P'): 1, (' ', 'N'): 1, (' ', 'a'): 2, ('a', 'm'): 1, ('P', 'e'): 1})

Take a string as an argument. Allows you to select the word n-gram and the character n-gram with another argument type.

06. Meeting

Find the set of characters bi-grams contained in "paraparaparadise" and "paragraph" as X and Y, respectively, and find the union, intersection, and complement of X and Y, respectively. In addition, find out if the bi-gram'se'is included in X and Y.

python


str_x = "paraparaparadise"
str_y = "paragraph"

X = n_gram(str_x, 2, 'letter')
Y = n_gram(str_y, 2, 'letter')

print X.union(Y) #Union. X|Synonymous with Y
#>>>set([('g', 'r'), ('p', 'h'), ('p', 'a'), ('s', 'e'), ('a', 'p'), ('a', 'g'), ('a', 'd'), ('i', 's'), ('r', 'a'), ('a', 'r'), ('d', 'i')])

print X.intersection(Y) #Intersection. X&Synonymous with Y
#>>>set([('a', 'p'), ('r', 'a'), ('p', 'a'), ('a', 'r')])

print X.difference(Y) #The difference set. X-Synonymous with Y
#>>>set([('a', 'd'), ('s', 'e'), ('d', 'i'), ('i', 's')])

tuple('se') in X
#>>> True
tuple('se') in Y
#>>> False

Use the function defined in the previous question. In the definition of the function, the given'se' is converted to tuple and the condition is judged.

07. Sentence generation by template

Implement a function that takes arguments x, y, z and returns the string "y at x is z". Furthermore, set x = 12, y = "temperature", z = 22.4, and check the execution result.

python


from string import Template

def generate_sentence(x,y,z):
    t = Template("${x_tmpl}of time${y_tmpl}Is${z_tmpl}")
    return t.safe_substitute(x_tmpl=x, y_tmpl=y, z_tmpl=z)

def generate_sentence_incomplete(x,y,z):
    t = Template("${x_tmpl}of time${y_tmpl}Is${z_tmpl}")
    return t.safe_substitute(x_tmpl=x, y_tmpl=y)

x, y, z =12, "temperature", 22.4
print generate_sentence(x,y,z)
#>>>The temperature at 12:00 is 22.4
print generate_sentence_incomplete(x,y,z)
#>>>The temperature at 12 o'clock${z_tmpl}

Since I used Template.safe_substitute (), http://docs.python.jp/2/library/string.html#string.Template.safe_substitute

Same as substitute (), but instead of throwing a KeyError exception, the original placeholder is included as is if the mapping or kws cannot find a corresponding placeholder.

08. Ciphertext

Implement the function cipher that converts each character of the given character string with the following specifications. Replace with lowercase letters (219 --character code) Output other characters as they are Use this function to encrypt / decrypt English messages.

python



def cipher(input_str):
    result = ""
    for letter in input_str:
        #If lowercase letters
        if letter.isalpha() and letter.islower():
            result += chr(219-ord(letter))
        else:
            result += letter
    return result

english_message = "This is a pen."

#encryption
print cipher(english_message)
#>>> Tsrh rh z kvm.

#Decryption
print cipher(cipher(english_message))
#>>> This is a pen.

I didn't know this subject, but it seems to be called Atbash cipher. http://www.mitsubishielectric.co.jp/security/learn/info/misty/stage1.html

Cryptography was also used in the Old Testament. One of them is the Hebrew substitution cipher Atbash. This cipher is made by numbering the letters and swapping the order from the beginning and the order from the end. If you want to encrypt the 26 letters of the alphabet, change the order of A to Z, B to Y, and so on.

That's why encryption and decryption can be achieved with the same function. (You can get the original string by applying the same function twice).

Click here for the built-in functions used.

Built-in functions


str.isalpha()
str.islower()
ord()
chr()
unichr()

# http://docs.python.jp/2.6/library/functions.html#ord
# http://docs.python.jp/2.6/library/functions.html#chr
# http://docs.python.jp/2.6/library/functions.html#unichr
  1. Typoglycemia

Create a program that randomly rearranges the order of the other letters, leaving the first and last letters of each word for the word string separated by spaces. However, words with a length of 4 or less are not rearranged. Give an appropriate English sentence (for example, "I couldn't believe that I could actually understand what I was reading: the phenomenal power of the human mind.") And check the execution result.

python


import random

def random_shuffled_str(input_str):
    input_list = list(input_str)
    random.shuffle(input_list)
    result = ''.join(input_list)
    return result

def typoglycemia(sentence):
    str_list = sentence.split()
    result_list = [x[0]+ random_shuffled_str(x[1:-1]) +x[-1] if len(x) > 4 else x for x in str_list]
    return ' '.join(result_list)

message = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."

print typoglycemia(message)
#>>> I cunl'dot beveile that I cloud aallucty unsrndtead what I was rdaineg : the phaenmneol pweor of the huamn mind .

print typoglycemia(message)
#>>> I cu'dolnt bivelee that I cloud aculltay udetnnasrd what I was ridneag : the pheonaenml peowr of the hamun mind .

When you want to use else in list comprehension

python


if len(x) > 4 else x

I'll bring it to this position ... And random.shuffle.

--Directly change the list (etc.) given as an argument. --So I defined the function so that the changed result can be obtained in the list comprehension.

http://docs.python.jp/2/library/random.html

random.shuffle(x[, random]) Mix sequence x by direct modification. The optional argument random is a function that has no arguments to return a random floating point number in the range [0.0, 1.0]; by default, this function is random ().

Note that even with a fairly small len (x), the permutations of x will be larger than the period of most random number generators; this means that most permutations will not be generated for long sequences. Means.

Recommended Posts

Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 1)
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 second half)
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 first half)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
100 Language Processing with Python Knock 2015
100 Language Processing Knock Chapter 1 (Python)
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
100 Language Processing Knock Chapter 1 by Python
After doing 100 language processing knock 2015, I got a lot of basic Python skills Chapter 1
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 Language Processing Knock 2020 Chapter 1
100 Language Processing Knock Chapter 1
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
100 language processing knock-42: Display of the phrase of the person concerned and the person concerned
100 Language Processing Knock-70 (using Stanford NLP): Obtaining and shaping data
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
10 functions of "language with battery" python
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
100 Language Processing Knock-59: Analysis of S-expressions
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
Coexistence of Python2 and 3 with CircleCI (1.0)
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
Python beginner tried 100 language processing knock 2015 (00 ~ 04)
Speed comparison of Wiktionary full text processing with F # and Python
Basics of binarized image processing with Python
Image processing with Python 100 knock # 10 median filter
Answers and impressions of 100 language processing knocks-Part 1
100 Language Processing Knock-44: Visualization of Dependent Tree
Answers and impressions of 100 language processing knocks-Part 2
100 Language Processing Knock-26: Removal of emphasized markup
100 Language Processing Knock-71 (using Stanford NLP): Stopword
3. Natural language processing with Python 2-1. Co-occurrence network
Image processing with Python 100 knock # 12 motion filter
3. Natural language processing with Python 1-1. Word N-gram
Drawing with Matrix-Reinventor of Python Image Processing-
[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 1
100 Language Processing Knock-88: 10 Words with High Similarity
I have 0 years of programming experience and challenge data processing with python