[PYTHON] 100 language processing knocks 03 ~ 05

1. Pi * Break down the sentence "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics." Into words, and create a list of the number of characters (in the alphabet) of each word in order of appearance.

`nlp03.py`


#! usr/bin/env python
from collections import Counter 
str = 'Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics'
li = []
count = Counter(map(len,str.split())).most_common()
for i in range(len(count)):
    li.append(count[i][0])
print(li)

Execution result [9, 1, 3, 5, 7, 2, 4, 6, 8]

I didn't know how to implement it without using a for loop.

03 Change of pi * We will correct the mistakes you made. Thank you for your advice. I misread the problem and wrote a program that outputs the number of characters with a high appearance rate. Also, I didn't remove, and. From the sentence.

`nlp03re.py`


#!usr/bin/env python
seq = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
seq = seq.replace(",","").replace(".","")
words = seq.split()
count =[]
for i in words:
    count.append(len(i))
print count

Execution result [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]

It became pi. I think there is a better way to write the part that omits "," and ".", But ...

1. Element symbol * Break down the sentence "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can." Into words 1, 5, 6, 7, 8, 9, 15, 16, 19 The first word is the first character, the other words are the first two characters, and the associative array (dictionary type or map type) from the extracted character string to the word position (what number of words from the beginning) is created. Create it.

`nlp04.py`


#!usr/bin/env python
str = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
number = [1, 5, 6, 7, 8, 9, 15, 16, 19]
dict = {}
strsp = str.split()
for i in range(len(strsp)):
    word = strsp[i]
    if i in number:
        dict[word[0:2]] = i
    else:
        dict[word[0:1]] = i
print(dict)

Execution result {'A': 17, 'B': 4, 'Co': 5, 'No': 6, 'H': 0, 'K': 18, 'Cl': 16, 'M': 11, 'L': 2, 'Ne': 9, 'P': 14, 'S': 13, 'Ox': 7, 'N': 10, 'Fl': 8, 'Ca': 19, 'Se': 15, 'He': 1}

04 Element symbol correction * The part you pointed out has been corrected. This issue also fixed the fact that I had to omit and. And the part where the count of the order was based on 0. In addition, the part where the location information had to be the value of dict was corrected to the length of the word.

`nlp04.py`


#!usr/bin/env python
str = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
number = [1, 5, 6, 7, 8, 9, 15, 16, 19]
dict = {}
words = str.replace(","," ").replace("."," ").split()
for (i,word) in enumerate(words,1):
    if i in number:
        dict[word[0:1]] = i
    else:
        dict[word[0:2]] = i
print(dict)

Execution result {'Be': 7, 'C': 5, 'B': 5, 'Ca': 3, 'F': 8, 'S': 8, 'H': 2, 'K': 4, 'Al': 4, 'Mi': 5, 'Ne': 3, 'O': 7, 'Li': 4, 'P': 5, 'Si': 4, 'Ar': 6, 'Na': 7, 'N': 3, 'Cl': 6, 'He': 2}

05. n-gram Create a function that creates an n-gram from a given sequence (string, list, etc.). Use this function to get the word bi-gram and the letter bi-gram from the sentence "I am an NLPer".

`nlp05.py`


#! usr/bin/env python
def word_ngram(n,seq):
    li = []
    for i in range(len(seq.split())+1-n):
        li.append(seq.split()[i:i+n])
    return li
def char_ngram(n,seq):
    li = []
    for i in range(len(seq)):
        li.append(seq[i:i+n])
    return li
str = "I am an NLPer"
print(word_ngram(2,str))
print(char_ngram(2,str))

Execution result [['I', 'am'], ['am', 'an'], ['an', 'NLPer']] ['I ', ' a', 'am', 'm ', ' a', 'an', 'n ', ' N', 'NL', 'LP', 'Pe', 'er', 'r']

The character bigram considers a space as one character.