Python inexperienced person tries to knock 100 language processing 05-06

This is a continuation of this.

Python inexperienced person tries to knock 100 language processing 00-04 https://qiita.com/earlgrey914/items/fe1d326880af83d37b22

Click here for more Python inexperienced person tries to knock 100 language processing 07-09 https://qiita.com/earlgrey914/items/a7b6781037bc0844744b


  1. n-gram

Create a function that creates an n-gram from a given sequence (string, list, etc.). Use this function to get the word bi-gram and the letter bi-gram from the sentence "I am an NLPer".

I wonder what n-gram is ... I've heard something about it. You can't solve this problem unless you understand something about n-gram! First from there! !!

~ 2 minutes googled ~

N-gram is a method of cutting out words in natural language (text) by N consecutive characters or N word units.


Reference
https://www.pytry3g.com/entry/N-gram

~~ I see. ~~ ~~ Then, if 1 is passed, it will be separated by one character, and if 2 is passed, it will be separated by 2 characters. ~~ ~~ Word bi-gram is a word-by-word delimiter ~~ ~~ I wonder if the character bi-gram should be delimited by two characters. ~~ ~~ So the answer is ~~ ~~ ■ Word bi-gram ~~ ~~["I", "am", "an", "NL", "Pe", "r"]~~ ~~ ■ Character bi-gram ~~ ~~["I ","ma","an","NL","Pe","r"]~~

Should I output ~~? I'm sorry if I make a mistake. I will solve it on this premise. ~~

Since it was usually wrong, I glanced at the output result of the answer.

[['I', 'am'], ['am', 'an'], ['an', 'NLPer']]
['I ', ' a', 'am', 'm ', ' a', 'an', 'n ', ' N', 'NL', 'LP', 'Pe', 'er']

It seems OK if it is output like this. I see.

We. For the time being, the word bigram was created.

enshu05.py


s = "I am an NLPer"

tango_bigram= []

def bigram(s):
    counter = 0
    list = s.split()

    for i in list:
        if counter < len(list)-1:
            tango_bigram.extend([[list[counter],list[counter+1]]])
            counter += 1
            
    return tango_bigram

print(bigram(s))
[['I', 'am'], ['am', 'an'], ['an', 'NLPer']]

As you may have noticed here, I started to see various parts that were not good for writing the code.

--Variable naming is too appropriate. English and Japanese are mixed, some use single-character variables such as s and ʻi, and some use variable names such as counter. ――Here, it is written as tango_bigram which is a snake case, but before (Practice 4), it is written as ʻichimozi List in camel case, and it is disjointed. --The line feed rule is a mystery. The rule to put a half-width space is a mystery.

I want to fix it in the future, but I'm still closing my eyes now. I'm just writing by myself. Well, eventually the code I wrote will have to be fixed by myself as "I can't see it."

In the previous exercise, we used ʻappend () to add to the list, but here we used ʻextend (). If you want to list multiple elements at once, you can use ʻextend (). There seems to be a notation that uses + =such asl + = [1, 2, 3], but the impression that ʻextend () is easier to understand.


Reference URL
https://qiita.com/tag1216/items/416314cc75a099ad6149

so, I also wrote the character bigram with a similar feeling.

enshu05.py


s = "I am an NLPer"

tango_bigram= []
moji_bigram = []

def bigram(s):
    tango_counter = 0
    moji_counter = 0
    
    #Word gram processing
    list = s.split()
    for i in list:
        if tango_counter < len(list)-1:
            tango_bigram.extend([[list[tango_counter],list[tango_counter+1]]])
            tango_counter += 1

    #Character gram processing
    for i in s:
        if moji_counter < len(s)-1:
            moji_bigram.append(s[moji_counter] + s[moji_counter+1])
            moji_counter += 1
    return tango_bigram,moji_bigram

print(bigram(s))
([['I', 'am'], ['am', 'an'], ['an', 'NLPer']], ['I ', ' a', 'am', 'm ', ' a', 'an', 'n ', ' N', 'NL', 'LP', 'Pe', 'er'])

Readability is garbage! !! !! Well no. ** Python's convention that "functions must be written above function call processing"? I'm not used to it ... **

** Personally, Python is dynamically typed and indented by block delimiters. I have the impression that it is easy to write but difficult to read. ** ** Maybe it's because I'm used to block delimiters using {} in a statically typed language like Java ... Java is also significantly less readable if indented properly.

A short break

Is there any way to give a good variable name? When I googled it, there was an article like this.
Reference URL
https://qiita.com/Ted-HM/items/7dde25dcffae4cdc7923

06. Meeting

Find the set of character bi-grams contained in "paraparaparadise" and "paragraph" as X and Y, respectively, and find the union, intersection, and complement of X and Y, respectively. In addition, find out if the bi-gram'se'is included in X and Y.

** Somehow Japanese is difficult. ** ** I know for the time being. Decoding Japanese is difficult before the program.

The moment I saw this problem, I thought, "What? Set? Can I import and use such a calculable library?" The N-gram I mentioned earlier is also a library, isn't it?

The author thinks that ** "you have to make it yourself" is "something that only you can think of" **, so if someone has made something, you should use it. ing.

However, this time the purpose is ** learning **, so I will make it myself.

It's easy to get two string bigams by tweaking the bigram function in Exercise 05. (The scope of the bigram function and moji_bigram was unreasonable, so I've fixed it.)

para.py


str_paradise = "paraparaparadise"
str_paragraph = "paragraph"

def bigram(s):

    moji_bigram = []
    moji_counter = 0

    #Character gram processing
    for i in s:
        if moji_counter < len(s)-1:
            moji_bigram.append(s[moji_counter]+s[moji_counter+1])
            moji_counter += 1

    return moji_bigram

print(bigram(str_paradise))
print(bigram(str_paragraph))

['pa', 'ar', 'ra', 'ap', 'pa', 'ar', 'ra', 'ap', 'pa', 'ar', 'ra', 'ad', 'di', 'is', 'se']
['pa', 'ar', 'ra', 'ag', 'gr', 'ra', 'ap', 'ph']

So, how do you find the set? If you google it appropriately like "Python set calculation" It seems that you should use a set type instead of a list type.

https://note.nkmk.me/python-set/

What is a set type? If you think

・ No duplicate elements ・ Elements are out of order

And that. It's perfect. https://note.nkmk.me/python-set/

It's completed quickly.

enshu06.py


str_paradise = "paraparaparadise"
str_paragraph = "paragraph"

#A function that returns a list of characters bigram
def bigram(s):
    moji_bigram = []
    moji_counter = 0

    for i in s:
        if moji_counter < len(s)-1:
            moji_bigram.append(s[moji_counter]+s[moji_counter+1])
            moji_counter += 1

    return moji_bigram

#A function that converts a list to a set
def listToSet(list):
    moji_bigram_set = {}
    moji_bigram_set = set(list)
    return moji_bigram_set

#Create a list of bigram
str_paradise_list = bigram(str_paradise)
str_paragraph_list = bigram(str_paragraph)

#Convert bigram list to set and remove duplicates
paradise_set_X = listToSet(str_paradise_list)
paragraph_set_Y = listToSet(str_paragraph_list)

print("paradise_set_X")
print(paradise_set_X)
print("paragraph_set_Y")
print(paragraph_set_Y)

print("Union")
print(paradise_set_X | paragraph_set_Y)

print("Intersection")
print(paradise_set_X & paragraph_set_Y)

print("Difference set")
print(paradise_set_X - paragraph_set_Y)
paradise_set_X
{'ap', 'ar', 'pa', 'di', 'is', 'ra', 'se', 'ad'}
paragraph_set_Y
{'ap', 'ar', 'pa', 'ph', 'ag', 'ra', 'gr'}
Union
{'ap', 'ar', 'gr', 'pa', 'di', 'ph', 'is', 'ag', 'ra', 'se', 'ad'}
Intersection
{'ra', 'pa', 'ap', 'ar'}
Difference set
{'is', 'di', 'se', 'ad'}

Yeah, that's easy. It's hard to check if there is an answer ...

Continue tomorrow! !! !! !!

It took 2 hours from 05 to 06! !! !! !! !! !! !! !! !! !! !! !! !! !! !! (important)

Recommended Posts

Python inexperienced person tries to knock 100 language processing 14-16
Python inexperienced person tries to knock 100 language processing 07-09
Python inexperienced person tries to knock 100 language processing 10 ~ 13
Python inexperienced person tries to knock 100 language processing 05-06
Python inexperienced person tries to knock 100 language processing 00-04
100 Language Processing with Python Knock 2015
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock with Python (Chapter 3)
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
100 Language Processing Knock Chapter 1 by Python
Python beginner tried 100 language processing knock 2015 (00 ~ 04)
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock with Python (Chapter 2, Part 1)
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
Introduction to Python language
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 Amateur Language Processing Knock: 67
Entry where Python beginners do their best to knock 100 language processing little by little
100 language processing knock-92 (using Gensim): application to analogy data
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
100 Language Processing Knock-43: Extracted clauses containing nouns related to clauses containing verbs
[Python] Try to classify ramen shops by natural language processing
100 language processing knock-42: Display of the phrase of the person concerned and the person concerned
Leave the troublesome processing to Python
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-55: named entity extraction
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock-82 (Context Word): Context Extraction