This is a continuation of this.
Python inexperienced person tries to knock 100 language processing 00-04 https://qiita.com/earlgrey914/items/fe1d326880af83d37b22
Click here for more Python inexperienced person tries to knock 100 language processing 07-09 https://qiita.com/earlgrey914/items/a7b6781037bc0844744b
I wonder what n-gram is ... I've heard something about it. You can't solve this problem unless you understand something about n-gram! First from there! !!
~ 2 minutes googled ~
N-gram is a method of cutting out words in natural language (text) by N consecutive characters or N word units.
Reference
https://www.pytry3g.com/entry/N-gram
~~ I see. ~~ ~~ Then, if 1 is passed, it will be separated by one character, and if 2 is passed, it will be separated by 2 characters. ~~ ~~ Word bi-gram is a word-by-word delimiter ~~ ~~ I wonder if the character bi-gram should be delimited by two characters. ~~ ~~ So the answer is ~~ ~~ ■ Word bi-gram ~~ ~~["I", "am", "an", "NL", "Pe", "r"]~~ ~~ ■ Character bi-gram ~~ ~~["I ","ma","an","NL","Pe","r"]~~
Should I output ~~? I'm sorry if I make a mistake. I will solve it on this premise. ~~
Since it was usually wrong, I glanced at the output result of the answer.
[['I', 'am'], ['am', 'an'], ['an', 'NLPer']]
['I ', ' a', 'am', 'm ', ' a', 'an', 'n ', ' N', 'NL', 'LP', 'Pe', 'er']
It seems OK if it is output like this. I see.
We. For the time being, the word bigram was created.
enshu05.py
s = "I am an NLPer"
tango_bigram= []
def bigram(s):
counter = 0
list = s.split()
for i in list:
if counter < len(list)-1:
tango_bigram.extend([[list[counter],list[counter+1]]])
counter += 1
return tango_bigram
print(bigram(s))
[['I', 'am'], ['am', 'an'], ['an', 'NLPer']]
As you may have noticed here, I started to see various parts that were not good for writing the code.
--Variable naming is too appropriate. English and Japanese are mixed, some use single-character variables such as s
and ʻi, and some use variable names such as
counter. ――Here, it is written as
tango_bigram which is a snake case, but before (Practice 4), it is written as ʻichimozi List
in camel case, and it is disjointed.
--The line feed rule is a mystery. The rule to put a half-width space is a mystery.
I want to fix it in the future, but I'm still closing my eyes now. I'm just writing by myself. Well, eventually the code I wrote will have to be fixed by myself as "I can't see it."
In the previous exercise, we used ʻappend () to add to the list, but here we used ʻextend ()
.
If you want to list multiple elements at once, you can use ʻextend (). There seems to be a notation that uses
+ =such as
l + = [1, 2, 3], but the impression that ʻextend ()
is easier to understand.
Reference URL
https://qiita.com/tag1216/items/416314cc75a099ad6149
so, I also wrote the character bigram with a similar feeling.
enshu05.py
s = "I am an NLPer"
tango_bigram= []
moji_bigram = []
def bigram(s):
tango_counter = 0
moji_counter = 0
#Word gram processing
list = s.split()
for i in list:
if tango_counter < len(list)-1:
tango_bigram.extend([[list[tango_counter],list[tango_counter+1]]])
tango_counter += 1
#Character gram processing
for i in s:
if moji_counter < len(s)-1:
moji_bigram.append(s[moji_counter] + s[moji_counter+1])
moji_counter += 1
return tango_bigram,moji_bigram
print(bigram(s))
([['I', 'am'], ['am', 'an'], ['an', 'NLPer']], ['I ', ' a', 'am', 'm ', ' a', 'an', 'n ', ' N', 'NL', 'LP', 'Pe', 'er'])
Readability is garbage! !! !! Well no. ** Python's convention that "functions must be written above function call processing"? I'm not used to it ... **
** Personally, Python is dynamically typed and indented by block delimiters.
I have the impression that it is easy to write but difficult to read. ** **
Maybe it's because I'm used to block delimiters using {}
in a statically typed language like Java ...
Java is also significantly less readable if indented properly.
Is there any way to give a good variable name? When I googled it, there was an article like this.
Reference URL
https://qiita.com/Ted-HM/items/7dde25dcffae4cdc7923
** Somehow Japanese is difficult. ** ** I know for the time being. Decoding Japanese is difficult before the program.
The moment I saw this problem, I thought, "What? Set? Can I import and use such a calculable library?" The N-gram I mentioned earlier is also a library, isn't it?
The author thinks that ** "you have to make it yourself" is "something that only you can think of" **, so if someone has made something, you should use it. ing.
However, this time the purpose is ** learning **, so I will make it myself.
It's easy to get two string bigams by tweaking the bigram function in Exercise 05.
(The scope of the bigram
function and moji_bigram
was unreasonable, so I've fixed it.)
para.py
str_paradise = "paraparaparadise"
str_paragraph = "paragraph"
def bigram(s):
moji_bigram = []
moji_counter = 0
#Character gram processing
for i in s:
if moji_counter < len(s)-1:
moji_bigram.append(s[moji_counter]+s[moji_counter+1])
moji_counter += 1
return moji_bigram
print(bigram(str_paradise))
print(bigram(str_paragraph))
['pa', 'ar', 'ra', 'ap', 'pa', 'ar', 'ra', 'ap', 'pa', 'ar', 'ra', 'ad', 'di', 'is', 'se']
['pa', 'ar', 'ra', 'ag', 'gr', 'ra', 'ap', 'ph']
So, how do you find the set? If you google it appropriately like "Python set calculation" It seems that you should use a set type instead of a list type.
What is a set type? If you think
・ No duplicate elements ・ Elements are out of order
And that. It's perfect.
It's completed quickly.
enshu06.py
str_paradise = "paraparaparadise"
str_paragraph = "paragraph"
#A function that returns a list of characters bigram
def bigram(s):
moji_bigram = []
moji_counter = 0
for i in s:
if moji_counter < len(s)-1:
moji_bigram.append(s[moji_counter]+s[moji_counter+1])
moji_counter += 1
return moji_bigram
#A function that converts a list to a set
def listToSet(list):
moji_bigram_set = {}
moji_bigram_set = set(list)
return moji_bigram_set
#Create a list of bigram
str_paradise_list = bigram(str_paradise)
str_paragraph_list = bigram(str_paragraph)
#Convert bigram list to set and remove duplicates
paradise_set_X = listToSet(str_paradise_list)
paragraph_set_Y = listToSet(str_paragraph_list)
print("paradise_set_X")
print(paradise_set_X)
print("paragraph_set_Y")
print(paragraph_set_Y)
print("Union")
print(paradise_set_X | paragraph_set_Y)
print("Intersection")
print(paradise_set_X & paragraph_set_Y)
print("Difference set")
print(paradise_set_X - paragraph_set_Y)
paradise_set_X
{'ap', 'ar', 'pa', 'di', 'is', 'ra', 'se', 'ad'}
paragraph_set_Y
{'ap', 'ar', 'pa', 'ph', 'ag', 'ra', 'gr'}
Union
{'ap', 'ar', 'gr', 'pa', 'di', 'ph', 'is', 'ag', 'ra', 'se', 'ad'}
Intersection
{'ra', 'pa', 'ap', 'ar'}
Difference set
{'is', 'di', 'se', 'ad'}
Yeah, that's easy. It's hard to check if there is an answer ...
Continue tomorrow! !! !! !!
It took 2 hours from 05 to 06! !! !! !! !! !! !! !! !! !! !! !! !! !! !! (important)
Recommended Posts