[PYTHON] You will be an engineer in 100 days --Day 68 --Programming --About TF-IDF

Click here until yesterday

You will become an engineer in 100 days --Day 66 --Programming --About natural language processing

You will become an engineer in 100 days --Day 63 --Programming --Probability 1

You will become an engineer in 100 days-Day 59-Programming-Algorithms

You will become an engineer in 100 days --- Day 53 --Git --About Git

You will become an engineer in 100 days --Day 42 --Cloud --About cloud services

You will become an engineer in 100 days --Day 36 --Database --About the database

You will be an engineer in 100 days-Day 24-Python-Basics of Python language 1

You will become an engineer in 100 days --Day 18 --Javascript --JavaScript basics 1

You will become an engineer in 100 days --Day 14 --CSS --CSS Basics 1

You will become an engineer in 100 days --Day 6 --HTML --HTML basics 1

This time about TF-IDF.

What is TF-IDF?

TF-IDF is the index term frequency and the reverse document frequency. TF (Term Frequency) word frequency and ʻIDF (Inverse Document Frequency)` The rarity of words It will be a product.

{TF: Frequency of occurrence of specified words in document: \ frac {Number of occurrences of specified words in document} {Number of occurrences of all words in document} \\ }
{IDF: Inverse document frequency (rareness of specified word): log \ frac {total number of documents} {number of documents including specified word}}
{TFIDF (index term frequency reverse document frequency) = TF * IDF}

reference: https://ja.wikipedia.org/wiki/Tf-idf

Word count

First, let's count the number of words in the sentence.

Make a sentence for counting.

result_list = []
result_list.append('I am a cat')
result_list.append('I am a cat')
result_list.append('I am also')
result_list.append('Please, please be a cat')

You can count the frequency of occurrence of words with the following code.

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

count_vectorizer = CountVectorizer(token_pattern='(?u)\\b\\w+\\b')

count_vectorizer.fit(result_list)
X = count_vectorizer.transform(result_list)

print(len(count_vectorizer.vocabulary_))
print(count_vectorizer.vocabulary_)

pd.DataFrame(X.toarray(), columns=count_vectorizer.get_feature_names())

8 {'I'm: 6,'is': 4,'cat': 7,'in': 1,'is': 0,'also': 5,'is': 2,'please': 3}

is there so is here you go Is Also I Cat
0 1 1 0 0 1 0 1 1
1 1 1 0 0 1 0 1 1
2 0 0 1 0 0 1 1 0
3 0 0 1 2 0 0 0 1

You can count how many times a word appears in each sentence.

Next, let's find TF-IDF.

You can find it with the following code.

from sklearn.feature_extraction.text import TfidfVectorizer
import warnings
import numpy as np
import pandas as pd
warnings.filterwarnings('ignore')

tfidf_vectorizer = TfidfVectorizer(token_pattern='(?u)\\b\\w+\\b')
tfidf_vectorizer.fit(result_list)

print(len(tfidf_vectorizer.vocabulary_))
print(tfidf_vectorizer.vocabulary_)

X = tfidf_vectorizer.fit_transform(result_list)
pd.DataFrame(X.toarray(), columns=tfidf_vectorizer.get_feature_names())

8 {'I'm: 6,'is': 4,'cat': 7,'in': 1,'is': 0,'also': 5,'is': 2,'please': 3}

is there so is here you go Is Also I Cat
0 0.481635 0.481635 0 0 0.481635 0 0.389925 0.389925
1 0.481635 0.481635 0 0 0.481635 0 0.389925 0.389925
2 0 0 0.553492 0 0 0.702035 0.4481 0
3 0 0 0.35157 0.891844 0 0 0 0.284626

TF-IDF has a value between 0 and 1. The value of what appears in many sentences is small. What appears a lot in one sentence is considered to be an important word.

You can think of it as a word that is more rare as it gets closer to 1.

Summary

There are methods for vectorizing sentences and methods for calculating the rarity of words. Since the sentences can be quantified, you will be able to perform various calculations.

Because these methods are often used in machine learning etc. Let's suppress the name and so on.

32 days until you become an engineer

Author information

Otsu py's HP: http://www.otupy.net/

Youtube: https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw

Twitter: https://twitter.com/otupython

Recommended Posts

You will be an engineer in 100 days --Day 68 --Programming --About TF-IDF
You will be an engineer in 100 days ――Day 71 ――Programming ――About scraping 2
You will be an engineer in 100 days ――Day 74 ――Programming ――About scraping 5
You will be an engineer in 100 days ――Day 73 ――Programming ――About scraping 4
You will be an engineer in 100 days ――Day 70 ――Programming ――About scraping
You will be an engineer in 100 days ――Day 81 ――Programming ――About machine learning 6
You will be an engineer in 100 days ――Day 82 ――Programming ――About machine learning 7
You will be an engineer in 100 days ――Day 79 ――Programming ――About machine learning 4
You will be an engineer in 100 days ――Day 76 ――Programming ――About machine learning
You will be an engineer in 100 days ――Day 80 ――Programming ――About machine learning 5
You will be an engineer in 100 days ――Day 84 ――Programming ――About machine learning 9
You will be an engineer in 100 days ――Day 83 ――Programming ――About machine learning 8
You will be an engineer in 100 days ――Day 77 ――Programming ――About machine learning 2
You will be an engineer in 100 days ――Day 85 ――Programming ――About machine learning 10
You will be an engineer in 100 days --Day 63 --Programming --Probability 1
You will be an engineer in 100 days --Day 65 --Programming --Probability 3
You will be an engineer in 100 days --Day 64 --Programming --Probability 2
You will be an engineer in 100 days --Day 86 --Database --About Hadoop
You will be an engineer in 100 days ――Day 60 ――Programming ――About data structure and sorting algorithm
You will be an engineer in 100 days --Day 27 --Python --Python Exercise 1
You will be an engineer in 100 days --Day 34 --Python --Python Exercise 3
You become an engineer in 100 days ――Day 67 ――Programming ――About morphological analysis
You will be an engineer in 100 days ――Day 24 ―― Python ―― Basics of Python language 1
You will be an engineer in 100 days ――Day 30 ―― Python ―― Basics of Python language 6
You will be an engineer in 100 days ――Day 25 ―― Python ―― Basics of Python language 2
You will be an engineer in 100 days --Day 29 --Python --Basics of the Python language 5
You will be an engineer in 100 days --Day 33 --Python --Basics of the Python language 8
You will be an engineer in 100 days --Day 26 --Python --Basics of the Python language 3
You will be an engineer in 100 days --Day 35 --Python --What you can do with Python
You will be an engineer in 100 days --Day 32 --Python --Basics of the Python language 7
You will be an engineer in 100 days --Day 28 --Python --Basics of the Python language 4
You have to be careful about the commands you use every day in the production environment.
What beginners think about programming in 2016