3. Natural language processing with Python 3-1. Important word extraction tool TF-IDF analysis [original definition]

⑴ The idea of TF-IDF

⑵ Definition of TF-IDF value

** Occurrence frequency $ tf $ multiplied by coefficient $ idf $, which is an indicator of rarity **

** The frequency of occurrence $ tf $ and the coefficient $ idf $ are defined as follows **

(3) Mechanism of calculation based on the original definition

#Import Numerical Library
from math import log
import pandas as pd
import numpy as np

➀ Prepare word data list

docs = [
        ["Word 1", "Word 3", "Word 1", "Word 3", "Word 1"],
        ["Word 1", "Word 1"],
        ["Word 1", "Word 1", "Word 1"],
        ["Word 1", "Word 1", "Word 1", "Word 1"],
        ["Word 1", "Word 1", "Word 2", "Word 2", "Word 1"],
        ["Word 1", "Word 3", "Word 1", "Word 1"]
        ]

N = len(docs)

words = list(set(w for doc in docs for w in doc))
words.sort()

print("Number of documents:", N)
print("Target words:", words)

image.png

➁ Define a function for calculation

#Definition of function tf
def tf(t, d):
    return d.count(t)/len(d)
 
#Definition of function idf
def idf(t):
    df = 0
    for doc in docs:
        df += t in doc
    return np.log10(N/df)
 
#Definition of function tfidf
def tfidf(t, d):
    return tf(t,d) * idf(t)

➂ Observe the calculation result of TF

#Calculate tf
result = []
for i in range(N):
    temp = []
    d = docs[i]
    for j in range(len(words)):
        t = words[j]     
        temp.append(tf(t,d))
    result.append(temp)
       
pd.DataFrame(result, columns=words)

image.png

➃ Observe the calculation result of IDF

#Calculate idf
result = []
for j in range(len(words)):
    t = words[j]
    result.append(idf(t))

pd.DataFrame(result, index=words, columns=["IDF"])

image.png

➄ TF-IDF calculation

#Calculate tfidf
result = []
for i in range(N):
    temp = []
    d = docs[i]
    for j in range(len(words)):
        t = words[j]
        temp.append(tfidf(t,d))   
    result.append(temp)

pd.DataFrame(result, columns=words)

image.png

⑷ Calculation by scikit-learn

# scikit-learn TF-Import IDF library
from sklearn.feature_extraction.text import TfidfVectorizer
#One-dimensional list
docs = [
        "Word 1 Word 3 Word 1 Word 3 Word 1",
        "Word 1 word 1",
        "Word 1 Word 1 Word 1",
        "Word 1 Word 1 Word 1 Word 1",
        "Word 1 Word 1 Word 2 Word 2 Word 1",
        "Word 1 Word 3 Word 1 Word 1"
        ]

#Generate model
vectorizer = TfidfVectorizer(smooth_idf=False)
X = vectorizer.fit_transform(docs)

#Represented in a data frame
values = X.toarray()
feature_names = vectorizer.get_feature_names()
pd.DataFrame(values,
             columns = feature_names)

image.png

⑸ Reproduce the result of scikit-learn

➀ Change IDF formula

#Definition of function idf
def idf(t):
    df = 0
    for doc in docs:
        df += t in doc
    #return np.log10(N/df)
    return np.log(N/df)+1
#Calculate idf
result = []
for j in range(len(words)):
    t = words[j]
    result.append(idf(t))

pd.DataFrame(result, index=words, columns=["IDF"])

image.png

➁ Observe the calculation result of TF-IDF

#Calculate tfidf
result = []
for i in range(N):
    temp = []
    d = docs[i]
    for j in range(len(words)):
        t = words[j]
        temp.append(tfidf(t,d))   
    result.append(temp)

pd.DataFrame(result, columns=words)

image.png

➂ L2 regularization of TF-IDF calculation results

#Calculate the norm value according to the definition only in document 1 as a trial
x = np.array([0.60, 0.000000, 0.839445])
x_norm = sum(x**2)**0.5
x_norm = x/x_norm
print(x_norm)

#Square them and add them up to make sure
np.sum(x_norm**2)

image.png

# scikit-Import learn regularization library
from sklearn.preprocessing import normalize

#L2 regularization
result_norm = normalize(result, norm='l2')

#Represented in a data frame
pd.DataFrame(result_norm, columns=words)

image.png

Recommended Posts

3. Natural language processing with Python 3-1. Important word extraction tool TF-IDF analysis [original definition]
3. Natural language processing with Python 5-2. Emotion intensity analysis tool VADER
3. Natural language processing with Python 1-1. Word N-gram
3. Natural language processing with Python 4-1. Analysis for words with KWIC
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
3. Natural language processing with Python 2-1. Co-occurrence network
3. Natural language processing with Python 5-3. Emotion value analysis of Japanese sentences [Word emotion polarity value correspondence table]
3. Natural language processing with Python 2-2. Co-occurrence network [mecab-ipadic-NEologd]
[Python] I played with natural language processing ~ transformers ~
Python: Natural language processing
3. Natural language processing with Python 3-3. A year of corona looking back at TF-IDF
Building an environment for natural language processing with Python
100 Language Processing with Python Knock 2015
Natural language processing 1 Morphological analysis
Natural language processing 3 Word continuity
Natural language processing 2 Word similarity
3. Natural language processing with Python 5-5. Emotion value analysis of Japanese sentences [Japanese evaluation polarity dictionary (words)]
3. Natural language processing with Python 3-4. A year of corona looking back on TF-IDF [Data creation]
Study natural language processing with Kikagaku
100 Language Processing Knock with Python (Chapter 1)
[Natural language processing] Preprocessing with Japanese
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock with Python (Chapter 3)
Japanese Natural Language Processing Using Python3 (4) Sentiment Analysis by Logistic Regression
Dockerfile with the necessary libraries for natural language processing in python
Natural Language Processing Case Study: Word Frequency in'Anne with an E'
3. Natural language processing with Python 5-4. Emotion value analysis of Japanese sentences [Japanese evaluation polarity dictionary (noun edition)]
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
I tried natural language processing with transformers.
3. Natural language processing with Python 1-2. How to create a corpus: Aozora Bunko
Getting started with Python with 100 knocks on language processing
Image processing from scratch with python (4) Contour extraction
Python: Deep Learning in Natural Language Processing: Basics
Let's enjoy natural language processing with COTOHA API
The first artificial intelligence. I wanted to try natural language processing, so I will try morphological analysis using MeCab with python3.
Quick batch text formatting + preprocessing for Aozora Bunko data for natural language processing with Python
[For beginners] Language analysis using the natural language processing tool "GiNZA" (from morphological analysis to vectorization)
100 natural language processing knocks Chapter 4 Morphological analysis (first half)
100 natural language processing knocks Chapter 4 Morphological analysis (second half)
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
100 natural language processing knocks Chapter 5 Dependency analysis (second half)
100 natural language processing knocks Chapter 5 Dependency analysis (first half)
[Chapter 4] Introduction to Python with 100 knocks of language processing
Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text
Natural language processing with Word2Vec developed by a researcher in the US google (original data)
Voice analysis with python
RNN_LSTM2 Natural language processing
Image processing with Python
Voice analysis with python
Data analysis with Python
[Practice] Make a Watson app with Python! # 3 [Natural language classification]
[Python] Try to classify ramen shops by natural language processing
Create a simple video analysis tool with python wxpython + openCV
■ [Google Colaboratory] Preprocessing of Natural Language Processing & Morphological Analysis (janome)
I will write a detailed explanation to death while solving 100 natural language processing knock 2020 with Python