[Python] I tried to calculate TF-IDF steadily

TF-IDF memo. It's much easier to use TfidfVectorizer , but it's also for studying. If anything is strange, please kindly tell me.

Document to be calculated

hoge.txt


white black red
white white black
white black black black
white
#Preparation
print(word_set)
['black', 'red', 'white']

print(doc_words)
[['white', 'black', 'red'], ['white', 'white', 'black'], ['white', 'black', 'black', 'black'], ['white']]
def tokenizer(word_set, doc_words):
    token_doc = []
    for words in doc_words:
        temp = []
        for w in words:
            temp.append(word_set.index(w))
        token_doc.append(temp)
    return token_doc

token_doc = tokenizer(word_set, doc_words)
#print(token_doc)

doc_num = len(token_doc)
#print(doc_num)
IDF = []
count = 0

import math

for j in range(len(word_set)):
    count = 0
    for d_list in token_doc:
        if j in d_list:
            count += 1
    IDF.append(math.log(doc_num / count) + 1)

TF_set = []
for doc in token_doc:
    TF = [0] * len(word_set)
    for t in doc:
        TF[t] += 1
    TF_set.append(TF)

TF_IDF_set = []
TF_IDF = []
for temp_TF in TF_set:
    for m in range(len(word_set)):
        TF_IDF.append(temp_TF[m] * IDF[m])
    TF_IDF_set.append(TF_IDF)
    TF_IDF = []
#result
print(token_doc)
[[2, 0, 1], [2, 2, 0], [2, 0, 0, 0], [2]]

print(word_set)
['black', 'red', 'white']

print(TF_IDF_set)
[[1.2876820724517808, 2.386294361119891, 1.0], [1.2876820724517808, 0.0, 2.0], [3.8630462173553424, 0.0, 1.0], [0.0, 0.0, 1.0]]

Recommended Posts

[Python] I tried to calculate TF-IDF steadily
I tried to touch Python (installation)
I tried to implement permutation in Python
I tried to implement PLSA in Python 2
Python3 standard input I tried to summarize
I tried to implement ADALINE in Python
I tried to touch Python (basic syntax)
I tried Python> autopep8
I tried to debug.
I tried to paste
I tried Python> decorator
I tried to get CloudWatch data with Python
I tried to output LLVM IR with Python
I tried to implement TOPIC MODEL in Python
I tried to automate sushi making with python
I tried to implement selection sort in python
I tried scraping with Python
I tried to graph the packages installed in Python
When I tried to introduce python3 to atom, I got stuck
I tried to learn PredNet
I tried to summarize how to use matplotlib of python
I tried to implement Minesweeper on terminal with python
I tried to touch the CSV file with Python
I tried to draw a route map with Python
I tried to solve the soma cube with python
I tried to implement a pseudo pachislot in Python
Continuation ・ I tried to make Slackbot after studying Python3
I tried to implement PCANet
I tried to get started with blender python script_Part 02
I tried to implement Dragon Quest poker in Python
I tried to implement GA (genetic algorithm) in Python
[Python] I tried to graph the top 10 eyeshadow rankings
I tried to automatically generate a password with Python3
I tried Python C extension
[Python] I tried using OpenPose
I tried to introduce Pylint
I tried to solve the problem with Python Vol.1
I tried to analyze J League data with Python
I tried gRPC with Python
I tried scraping with python
[Python] I tried to get Json of squid ring 2
I tried to touch jupyter
I tried to implement StarGAN (1)
I tried to access Google Spread Sheets using Python
I tried to summarize the string operations of Python
I tried to solve AOJ's number theory with Python
I tried to find the entropy of the image with python
I tried to simulate how the infection spreads with Python
I tried to create API list.csv in Python from swagger.yaml
I tried to make various "dummy data" with Python faker
I tried to implement a one-dimensional cellular automaton in Python
I tried LeetCode every day 13. Roman to Integer (Python, Go)
[Markov chain] I tried to read negative emotions into Python.
[Markov chain] I tried to read a quote into Python.
I tried "How to get a method decorated in Python"
[Python] I tried to visualize tweets about Corona with WordCloud
[Python] I tried to visualize the follow relationship of Twitter
Mayungo's Python Learning Episode 3: I tried to print numbers with print
I tried to implement the mail sending function in Python
I tried to enumerate the differences between java and python
I tried to make a stopwatch using tkinter in python