[PYTHON] Format summary of formats that can be serialized with gensim

gensim

A library of topic models implemented in Python. The details of the function are not covered here. This time, I will summarize the formats of various formats that can be converted when converting a character string to the BoW format with gensim.

Execution code

Output as Official Reference.

from gensim import corpora
from collections import defaultdict
from pprint import pprint

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1]
         for text in texts]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

corpora.MmCorpus.serialize("./corpus.mm", corpus)
corpora.BleiCorpus.serialize("./corpus.blei", corpus)
corpora.LowCorpus.serialize("./corpus.low", corpus)
corpora.SvmLightCorpus.serialize("./corpus.svmlight", corpus)
corpora.UciCorpus.serialize("./corpus.low", corpus)

pprint(texts)
print("\n")
pprint(dictionary.token2id)
print("\n")
pprint(corpus)

Output

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


{'computer': 1,
 'eps': 8,
 'graph': 10,
 'human': 2,
 'interface': 0,
 'minors': 11,
 'response': 6,
 'survey': 4,
 'system': 5,
 'time': 7,
 'trees': 9,
 'user': 3}


[[(0, 1), (1, 1), (2, 1)],
 [(1, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(0, 1), (3, 1), (5, 1), (8, 1)],
 [(2, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]

Matrix Market format

corpus.mm


%%MatrixMarket matrix coordinate real general
9 12 28                                           
1 1 1
1 2 1
1 3 1
2 2 1
2 4 1
2 5 1
2 6 1
2 7 1
2 8 1
3 1 1
3 4 1
3 6 1
3 9 1
4 3 1
4 6 2
4 9 1
5 4 1
5 7 1
5 8 1
6 10 1
7 10 1
7 11 1
8 10 1
8 11 1
8 12 1
9 5 1
9 11 1
9 12 1

Blei format

corpus.blei


3 0:1 1:1 2:1
6 1:1 3:1 4:1 5:1 6:1 7:1
4 0:1 3:1 5:1 8:1
3 2:1 5:2 8:1
3 3:1 6:1 7:1
1 9:1
2 9:1 10:1
3 9:1 10:1 11:1
3 4:1 10:1 11:1

text:corpus.blei.vocab


0
1
2
3
4
5
6
7
8
9
10
11

UCI format

corpus.uci


9                   
12                  
28                  
1 1 1
1 2 1
1 3 1
2 2 1
2 4 1
2 5 1
2 6 1
2 7 1
2 8 1
3 1 1
3 4 1
3 6 1
3 9 1
4 3 1
4 6 2
4 9 1
5 4 1
5 7 1
5 8 1
6 10 1
7 10 1
7 11 1
8 10 1
8 11 1
8 12 1
9 5 1
9 11 1
9 12 1

text:corpus.uci.vocab


0
1
2
3
4
5
6
7
8
9
10
11

Low format

corpus.low


9
0 1 2
1 3 4 5 6 7
0 3 5 8
2 5 5 8
3 6 7
9
9 10
9 10 11
4 10 11

text:corpus.low.vocab


0
1
2
3
4
5
6
7
8
9
10
11

SvmLight format

corpus.svmlight


0 1:1 2:1 3:1
0 2:1 4:1 5:1 6:1 7:1 8:1
0 1:1 4:1 6:1 9:1
0 3:1 6:2 9:1
0 4:1 7:1 8:1
0 10:1
0 10:1 11:1
0 10:1 11:1 12:1
0 5:1 11:1 12:1

reference

Recommended Posts

Format summary of formats that can be serialized with gensim
[Python] Introduction to web scraping | Summary of methods that can be used with webdriver
Comparison of 4 styles that can be passed to seaborn with set_context
Basic summary of scraping with Requests that beginners can absolutely understand [Python]
File types that can be used with Go
Summary of examples that cannot be pyTorch backward
List packages that can be updated with pip
Summary of scikit-learn data sources that can be used when writing analysis articles
Format DataFrame data with Pytorch into a form that can be trained with NN
Color list that can be set with tkinter (memorial)
Python knowledge notes that can be used with AtCoder
Limits that can be analyzed at once with MeCab
Summary of statistical data analysis methods using Python that can be used in business
Summary of things that need to be installed to run tf-pose-estimation
It seems that Skeleton Tracking can be done with RealSense
Basic knowledge of DNS that can not be heard now
NumPy zeros can be defined even with a size of 0
I investigated the pretreatment that can be done with PyCaret
Let's make a diagram that can be clicked with IPython
Evaluation index that can be specified in GridSearchCV of sklearn
Here's a summary of things that might be useful when dealing with complex numbers in Python
About the matter that torch summary can be really used when building a model with Pytorch
[Python] A program that finds the maximum number of toys that can be purchased with your money
Predict the number of cushions that can be received as laughter respondents with Word2Vec + Random Forest
[Python] Make a graph that can be moved around with Plotly
Investigation of DC power supplies that can be controlled by Python
Make a Spinbox that can be displayed in Binary with Tkinter
I made a shuffle that can be reset (reverted) with Python
Make a currency chart that can be moved around with Plotly (2)
Python standard input summary that can be used in competition pro
Make a Spinbox that can be displayed in HEX with Tkinter
Make a currency chart that can be moved around with Plotly (1)
requirements.txt can be commented out with #
Confirmation that rkhunter can be installed
Get a list of camera parameters that can be set with cv2.VideoCapture and make it a dictionary type
Easy padding of data that can be used in natural language processing
AtCoder C problem summary that can be solved in high school mathematics
Acoustic signal processing module that can be used with Python-Sounddevice ASIO [Application]
Create a web app that can be easily visualized with Plotly Dash
Mathematical optimization that can be used for free work with Python + PuLP
Maximum number of function parameters that can be defined in each language
Draw a graph that can be moved around with HoloViews and Bokeh
"Manim" that can draw animation of mathematical formulas and graphs with Python
Acoustic signal processing module that can be used with Python-Sounddevice ASIO [Basic]
Article that can be a human resource who understands and masters the mechanism of API (with Python code)
Summary of snippets when developing with Go
Items that cannot be imported with sklearn
Summary of operations often performed with asyncpg
A memo for making a figure that can be posted to a journal with matplotlib
Python practice data analysis Summary of learning that I hit about 10 with 100 knocks
[Python] The movement of the decorator that can be understood this time ② The decorator that receives the argument
Tensorflow, it seems that even the eigenvalues of the matrix can be automatically differentiated
A class for PYTHON that can be operated without being aware of LDAP
A personal memo of Pandas related operations that can be used in practice
Moved Raspberry Pi remotely so that it can be LED attached with Python
I made a familiar function that can be used in statistics with Python
Design pattern that starts with "I can do that because of language specifications" ①-Basics-
List of tools that can be used to easily try sentiment analysis of Japanese sentences in Python (try with google colab)
[Python] I examined the practice of asynchronous processing that can be executed in parallel with the main thread (multiprocessing, asyncio).