Vector representation of document

In natural language processing, I mainly explained how to handle documents and sentences. We will analyze the relationships between sentences and words.

Vector representation of document

The vector representation of a document is the vector representation of how words are distributed in the document.

For example, the sentence "I like tomatoes and cucumbers" can be converted into the following vector representation.

(But cucumbers, I like them, and tomatoes)= (1, 1, 1, 1, 1, 2)

The number of occurrences of each word is expressed, but information on where it appeared is lost. That is, the structure and word order information is lost. Such a vector representation method

Bag of Words(BOW)Is called.

There are three typical methods for converting to a vector representation.

(1) Count expression: As in the previous example, how to focus on the number of occurrences of each word in the document
(2) Binary expression: A method that focuses only on whether or not each word appears in a sentence without worrying about the frequency of appearance.
③tf-idf expression: tf-How to handle the weight information of each word in the sentence calculated by the method called idf

Generally, tf-idf is used, but it takes time to calculate when there are many sentences. Use binary or count representations.

BOW count expression

In the count expression, the document is converted into a vector by counting the number of occurrences of each word in the document.

1.John likes to watch movies. Mary likes movies too.
2.John also likes to watch football games.

When there were the above two sentences, the Bag of Words of these two sentences It is a vector that uses the number of occurrences of words in the document as an element.

["John","likes","to","watch","movies","Mary","too","also","football", "games"]

The number of times the above words appear

1. [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]
2. [1, 1, 1, 1, 0, 0, 0, 1, 1, 1]

It has become. This represents the characteristics of the text.

In Python, by using gensim, a machine learning library mainly for text analysis It is possible to calculate automatically.

First,

dictionary = gensim.corpora.Dictionary
# (Divided text)

Creates a dictionary dictionary of words that appear in the document in advance.

dictionary.doc2bow
# (Each word in a word-separated sentence)

You can create a Bag of Words with, and the output will be a list of (id, number of occurrences). You can get the id number of each word with dictionary.token2id.

from gensim import corpora

#List of words for each document(documents)Create a dictionary corresponding to words and the number of occurrences from
dictionary = corpora.Dictionary(documents)

#Creating Bag of Words
bow_corpus = [dictionary.doc2bow(d) for d in documents]

Click here for more detailed code

from gensim import corpora
from janome.tokenizer import Tokenizer

text1 = "Of the thighs and thighs"
text2 = "Great food and scenery"
text3 = "My hobby is photography"

t = Tokenizer()
tokens1 = t.tokenize(text1, wakati=True)
tokens2 = t.tokenize(text2, wakati=True)
tokens3 = t.tokenize(text3, wakati=True)

documents = [tokens1, tokens2, tokens3]
#Create a word dictionary using corpora.
dictionary =corpora.Dictionary(documents)

#Display the id of each word
print(dictionary.token2id)

#Create Bag of Words
bow_corpus =[dictionary.doc2bow(d) for d in documents]

# (id,Number of appearances)The list of is output.
print(bow_corpus)

print()
# bow_Output the contents of corpus in an easy-to-understand manner
texts = [text1, text2, text3]
for i in range(len(bow_corpus)):
    print(texts[i])
    for j in range(len(bow_corpus[i])):
        index = bow_corpus[i][j][0]
        num = bow_corpus[i][j][1]
        print("\"", dictionary[index], "\"But" ,num, "Times", end=", ")
    print()

Output result

{'Plum': 0, 'Also': 1, 'AlsoAlso': 2, 'of': 3, 'home': 4, 'cuisine': 5, 'view': 6, 'great': 7, 'I': 8, 'hobby': 9, 'Is': 10, 'Photo': 11, 'photograph': 12, 'is': 13}
[[(0, 1), (1, 2), (2, 2), (3, 1), (4, 1)], [(1, 2), (5, 1), (6, 1), (7, 1)], [(3, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1)]]

Of the thighs and thighs
"Plum"Once, "Also"Twice, "Peaches"Twice, "of"Once, "home"Once, 
Great food and scenery
"Also"Twice, "cuisine"Once, "view"Once, "great"Once, 
My hobby is photography
"of"Once, "I"Once, "hobby"Once, "Is"Once, "Photo"Once, "photograph"Once, "is"Once,

Weighting by BOW tf-idf (theory)

In the count expression performed in "BOW_count expression", the number of occurrences of words that characterize a sentence was treated as a feature quantity.

tf-idf is
tf(Term frequency):Word frequency and
idf(Inverse Document Frequency):Inverse document frequency that shows how rare the word is (rareness)
It is represented by the product of.

tf and idf are defined by the following formula.

Decrease the importance of words (general words) that appear in many documents The formula shows that it plays a role in increasing the importance of words that appear only in a particular document.

As a result, the values such as "is" and "masu" are reduced, and the importance can be set correctly.

In other words, in tf-idf, words with a biased distribution of appearances are more important, which appear more often only in certain documents and less often in other documents.

Weighting by BOW tf-idf (implementation)

I would like to actually implement tf-idf. Package provided by scikit-learn

Implement using TfidfVectorizer.

TfidfVectorizer uses the formula explained in "BOW tf-idf weighting (theory)". It's implemented in a slightly improved way, but the essentials are the same.

The implementation of vector representation of a document using TfidfVectorizer is as follows.

#When displaying, display with 2 significant figures
np.set_printoptions(precision=2)

#Word-separated document
docs = np.array([
    "White black red", 
    "White white black", 
    "Red black"
])
vectorizer = TfidfVectorizer(use_idf=True, token_pattern="(?u)\\b\\w+\\b")
vecs = vectorizer.fit_transform(docs)

#Gets the column element
print(vectorizer.get_feature_names())

# tf-Gets the matrix that stores the idf value
print(vecs.toarray())
#Output result
['White', 'Red', 'black']
[[ 0.62  0.62  0.48]
[ 0.93  0.    0.36]
[ 0.    0.79  0.61]]

Line n of vecs.toarray () corresponds to the nth vector representation of the original document docs. And the nth column of vecs.toarray () corresponds to the vector representation of all words.

I will supplement the code.

vectorizer = TfidfVectorizer()
#This will generate a converter for vector representation.

use_idf=False
#If you do this, only tf will be weighted.

TfidfVectorizer
#This is because by default, one character or character string is not treated as a token.
#Token as an argument_pattern="(?u)\\b\\w+\\b"Should not be excluded by adding.

"(?u)\\b\\w+\\b"
#This is a regular expression that stands for "any string of one or more characters", but you don't need to understand it in depth.
#(To be exact, it is a regular expression with an escape sequence added.)

vectorizer.fit_transform() 
#Convert the document to a vector.
#The argument is an array of documents separated (separated) by whitespace characters.
#You can convert the output to a Numpy ndarray array with toarray.

np.set_printoptions()
#It is a function that defines the display format of the numpy array, and you can specify significant figures with precision.
#In this example, it is displayed with two significant figures.

Here is an example based on this

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

np.set_printoptions(precision=2)
docs = np.array([
    "Apple apple", "Apple gorilla", "Gorilla trumpet"
])

#Convert to vector representation.
vectorizer = TfidfVectorizer(use_idf=True, token_pattern=u"(?u)\\b\\w+\\b")
vecs = vectorizer.fit_transform(docs)

print(vectorizer.get_feature_names())
print(vecs.toarray())

Output result

['gorilla', 'rap', 'Apple']
[[ 0.    0.    1.  ]
 [ 0.71  0.    0.71]
 [ 0.61  0.8   0.  ]]

cos similarity

So far, we have vectorized documents in order to judge them quantitatively. By comparing the vectors, you can analyze the similarity between documents. To show how close a vector is to a vector

There is a cos similarity.

The cos similarity is expressed by the following formula and represents the cosine (0 to 1) of the angle formed by the vector. Therefore, the cos similarity value is high when the directions of the two vectors are close. It takes a small value when facing in the opposite direction.

Keep in mind that "when it's close to 1, it's similar, and when it's close to 0, it's not."

When implemented, it will be as follows. np.dot () represents the inner product and np.linalg.norm represents the vector norm (vector length).

import numpy as np

def cosine_similarity(v1, v2):
    cos_sim = np.dot(v1, v2) / (np.linalg.norm(v1)*np.linalg.norm(v2))
    return cos_sim

Here is a usage example.

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

docs = np.array([
    "Apple apple", "Apple gorilla", "Gorilla trumpet"
])
vectorizer = TfidfVectorizer(use_idf=True, token_pattern=u"(?u)\\b\\w+\\b")
vecs = vectorizer.fit_transform(docs)
vecs = vecs.toarray()

#Define a function to find cos similarity
def cosine_similarity(v1, v2):
    cos_sim = np.dot(v1, v2) / (np.linalg.norm(v1)*np.linalg.norm(v2))
    return cos_sim

#Let's compare the similarity
print("%1.3F" % cosine_similarity(vecs[0], vecs[1]))
print("%1.3F" % cosine_similarity(vecs[0], vecs[2]))
print("%1.3F" % cosine_similarity(vecs[1], vecs[2]))

Vector representation of words

Word2Vec

In the previous section, the document was represented as a vector representation, but this time we will vectorize the word. By expressing a word as a vector, you can quantify the closeness of the meaning of the word and search for synonyms.

Recent research has made it a tool for vectorizing newly born words

There is something called Word2Vec.

To put it briefly, to capture the meaning and grammar of words A tool that vectorizes words and compresses dimensions.

It is said that the number of vocabulary used by Japanese people on a daily basis is tens of thousands to hundreds of thousands. Word2Vec can express each word as a vector of about 200 dimensions.

Word2Vec makes it easy to express word-to-word relationships

"King"-"Man"+"woman"="Queen"
"Paris"-"France"+"Japan"="Tokyo"
It is also possible to perform operations between words such as.

Here, we give Word2Vec the document data of a news article called "livedoor news corpus". We will learn the relationships between words.

From now on, we will use Word2Vec to look up strings that are highly relevant to the word "man". The flow is as follows. The corresponding session name is also listed at the same time.

1,Extract news articles as a text corpus and divide them into sentences and categories.:"Glob module", "with statement", "call corpus"
2,Divide the extracted document by part of speech and make a list.: 「janome(3)」
3,Generate a model with Word2Vec.:"Word2Vec (implementation)"
4,Look up words that are highly relevant to a man.:"Word2Vec (implementation)"

glob module

The glob module is a convenient module for manipulating files and directories, and you can specify the path using regular expressions.

File: The smallest unit of information operated and managed by the user, such as documents, photos, and music.
Directory: A container that holds files together.
Path: The location of a file or directory on your computer.

glob module

What is different from the os module (basically the module used when manipulating files and directories)

The point is that you can search files wisely using special characters and strings. For example, if you write as in the following example using an asterisk * You can display all txt files in the test directory.

import glob

lis = glob.glob("test/*.txt")
print(lis)
#Output result
["test/sample.txt", "test/sample1.txt", "test/sample2.txt"]

You can also use a special string as in the example below. In this example, all sample (number) .txt files under the test directory can be displayed.

import glob

lis = glob.glob("test/sample[0-9].txt")
print(lis)
#Output result
["test/sample1.txt", "test/sample2.txt"]

with statement

When reading a normal file

open()Open the file with
read()Read the file with etc.
close()Close the file with.

The second argument "r" of open () in the example below specifies the mode when opening a file. In this case, it means read, which means that it is read-only. In addition, for example, if it is write-only, specify "w" to mean write.

f = open("a.text", "r", encoding="utf-8")
data = f.read()
f.close()

But with this notation I forgot to write close () An error occurred on the way and the file could not be closed Memory may be wasted.

So instead

Use the with statement.

Use the with statement And the file is automatically closed () Even if an error occurs while opening the file, proper exception handling will be done automatically. very convenient.

If you use the with statement, open the file as follows. The most popular in this example

Character code UTF-The file is output at 8.

A character code is a number assigned to each character to handle the characters on the computer.

with open("a.text", "r", encoding="utf-8") as f:

In the with sentence, for example

data = f.read()

Read the file with etc. If you use read () Reads all the data in the file as a character string.

Taking out the corpus

A corpus was data that gave some information to a document or audio data.

According to the download source, the livedoor news corpus has the following data.

Of the "livedoor news" operated by NHN Japan Corporation
Collect news articles covered by a Creative Commons license
It is created by removing HTML tags as much as possible.
It does not require the data formatting work that is normally required.

Take the directory I'm using as an example You can check the contents of the text directory and the directory for each category with the following code.

glob.glob("./5050_nlp_data/*")
glob.glob("./text/sports-watch/*")

This time, take out the livedoor news corpus as a text corpus Use it by dividing it into sentences and categories. You can take out the corpus and classify it by writing as follows. At first glance it's complicated, but the content is very simple.

splitlines()Is
A built-in function that decomposes a string at line breaks and returns a list of lines.

import glob

def load_livedoor_news_corpus():
    category = {
        "dokujo-tsushin": 1,
        "it-life-hack":2,
        "kaden-channel": 3,
        "livedoor-homme": 4,
        "movie-enter": 5,
        "peachy": 6,
        "smax": 7,
        "sports-watch": 8,
        "topic-news":9
    }
    docs  = [] #The text of all articles is stored here.
    labels = [] #Treat categories 1-9 of articles stored in docs as labels.

    #Executes for directories of all categories.
    for c_name, c_id in category.items():
        # {c_name}In category.items()Category name c obtained from_Embed the name using the format method.
        files = glob.glob("./5050_nlp_data/{c_name}/{c_name}*.txt".format(c_name=c_name))
        #Displays the number of files (number of articles) belonging to the category.
        print("category: ", c_name, ", ",  len(files))
        #For each article, get the URL, date, title, and body information as follows.
        for file in files:
            #Because the with statement is used, close()Is unnecessary.
            with open(file, "r", encoding="utf-8") as f:
                #Split by newline character
                lines = f.read().splitlines()
                #0th url when split,The date is listed first, the title is listed second, and the article text is listed third and subsequent.
                url = lines[0]  
                datetime = lines[1]  
                subject = lines[2]
                #The text in the article will be put together on one line.
                body = "".join(lines[3:])
                #I will put together the title and the text.
                text = subject + body

            docs.append(text)
            labels.append(c_id)

    return docs, labels

#Get the text data of all articles and their labels (categories).
docs, labels = load_livedoor_news_corpus()

Click here for actual usage examples

import glob

def load_livedoor_news_corpus():
    category = {
        "dokujo-tsushin": 1,
        "it-life-hack":2,
        "kaden-channel": 3,
        "livedoor-homme": 4,
        "movie-enter": 5,
        "peachy": 6,
        "smax": 7,
        "sports-watch": 8,
        "topic-news":9
    }
    docs  = []
    labels = []
    
    
    #Copy the above code.
    for c_name, c_id in category.items():
        files = glob.glob("./5050_nlp_data/{c_name}/{c_name}*.txt".format(c_name=c_name))
        text = ""
        for file in files:
            with open(file, "r", encoding="utf-8") as f:
                lines = f.read().splitlines() 
                url = lines[0]  
                datetime = lines[1]  
                subject = lines[2]
                body = "".join(lines[3:])
                text = subject + body

            docs.append(text)
            labels.append(c_id)

    return docs, labels

docs, labels = load_livedoor_news_corpus()
print("\nlabel: ", labels[0], "\ndocs:\n", docs[0])
print("\nlabel: ", labels[1000], "\ndocs:\n", docs[1000])

Word2Vec (implementation)

Now that we have introduced the previous stage, let's explain Word2Vec, which is the main subject.

When using Word2Vec, import from the gensim module.

from gensim.models import word2vec

Lists (separated documents) used for learning Generate a model by using it as an argument of the Word2Vec function.

Use the janome.tokenizer handled in "BOW Count" etc. to divide in advance. When dividing, look up the part of speech for each word.

In Japanese, only "nouns, verbs, adjectives, adjective verbs" can be used to analyze word relevance. Create a word-separated list of only "nouns, verbs, adjectives, adjectives".

Word2Vec is used as follows.

model = word2vec.Word2Vec(list, size=a, min_count=b, window=c)
#However, a, b,c is a number

The main arguments that Word2Vec often uses are:

size: The number of dimensions of the vector.
window: The words before and after this number are regarded as related words for learning.
min_count: Discard words that appear less than n times.

After proper learning, for the model

.most_similar(positive=["word"])

If you use the most_similar () method like The one with high similarity to the word is output.

As an example Use word2vec to output words that are highly relevant to the word "man". For the argument of word2vec.Word2Vec, specify [sentences] in the list. Set size = 100, min_count = 20, window = 15.

import glob
from janome.tokenizer import Tokenizer
from gensim.models import word2vec

#Loading and classifying livedoor news
def load_livedoor_news_corpus():
    category = {
        "dokujo-tsushin": 1,
        "it-life-hack":2,
        "kaden-channel": 3,
        "livedoor-homme": 4,
        "movie-enter": 5,
        "peachy": 6,
        "smax": 7,
        "sports-watch": 8,
        "topic-news":9
    }
    docs  = []
    labels = []

    for c_name, c_id in category.items():
        files = glob.glob("./5050_nlp_data/{c_name}/{c_name}*.txt".format(c_name=c_name))

        text = ""
        for file in files:
            with open(file, "r", encoding="utf-8") as f:
                lines = f.read().splitlines() 

                #1,The time is not related to the URL written on the second line, so remove it.
                url = lines[0]  
                datetime = lines[1]  
                subject = lines[2]
                body = "".join(lines[3:])
                text = subject + body

            docs.append(text)
            labels.append(c_id)

    return docs, labels

#Take out part of speech and create a list of "nouns, verbs, adjectives, adjectives"
def tokenize(text):
    tokens = t.tokenize(",".join(text))
    word = []
    for token in tokens:
        part_of_speech = token.part_of_speech.split(",")[0]
 
        if part_of_speech in ["noun", "verb", "adjective", "形容verb"]:
            word.append(token.surface)            
    return word

#Classify by label and text
docs, labels = load_livedoor_news_corpus()
t = Tokenizer() #First create a Tokenizer instance
sentences = tokenize(docs[0:100])  #Limited due to the large amount of data
#Please create an answer below
#word2vec.Regarding Word2Vec arguments, size=100, min_count=20, window=Please set to 15
model = word2vec.Word2Vec([sentences], size=100, min_count=20, window=15)
print(model.most_similar(positive=["Man"]))

Doc2Vec

Doc2Vec（1）

Doc2Vec is a technology for vectorizing sentences that applies Word2Vec.

I studied vectorization of sentences with BOW in "Vector representation of documents" The big difference from BOW is that the word order of sentences can also be taken into consideration as a feature.

The following is a review of the shortcomings of BOW learned from the vector representation of documents.

① There is no word order information for words
② I am not good at expressing the meaning of words
Doc2Vec makes up for these two shortcomings.

Doc2Vec（2）

We will implement Doc2Vec. Created in "Retrieving Corpus"

livedoor news corpus docs[0],docs[1],docs[2],docs[3]Compare the similarity of.

The flow looks like this

1,Word-separation

Use janome's Tokenizer to divide the text.

2,Create an instance of the TaggedDocument class

Words in the argument of TaggedDocument="Each element that is divided", tags=["tag"]When you give
You can create an instance of the TaggedDocument class.
Tags are like document ids.
Store the TaggedDocument in a list and pass it to Doc2Vec.

3,Model generation with Doc2Vec
The training of the model is described as follows.

model = Doc2Vec(documents=list, min_count=1)
# min_count:Use only words that appear at least this number of times for learning

4,Similarity output
The output of similarity is described as follows.

for i in range(4):
    print(model.docvecs.most_similar("d"+str(i)))

Click here for a summary usage example

import glob
from gensim.models.doc2vec import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from janome.tokenizer import Tokenizer

#Loading and classifying livedoor news
def load_livedoor_news_corpus():
    category = {
        "dokujo-tsushin": 1,
        "it-life-hack":2,
        "kaden-channel": 3,
        "livedoor-homme": 4,
        "movie-enter": 5,
        "peachy": 6,
        "smax": 7,
        "sports-watch": 8,
        "topic-news":9
    }
    docs  = []
    labels = []

    for c_name, c_id in category.items():
        files = glob.glob("./5050_nlp_data/{c_name}/{c_name}*.txt".format(c_name=c_name))

        text = ""
        for file in files:
            with open(file, "r", encoding="utf-8") as f:
                lines = f.read().splitlines() 

                #1,The time is not related to the URL written on the second line, so remove it.
                url = lines[0]  
                datetime = lines[1]  
                subject = lines[2]
                body = "".join(lines[3:])
                text = subject + body

            docs.append(text)
            labels.append(c_id)

    return docs, labels
docs, labels = load_livedoor_news_corpus()

#Doc2Vec processing
token = [] #A list that stores the results of each docs
training_docs = [] #List to store TaggedDocument
t = Tokenizer() #First create a Tokenizer instance
for i in range(4):
    
    # docs[i]Divide and store in token
    token.append(t.tokenize(docs[i], wakati=True))
    
    #Create an instance of the TaggedDocument class and train the result_Store in docs
    #The tag is"d number"will do
    training_docs.append(TaggedDocument(words=token[i], tags=["d" + str(i)]))

#Please create an answer below
model = Doc2Vec(documents=training_docs, min_count=1)

for i in range(4):
    print(model.docvecs.most_similar("d"+str(i)))

Classification of Japanese text

Categorize Japanese text categories in a random forest. Again, we use livedoor news. The data given to Random Forest is a vectorized news article It is divided into 9 categories. By representing articles as vectors, you can classify articles by applying the methods learned in supervised learning as they are.

The learning flow of this chapter is as follows.

1,Loading and classifying livedoor news: "Retrieving Corpus"
2,Dividing data into training data and test data: "Theory and Practice of Holdout Method" in Introduction to Machine Learning
3,tf-Vectorize training and test data with idf: "BOW tf-Weighting by idf (implementation) "," fit function "
4.Learn in Random Forest: Supervised Classification "Random Forest"
5.Implementation: "Implement a corpus category in a random forest"
6.Increase accuracy: "Increase accuracy"

fit function

scikit-learn For conversion classes (StandardScaler, Normalizer, TfidfVectorizer, etc.) There are functions such as fit (), fit_transform (), transform ().

fit()Function: Get statistics (maximum, minimum, average, etc.) of the passed data and save it in memory.
transform()Function: fit()Rewrite the data using the information acquired in.
fit_transform()Function: fit()After transform()To carry out.

The fit () function is used to learn parameters from the training dataset The tranform () function reshapes the data based on the learned parameters. In other words (1) For training data, use the fit_transform function (2) In the case of test data, since it is based on the result of fit () of training data, You need to do the transform () function.

Categorize corpus categories by random forest

Use what you've learned to categorize the livedoornews corpus into random forests.

As I wrote in "Classification of Japanese text", the flow is as follows.

1,Loading and classifying livedoor news: "Retrieving Corpus"
2,Dividing data into training data and test data: "Theory and Practice of Holdout Method" in Introduction to Machine Learning
3,tf-Vectorize training and test data with idf: "BOW tf-Weighting by idf (implementation) "," fit function "
4.Learn in Random Forest: Supervised Classification "Random Forest"
5.Implementation: "Implement a corpus category in a random forest"
6.Increase accuracy: "Increase accuracy"

upgrade accuracy

We will work to improve the accuracy of the category prediction implemented in "Classify corpus categories by random forest".

TfidfVectorizer()To the parameters of
tokenizer=You can set a function to split the text by the specified function.

For example, the following function The argument of tokenizer = will be vectorized with text containing "nouns, verbs, adjectives, adjective verbs". This time, I tried to improve the prediction accuracy of the category by omitting particles etc. Depending on the model, it may be better to include particles.

from janome.tokenizer import Tokenizer
t=Tokenizer()
def tokenize(text):
    tokens = t.tokenize(",".join(text))
    noun = []
    for token in tokens:
    #Take out the part of speech
        partOfSpeech = token.part_of_speech.split(",")[0]

        if partOfSpeech == "noun":
            noun.append(token.surface)        
        if partOfSpeech == "verb":        
            noun.append(token.surface)
        if partOfSpeech == "adjective":
            noun.append(token.surface)        
        if partOfSpeech == "Adjectival noun":        
            noun.append(token.surface)            
    return noun

Python: Natural language vector representation

Vector representation of document