# [PYTHON] I took a look at the contents of sklearn (scikit-learn) (1) ~ What about the implementation of CountVectorizer? ~

## Introduction

In this article, I would like to take a look at the contents of sklearn. Recently, many books have been published that let you try to implement machine learning algorithms yourself. I haven't read it myself, but I believe that if you take a closer look at the contents of sklearn without reading this book, you will get used to it so that you can implement it yourself without purchasing the book. Also, sklearn is a free package that is edited daily by a large number of users, so the program is excellently optimized. So it's a very polite program and there's no reason not to use it to study the program! !! That said, even if a beginner suddenly looks at the contents of sklearn, he cannot understand it, so I think it is his true intention. Even if I looked at the contents of sklearn, I couldn't organize it in my head. So, is there anyone who can explain the contents of the sklearn package? When. .. .. .. What I thought did not appear among the good ones. So, I thought I should find some free time and take a look at the contents. !! I won't explain it so carefully, but I'll just take a quick look. As a starting point, let's take a look at the contents of a simple CountVectorizer.

CountVectorizer CountVectorizer is an algorithm that counts the frequency of occurrence of words. The frequency of appearance of words counts how many times the words appearing in the sentence have been used, and can be easily calculated using sklearn's Count Vectorizer. A method called feature extraction is used to determine the frequency of appearance. Feature extraction is a vectorization of what features the training data has, and in this case, the frequency of appearance of words corresponds to the vector (numerical value) quote % E3% 81% AEcountvectorizer% E3% 82% 92% E7% 94% A8% E3% 81% 84% E3% 81% A6% E5% 8D% 98% E8% AA% 9E% E3% 81% AE% E5 % 87% BA% E7% 8F% BE% E9% A0% BB% E5% BA% A6% E3% 82% 92% E6% 95% B0% E3% 81% 88% E3% 81% A6% E3% 81 % BF% E3% 82% 8B /).

## Let's take a look.

``````class CountVectorizer(_VectorizerMixin, BaseEstimator):
def __init__(self, input='content', encoding='utf-8',
decode_error='strict', strip_accents=None,
lowercase=True, preprocessor=None, tokenizer=None,
stop_words=None, token_pattern=r"(?u)\b\w\w+\b",
ngram_range=(1, 1), analyzer='word',
max_df=1.0, min_df=1, max_features=None,
vocabulary=None, binary=False, dtype=np.int64):
self.input = input
self.encoding = encoding
self.decode_error = decode_error
self.strip_accents = strip_accents
self.preprocessor = preprocessor
self.tokenizer = tokenizer
self.analyzer = analyzer
self.lowercase = lowercase
self.token_pattern = token_pattern
self.stop_words = stop_words
self.max_df = max_df
self.min_df = min_df
if max_df < 0 or min_df < 0:
raise ValueError("negative value for max_df or min_df")
self.max_features = max_features
if max_features is not None:
if (not isinstance(max_features, numbers.Integral) or
max_features <= 0):
raise ValueError(
"max_features=%r, neither a positive integer nor None"
% max_features)
self.ngram_range = ngram_range
self.vocabulary = vocabulary
self.binary = binary
self.dtype = dtype
``````

It seems like that. This object seems to inherit two classes, but basically fit and fit_transform are enough when actually using this. Also, when you look at the initial values, there seems to be no required parameters. Now let's look at fit first.

``````    def fit(self, raw_documents, y=None):
"""Learn a vocabulary dictionary of all tokens in the raw documents.

Parameters
----------
raw_documents : iterable
An iterable which yields either str, unicode or file objects.

Returns
-------
self
"""
self._warn_for_unused_params() ##concern
self.fit_transform(raw_documents)
return self

``````

Interestingly, it seems that fit_transform is used in fit. It turns out that the important thing is not fit etc., but fit_transform. Also, let's take a look at the one above.

``````    def _warn_for_unused_params(self):

if self.tokenizer is not None and self.token_pattern is not None:
warnings.warn("The parameter 'token_pattern' will not be used"
" since 'tokenizer' is not None'")

if self.preprocessor is not None and callable(self.analyzer):
warnings.warn("The parameter 'preprocessor' will not be used"
" since 'analyzer' is callable'")

if (self.ngram_range != (1, 1) and self.ngram_range is not None
and callable(self.analyzer)):
warnings.warn("The parameter 'ngram_range' will not be used"
" since 'analyzer' is callable'")
if self.analyzer != 'word' or callable(self.analyzer):
if self.stop_words is not None:
warnings.warn("The parameter 'stop_words' will not be used"
" since 'analyzer' != 'word'")
if self.token_pattern is not None and \
self.token_pattern != r"(?u)\b\w\w+\b":
warnings.warn("The parameter 'token_pattern' will not be used"
" since 'analyzer' != 'word'")
if self.tokenizer is not None:
warnings.warn("The parameter 'tokenizer' will not be used"
" since 'analyzer' != 'word'")
``````

As far as I can see, it seems that I'm looking for something wrong with the parameters. The important thing here is that the method here is not a CountVectorizer, but an object of _VectorizerMixin. So it seems that this object is doing error checking for parameters and more. Also, when inheriting multiple objects, add Mixin at the end like _VectorizerMixin. Objects to which this is attached are basically used in combination with other objects! !! It points to that. Now let's check the important fit_transform method.

``````    def fit_transform(self, raw_documents, y=None):

if isinstance(raw_documents, str):  #Raw here_Excludes cases where documents do not appear as a list type. str type useless! !!
raise ValueError(
"Iterable over raw text documents expected, "

self._validate_params() #n_Whether the gram range is suitable
self._validate_vocabulary()#concern
max_df = self.max_df #This is a little point. Since it is troublesome to describe self many times, we use it as a variable here.
min_df = self.min_df
max_features = self.max_features

vocabulary, X = self._count_vocab(raw_documents,
self.fixed_vocabulary_) #concern

if self.binary:
X.data.fill(1)

if not self.fixed_vocabulary_:
X = self._sort_features(X, vocabulary)

n_doc = X.shape
max_doc_count = (max_df
if isinstance(max_df, numbers.Integral)
else max_df * n_doc)
min_doc_count = (min_df
if isinstance(min_df, numbers.Integral)
else min_df * n_doc)
if max_doc_count < min_doc_count:
raise ValueError(
"max_df corresponds to < documents than min_df")
X, self.stop_words_ = self._limit_features(X, vocabulary,
max_doc_count,
min_doc_count,
max_features)

self.vocabulary_ = vocabulary

return X #Returns a vector

``````

Let's take a look at the first point, the method of self._validate_vocabulary ().

``````    def _validate_vocabulary(self):
vocabulary = self.vocabulary #dictionary
if vocabulary is not None: #Is the dictionary entered as the initial value?#When the dictionary is not set. Or this is executed when it is fitted once
if isinstance(vocabulary, set):
vocabulary = sorted(vocabulary)
if not isinstance(vocabulary, Mapping): #Is the vocabulary properly dict type? Is being investigated.
vocab = {}
for i, t in enumerate(vocabulary):
if vocab.setdefault(t, i) != i: #Check here for duplicate expressions in the dictionary
msg = "Duplicate term in vocabulary: %r" % t
raise ValueError(msg)
vocabulary = vocab
else:#Your dictionary is not dict type, but is it okay? ??
indices = set(vocabulary.values())
if len(indices) != len(vocabulary):
raise ValueError("Vocabulary contains repeated indices.")
for i in range(len(vocabulary)):
if i not in indices:
msg = ("Vocabulary of size %d doesn't contain index "
"%d." % (len(vocabulary), i))
raise ValueError(msg)
if not vocabulary:
raise ValueError("empty vocabulary passed to fit")
self.fixed_vocabulary_ = True #The dictionary is set properly
self.vocabulary_ = dict(vocabulary) #Form a dictionary.
else: #When the dictionary is not entered in the initial parameters.
self.fixed_vocabulary_ = False #The dictionary is not set.
``````

This method is a method of vectorizerMixin. Is this method basically forming a dictionary? ?? It seems to be a method to check. Creating a dictionary is an important element that corresponds to the output column. If no dictionary is created for the initial parameters, self.fixed_vocabulary = False is executed. This method is actually called the second time, that is, when it is transformed. Therefore, self.fixed_vocabulary_ = True self.vocabulary_ = dict (vocabulary) These two are executed.

Now that we have confirmed that the dictionary has been created, let's check self.count_vocab (raw_documents, self.fixed_vocabulary).

``````    def _count_vocab(self, raw_documents, fixed_vocab):
"""Create sparse feature matrix, and vocabulary where fixed_vocab=False
"""
if fixed_vocab:#When the dictionary is created
vocabulary = self.vocabulary_
else:#When the dictionary has not been created
# Add a new value when a new vocabulary item is seen
vocabulary = defaultdict()
vocabulary.default_factory = vocabulary.__len__ #By making these settings, vocabulary[word]とすることでそのwordに自動でindexが設定されます．結構役たちます．

analyze = self.build_analyzer() #Here n_Settings such as gram are applicable.
j_indices = []
indptr = []

values = _make_int_array()
indptr.append(0)
for doc in raw_documents:#Read one-dimensional data.
#doc = ["hoge hogeee hogeeeee"]Feeling like
feature_counter = {}
for feature in analyze(doc):#word
#feature = "hoge"Feeling like
try:
feature_idx = vocabulary[feature] #Here hoge:1 hogee:2 hogeee:It feels like 3. feature_idx is numerical data. If it is hoge, 1 is changed.
if feature_idx not in feature_counter:
feature_counter[feature_idx] = 1 #feature_If it is not in the counter dictionary.
else:
feature_counter[feature_idx] += 1 #feature_If in the counter dictionary+1 to be
except KeyError:
# Ignore out-of-vocabulary items for fixed_vocab=True
continue

j_indices.extend(feature_counter.keys()) #Dictionary words (numerical)
values.extend(feature_counter.values()) #How many times a word in the dictionary appears
indptr.append(len(j_indices))
#The above three are the methods that appear when creating a sparse model.

if not fixed_vocab: #Execute when the dictionary is not created
# disable defaultdict behaviour
vocabulary = dict(vocabulary)
if not vocabulary:
raise ValueError("empty vocabulary; perhaps the documents only"
" contain stop words")

if indptr[-1] > np.iinfo(np.int32).max:  # = 2**31 - 1
if _IS_32BIT:
raise ValueError(('sparse CSR array has {} non-zero '
'elements and requires 64 bit indexing, '
'which is unsupported with 32 bit Python.')
.format(indptr[-1]))
indices_dtype = np.int64

else:
indices_dtype = np.int32
j_indices = np.asarray(j_indices, dtype=indices_dtype)
indptr = np.asarray(indptr, dtype=indices_dtype)
values = np.frombuffer(values, dtype=np.intc)

X = sp.csr_matrix((values, j_indices, indptr),
shape=(len(indptr) - 1, len(vocabulary)),
dtype=self.dtype)
X.sort_indices()
return vocabulary, X #Dictionary and X(Vector value of sparse)

``````

I think many people are familiar with the algorithms for creating dictionaries here. What you are doing is not that difficult. However, using sparse values is a bit complicated.

Well, it's the final stage.

``````        if not self.fixed_vocabulary_: #Execute when False
X = self._sort_features(X, vocabulary) #Rearrange the dictionary neatly.

n_doc = X.shape
max_doc_count = (max_df
if isinstance(max_df, numbers.Integral)
else max_df * n_doc)
min_doc_count = (min_df
if isinstance(min_df, numbers.Integral)
else min_df * n_doc)
if max_doc_count < min_doc_count:
raise ValueError(
"max_df corresponds to < documents than min_df")
X, self.stop_words_ = self._limit_features(X, vocabulary,
max_doc_count,
min_doc_count,
max_features)

self.vocabulary_ = vocabulary #Set dictionary here

return X
``````

The above is basically strange if min_doc_count <min_doc_count! In self._limit_features (), the dimension is reduced according to the frequency of appearance. Return X returns the vector with scipy.

## At the end

I took a look at the contents of Count Vectorizer in sklearn. The program wasn't that complicated because no mathematical formulas appeared this time. If you divide it into unexpected parts, each process is simple. Take a look at this article and see how it's made by yourself. Next, I think I'll try TF-IDF. I will write it when I feel like it. that's all