[PYTHON] I took a look at the contents of sklearn (scikit-learn) (1) ~ What about the implementation of CountVectorizer? ~

Introduction

In this article, I would like to take a look at the contents of sklearn. Recently, many books have been published that let you try to implement machine learning algorithms yourself. I haven't read it myself, but I believe that if you take a closer look at the contents of sklearn without reading this book, you will get used to it so that you can implement it yourself without purchasing the book. Also, sklearn is a free package that is edited daily by a large number of users, so the program is excellently optimized. So it's a very polite program and there's no reason not to use it to study the program! !! That said, even if a beginner suddenly looks at the contents of sklearn, he cannot understand it, so I think it is his true intention. Even if I looked at the contents of sklearn, I couldn't organize it in my head. So, is there anyone who can explain the contents of the sklearn package? When. .. .. .. What I thought did not appear among the good ones. So, I thought I should find some free time and take a look at the contents. !! I won't explain it so carefully, but I'll just take a quick look. As a starting point, let's take a look at the contents of a simple CountVectorizer.

CountVectorizer CountVectorizer is an algorithm that counts the frequency of occurrence of words. The frequency of appearance of words counts how many times the words appearing in the sentence have been used, and can be easily calculated using sklearn's Count Vectorizer. A method called feature extraction is used to determine the frequency of appearance. Feature extraction is a vectorization of what features the training data has, and in this case, the frequency of appearance of words corresponds to the vector (numerical value) quote % E3% 81% AEcountvectorizer% E3% 82% 92% E7% 94% A8% E3% 81% 84% E3% 81% A6% E5% 8D% 98% E8% AA% 9E% E3% 81% AE% E5 % 87% BA% E7% 8F% BE% E9% A0% BB% E5% BA% A6% E3% 82% 92% E6% 95% B0% E3% 81% 88% E3% 81% A6% E3% 81 % BF% E3% 82% 8B /).

Let's take a look.

class CountVectorizer(_VectorizerMixin, BaseEstimator):
    def __init__(self, input='content', encoding='utf-8',
                 decode_error='strict', strip_accents=None,
                 lowercase=True, preprocessor=None, tokenizer=None,
                 stop_words=None, token_pattern=r"(?u)\b\w\w+\b",
                 ngram_range=(1, 1), analyzer='word',
                 max_df=1.0, min_df=1, max_features=None,
                 vocabulary=None, binary=False, dtype=np.int64):
        self.input = input
        self.encoding = encoding
        self.decode_error = decode_error
        self.strip_accents = strip_accents
        self.preprocessor = preprocessor
        self.tokenizer = tokenizer
        self.analyzer = analyzer
        self.lowercase = lowercase
        self.token_pattern = token_pattern
        self.stop_words = stop_words
        self.max_df = max_df
        self.min_df = min_df
        if max_df < 0 or min_df < 0:
            raise ValueError("negative value for max_df or min_df")
        self.max_features = max_features
        if max_features is not None:
            if (not isinstance(max_features, numbers.Integral) or
                    max_features <= 0):
                raise ValueError(
                    "max_features=%r, neither a positive integer nor None"
                    % max_features)
        self.ngram_range = ngram_range
        self.vocabulary = vocabulary
        self.binary = binary
        self.dtype = dtype

It seems like that. This object seems to inherit two classes, but basically fit and fit_transform are enough when actually using this. Also, when you look at the initial values, there seems to be no required parameters. Now let's look at fit first.

    def fit(self, raw_documents, y=None):
        """Learn a vocabulary dictionary of all tokens in the raw documents.

        Parameters
        ----------
        raw_documents : iterable
            An iterable which yields either str, unicode or file objects.

        Returns
        -------
        self
        """
        self._warn_for_unused_params() ##concern
        self.fit_transform(raw_documents)
        return self

Interestingly, it seems that fit_transform is used in fit. It turns out that the important thing is not fit etc., but fit_transform. Also, let's take a look at the one above.

    def _warn_for_unused_params(self):

        if self.tokenizer is not None and self.token_pattern is not None:
            warnings.warn("The parameter 'token_pattern' will not be used"
                          " since 'tokenizer' is not None'")

        if self.preprocessor is not None and callable(self.analyzer):
            warnings.warn("The parameter 'preprocessor' will not be used"
                          " since 'analyzer' is callable'")

        if (self.ngram_range != (1, 1) and self.ngram_range is not None
                and callable(self.analyzer)):
            warnings.warn("The parameter 'ngram_range' will not be used"
                          " since 'analyzer' is callable'")
        if self.analyzer != 'word' or callable(self.analyzer):
            if self.stop_words is not None:
                warnings.warn("The parameter 'stop_words' will not be used"
                              " since 'analyzer' != 'word'")
            if self.token_pattern is not None and \
               self.token_pattern != r"(?u)\b\w\w+\b":
                warnings.warn("The parameter 'token_pattern' will not be used"
                              " since 'analyzer' != 'word'")
            if self.tokenizer is not None:
                warnings.warn("The parameter 'tokenizer' will not be used"
                              " since 'analyzer' != 'word'")

As far as I can see, it seems that I'm looking for something wrong with the parameters. The important thing here is that the method here is not a CountVectorizer, but an object of _VectorizerMixin. So it seems that this object is doing error checking for parameters and more. Also, when inheriting multiple objects, add Mixin at the end like _VectorizerMixin. Objects to which this is attached are basically used in combination with other objects! !! It points to that. Now let's check the important fit_transform method.

    def fit_transform(self, raw_documents, y=None):

        if isinstance(raw_documents, str):  #Raw here_Excludes cases where documents do not appear as a list type. str type useless! !!
            raise ValueError(
                "Iterable over raw text documents expected, "
                "string object received.")

        self._validate_params() #n_Whether the gram range is suitable
        self._validate_vocabulary()#concern
        max_df = self.max_df #This is a little point. Since it is troublesome to describe self many times, we use it as a variable here.
        min_df = self.min_df
        max_features = self.max_features

        vocabulary, X = self._count_vocab(raw_documents,
                                          self.fixed_vocabulary_) #concern
 
        if self.binary:
            X.data.fill(1)

        if not self.fixed_vocabulary_:
            X = self._sort_features(X, vocabulary) 

            n_doc = X.shape[0]
            max_doc_count = (max_df
                             if isinstance(max_df, numbers.Integral)
                             else max_df * n_doc)
            min_doc_count = (min_df
                             if isinstance(min_df, numbers.Integral)
                             else min_df * n_doc)
            if max_doc_count < min_doc_count:
                raise ValueError(
                    "max_df corresponds to < documents than min_df")
            X, self.stop_words_ = self._limit_features(X, vocabulary,
                                                       max_doc_count,
                                                       min_doc_count,
                                                       max_features)

            self.vocabulary_ = vocabulary

        return X #Returns a vector

Let's take a look at the first point, the method of self._validate_vocabulary ().

    def _validate_vocabulary(self):
        vocabulary = self.vocabulary #dictionary
        if vocabulary is not None: #Is the dictionary entered as the initial value?#When the dictionary is not set. Or this is executed when it is fitted once
            if isinstance(vocabulary, set): 
                vocabulary = sorted(vocabulary)
            if not isinstance(vocabulary, Mapping): #Is the vocabulary properly dict type? Is being investigated.
                vocab = {}
                for i, t in enumerate(vocabulary):
                    if vocab.setdefault(t, i) != i: #Check here for duplicate expressions in the dictionary
                        msg = "Duplicate term in vocabulary: %r" % t
                        raise ValueError(msg)
                vocabulary = vocab
            else:#Your dictionary is not dict type, but is it okay? ??
                indices = set(vocabulary.values())
                if len(indices) != len(vocabulary):
                    raise ValueError("Vocabulary contains repeated indices.")
                for i in range(len(vocabulary)):
                    if i not in indices:
                        msg = ("Vocabulary of size %d doesn't contain index "
                               "%d." % (len(vocabulary), i))
                        raise ValueError(msg)
            if not vocabulary:
                raise ValueError("empty vocabulary passed to fit")
            self.fixed_vocabulary_ = True #The dictionary is set properly
            self.vocabulary_ = dict(vocabulary) #Form a dictionary.
        else: #When the dictionary is not entered in the initial parameters.
            self.fixed_vocabulary_ = False #The dictionary is not set.

This method is a method of vectorizerMixin. Is this method basically forming a dictionary? ?? It seems to be a method to check. Creating a dictionary is an important element that corresponds to the output column. If no dictionary is created for the initial parameters, self.fixed_vocabulary = False is executed. This method is actually called the second time, that is, when it is transformed. Therefore, self.fixed_vocabulary_ = True self.vocabulary_ = dict (vocabulary) These two are executed.

Now that we have confirmed that the dictionary has been created, let's check self.count_vocab (raw_documents, self.fixed_vocabulary).

    def _count_vocab(self, raw_documents, fixed_vocab):
        """Create sparse feature matrix, and vocabulary where fixed_vocab=False
        """
        if fixed_vocab:#When the dictionary is created
            vocabulary = self.vocabulary_
        else:#When the dictionary has not been created
            # Add a new value when a new vocabulary item is seen
            vocabulary = defaultdict()
            vocabulary.default_factory = vocabulary.__len__ #By making these settings, vocabulary[word]とすることでそのwordに自動でindexが設定されます．結構役たちます．

        analyze = self.build_analyzer() #Here n_Settings such as gram are applicable.
        j_indices = []
        indptr = []

        values = _make_int_array()
        indptr.append(0)
        for doc in raw_documents:#Read one-dimensional data.
            #doc = ["hoge hogeee hogeeeee"]Feeling like
            feature_counter = {}
            for feature in analyze(doc):#word
                #feature = "hoge"Feeling like
                try:
                    feature_idx = vocabulary[feature] #Here hoge:1 hogee:2 hogeee:It feels like 3. feature_idx is numerical data. If it is hoge, 1 is changed.
                    if feature_idx not in feature_counter:
                        feature_counter[feature_idx] = 1 #feature_If it is not in the counter dictionary.
                    else:
                        feature_counter[feature_idx] += 1 #feature_If in the counter dictionary+1 to be
                except KeyError:
                    # Ignore out-of-vocabulary items for fixed_vocab=True
                    continue

            j_indices.extend(feature_counter.keys()) #Dictionary words (numerical)
            values.extend(feature_counter.values()) #How many times a word in the dictionary appears
            indptr.append(len(j_indices)) 
            #The above three are the methods that appear when creating a sparse model.

        if not fixed_vocab: #Execute when the dictionary is not created
            # disable defaultdict behaviour
            vocabulary = dict(vocabulary)
            if not vocabulary:
                raise ValueError("empty vocabulary; perhaps the documents only"
                                 " contain stop words")

        if indptr[-1] > np.iinfo(np.int32).max:  # = 2**31 - 1
            if _IS_32BIT:
                raise ValueError(('sparse CSR array has {} non-zero '
                                  'elements and requires 64 bit indexing, '
                                  'which is unsupported with 32 bit Python.')
                                 .format(indptr[-1]))
            indices_dtype = np.int64

        else:
            indices_dtype = np.int32
        j_indices = np.asarray(j_indices, dtype=indices_dtype)
        indptr = np.asarray(indptr, dtype=indices_dtype)
        values = np.frombuffer(values, dtype=np.intc)

        X = sp.csr_matrix((values, j_indices, indptr),
                          shape=(len(indptr) - 1, len(vocabulary)),
                          dtype=self.dtype)
        X.sort_indices()
        return vocabulary, X #Dictionary and X(Vector value of sparse)

I think many people are familiar with the algorithms for creating dictionaries here. What you are doing is not that difficult. However, using sparse values is a bit complicated.

Well, it's the final stage.

        if not self.fixed_vocabulary_: #Execute when False
            X = self._sort_features(X, vocabulary) #Rearrange the dictionary neatly.

            n_doc = X.shape[0]
            max_doc_count = (max_df
                             if isinstance(max_df, numbers.Integral)
                             else max_df * n_doc)
            min_doc_count = (min_df
                             if isinstance(min_df, numbers.Integral)
                             else min_df * n_doc)
            if max_doc_count < min_doc_count:
                raise ValueError(
                    "max_df corresponds to < documents than min_df")
            X, self.stop_words_ = self._limit_features(X, vocabulary,
                                                       max_doc_count,
                                                       min_doc_count,
                                                       max_features)

            self.vocabulary_ = vocabulary #Set dictionary here

        return X

The above is basically strange if min_doc_count <min_doc_count! In self._limit_features (), the dimension is reduced according to the frequency of appearance. Return X returns the vector with scipy.

At the end

I took a look at the contents of Count Vectorizer in sklearn. The program wasn't that complicated because no mathematical formulas appeared this time. If you divide it into unexpected parts, each process is simple. Take a look at this article and see how it's made by yourself. Next, I think I'll try TF-IDF. I will write it when I feel like it. that's all