The code I wrote

It's a childish result, but I am doing Publish the code on Github.

References

-[1] Hirokazu Kameoka: Introduction to Nonnegative Matrix Factorization-On Acoustic Signal Processing- (You can understand NMF visually) -[2] Hiroshi Sawada: Basics of non-negative matrix factorization NMF and its application to data / signal analysis (Easy-to-understand NMF commentary article) -[3] Continuation / Easy-to-understand pattern recognition-Introduction to unsupervised learning- (Thanks for the derivation of mathematical formulas) -[4] Tatsufumi Matsubayashi et al .: Brand selection analysis in purchasing behavior using non-negative tensor factorization (It describes how to use NTF for actual data)

[5] Koh Takeuchi et al.: Non-negative Multiple Tensor Factorization (Although it is an extended NTF paper, it is also useful for understanding NTF)

The literature [1,2,4,5] is currently available for free on the Internet, so it is highly recommended for those who are poor like themselves (those who cannot freely download papers). In particular, for those who do not understand non-negative Matrix Factorization (NMF) before non-negative Tensor Factorization (NTF), the literature [1] [2] is available. It's very easy to understand, so I recommend reading it before this article. Reference [3] has a very polite mathematical expansion and is the most easy-to-understand book in Bayesian estimation, which is also recommended.

My rough understanding

Non-negative tensor factorization Revised NTF is a decomposition calculation that approximates a tensor with a low-rank non-negative tensor product. Don't be afraid to misunderstand, at the high school level, factoring a large matrix into smaller matrices with each component greater than or equal to 0. Tensors are extensions of the concepts of scalars, vectors, and matrices, which can be paraphrased as 0th-order tensors, 1st-order tensors, and 2nd-order tensors, respectively. Widely used in various research fields, NMF is an NTF limited to the tensor on the second floor.

Even a huge amount of data that humans cannot understand at first glance can be decomposed into low-ranked tensors by applying NTF to extract relationships between multiple factors, making the data format easy for humans to understand. For intuitive understanding, the following is a punch picture showing that the third-order tensor $ X $ can be approximately decomposed into three lower-rank tensors $ T, U, V $ (matrix, vector etc.). ..

NMF could only extract relationships between factors in two-dimensional data. On the other hand, NTF can extract the relationship between factors by the rank of the tensor, so it is possible to analyze more complicated relationships. This also means that data that is difficult to visualize in four or more dimensions can be indirectly visualized in the form of trends.

NTF is based on the minimization of the distance (cost function) between the original tensor and the tensor reconstructed from the product of the approximate vectors. Any method can be used to solve this distance minimization, but the solution converges efficiently by repeating the calculation of the expected value of the approximate tensor and the minimization of the cost function when it is used. It is known that (there is no need to calculate the inverse matrix, which tends to be tensor decomposition!), And this approach is often used. In other words, in practice, if you know this update formula, you will be able to analyze data using NTF.

Derivation of update formula

For those who use machine learning as a mere tool, such as NTF, the process of converting mathematical formulas around here tends to be neglected, but the algorithm is expanded by designing and modeling cost functions according to the characteristics of the data. It is essential knowledge to handle. So, I tried my best to develop the formula with my own understanding.

Roughly speaking, when the partial derivative of the cost function becomes 0, the update formula is derived by taking the minimum value of the cost function.

Definition of cost function

$ \ Beta $ divergence is often used for cost functions, but this time we will focus on its special form, generalized KL divergence. Let $ u_ {ki_ {r}} ^ {(r)} $ be the vector obtained by decomposing the $ R $ tensor. $ (K = 1,2, ..., K) $ (i_ { r} = 1,2, ..., I_ {r}) $ (r = 1,2, ..., R) $. Here, $ K $ is the number of bases (the number of vectors in each dimension used for approximation), and $ I_ {r} $ is the number of elements in each vector. $ i_ {r} $ is a subscript to the element of each vector. When $ L $ is a set of the number of elements in each dimension, the tensor obtained by the Kronecker product of vectors is as follows.

\hat{x}_{\bf i} = \sum_{k=1}^{K} \prod_{r=1}^{R}
u_{ki_{r}}^{(r)}
\quad ({\bf i} = i_{1}i_{2}...i_{R} \in L) \quad (1)

The cost function based on the generalized KL divergence is:

D(X, \hat{X})
 = \sum_{{\bf i} \in L}(
   x_{\bf i}\log\frac{x_{\bf i}}{\hat{x}_{\bf i}}
    - x_{\bf i} + \hat{x}_{\bf i}) \quad (2)

This cost function is 0 when the approximated tensor $ X $ and the approximated tensor $ \ hat {X} $ are exactly the same. Substituting equation (1) into equation (2) and converting the fraction of $ \ log $ to the sum gives the following equation.

D(X, \hat{X})
 = \sum_{{\bf i} \in L}(
   x_{\bf i}\log x_{\bf i}
   - x_{\bf i}\log \sum_{k=1}^{K}
   \prod_{r=1}^{R} u_{ki_{r}}^{(r)}
   - x_{\bf i} + \sum_{k=1}^{K}
   \prod_{r=1}^{R} u_{ki_{r}}^{(r)}) \quad (3)

Partial derivative of cost function

I want to partially differentiate equation (3) with $ u_ {ki_ {r}} ^ {(r)} , but the second term is log-sum ( \ log \ sum_ {k} f_ {k} Partial differentiation is difficult because it is in the form of (u_ {k {\ bf i}}) $). Therefore,

h_{k{\bf i}}^{0} = \frac{
  \prod_{r=1}^{R} u_{ki_{r}}^{0(r)}
}{
  \hat{x}_{\bf i}^{0}
} \quad (4)

Define a variable that has the property of $ \ sum_ {k = 1} ^ {K} h_ {k {\ bf i}} = 1, h_ {k {\ bf i}} \ geq 0 $. Here, 0 on the right shoulder of each variable indicates that it is before update. For the second term of equation (3), add $ h_ {k {\ bf i}} / h_ {k {\ bf i}} (= 1) $ composed of equation (4) to * Jensen * By using the inequality of, the upper limit of the second term having the form sum-log ($ \ sum_ {k} \ log f_ {k} (u_ {k {\ bf i}}) $) as follows. Can be derived.

- x_{\bf i} \log \sum_{k=1}^{K}
h_{k{\bf i}}^{0}
\frac{\prod_{r=1}^{R} u_{ki_{r}}^{(r)}}
{h_{k{\bf i}}^{0}} \leq - x_{\bf i}
\sum h_{k{\bf i}}^{0} \log
\frac{\prod_{r=1}^{R} u_{ki_{r}}^{(r)}}
{h_{k{\bf i}}^{0}} \quad (5)

By substituting Eq. (5) into Eq. (3), the upper limit of the cost function can be derived as shown in the following equation.

D(X, \hat{X}) \leq
\sum_{{\bf i} \in L}(
  x_{\bf i}\log x_{\bf i}
  - x_{\bf i} \sum h_{k{\bf i}}^{0} \log \frac{\prod_{r=1}^{R} u_{ki_{r}}^{(r)}}
  {h_{k{\bf i}}^{0}}
  - x_{\bf i} + \sum_{k=1}^{K}
  \prod_{r=1}^{R} u_{ki_{r}}^{(r)}) \\
  = \sum_{{\bf i} \in L}(
    - x_{\bf i} \sum h_{k{\bf i}} \log \prod_{r=1}^{R} u_{ki_{r}}^{(r)}
    + \sum_{k=1}^{K}
    \prod_{r=1}^{R} u_{ki_{r}}^{(r)}
   + C)
   \quad (6)

Here, the terms that do not include $ u_ {ki_ {r}} ^ {(r)} $ become 0 in the process of the next partial differential, so they are summarized by the constant term $ C $. In order to minimize the upper limit of this cost function, the following equation is obtained by setting the partial derivative of equation (6) with $ u_ {ki_ {r}} ^ {(r)} $ to 0.

0 = \sum_{{\bf i} \in L_{-r}} (
  - x_{\bf i}
\frac{h_{k{\bf i}}^{0}}{u_{ki_{r}}^{(r)}} +
\prod_{\substack{r=1 \\ \bar{r} \neq r} }^{R}
u_{ki_{\bar{r}}}^{(\bar{r})}
) \quad (7)

Where $ L_ {-r} $ is ($ \ bar {\ bf i} = i_ {1} ... i_ {r-1} i_ {r + 1} ... i_ {R} \ in L_ It is a set called {-r} $). The following update equation can be obtained by rearranging equation (7) for $ u_ {ki_ {r}} ^ {(r)} $ using equation (4) as well.

u_{ki_{r}}^{(r)} = u_{ki_{r}}^{0(r)} \cdot
\frac{
  \sum_{\bar{\bf i} \in L_{-r}}
  (x_{\bar{\bf i}}
    / \hat{x}_{\bar{\bf i}}^{0})
    \prod_{\substack{\bar{r}=1 \\ \bar{r} \neq r} }^{R}
    u_{ki_{\bar{r}}}^{0(\bar{r})}
}{
  \sum_{\bar{\bf i} \in L_{-r}}
  \prod_{\substack{\bar{r}=1 \\ \bar{r} \neq r} }^{R}
  u_{ki_{\bar{r}}}^{(\bar{r})}
} \quad (8)

$ U_ {ki_ {r}} ^ {(r)} $ obtained by minimizing the upper limit of the cost function is set to $ h_ {k {\ bf i}} ^ {0} $, and the upper limit of the cost function is further set. A sequential update that minimizes the value yields the minimum value of $ D (X, \ hat {X}) $. $ {\ Bf u} _ {k} ^ {(r)} $ at the time when the minimum value is obtained is an approximate tensor obtained by factoring the original tensor.

Implementation

Implemented with Python + Numpy. Use `` `pip``` to drop the missing library. For those who are new to Python + Numpy, it is hard to get addicted to using Anaconda, which includes various packages from the beginning.

If the data to be input is a matrix, the NTF, or NMF, of the second-order tensor is also possible. I will show the code only for the important update expression part.

`ntf.py`


class NTF():
    #Various other member functions etc. are omitted...
    def updateBasedOnGKLD(self, x, factor, index):
        # Create tensor partly.
        element = self.kronAlongIndex(factor, index)

        # Summation
        element = element.reshape(self.preshape[index])
        estimation = self.createTensorFromFactors()
        boost = x/(estimation + EPS)
        numer = self.sumAlongIndex(boost*element, factor, index)
        denom = np.sum(element)

        return numer/(denom + EPS)

    def kronAlongIndex(self, factor, index):
            element = np.array([1])
            for i1 in factor[:index]:
                element = np.kron(element, i1)
            for i1 in factor[index + 1:]:
                element = np.kron(element, i1)
            return element

    def sumAlongIndex(self, value, factor, index):
        for _ in np.arange(index):
            value = np.sum(value, axis=0)
        for _ in np.arange(index + 1, len(factor)):
            value = np.sum(value, axis=1)
        return value

    def createTensorFromFactors(self):
        tensor = self.composeTensor(self.factor)
        tensor = np.sum(tensor, axis=0)
        return tensor.reshape(self.shape)

    #The function that is the entity of composeTensor.
    def composeTensorSerially(self, element):
        return map(self.kronAll, element)

    def kronAll(self, factor):
        element = np.array([1])
        for i1 in factor:
            element = np.kron(element, i1)
        return element

Although it is Python, there are many `` `for``` statements and it feels dirty, but it is delicate to stick to here, so I leave it alone. \ # If you know how to write beautifully or how to write at explosive speed, please pull request.

Experiment

I did an experiment (or rather a demo) of clustering random numbers created from a Gaussian distribution.

Experimental conditions

We sprayed 100 samples in 3D space, respectively, based on the following two Gaussian distributions.

ID	average	Distributed
A	(10, 10, 20)	5 \times E_{3}
B	(30, 30, 30)	5 \times E_{3}

It is a body that is already known to have samples sown from two clusters, and although it is match pump-like, the number of bases is 2. In other words, the third-order tensor is decomposed into approximate vectors with two basis numbers.

To retest, run run_ntf_demo_3rd_order_2bases.py. In addition, code that can try the same experiment with different parameters (eg, rank of tensor, number of bases etc.) is also prepared with a name such as [run_ntf_demo_ \ *. Py](https://github.com/drumichiro/ nmf-and-ntf / blob / master / run_ntf_demo_4th_order_2bases.py).

result

The distribution (tensor) of the sample reconstructed by the Kronecker product from the approximate vector is shown on the right side of the figure below. The distribution of the original sample is also shown on the left for comparison. Please note that the values assigned to each axis are subscripts for each divided bin, so the average and fraction shown in the experimental conditions do not match.

Distribution of the original sample	Distribution of samples reconstructed from approximate vectors

You can see that the distributions of both samples are close. The approximate vector used to reconstruct the sample distribution is as follows. The horizontal axis is the factor axis. The vertical axis is the axis of the basis, but you can see that one Gaussian distribution is distributed to each basis. Well, with this kind of feeling, I was able to confirm that the code was assembled as it was.

Supplement

Applied NTF to coupon purchase history. This is a more practical example, so please see here if you like.

[PYTHON] Derivation and implementation of update equations for non-negative tensor factorization