[PYTHON] I will explain about the neural network "Differentiable Neural Computing (DNC)" with external memory.

About this article

A paper submitted by DeepMind to Nature Hybrid computing using a neural network with dynamic external memory Describes the "Differentiable Neural Computing (DNC)" used in. The explanation of logic is the main, but I will also introduce an implementation example by Python --Chainer.

What is Differentiable Neural Computing (DNC)?

The desire to process sequential data with Neural Nets seems to have been around for a long time, and the most standard is Recurrent Neural Nets (RNNs). However, RNN has a "vanishing gradient problem", and the model that overcomes it is called Long Short Term Memory (LSTM). What is short and what is long is that the input data stays inside the Neural Net until it is output. You can think of this as something like "Memory" in a way. However, it can only be memorized for a "short time" from input to output. So it's called "short term momory". It seems that it is named Long Short Term Memory in the sense that "this short term memory can be stored for a long time". Then, I feel that the really long Meomory may be outside the learning machine instead of inside. I don't know if it was born from such a thought, but Differentiable Neural Computing (DNC) is to prepare a Memory outside the learning machine and let it learn including how to use the Memory. It means "Differentiable", but if you want to train with back propagation, you need to be able to calculate the differentiation. Therefore, it means that "differential calculation is possible" including operations on external memory.

Logic explanation

The big picture of DNC

The components of the DNC are the main body Controller and the external Memory. According to the paper, the overall data flow can be represented as shown in the figure below. DNC_concept.jpg DNC deals with sequential data, so the current time-step is represented as $ t $. The previous step is $ t-1 $, the next step is $ t + 1 $, and the subscript $ t $ in the figure indicates which time-step each variable was generated in.

Let's take a step-by-step look at the data flow. Numbers (1) to (4) are assigned in the figure, so follow them in this order. (1): Input data $ x_t $ is input from "Data Set". $ \ Boldsymbol {\ chi} \ _ t = [\ boldsymbol {x} \ _ t, \ boldsymbol, which is the sum of this and the output from Memory in the previous step $ \ boldsymbol {r} \ _ {t-1} $ Let {r} \ _ {t-1}] $ be the input to the Controller. (2): Since the output $ \ boldsymbol {h} \ _t $ from Controller is obtained, divide it into two routes. One is $ \ boldsymbol {v} \ _t $ for "Out Put" and the other is "interface vector" $ \ boldsymbol {\ xi} \ _t $ for controlling Memory. In the paper, these are the linear transformations of $ \ boldsymbol {h} \ _t $ $ \ boldsymbol {v} \ _t = W \ _y \ boldsymbol {h} \ _t $ and $ \ boldsymbol {\ xi} \ _ t = It is written as W \ _ {\ xi} \ boldsymbol {h} \ _ t $. (3): Data is written to Memory based on "interface vector" $ \ boldsymbol {\ xi} \ _ t $, and the state of Memory is updated. It also reads from Memory, giving you a "read vector" $ \ boldsymbol {r} _t $. (4): "read vector" $ \ boldsymbol {r} \ _ t $ is added to the output to "Out Put", while being sent to the input to the Controller in the next step. (5): Combine the output from Controller and the output from Memory and go to "Out Put" $ \ boldsymbol {y} \ _ t = \ boldsymbol {v} \ _ t + W \ _r \ boldsymbol {r} \ _ t $ Is output. As mentioned above, 1 step is completed by (1) to (5).

As a Controller, any learning machine that receives multidimensional input and returns multidimensional output can be used, but it seems that (Recurrent) Neural Net and LSTM are often used. (Recurrent) Neural Net and LSTM will be explained elsewhere, but in this article, I will explain "reading and writing external memory" which is the biggest feature of DNC.

About reading and writing external memory

Now let's talk about reading and writing external memory. It's a hassle, so I'll just call it Memory below.

Memory structure

First, let's understand the structure of Memory. As Memory, we use a matrix of $ N $ × $ W $. That is, consider that addresses are assigned from $ 1 $ to $ N $, and each address has a slot that can hold a $ W $ -dimensional numeric vector. Since the state of Memory is updated every moment, we will write the matrix representing Memory at time-step $ t $ as $ M \ _t $. However, the matrix dimensions $ N $ × $ W $ are assumed to be fixed, where $ N $ is always the total number of memory slots (total number of addresses) and $ W $ is the length of the slots (of the numeric vector to store). Dimension).

Details of interface vector

As mentioned in "Overall data flow", the operation of Memory by Controller is It is done through "interface vector" $ \ boldsymbol {\ xi} \ _t $. Since there are multiple tasks that can be considered, such as reading / writing / addressing, even if you say Memory operation in one bite, $ \ boldsymbol {\ xi} \ _t $ is decomposed into each component by function.

\begin{align}
\boldsymbol{\xi}_t = \Big[ \boldsymbol{k}_t^{r, 1},...,\boldsymbol{k}_t^{r, R}, \; \hat{\beta}_t^{r, 1},...,\hat{\beta}_t^{r, R}, \; \boldsymbol{k}_t^w, \; \hat{\beta_t}^w, \hat{\boldsymbol{e}}_t, \; \boldsymbol{\nu}_t, \; \hat{f}_t^1,...,\hat{f}_t^R, \; \hat{g}_t^a, \; \hat{g}_t^w, \; \hat{\boldsymbol{\pi}}_t^1,...,\hat{\boldsymbol{\pi}}_t^R \Big]
\end{align}

Those with the'$ \ hat {} $'symbol will undergo further scale conversion. There are many types and it is complicated, so I will summarize it once. The value in (,) is (dimension, number). ・ "Read key" $ (W, R) ;: ; \ boldsymbol {k} \ _ t ^ {r, 1}, ..., \ boldsymbol {k} \ _ t ^ {r, R} $ ・ "Read strength" $ (1, R) ;: ; \ beta \ _ t ^ {r, 1}, ..., \ beta \ _ t ^ {r, R} ; ; \ big (\ beta) \ _ t ^ {r, i} = \ text {oneplus} (\ hat {\ beta} \ _ t ^ {r, i}) \ big) $ ・ "Write key" $ (W, 1) ;: ; \ boldsymbol {k} \ _ t ^ w $ ・ "Write strength" $ (1,1) ;: ; \ beta \ _ t ^ w = \ text {oneplus} (\ hat {\ beta} \ _ t ^ w) $ ・ "Erase vector" $ (W, 1) ;: ; \ boldsymbol {e} \ _ t = \ sigma (\ hat {\ boldsymbol {e}} _t) $ ・ "Write vector" $ (W, 1) ;: ; \ boldsymbol {\ nu} \ _ t $ ・ "Free gate" $ (1, R) ;: ; f \ _ t ^ 1, ..., f \ _ t ^ R ; ; \ big (f \ _ t ^ i = \ sigma (\ hat { f} \ _ t ^ i) \ big) $ ・ "Allocation gate" $ (1, 1) ;: ; g ^ a \ _ t = \ sigma (\ hat {g} ^ a \ _ t) $ ・ "Write gate" $ (1, 1) ;: ; g ^ w \ _ t = \ sigma (\ hat {g} ^ w \ _ t) $ ・ "Read mode" $ (3, R) ;: ; \ boldsymbol {\ pi} \ _ t ^ 1, ..., \ boldsymbol {\ pi} \ _ t ^ R ; ; \ big (\ boldsymbol {\ pi} \ _ t ^ i = \ text {softmax} (\ hat {\ boldsymbol {\ pi}} \ _ t ^ i) \ big) $ The function used for scale conversion is

\begin{align}
& \text{oneplus}(x) = 1+\text{log}(1+e^x) \in [1, \infty) \\
& \sigma(x) = \frac{1}{1+e^{-x}} \in [0, 1] \\
& \text{softmax}(\boldsymbol{x}) = \frac{e^\boldsymbol{x}}{\sum_ie^{x_i}} \in [0, 1]
\end{align}

Is defined as. Note that if you take a vector value as an argument of $ \ text {oneplus} (x) $, $ \ sigma (x) $, $ \ text {exp} (x) $, it means the action of each component. please.

I'm just listing the definitions here, so I don't think you know anything. Below, we will explain in order how to use these components to control Memory.

Before we move on, let me add one point. $ R $ represents the number of reads from Memory. The DNC in the paper is set to read from Memory multiple times in one step, and the number of times is $ R $. On the other hand, writing is set only once per step. You can also see that the dimensions of the interface vector are determined by $ W $ and $ R $, which is $ WR + 3W + 5R + 3 $.

Memory read / write procedure (overview)

To read or write Memory, you need to specify the address of the Memory slot to read or write. However, in order to be able to learn by back propagation, differentiability, at least the continuity of operations, is required. Therefore, instead of specifying only one specific address, we give a width and weight which address is read / written intensively. Hereafter, we will call this weighting "read / write weighting", but the point is how to find these.

Once you've asked for "read / write weighting", the read / write operation itself is simple. I'll leave the details later, but let's get an overview. First, consider the case of reading. Now suppose you get "read weighting" $ \ boldsymbol {w} ^ r \ _t $. The dimension of this vector is $ N $, which is the same as the total number of Memory slots. Each component represents how "strongly" it reads the information in the Memory slot of the corresponding address. That is, the information read is expressed as $ M \ _t ^ T \ boldsymbol {w} ^ r \ _t $ as the product of the Memory matrix and the weight vector. As mentioned earlier, in the paper, reading is performed $ R $ times in one step, so $ R $ of "read weighting" is also $ \ {\ boldsymbol {w} \ _ t ^ {r, i} \} \ _ {i = 1, ..., R} $ Consists. Also, as a natural request to prevent amplification due to reading.

\begin{align}
& 0 \leqq \boldsymbol{w}_t^{r,i}[n] \leqq 1 \;\; (n=1,2,...,N)\\
& \sum_{n=1}^N \boldsymbol{w}_t^{r,i}[n] \leqq 1 \;\; (i=1,2,...,R)
\end{align}

Shall impose. Similarly, consider writing. Suppose you get "write weighting" $ \ boldsymbol {w_t ^ w} $. We also have a vector $ \ boldsymbol {\ nu} \ _t $ that represents the information we want to write. The dimension of $ \ boldsymbol {w_t ^ w} $ is $ N $, which means how "strongly" $ \ boldsymbol {\ nu} \ _t $ is written to the corresponding Memory slot. That is, the Memory matrix $ M \ _ {t-1} $ is written by adding the matrix $ \ boldsymbol {w_t ^ w} \ boldsymbol {\ nu} \ _t ^ T $ to update Memory. For each ingredient

\begin{align}
& 0 \leqq \boldsymbol{w}_t^w[n] \leqq 1 \;\; (n=1,2,...,N)\\
& \sum_{n=1}^N \boldsymbol{w}_t^w[n] \leqq 1 
\end{align}

The point of imposing is the same.

Memory read / write procedure (details)

In the dissertation, the following four steps are taken, including reading and writing to Memory itself. ① Update write weighting ② Write to Memory ③ Update read weighting ④ Read from Memory

Below, we will explain along this step, but even in the implementation example, the processing is summarized in this order, so please refer to it. In addition, read / write weighting $ \ {\ boldsymbol {w} \ _ {t-1} ^ {r, i} \} \ _ {i = 1, ..., R} in the previous time-step It is assumed that $ / $ \ boldsymbol {w} \ _ {t-1} ^ {w} $ has been obtained.

① Update write weighting

There are two ways to choose the address to write to, and write weighting consists of two elements. That is, (1) select the write destination slot based on the "key" input through the interface vector, and (2) select the slot where the used information remains based on the read status up to the previous time-step. , ,.

First, let's start with (1). Compute the similarity by matching the "write key" $ \ boldsymbol {k} \ _t ^ w $ in the interface vector $ \ boldsymbol {\ xi} \ _t $ with the information held in each Memory slot. We also use "write strength" $ \ beta_t ^ w $ as a parameter to adjust the sharpness of the weighting peak. This calculation is also common for finding read weighting, so for the $ N $ × $ W $ matrix $ M $, $ W $ order vector $ \ boldsymbol {k} $, and the scalar value $ \ beta $, Define the operation.

\begin{align}
\mathcal{C}(M, \boldsymbol{k}, \beta)[n] = \frac{\text{exp}\big(\mathcal{D}(\boldsymbol{k}, M[n,:])\beta \big)}{\sum_m\text{exp}\big(\mathcal{D}(\boldsymbol{k}, M[m,:])\beta \big)}
\end{align}

Here, $ \ mathcal {D} $ is the distance between two vectors, and there are various ways to take it, but the cosine similarity is according to the paper.

\begin{align}
\mathcal{D}(\boldsymbol{u}, \boldsymbol{v}) = \frac{\boldsymbol{u} \cdot \boldsymbol{v}}{||\boldsymbol{u}|| \; ||\boldsymbol{v}||}
\end{align}

I will leave it as. At the limit of $ \ beta \ rightarrow \ infinity $, $ \ mathcal {C} $ has only one sharp peak. By the way, using the operation defined here, write weighting by the method of (1) is

\begin{align}
\boldsymbol{c}_t^w=\mathcal{C}(M_{t-1}, \boldsymbol{k}_t^w, \beta_t^w)
\end{align}

Is calculated.

Next is the calculation by the method (2). It's a little complicated, but we will ask for it in the following order.

(2) -1. Configuration of "retention vector" $ \ boldsymbol {\ psi} \ _t $ It is natural to want to use the Memory slot whose information was read in the previous time-step $ t-1 $ as a used slot for writing. However, if you still have the information you need, don't overwrite it. In this way, the flag (weight because it is actually a continuous value from 0 to 1) is "free gate" $ f \ _ t ^ i \ whether the Memory slot read in the previous time-step can be really released. ; (i = 1, ..., R) $. That is, consider $ f \ _t ^ i \ boldsymbol {w} \ _ {t-1} ^ {r, i} $ in component units, and consider "$ f \ _t ^ iw \ _ {t-1} ^ { r, i} $ $ \ simeq 1 $ $ \ Leftrightarrow f \ _t ^ i \ simeq 1 $ and $ w \ _ {t-1} ^ {r, i} \ simeq 1 $ "is read and open OK So I will overwrite it. Also, "$ f \ _t ^ iw \ _ {t-1} ^ {r, i} $ $ \ simeq 0 $ $ \ Leftrightarrow f \ _t ^ i \ simeq 0 $ or $ w \ _ {t-1} If it is "^ {r, i} \ simeq 0 $", it cannot be overwritten because it has not been read in the previous time-step, or even if it has been read, the release flag is not set. Now, I was thinking about each reading, but I set the "retention vector" $ \ boldsymbol {\ psi} \ _t $, which is the in-use (non-overwriteable) flag for all $ R $ times.

\begin{align}
\boldsymbol{\psi}_t = \prod_{i=1}^R\big(1-f_t^i\boldsymbol{w}_{t-1}^{r, i}\big)
\end{align}

Defined in. The product is the product of each component. Since the product is calculated after calculating the difference from 1, the memory slot that is judged to be overwritable even once in $ R $ is judged to be overwritable. On the other hand, it is determined that it is in use (non-overwrite) $ \ psi \ _t \ simeq 1 $ only when it is determined that it cannot be overwritten for all $ R $ times.

(2) -2. Configuration of "usage vector" $ \ boldsymbol {u} \ _t $ In (2) -1, I thought about $ \ boldsymbol {\ psi} \ _t $ as a flag in use (correctly weight), but I was only thinking about reading in the previous time-step. In reality, if a write is made, it should be more in use, and you should consider the situation at the time-step before the last time-step. Based on these, the weight "memory usage vector" $ \ boldsymbol {u} _t $, which represents the degree of use of each Memory slot, is defined by the following update formula.

\begin{align}
\boldsymbol{u}_t &= \big(\boldsymbol{u}_{t-1} + \boldsymbol{w}_{t-1}^w - \boldsymbol{u}_{t-1} \circ \boldsymbol{w}_{t-1}^w \big) \circ \boldsymbol{\psi}_t \\
\boldsymbol{u}_0 &= 0, \;\; \boldsymbol{w}_0^w = 0
\end{align}

$ \ Circ $ represents the product of each component. The third item in (...) is a correction to adjust each component of $ \ boldsymbol {u} \ _t $ so that it does not exceed $ 1 $. When $ u \ _t = 1 $ is reached, the weight update by writing will stop, and even if $ u \ _t> 1 $, the value will be reduced by the update.

(2) -3. Configuration of "allocation weighting" Determines the address of the Memory slot to write to, based on the degree of use $ \ boldsymbol {u} \ _t $. Basically, it is written in the place where the degree of use is low, but it is inclined. First, let $ \ boldsymbol {\ phi} _t $ be the vector consisting of the indexes when the components of $ \ boldsymbol {u} \ _t $ are arranged in ascending order of value. In other words

\begin{align}
\boldsymbol{u}_t[\boldsymbol{\phi}_t[1]] \leqq \boldsymbol{u}_t[\boldsymbol{\phi}_t[2]] \leqq 
\boldsymbol{u}_t[\boldsymbol{\phi}_t[3]] \leqq \cdots \leqq 
\boldsymbol{u}_t[\boldsymbol{\phi}_t[N]]
\end{align}

is. Use this to "allocation weighting" $ \ boldsymbol {a} \ _t $

\begin{align}
\boldsymbol{a}_t[\boldsymbol{\phi}_t[n]] = \big(1-\boldsymbol{u}_t[\boldsymbol{\phi}_t[n]]\big) \prod_{m=1}^{n-1}\boldsymbol{u}_t[\boldsymbol{\phi}_t[m]]
\end{align}

Defined in. The product is the product of each component. Basically, it is $ 1- \ boldsymbol {u} \ _t $, so it is a weight that indicates the degree of used (writable).

As mentioned above, the results of (1) and (2) are integrated, and write weighting $ \ boldsymbol {w} \ _ t ^ w $ is written.

\begin{align}
\boldsymbol{w}_t^w = g_t^w\big(g_t^a\boldsymbol{a}_t+(1-g_t^a)\boldsymbol{c}_t^w\big)
\end{align}

Update as. Where $ g_t ^ a $ and $ g_t ^ w $ were entered through the interface vector $ \ boldsymbol {\ xi} _t $. Each works as a gate that controls "whether to write based on the used flag or" key "" and "whether to write in the first place".

② Write to Memory

Now that the write weighting update is complete, write to Memory. However, old information is deleted at the same time as writing. Weighting which Memory slot is targeted is done by "write weighting" $ \ boldsymbol {w} \ _t ^ w $. The data to be written is "write vector" $ \ boldsymbol {\ nu} \ _ t $, and the pattern to erase the data in slot is given by erase vector $ \ boldsymbol {e} \ t $. The two are input through the interface vector. The update by erasing / writing the Memory matrix $ M {t-1} $ is performed according to the following formula.

\begin{align}
& M_t[n,s] = M_{t-1}[n,s] \big(1-\boldsymbol{w}_t^w[n] \boldsymbol{e}_t[s] \big) + \boldsymbol{w}_t^w[n] \boldsymbol{\nu}_t[s] \\
& \Leftrightarrow M_t = M_{t-1} \circ \big(1-\boldsymbol{w}_t^w \boldsymbol{e}_t^T \big) + \boldsymbol{w}_t^w \boldsymbol{\nu}_t^T 
\end{align}

③ Update read weighting

There are two ways to select the destination address to read. In other words (1) Select the read destination slot based on the "key" entered through the interface vector, and (2) Select the slot according to the writing order.

Regarding (1), "read key" $ \ {\ boldsymbol {k} \ _ t ^ {r, i} \} \ _ {i = 1,2, .., as in the case of write weighting. Using R} $ and "read strength" $ \ {\ beta \ _t ^ {r, i} \} \ _ {i = 1,2, .., R} $,

\begin{align}
\boldsymbol{c}_t^{r, i} = \mathcal{C}\big(M_t, \boldsymbol{k}_t^{r,i}, \beta_t^{r,i}\big) \;\; (i=1,2,...,R)
\end{align}

To calculate.

(2) is a little long, but I will explain it in order. For convenience of explanation, explanations are given in the order of (2) -2 → (2) -1, but when implementing, the flow is (2) -1 → (2) -2.

(2) -2. Composition of "precedence weighting" Define the precedence weighting $ \ boldsymbol {p} _t $ with the following update expression.

\begin{align}
& \boldsymbol{p}_0 = 0 \\
& \boldsymbol{p}_t = \Big(1-\sum_{n=1}^N \boldsymbol{w}_t^w[n]\Big)\boldsymbol{p}_{t-1} + \boldsymbol{w}_t^w
\end{align}

Please note that $ \ boldsymbol {w} \ _ t ^ w $ has already been updated in ①. If "strong" writing is done in ②, it will be $ \ sum \ boldsymbol {w} \ _ t ^ w \ simeq 1 $, so the information of the previous time-step $ \ boldsymbol {p} \ _ {t- 1} $ is erased and saves where the write was done in the current time-step. If no writes were made, the weight of the last write was retained. You can also easily see that it is $ \ boldsymbol {p} \ _ t [n] \ leq 1 $, $ \ sum \ _n \ boldsymbol {p} _t [n] \ leq 1 $.

(2) -1. Configuration of "temporal link matrix" Constructs a $ N $ × $ N $ matrix $ L_t $ that represents the writing order to the Memory slot. When viewed in component units, $ L_t [n, m] \ simeq 1 $ means that "in Memory $ M_t $, the information in slot'n'is written after the information in slot'm'. The update formula is given below so that the relationship "is" can be expressed.

\begin{align}
& L_0[n,m]=0 \\
& L_t[n,n]=0 \\
& L_t[n,m]=\big(1-\boldsymbol{w}_t^w[n]-\boldsymbol{w}_t^w[m]\big)L_{t-1}[n,m] + \boldsymbol{w}_t^w[n]\boldsymbol{p}_{t-1}[m]
\end{align}

The diagonal component doesn't make sense, so fix it at $ 0 $. $ 0 \ leqq L \ _t [n, m] \ leqq 1 $ and $ \ sum_m L \ _t [n, m] \ leqq 1 $ can be confirmed immediately. I tried to check $ \ sum_n L \ _t [n, m] \ leqq 1 $, but couldn't show it. If anyone can show it, I would appreciate it if you could comment.

(2) -3. Configuration of "forward / backward weighting" Forward / backward weighting considering the writing order to Memory $ \ {\ boldsymbol {f} \ _ t ^ i \} \ _ {i = 1, .., R} $ / $ \ {\ boldsymbol {b } \ _t ^ i \} \ _ {i = 1, .., R} $ is a little rough

\begin{align}
&\boldsymbol{f}_t^i[n]=\sum_{m=1}^NL_t[n,m]\boldsymbol{w}_{t-1}^{r,i}[m] \; \Leftrightarrow \; \boldsymbol{f}_t^i=L_t\boldsymbol{w}_{t-1}^{r,i} \\
&\boldsymbol{b}_t^i[m]=\sum_{n=1}^N\boldsymbol{w}_{t-1}^{r,i}[n]L_t[n,m] \; \Leftrightarrow \; \boldsymbol{b}_t^i=L_t^T\boldsymbol{w}_{t-1}^{r,i} \\
\end{align}

It consists of. If $ \ sum_n L \ _t [n, m] \ leqq 1 $ holds, then $ 0 \ leqq \ boldsymbol {f} \ _t ^ i [n] \ leqq 1 $ and $ \ sum_n \ boldsymbol {f} \ _t ^ i [n] \ leqq1 $ holds. The same is true for $ \ boldsymbol {b} \ _ t ^ i $.

As mentioned above, by integrating (1) and (2), read weighting $ \ {\ boldsymbol {w} \ _ t ^ {i, r} \} \ _ {i = 1, ..., R} $ To update.

\begin{align}
\boldsymbol{w}_t^{r,i}=\boldsymbol{\pi}_t^i[1]\boldsymbol{b}_t^i + \boldsymbol{\pi}_t^i[2]\boldsymbol{c}_t^i + \boldsymbol{\pi}_t^i[3]\boldsymbol{f}_t^i \;\; (i=1,2,...,R)
\end{align}

$ \ {\ pi_t ^ i \} \ _ {i = 1, ..., R} $ is entered through the interface vector.

④ Read from Memory

Reading from Memory is easy. The read result "read vector" $ \ {\ boldsymbol {r} \ _t ^ i \} \ _ {i = 1, .., R} $ is

\begin{align}
\boldsymbol{r}_t^i[s] = \sum_{n=1}^N \boldsymbol{w}_t^{r,i}[n]M_t[n,s] \; \Leftrightarrow \; \boldsymbol{r}_t^i = M_t^T\boldsymbol{w}_t^{r,i} \;\; (i=1,2,...,R)
\end{align}

Is calculated.

This completes reading and writing Memory for time-step $ t $.

Python --Chainer implementation example

Here is an example of implementing the logic described above in python. The original code is Overview of DNC (Differentiable Neural Computers) + Implementation by Chainer It is in. The processing is grouped into functions and variable names are changed for easy comparison with the above explanation, but the contents are almost the same.

One thing to note is that the overall output from the Controller, $ \ boldsymbol {h} \ _t $, integrates not only the final output of the Controller, but all the output of the hidden layer in the paper. It looks like it is. However, for the sake of simplicity, the Controller used here is limited to a simple one with about two layers, and the output of the hidden layer is not taken out of the Controller.

DNC implementation

dnc.py


import numpy as np
import chainer
from chainer import functions as F
from chainer import links as L
from chainer import optimizers, Chain, Link, Variable


# controller of DNC
class SimpleLSTM(Chain): 
    
    def __init__(self, d_in, d_hidden, d_out):
        super(SimpleLSTM, self).__init__(
            l1 = L.LSTM(d_in, d_hidden),
            l2 = L.Linear(d_hidden, d_out),
            )
        
    def __call__(self, x):
        return self.l2(self.l1(x))
    
    def reset_state(self):
        self.l1.reset_state()
        
        
class DNC(Chain):
    
    def __init__(self, X, Y, N, W, R):
        self.X = X # input dimension
        self.Y = Y # output dimension
        self.N = N # number of memory slot
        self.W = W # dimension of one memory slot
        self.R = R # number of read heads
        self.d_ctr_in = W*R+X # input dimension into the controller
        self.d_ctr_out = Y+W*R+3*W+5*R+3 # output dimension from the controller
        self.d_ctr_hidden = self.d_ctr_out # dimension of hidden unit of the controller
        self.d_interface = W*R+3*W+5*R+3 # dimension of interface vector
        
        self.controller = SimpleLSTM(self.d_ctr_in, self.d_ctr_hidden, self.d_ctr_out)

        super(DNC, self).__init__(
            l_ctr = self.controller, 
            l_Wy = L.Linear(self.d_ctr_out, self.Y),
            l_Wxi = L.Linear(self.d_ctr_out, self.d_interface), 
            l_Wr = L.Linear(self.R * self.W, self.Y), 
            )
        
        self.reset_state()
            
    def reset_state(self):
        # initialize all the recurrent state
        self.l_ctr.reset_state() # controller
        self.u = Variable(np.zeros((self.N, 1)).astype(np.float32)) # usage vector (N, 1)
        self.p = Variable(np.zeros((self.N, 1)).astype(np.float32)) # precedence weighting (N, 1)
        self.L = Variable(np.zeros((self.N, self.N)).astype(np.float32)) # temporal memory linkage (N, N)                     
        self.Mem = Variable(np.zeros((self.N, self.W)).astype(np.float32)) # memory (N, W)
        self.r = Variable(np.zeros((1, self.R*self.W)).astype(np.float32)) # read vector (1, R * W)
        self.wr = Variable(np.zeros((self.N, self.R)).astype(np.float32)) # read weighting (N, R)
        self.ww = Variable(np.zeros((self.N, 1)).astype(np.float32)) # write weighting (N, 1)

        
    # utility functions
    
    def _cosine_similarity(self, u, v):
        # cosine similarity as a distance of two vectors
        # u, v: (1, -) Variable  -> (1, 1) Variable
        denominator = F.sqrt(F.batch_l2_norm_squared(u) * F.batch_l2_norm_squared(v))
        if (np.array_equal(denominator.data, np.array([0]))):
            return F.matmul(u, F.transpose(v))
        return F.matmul(u, F.transpose(v)) / F.reshape(denominator, (1, 1))

    
    def _C(self, Mem, k, beta):
        # similarity between rows of matrix Mem and vector k
        # Mem:(N, W) Variable, k:(1, W) Variable, beta:(1, 1) Variable -> (N, 1) Variable
        N, W = Mem.shape 
        ret_list = [0] * N
        for i in range(N):
            # calculate distance between i-th row of Mem and k
            ret_list[i] = self._cosine_similarity(F.reshape(Mem[i,:], (1, W)), k) * beta 
        # concat horizontally because softmax operates along the direction of axis=1 
        return F.transpose(F.softmax(F.concat(ret_list, 1))) 

    
    def _u2a(self, u):
        # convert usage vector u to allocation weighting a
        # u, a: (N, 1) Variable
        N = u.shape[0]
        phi = np.argsort(u.data.flatten()) # u.data[phi]: ascending
        a_list = [0] * N    
        cumprod = Variable(np.array([[1.0]]).astype(np.float32)) 
        for i in range(N): 
            a_list[phi[i]] = cumprod * (1.0 - F.reshape(u[phi[i]], (1, 1)))
            cumprod *= F.reshape(u[phi[i]], (1, 1))
        return F.concat(a_list, 0) 
    
    
    # operations of the DNC system
    
    def _controller_io(self, x):
        # input data from the Data Set : x (1, X) Variable  
        # out-put from the controller h is split into two ways : v (1, Y), xi(1, W*R+3*W+5*R+3) Variable
        chi = F.concat([x, self.r], 1) # total input to the controller 
        h = self.l_ctr(chi) # total out-put from the controller
        self.v = self.l_Wy(h)
        self.xi = self.l_Wxi(h)
        
        # interface vector xi is split into several components
        (self.kr, self.beta_r, self.kw, self.beta_w,
         self.e, self.nu, self.f, self.ga, self.gw, self.pi
         ) = F.split_axis(self.xi, np.cumsum(
             [self.W*self.R, self.R, self.W, 1, self.W, self.W, self.R, 1, 1]), 1) # details of the interface vector
        
        # rescale components
        self.kr = F.reshape(self.kr, (self.R, self.W)) # read key (R, W)
        self.beta_r = 1 + F.softplus(self.beta_r) # read strength (1, R)
        # self.kw : write key (1, W)
        self.beta_w = 1 + F.softplus(self.beta_w) # write strength (1, 1)
        self.e = F.sigmoid(self.e) # erase vector (1, W)
        # self.nu : write vector (1, W)
        self.f = F.sigmoid(self.f) #  free gate (1, R)
        self.ga = F.sigmoid(self.ga) # allcation gate (1, 1)
        self.gw = F.sigmoid(self.gw) # write gate (1, 1)
        self.pi = F.softmax(F.reshape(self.pi, (self.R, 3))) # read mode (R, 3)
        
        
    def _up_date_write_weighting(self):
        # calculate retention vector : psi (N, 1) 
        # here, read weighting : wr (N, R) must retain state one step former
        psi_mat = 1 - F.matmul(Variable(np.ones((self.N, 1)).astype(np.float32)), self.f) * self.wr # (N, R)
        self.psi = Variable(np.ones((self.N, 1)).astype(np.float32)) 
        for i in range(self.R):
            self.psi = self.psi * F.reshape(psi_mat[:,i],(self.N,1)) # (N, 1)
        # up date usage vector : u (N, 1)
        # here, write weighting : ww (N, 1) must retain state one step former
        self.u = (self.u + self.ww - (self.u * self.ww)) * self.psi 
        # calculate allocation weighting : a (N, 1)
        self.a = self._u2a(self.u) 
        # calculate write content weighting : cw (N, 1)
        self.cw = self._C(self.Mem, self.kw, self.beta_w) 
        # up date write weighting : ww (N, 1)
        self.ww = F.matmul(F.matmul(self.a, self.ga) + F.matmul(self.cw, 1.0 - self.ga), self.gw) 

        
    def _write_to_memory(self):
        # erase vector : e (1, W) deletes information on the Memory : Mem (N, W) 
        # and write vector : nu (1, W) is written there
        # write weighting : ww (N, 1) must be up-dated before this step
        self.Mem = self.Mem * (np.ones((self.N, self.W)).astype(np.float32) - F.matmul(self.ww, self.e)) + F.matmul(self.ww, self.nu)

        
    def _up_date_read_weighting(self):    
        # up date temporal memory linkage : L (N, N)
        ww_mat = F.matmul(self.ww, Variable(np.ones((1, self.N)).astype(np.float32))) # (N, N)
        # here, precedence wighting : p (N, 1) must retain state one step former
        self.L = (1.0 - ww_mat - F.transpose(ww_mat)) * self.L + F.matmul(self.ww, F.transpose(self.p)) # (N, N)
        self.L = self.L * (np.ones((self.N, self.N)) - np.eye(self.N)) # constrain L[i,i] == 0   
        # up date prcedence weighting : p (N, 1)
        self.p = (1.0 - F.matmul(Variable(np.ones((self.N, 1)).astype(np.float32)), F.reshape(F.sum(self.ww),(1, 1)))) * self.p + self.ww 
        # calculate forward weighting : fw (N, R)
        # here, read wighting : wr (N, R) must retain state one step former
        self.fw = F.matmul(self.L, self.wr) 
        # calculate backward weighting : bw (N, R)
        self.bw = F.matmul(F.transpose(self.L), self.wr)
        # calculate read content weighting : cr (N, R)
        self.cr_list = [0] * self.R
        for i in range(self.R):
            self.cr_list[i] = self._C(self.Mem, F.reshape(self.kr[i,:], (1, self.W)), F.reshape(self.beta_r[0,i],(1, 1))) # (N, 1)
        self.cr = F.concat(self.cr_list, 1) # (1, N * R)       
        # compose up-dated read weighting : wr (N, R)
        bcf_tensor = F.concat([
                            F.reshape(F.transpose(self.bw), (self.R, self.N, 1)),
                            F.reshape(F.transpose(self.cr), (self.R, self.N, 1)),
                            F.reshape(F.transpose(self.fw), (self.R, self.N, 1))
                            ], 2) # (R, N, 3)
        self.pi = F.reshape(self.pi, (self.R, 3, 1)) # (R, 3, 1)
        self.wr = F.transpose(F.reshape(F.batch_matmul(bcf_tensor, self.pi), (self.R, self.N))) # (N, R)
    
    
    def _read_from_memory(self): 
        # read information from the memory : Mem (N, W) and compose read vector : r (W, R) to reshape (1, W * R)
        # read weighting : wr (N, R) must be up-dated before this step
        self.r = F.reshape(F.matmul(F.transpose(self.Mem), self.wr), (1, self.R * self.W))

        
    def __call__(self, x):
        self._controller_io(x) # input data is processed through the controller
        self._up_date_write_weighting() 
        self._write_to_memory() # memory up-date
        self._up_date_read_weighting()  
        self._read_from_memory() # extract information from the memory
        self.y = self.l_Wr(self.r) + self.v # compose total out put y : (1, Y)
        return self.y

Example of use

A usage example is also provided, but the content is the same as that in Overview of DNC (Differentiable Neural Computers) + Implementation by Chainer. Input a random number of fixed-length one-hot vectors, and after the input is completed, echo the input data as output. Learning is performed for each set of inputs, with the squared error between the echoed data and the input data as loss. When you're done training, enter the next set of one-hot vectors.

dnc_echo_test.py


import dnc

import numpy as np
import chainer
from chainer import functions as F
from chainer import links as L
from chainer import optimizers, Chain, Link, Variable


def onehot(x, n):
    ret = np.zeros(n).astype(np.float32)
    ret[x] = 1.0
    return ret


X = 5
Y = 5
N = 10
W = 10
R = 2


model = dnc.DNC(X, Y, N, W, R)
optimizer = optimizers.Adam()
optimizer.setup(model)


n_data = 10000 # number of input data
loss = 0.0
acc = 0.0
acc_bool = []


for data_cnt in range(n_data):
    
    loss_frac = np.zeros((1, 2))    
    
    # prepare one pair of input and target data 
    # length of input data is randomly set
    len_content = np.random.randint(3, 6) 
    # generate one input data as a sequence of randam integers
    content = np.random.randint(0, X-1, len_content) 
    len_seq = len_content + len_content # the former is for input, the latter for the target
    x_seq_list = [float('nan')] * len_seq # input sequence
    t_seq_list = [float('nan')] * len_seq # target sequence
    
    for i in range(len_seq):
        # convert a format of input data
        if (i < len_content):
            x_seq_list[i] = onehot(content[i], X)
        elif (i == len_content):
            x_seq_list[i] = onehot(X-1, X)
        else:
            x_seq_list[i] = np.zeros(X).astype(np.float32)
        # convert a format of output data
        if (i >= len_content):
            t_seq_list[i] = onehot(content[i - len_content], X)         
    
    model.reset_state() # reset reccurent state per input data
    
    # input data is fed as a sequence
    for cnt in range(len_seq):
        x = Variable(x_seq_list[cnt].reshape(1, X))
        if (isinstance(t_seq_list[cnt], np.ndarray)):
            t = Variable(t_seq_list[cnt].reshape(1, Y))
        else:
            t = []

        y = model(x)
        
        if (isinstance(t, chainer.Variable)):
            loss += (y - t)**2
            acc_bool.append(np.argmax(y.data)==np.argmax(t.data))                        
            if (np.argmax(y.data)==np.argmax(t.data)): acc += 1
                
        if (cnt+1==len_seq):
            # training by back propagation
            model.cleargrads()
            loss.grad = np.ones(loss.shape, dtype=np.float32)
            loss.backward()
            optimizer.update()
            loss.unchain_backward()
            # print loss and accuracy
            if data_cnt < 50 or data_cnt >= 9950:
                print('(', data_cnt, ')', acc_bool, ' :: ', loss.data.sum()/loss.data.size/len_content, ' :: ', acc/len_content)
            loss_frac += [loss.data.sum()/loss.data.size/len_seq, 1.]
            loss = 0.0
            acc = 0.0
            acc_bool = []

test result

This is the result when it is repeated 10000 times. The bool value in [] is the bool value that indicates whether the maximum value in the output vector is set to 1 and the others are changed to 0, which matches the input value.

・ Results of the first 20 times

( 0 ) [True, False, False]  ::  0.197543557485  ::  0.3333333333333333
( 1 ) [False, False, False, False]  ::  0.209656882286  ::  0.0
( 2 ) [True, False, False]  ::  0.172263367971  ::  0.3333333333333333
( 3 ) [False, True, True]  ::  0.185363880793  ::  0.6666666666666666
( 4 ) [True, True, True, True]  ::  0.157090616226  ::  1.0
( 5 ) [False, False, False, False, False]  ::  0.191528530121  ::  0.0
( 6 ) [True, False, False, False, False]  ::  0.175649337769  ::  0.2
( 7 ) [False, False, False, True, True]  ::  0.173387451172  ::  0.4
( 8 ) [True, False, True, True]  ::  0.150813746452  ::  0.75
( 9 ) [False, True, False]  ::  0.163899072011  ::  0.3333333333333333
( 10 ) [False, False, False, False, False]  ::  0.183468780518  ::  0.0
( 11 ) [True, False, True, False]  ::  0.152743542194  ::  0.5
( 12 ) [False, False, True, False]  ::  0.170574557781  ::  0.25
( 13 ) [False, True, False, True, False]  ::  0.161617393494  ::  0.4
( 14 ) [False, False, False, False]  ::  0.168220555782  ::  0.0
( 15 ) [False, False, False]  ::  0.167814588547  ::  0.0
( 16 ) [False, True, False, False]  ::  0.158575570583  ::  0.25
( 17 ) [False, False, False, False]  ::  0.165678012371  ::  0.0
( 18 ) [False, False, False]  ::  0.165241924922  ::  0.0
( 19 ) [False, True, False]  ::  0.143808253606  ::  0.3333333333333333

・ Results of the last 20 times

( 9980 ) [True, True, True, True]  ::  0.000208107382059  ::  1.0
( 9981 ) [True, True, True, True, True]  ::  0.000164349582046  ::  1.0
( 9982 ) [True, True, True, True, True]  ::  0.000122650777921  ::  1.0
( 9983 ) [True, True, True]  ::  0.000181751077374  ::  1.0
( 9984 ) [True, True, True, True, True]  ::  0.000318505689502  ::  1.0
( 9985 ) [True, True, True, True, True]  ::  0.00023639023304  ::  1.0
( 9986 ) [True, True, True, True, True]  ::  0.000988183766603  ::  1.0
( 9987 ) [True, True, True, True, True]  ::  0.000226851813495  ::  1.0
( 9988 ) [True, True, True]  ::  0.000401457709571  ::  1.0
( 9989 ) [True, True, True, True]  ::  0.000256504747085  ::  1.0
( 9990 ) [True, True, True, True, True]  ::  0.000165695995092  ::  1.0
( 9991 ) [True, True, True, True]  ::  0.000123940082267  ::  1.0
( 9992 ) [True, True, True, True, True]  ::  0.000351718552411  ::  1.0
( 9993 ) [True, True, True, True]  ::  0.000147357559763  ::  1.0
( 9994 ) [True, True, True, True]  ::  0.000173216045368  ::  1.0
( 9995 ) [True, True, True, True]  ::  0.000108330522198  ::  1.0
( 9996 ) [True, True, True, True]  ::  0.00016659933608  ::  1.0
( 9997 ) [True, True, True]  ::  0.000255667418242  ::  1.0
( 9998 ) [True, True, True]  ::  0.000280433737983  ::  1.0
( 9999 ) [True, True, True, True, True]  ::  0.000443447269499  ::  1.0

The answer is perfectly correct. I only posted 20 times, but False was 0 times in the last 100 times. However, I haven't compared it with other methods, so it's unclear if it's because of DNC.

This time, the purpose was to introduce logic, so that's it.

Reference URL

・ I used the one on this page as an implementation example of Chainer. Overview of DNC (Differentiable Neural Computers) + Implementation by Chainer

・ I didn't think about GPU, batch processing, etc. this time, but I think the page below will be helpful. Learning and generating strings with (Chainer) DNC (Differentiable Neural Computers)

-There seems to be an implementation in TensorFlow https://github.com/Mostafa-Samir/DNC-tensorflow

・ The explanation of LSTM is easy to understand here. Understanding LSTM-with recent trends

Recommended Posts

I will explain about the neural network "Differentiable Neural Computing (DNC)" with external memory.
Persist the neural network built with PyBrain
I ran the neural network on the actual FPGA
I ran the TensorFlow tutorial with comments (first neural network: the beginning of the classification problem)