[PYTHON] Understand the TensorFlow namespace and master shared variables

Intro

This is the 9th day article of TensorFlow Advent Calendar 2016.

TensorFlow was released in November 2015, but the "namespace" function was supported from the beginning. This is used in graph visualization with TensorBoard, but of course it's not just for that. Namespaces are very useful for managing identifiers. Strong "namespace" support reminds me of C ++, but I quote it from a C ++ instructional book (self-study C ++).

The purpose of namespaces is to localize identifier names and avoid name conflicts. In the C ++ programming environment, the names of variables, functions, and classes have continued to grow exponentially. Prior to the advent of namespaces, all these names competed for space within the global namespace, creating a lot of conflict.

On the other hand, Python's variable scope has only the minimum of local, global (global), + α (nonlocal), so a Google engineer who wrote the core part of TensorFlow in C ++ It seems natural to think about implementing C ++ level namespace support in TensorFlow.

Name management is not a problem for Neural Networks such as MLP (Multi-Layer Perceptron), which does not have many layers, but for deep CNNs and large models found in RNNs, weight sharing is also possible, so proper variables. A scope is required. Also, considering scale-up (although I have little experience), I have to apply the code to a distributed environment such as Multi-Device (GPU) and cluster. Again, we need variable scope.

In this article, I would like to confirm the related API for the purpose of firmly understanding the "namespace" of TensorFlow. (The programming environment is TensorFlow 0.11.0, Python 3.5.2, Ubuntu 16.04LTS.)

TensorFlow variable scope catching point

Variable scoping is not difficult if you read the documentation properly, but if you understand it "ambiguously", the following points may be caught.

--There are tf.name_scope () and tf.variable_scope () for defining the scope, but what's the difference? --I remember using tf.Variable () for variable definition in TensorFlow, but when do you use tf.get_variable ()?

I will write the answer first, but the answer to Question 1 is that tf.name_scope () is a more general-purpose scope definition, and tf.variable_scope () is a dedicated scope definition for managing variables (identifiers). It becomes. Also, the answer to Question 2 is that tf.Variable () is a more primitive (lower level) variable definition, while tf.get_variable () is a (higher level) variable that considers the variable scope. It becomes a definition. (The related document, TesorFlow-"HOW TO"-"Sharing Variable", explains the shared variable relationship in detail.)

Below, I would like to move the code and investigate the details.

# tf.name_scope
with tf.name_scope("my_scope"):
    v1 = tf.get_variable("var1", [1], dtype=tf.float32)
    v2 = tf.Variable(1, name="var2", dtype=tf.float32)
    a = tf.add(v1, v2)

print(v1.name)  # var1:0
print(v2.name)  # my_scope/var2:0
print(a.name)   # my_scope/Add:0

First, I used tf.name_scope () to define the variables in it. The identifier managed by TensorFlow is output in the latter half, but the output is displayed as a comment to the right of the print statement. The scope of "my_scope" is properly defined for the variable v2 defined by tf.Variable () and the addition operation a. On the other hand, v1 defined by tf.get_varible () brilliantly ignored the scope.

 # tf.variable_scope
with tf.variable_scope("my_scope"):
    v1 = tf.get_variable("var1", [1], dtype=tf.float32)
    v2 = tf.Variable(1, name="var2", dtype=tf.float32)
    a = tf.add(v1, v2)

print(v1.name)  # my_scope/var1:0
print(v2.name)  # my_scope_1/var2:0  ...The scope name has been updated.
print(a.name)   # my_scope_1/Add:0   ...The scope name after the update is maintained.

Then I used tf.variable_sope (). (Note that the previous snippet and this snippet are moving continuously.) The variable v1 defined by tf.get_variable () was created with "my_scope" attached to the variable name as intended. Also, the variable v2 under it and the operation a have "my_scope_1" added (despite the fact that it is tf.variable_scope ("my_scope")). The reason for this is that it should have been given "my_scope" (in the initial state of the program), but it is automatic because the same identifier ("my_scope / var2: 0") has already been used in the previous code snippet. This is because I updated to "my_scope_1". (The statement ʻa = tf.add (v1, v2) `after the scope name update ("my_scope"-> "my_scope_1") seems to maintain this scope ("my_scope_1").)

It's getting complicated, so I'll sort it out a little.

--tf.name_scope () is a generic name scope definition. (As you know, the output to TensorBoard uses this identifier setting.) --tf.variable_scope () is a scope definition used for variable management. (Function name variable_scope is as it is ...) --tf.get_variable () defines variables while managing variable name identifiers (new or duplicate?). Be sure to use it as a set with tf.variable_scope ().

In the above two snippets, the situation was complicated due to the experiment, but it is not particularly difficult if you follow the basic principle that "tf.get_variable () is used as a set with tf.variable_scope ()".

Now, let's see how to use shared variables using tf.get_variable ().

with tf.variable_scope("my_scope2"):
    v4_init = tf.constant_initializer([4.])
    v4 = tf.get_variable("var4", shape=[1], initializer=v4_init)

print(v4.name)  # my_scope2/var4:0

First, we defined the variable v4 in the scope "my_scope2". tf.get_variable () specifies a variable initializer to define a variable. Here, we used a constant initializer to make a statement that contains 4. in v4. For the identifier to TensorFlow, "var4" was specified in the first argument.

Next, try to allocate a variable with the same identifier.

with tf.variable_scope("my_scope2"):
    v5 = tf.get_variable("var4", shape=[1], initializer=v4_init)

ValueError: Variable my_scope2/var4 already exists, disallowed. Did you mean to set reuse=True in VarScope? Originally defined at:

  File "name_scope_ex.py", line 47, in <module>
    v4 = tf.get_variable("var4", shape=[1], initializer=v4_init)

As planned. A ValueError has occurred. The error is "Isn't it strange to take a variable with the same identifier?" Variable ** re-assignment using the same identifier uses the reuse option as follows.

with tf.variable_scope("my_scope2", reuse=True):
    v5 = tf.get_variable("var4", shape=[1])
print(v5.name)  # my_scope2/var4:0

Alternatively, you can do the following as well.

with tf.variable_scope("my_scope2"):
    tf.get_variable_scope().reuse_variables()
    v5 = tf.get_variable("var4", shape=[1])
print(v5.name)  # my_scope2/var4:0

So far, we have confirmed the basic functions of tf.variable_scope () and tf.get_variable ().

Examples of shared variables-Autoencoder

Now, let's look at an example of using shared variables, but in the document TensorFlow --Sharing Variable ， Please refer to the following usage example.

models/image/cifar10.py, Model for detecting objects in images.
models/rnn/rnn_cell.py, Cell functions for recurrent neural networks.
models/rnn/seq2seq.py, Functions for building sequence-to-sequence models.

Since both are a considerable amount of code, this time I would like to take up the weight sharing of a self-encoder (hereinafter, Autoencoder) different from these. The encode side / decode side of Autoencoder can be expressed as follows.

y = f(\textbf{W}x + \textbf{b})  \\
\hat{x} = \tilde{f}(\tilde{\textbf{W}}y + \tilde{\textbf{b}})

The following weight sharing can be used in such a symmetric Autoencoder.

\tilde{\textbf{W} } = \textbf{W} ^{\mathrm{T}}

Let's implement the above configuration network using shared variables of TensorFlow. First, define the Encoder class.

# Encoder Layer   
class Encoder(object):
    def __init__(self, input, n_in, n_out, vs_enc='encoder'):
        self.input = input
        with tf.variable_scope(vs_enc):
            weight_init = tf.truncated_normal_initializer(mean=0.0, stddev=0.05)
            W = tf.get_variable('W', [n_in, n_out], initializer=weight_init)
            bias_init = tf.constant_initializer(value=0.0)
            b = tf.get_variable('b', [n_out], initializer=bias_init)
        self.w = W
        self.b = b
    
    def output(self):
        linarg = tf.matmul(self.input, self.w) + self.b
        self.output = tf.sigmoid(linarg)
        
        return self.output

The variable scope is set by specifying the option vs_enc, and W is defined by tf.get_variable (). Next is the Decoder class, which is as follows.

# Decoder Layer
class Decoder(object):
    def __init__(self, input, n_in, n_out, vs_dec='decoder'):
        self.input = input
        if vs_dec == 'decoder': # independent weight
            with tf.variable_scope(vs_dec):
                weight_init = tf.truncated_normal_initializer(mean=0.0, stddev=0.05)
                W = tf.get_variable('W', [n_in, n_out], initializer=weight_init)
        else:                   # weight sharing (tying)
            with tf.variable_scope(vs_dec, reuse=True):     # set reuse option
                W = tf.get_variable('W', [n_out, n_in])
                W = tf.transpose(W)

        with tf.variable_scope('decoder'):  # in all case, need new bias
            bias_init = tf.constant_initializer(value=0.0)
            b = tf.get_variable('b', [n_out], initializer=bias_init)
        self.w = W
        self.b = b
    
    def output(self):
        linarg = tf.matmul(self.input, self.w) + self.b
        self.output = tf.sigmoid(linarg)
        
        return self.output

Most of it is the same as the Encoder class, but the definition statement of the variable W is processed by branching. The network definition part is as follows.

# make neural network model
def make_model(x):
    enc_layer = Encoder(x, 784, 625, vs_enc='encoder')
    enc_out = enc_layer.output()
    dec_layer = Decoder(enc_out, 625, 784, vs_dec='encoder')
    dec_out = dec_layer.output()

    return enc_out, dec_out

If you specify vs_dec ='decoder' when creating a Decoder object, or omit this option, a new weight variable W will be allocated and the Encoder will have the same vs_dec ='encoder' above. When the variable scope is the same as the one used, the weight variable is implemented so that W is reused as a shared variable. (If you want to reuse it, transpose W to match the network.)

An example of executing calculation with MNIST data is shown. First, if weight sharing is not performed, the result is as follows.

Training...
  step, loss =      0:  0.732
  step, loss =   1000:  0.271
  step, loss =   2000:  0.261
  step, loss =   3000:  0.240
  step, loss =   4000:  0.234
  step, loss =   5000:  0.229
  step, loss =   6000:  0.219
  step, loss =   7000:  0.197
  step, loss =   8000:  0.195
  step, loss =   9000:  0.193
  step, loss =  10000:  0.189
loss (test) =  0.183986

When weight sharing is done, it is as follows.

Training...
  step, loss =      0:  0.707
  step, loss =   1000:  0.233
  step, loss =   2000:  0.215
  step, loss =   3000:  0.194
  step, loss =   4000:  0.186
  step, loss =   5000:  0.174
  step, loss =   6000:  0.167
  step, loss =   7000:  0.154
  step, loss =   8000:  0.159
  step, loss =   9000:  0.152
  step, loss =  10000:  0.152
loss (test) =  0.147831

Due to the weight sharing setting, the loss (cross-entropy) decreases faster with the same number of learnings. Since the degree of freedom of the network is about half, it can be said that the result is as expected.

Somewhat complex model example-a classifier with two MLPs

Let's consider another slightly complicated model. (Although it is not so complicated ...) MNIST is used for the data to be handled as above. This time, we will do multi-class classification. As a classifier, we used MLP (Multi-layer Perceptron) with 2 hidden layers and 1 output layer. The figure below is a graph chart of TensorBoard.

Fig. Graph of 2 MLP networks

(I'm not familiar with TensorBoard. Please take it as a rough image.)

First, define the classes of the hidden layer (fully connected layer) and the output layer.

# Full-connected Layer   
class FullConnected(object):
    def __init__(self, input, n_in, n_out, vn=('W', 'b')):
        self.input = input

        weight_init = tf.truncated_normal_initializer(mean=0.0, stddev=0.05)
        bias_init = tf.constant_initializer(value=0.0)
        W = tf.get_variable(vn[0], [n_in, n_out], initializer=weight_init)
        b = tf.get_variable(vn[1], [n_out], initializer=bias_init)
        self.w = W
        self.b = b
        self.params = [self.w, self.b]
    
    def output(self):
        linarg = tf.matmul(self.input, self.w) + self.b
        self.output = tf.nn.relu(linarg)
        
        return self.output
#

# Read-out Layer
class ReadOutLayer(object):
    def __init__(self, input, n_in, n_out, vn=('W', 'b')):
        self.input = input

        weight_init = tf.random_normal_initializer(mean=0.0, stddev=0.05)
        bias_init = tf.constant_initializer(value=0.0)
        W = tf.get_variable(vn[0], [n_in, n_out], initializer=weight_init)
        b = tf.get_variable(vn[1], [n_out], initializer=bias_init)
        self.w = W
        self.b = b
        self.params = [self.w, self.b]
    
    def output(self):
        linarg = tf.matmul(self.input, self.w) + self.b
        self.output = tf.nn.softmax(linarg)  

        return self.output

The variable name is set in the option of the class constructor, but the variable sharing operation is not performed here. Next, the part that defines the network is as follows.

# Create the model
def mk_NN_model(scope='mlp', reuse=False):
    '''
      args.:
        scope   : variable scope ID of networks
        reuse   : reuse flag of weights/biases
    '''
    with tf.variable_scope(scope, reuse=reuse):
        hidden1 = FullConnected(x, 784, 625, vn=('W_hid_1','b_hid_1'))
        h1out = hidden1.output()
        hidden2 = FullConnected(h1out, 625, 625, vn=('W_hid_2','b_hid_2'))
        h2out = hidden2.output()    
        readout = ReadOutLayer(h2out, 625, 10, vn=('W_RO', 'b_RO'))
        y_pred = readout.output()
     
    cross_entropy = -tf.reduce_sum(y_*tf.log(y_pred))
    
    # Regularization terms (weight decay)
    L2_sqr = tf.nn.l2_loss(hidden1.w) + tf.nn.l2_loss(hidden2.w)
    lambda_2 = 0.01
    # the loss and accuracy
    with tf.name_scope('loss'):
        loss = cross_entropy + lambda_2 * L2_sqr
    with tf.name_scope('accuracy'):
        correct_prediction = tf.equal(tf.argmax(y_pred,1), tf.argmax(y_,1))
        accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    
    return y_pred, loss, accuracy

This function is specified to take the variable scope scope and the variable sharing flag reuse as options. In two MLP networks, weight sharing is flagged with the reuse flag with the same scope name as follows:

    y_pred1, loss1, accuracy1 = mk_NN_model(scope='mlp1')
    y_pred2, loss2, accuracy2 = mk_NN_model(scope='mlp1', reuse=True)

If you do not want to share weights, set as follows. (Although it is a natural syntax ...)

    y_pred1, loss1, accuracy1 = mk_NN_model(scope='mlp1')
    y_pred2, loss2, accuracy2 = mk_NN_model(scope='mlp2')

The following two cases were performed as calculation experiments.

The training data is divided into two and supplied to the two classifiers'mlp1'and'mlp2'. The two classifiers set up weight sharing. Performed serially with'mlp1'training and'mlp2' training. Classify the test data using the final parameters.
Train data is divided into two and supplied to two classifiers'mlp1'and'mlp2'. 'mlp1' and'mlp2' are independent (not shared) networks, and each learns. The test data is applied to each classifier, and the results are saved to obtain the final classification result.

Since I wanted to experiment with weight sharing, the number of layers and the number of units of the two classifiers must be the same. However, since the same classifier is boring, the optimizer is different and the learning rate is finely adjusted.

First, the execution result of case No. 1 is as follows.

Training...
  Network No.1 :
  step, loss, accurary =      0: 178.722,   0.470
  step, loss, accurary =   1000:  22.757,   0.950
  step, loss, accurary =   2000:  15.717,   0.990
  step, loss, accurary =   3000:  10.343,   1.000
  step, loss, accurary =   4000:   9.234,   1.000
  step, loss, accurary =   5000:   8.950,   1.000
  Network No.2 :
  step, loss, accurary =      0:  14.552,   0.980
  step, loss, accurary =   1000:   7.353,   1.000
  step, loss, accurary =   2000:   5.806,   1.000
  step, loss, accurary =   3000:   5.171,   1.000
  step, loss, accurary =   4000:   5.043,   1.000
  step, loss, accurary =   5000:   4.499,   1.000
accuracy1 =   0.9757
accuracy2 =   0.9744

Note that the loss increases slightly at the start of learning for Network No. 2, but it is considerably smaller than the value at the start of learning for No. 1. This indicates that as a result of weight sharing, the parameters (inheriting the learning result of No. 1) started from the beginning of No. 2. However, the final classification accuracy, ʻaccuracy2 = 0.9744, did not improve from ʻaccuracy1, and it was found that this" ensemble learning-like "was a failure.

Naturally, when you think about it, you are in a situation where you are using the same classifier in two learning sessions. Since the training data was simply divided into two parts and supplied, this cannot be expected to improve the accuracy of the ensemble.

The result of performing the correct ensemble with the independent classifier configuration of case No. 2 is as follows.

Training...
  Network No.1 :
  step, loss, accurary =      0: 178.722,   0.470
  step, loss, accurary =   1000:  15.329,   0.990
  step, loss, accurary =   2000:  12.242,   0.990
  step, loss, accurary =   3000:  10.827,   1.000
  step, loss, accurary =   4000:  10.167,   0.990
  step, loss, accurary =   5000:   8.178,   1.000
  Network No.2 :
  step, loss, accurary =      0: 192.382,   0.570
  step, loss, accurary =   1000:  10.037,   0.990
  step, loss, accurary =   2000:   7.590,   1.000
  step, loss, accurary =   3000:   5.855,   1.000
  step, loss, accurary =   4000:   4.678,   1.000
  step, loss, accurary =   5000:   4.693,   1.000
accuracy1 =   0.9751
accuracy2 =   0.9756
accuracy (model averaged) =   0.9810

As expected, the accuracy (correct answer rate), which was around 0.975 for each classifier, is slightly better at 0.980 according to the model average.

(The code created this time has been uploaded to Gist.)

It seems a little off the beaten track, but I hope you got an idea of how to use variable scope and shared variables. I don't think there is much need to use variable scope to manage variables in smaller models, but in larger models you may want to use variable scope and shared variables. This is a feature of TensorFlow that is not often found in other Deep Learning Frameworks, so please use it!