Introduction

This article is an implementation of the paper "Expressive Number of Two or More Hidden Layer ReLU Neural Networks".

Outline

Consider a neural network with two intermediate layers, the activation function is the ReLU function, the input is n-dimensional, and the output is one-dimensional. When the number of intermediate neurons is a1 and a2 in order from the input side, there are neural network parameters that express those data with 0 error for any a1 * a2 input / output data. Here, "expressing the input / output data with an error of 0" means that when each input data is given to the neural network, a value equal to the corresponding output data is returned. In summary, ** when the training data in machine learning is a1 * a2 or less, there is always a learning convergence destination **. (However, learning does not necessarily converge just because the existence of the convergence destination can be said.) For example, when there are 10,000 output one-dimensional training data, all the data can be expressed with zero error by the ReLU neural network with 100 intermediate neurons and 100 intermediate layers and two layers. This fact does not depend on the input dimension of the data, so it doesn't matter how large the input dimension is. (If the output is not one-dimensional, it will be described later.) Since the proof is in the original paper, I will omit it, but in this article, I will introduce a program that actually displays one set of parameters that is the convergence destination. I will. First, see the execution example.

Execution example

This is an example of a ReLU neural network with 5 dimensions for input, 3 and 3 intermediate neurons in order from the input side, and 1 dimension for output, and outputs parameters that represent 3 * 3 = 9 data.

$python3 NNH2.py

( 5 , 3 , 3 , 1 ) ReLU neural network

the number of data = 9

 input =
[[93 59 58 20 80 57 35 21 38]
 [ 4 91 47 69 98 85 68  2 15]
 [14 60 31 86 37 12 23 69 42]
 [ 4 14 52 98 72 60 67 51 90]
 [27 12  6 32 76 63 49 41 28]]
output =
[ 1 81 17 65 25 33 45 77 10]


parameters

1st layer
W1 =
[[Fraction(823849691, 1) Fraction(4336051, 1) Fraction(28907, 1)
  Fraction(149, 1) Fraction(1, 1)]
 [Fraction(823849691, 1) Fraction(4336051, 1) Fraction(28907, 1)
  Fraction(149, 1) Fraction(1, 1)]
 [Fraction(823849691, 1) Fraction(4336051, 1) Fraction(28907, 1)
  Fraction(149, 1) Fraction(1, 1)]]
b1 =
[[Fraction(-16778681974, 1)]
 [Fraction(-60502822101, 2)]
 [Fraction(-48495714637, 1)]]

2nd layer
W2 =
[[Fraction(148, 1) Fraction(-9237952317912, 35473138591)
  Fraction(4049615396998340012232, 18010928872046123981)]
 [Fraction(15800556778618364518367199397870934943209115691793, 1077111270972508432064314372084032376028236629)
  Fraction(-11216317162890245084133171070123933902029519034081603343913003835232890, 434989745706342223538442515057047029191074444247675999926788518821)
  Fraction(2686834406446276617746833568279126654074919365089537293846487174104542, 224870468718735937295513181927691380500164378143178284849730556247)]
 [Fraction(843610077776665412761527367413211990104912799146270139270113712675305808525969059554219936165800, 154068217051841137536218687813904006692418581384372306328955832146429671252437697)
  Fraction(-57625119985396507975392986118005960657818304694554844951150042194684795633743261029837087833785575305876950, 5262027877233168541064334417747189692030849998640803800770876843784630220816492104662661549)
  Fraction(100319159657643248312549073786161213853603634088990731217920689495343417295177190966187025547162361565044553868359914, 11390289225651438251169009216834446835092039706616191741035899343815780751119047453235035761689895223)]]
b2 =
[[Fraction(-99, 1)]
 [Fraction(-31282288621675736677206673372946695670689268182842, 4042938352270697695885621047346217759)]
 [Fraction(-912723226773529956403403228639959057460803178243124784475781762180477870767754518136094, 13495462068042154164632946308945540668991317744154109409049607)]]

3rd layer
W3 =
[[ 1 -1  1]]
b3 =
[[16]]


check

MP(x) = [Fraction(1, 1), Fraction(81, 1), Fraction(17, 1), Fraction(65, 1), Fraction(25, 1), Fraction(33, 1), Fraction(45, 1), Fraction(77, 1), Fraction(10, 1)]
output= [ 1 81 17 65 25 33 45 77 10]

MP(x) - output = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

If you explain in order, First, any 3 * 3 = 9 input / output data can be expressed, so give the data as follows.

input =
[[93 59 58 20 80 57 35 21 38]
 [ 4 91 47 69 98 85 68  2 15]
 [14 60 31 86 37 12 23 69 42]
 [ 4 14 52 98 72 60 67 51 90]
 [27 12  6 32 76 63 49 41 28]]
output =
[ 1 81 17 65 25 33 45 77 10]

This means data such that the output corresponding to the input [[93] [4] [14] [4] [27]] is [1], and such input / output data is Nine are lined up. In this example, the data is given by random numbers, but you can also give specific data. For the given data, the weight value and bias value of each layer, which are the parameters of the neural network that expresses them, are the first layer W1, b1, the second layer W2, b2, 3, respectively. It is output as layers W3 and b3. (Fraction (n, m) represents the rational number $ \ frac {n} {m} $.)

1st layer
W1 =
[[Fraction(823849691, 1) Fraction(4336051, 1) Fraction(28907, 1)
  Fraction(149, 1) Fraction(1, 1)]
 [Fraction(823849691, 1) Fraction(4336051, 1) Fraction(28907, 1)
  Fraction(149, 1) Fraction(1, 1)]
 [Fraction(823849691, 1) Fraction(4336051, 1) Fraction(28907, 1)
  Fraction(149, 1) Fraction(1, 1)]]
b1 =
[[Fraction(-16778681974, 1)]
 [Fraction(-60502822101, 2)]
 [Fraction(-48495714637, 1)]]
...

The denominator and numerator of the parameter are calculated to be very large, but the output of the neural network for each of the nine inputs is neatly divisible, each

MP(x) = [Fraction(1, 1), Fraction(81, 1), Fraction(17, 1), Fraction(65, 1), Fraction(25, 1), Fraction(33, 1), Fraction(45, 1), Fraction(77, 1), Fraction(10, 1)]

It will be. Comparing the error with the output of the original data,

MP(x) - output = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

And you can see that the error is 0. (Converted to float type for readability.) This result does not depend on the input / output data, input dimension, or number of intermediate neurons, and the error should be 0 at any value.

Implementation

The algorithm is the proof of Teorem 3 of the original paper and ["Representative ability of neural network based on the number of expressible data"](https://search.ieice.org/bin/index.php?category=D&lang=J&vol= I referred to the proof of Theorem 3 of J102-D & num = 6 & abst =). Specifically, it is a program that calculates one solution of simultaneous equations obtained during proof, but it is quite complicated because mutual recursion two-variable recurrence formulas appear, so if you want to know the details Original paperhere is. Click here for Python implementation code (https://github.com/nekonistyle/Python)

Modifiable parameters

Number of neurons

`python`


# constants
idim = 5 # the number of input neurons
a1 = 3 # the number of 1st hidden neurons
a2 = 3 # the number of 2nd hidden neurons

N = a1 * a2 # the number of data (do not change)

ʻIdim is the input dimension, and ʻa1 and ʻa2are the number of intermediate neurons in the first and second layers, respectively. N = a1 * a2` is the number of data and should not be changed.

Data to give

idata = randomidata(idim,idataRange) # input data must be unique
odata = randomodata(odataRange)

ʻIdata is input data and must be ʻidim x N matrix and the same column vector does not exist, ʻodata is output data and must be N` dimensional vector. In this example, the data range is limited to obtain random numbers, but in reality, it is not necessary to limit either, and the parameter can be calculated with any value.

Commerce

# select division operator ('Fraction' or '/')
divop = fractions.Fraction
# divop = lambda x,y: x / y

Since only four arithmetic operations are used to calculate the parameters, the quotient operation divop is defined by the rational number operation Fraction, but it can be changed to the immovable decimal point operation /. However, it may be divided by a very large number, and if it is /, the error may be large, so be careful.

If the output is m-dimensional

From Theorem 4 of the original paper, if the output is m-dimensional, $ a_1 (a_2 \ operatorname {div} m) + a_2 \ operatorname {mod} m $ data can be expressed. (Although it is not implemented, it can be made in the same way by diverting the parameters when the output is one-dimensional.) That is, if the intermediate neurons in the second layer are multiples of the output dimension, [Number of training data] x [Output dimension] ≤ [Number of intermediate neurons in the first layer] x [Number of intermediate neurons in the second layer] Since it can be expressed if it satisfies, for example, if the training data is 10,000 and the output is 10 dimensions, all the data can be expressed by a neural network with 400 intermediate neurons and 250 intermediate neurons.

Summary

I explained a program that outputs a set of parameters that expresses given a1 * a2 data in a ReLU neural network with two intermediate layers, where the number of intermediate neurons is a1 and a2. The parameters output by this program seem to have very large absolute values, but it is thought that such a result was obtained because of the parameter calculation formula that can be expressed for any distorted data. Fortunately, however, the parameters that represent a given data are generally not unique, so some data may be represented by parameters with smaller absolute values.

[PYTHON] A story and its implementation that arbitrary a1 * a2 data can be represented by a 3-layer ReLU neural network with a1 and a2 intermediate neurons with an error of 0.