Target person

I summarized what kind of activation function there is. ** The latest Swish and Mish, as well as tanhExp! ** ** I'm targeting the layer that I can't find a good one even if I search the list. New ones will be added as soon as they are found. If you have information on new functions or functions in the TODO list below, please let us know!

TODO list

--Check the supplementary information of the hardShrink function --Check the supplementary information of the softShrink function --Check the supplementary information of the Threshold function --Check the supplementary information of the logSigmoid function --Check the supplementary information of the tanhShrink function --Check the supplementary information of the hardtanh function --Check the supplementary information of the ReLU6 function --Check the supplementary information of the CELU function --Check the supplementary information of the softmin function --Check the supplementary information of the logSoftmax function --Investigate some unimplemented functions - Pytorch --What is the Multihead Attention function? --PReLU function learning method --Implementation of RReLU function ――What is the cumulative distribution function that appears in the GELU function? --Implementation of GELU function --Implementation of Softmax2d function

-[Step function (step)](#step function step) -[Identity function (identity)](# identity function identity) -[Bent Identity function](# bent-identity function) -[hardShrink function](#hardshrink function) -[softShrink function](#softshrink function) -[Threshold function](#threshold function) -[Sigmoid function (sigmoid)](#sigmoid function sigmoid) -[hardSigmoid function](#hardsigmoid function) -[logSigmoid function](#logsigmoid function) -[tanh function](#tanh function) -[tanhShrink function](#tanhshrink function) -[hardtanh function](#hardtanh function) -[ReLU function](#relu function) -[ReLU6 function](# relu6 function) -[leaky-ReLU function](# leaky-relu function) -[ELU function](#elu function) -[SELU function](#selu function) -[CELU function](#celu function) -[Softmax function (softmax)](#softmax function softmax) -[softmin function](#softmin function) -[logSoftmax function](#logsoftmax function) -[softplus function](#softplus function) -[softsign function](#softsign function) -[Swish function](#swish function) -[Mish function](#mish function) -[tanhExp function](#tanhexp function) -[Code example](#Code example)

Step function (step)

First of all, from the step function. It is probably the most historical activation function. At that time, it was used to implement Perceptron, but it is rarely seen in deep learning these days. The reason is that the derivative is $ 0 $ for all real numbers ($ x \ ne 0 $), so the parameters cannot be optimized by backpropagation of errors.

The forward propagation formula is

y = \left\{
  \begin{array}{cc}
    1 & (x \gt 0) \\
    0 & (x \le 0)
  \end{array}
\right.

Like this, the formula of back propagation is natural

\cfrac{\partial y}{\partial x} = 0

And multiply this by the error that flows. In other words, nothing is shed. For this reason, the error back propagation method could not be applied, and deep learning was forced into the shade.

Identity function

The identity function outputs the input as is. It is used for the activation function of the output layer of regression analysis. There is no turn in the middle layer. The purpose of using such an activation function is ** to implement it uniquely **. The unique implementation is intended here to not divide the processing by conditional branching or the like. Since the differential value is $ 1 $, the error propagates to the previous layer as it is. Since the error calculation uses the squared error, the propagation to the next layer will be $ y --t $ ~

The forward propagation formula is

y = x

And the back propagation is

\cfrac{\partial y}{\partial x} = 1

It will be. You can see that the value that has just flowed will flow!

Bent Identity function

It is a function similar to [Identity function](# identity function identity). However, it is not straight but slightly curved.

The forward propagation formula is

y = \cfrac{1}{2}(\sqrt{x^2 + 1} - 1) + x

Like this, back propagation

\cfrac{\partial y}{\partial x} = \cfrac{x}{2 \sqrt{x^2 + 1}} + 1

It will be. Somehow it looks like [ReLU function](#relu function) (personal impression). I couldn't find the article introduced in Japanese at a glance, so I think it's a minor activation function.

hardShrink function

From Pytorch, just an introduction for the time being. ** Check TODO supplementary information ** The forward propagation formula is

y = \left\{
  \begin{array}{cc}
    x & (x \lt -\lambda \quad \textrm{or} \quad \lambda \lt x) \\
    0 & (\textrm{otherwise})
  \end{array}
\right.

And the back propagation is

\cfrac{\partial y}{\partial x} = \left\{
  \begin{array}{cc}
    1 & (x \lt -\lambda \quad \textrm{or} \quad \lambda \lt x) \\
    0 & (\textrm{otherwise})
  \end{array}
\right.

It will be. The default value for $ \ lambda $ is $ 0.5 $.

softShrink function

This is also just an introduction from Pytorch. ** Check TODO supplementary information *** The forward propagation formula is

y = \left\{
  \begin{array}{cc}
    x + \lambda & (x \lt -\lambda) \\
    x - \lambda & (x \gt \lambda) \\
    0 & (\textrm{otherwise})
  \end{array}
\right.

And the back propagation is

\cfrac{\partial y}{\partial x} = \left\{
  \begin{array}{cc}
    1 & (x \lt -\lambda \quad \textrm{or} \quad \lambda \lt x) \\
    0 & (\textrm{otherwise})
  \end{array}
\right.

It will be. The initial value of $ \ lambda $ here is also $ 0.5 $.

Threshold function

This is just an introduction from Pytorch. ** Check TODO supplementary information ** The forward propagation formula is

y = \left\{
  \begin{array}{cc}
    x & (x \gt threshold) \\
    value & (\textrm{otherwise})
  \end{array}
\right.

And the back propagation is

y = \left\{
  \begin{array}{cc}
    1 & (x \gt threshold) \\
    0 & (\textrm{otherwise})
  \end{array}
\right.

It will be. Here, the variables threshold and value are the values that should be given in advance. Appropriately for the time being in the graph

threshold = -1 \\
value = -2

It is said.

Sigmoid function (sigmoid)

The sigmoid function is an activation function that was often used when the backpropagation method was introduced. However, it is rarely used in the middle layer now, and it is often used in the output layer of binary classification problems. The reason is the disadvantage described later. Forward propagation

y = \cfrac{1}{1 + e^{-x}}

Backpropagation

\cfrac{\partial y}{\partial x} = y(1 - y)

Can be written as The biggest feature is that the derivative can be easily obtained from the output, but the response to extremely large and small inputs is poor, and the maximum value of the derivative is $ 0.25 $, so if you stack layers ** gradient There are also disadvantages such as the problem of disappearance **. Also, since there are exponential calculation and division, the calculation load is inevitably higher than that of simple functions such as [ReLU function](#relu function).

hardSigmoid function

The hardSigmoid function is a straight line approximation of the sigmoid function, such as a linear function. Mathematically, forward propagation

y = \left\{
  \begin{array}{cc}
    1 & (x \gt 2.5) \\
    0.2x + 0.5 & (-2.5 \le x \le 2.5) \\
    0 & (x \lt -2.5)
  \end{array}
\right.

Backpropagation

\cfrac{\partial y}{\partial x} = \left\{
  \begin{array}{cc}
    0.2 & (-2.5 \le x \le 2.5) \\
    0 & (\textrm{otherwise})
  \end{array}
\right.

It looks like. There is a Article verified in detail, so if you want to know more, please go there! There is such a complicated theoretical reason that the coefficient of the linear function is $ 0.2 $ ... ~~ I read it but didn't understand at all ~~

logSigmoid function

This is also just an introduction from Pytorch. Takes the logarithm of the [sigmoid function](#sigmoid function sigmoid). ** Check TODO supplementary information ** The forward propagation formula is

y = \log \left( \cfrac{1}{1 + e^{-x}} \right)

And backpropagation

\cfrac{\partial y}{\partial x} = \cfrac{1}{1 + e^x}

It will be. Note that the denominator of backpropagation is not the $ -x $ power.

tanh function

The tanh function, which is one of the hyperbolic functions, was proposed as one of the functions to solve the weakness that the maximum value of the differentiation of [sigmoid function](#sigmoid function sigmoid) is $ 0.25 $. As you can see in the figure, the maximum value of the derivative is $ 1 $, and the cause of the gradient disappearance can be eliminated. However, there is still the problem that the derivative with extremely large and small inputs becomes $ 0 $.

y = \tanh x = \cfrac{e^x - e^{-x}}{e^x + e^{-x}}

Backpropagation

\cfrac{\partial y}{\partial x} = \textrm{sech}^2 x = \cfrac{1}{\cosh^2 x} = \cfrac{4}{(e^x + e^{-x})^2}

It will be. Recently, it has been partially used for expected novae such as [Mish function](#mish function) and [tanhExp function](#tanhexp function). It seems to be a function with high attention.

tanhShrink function

This is also from Pytorch. It's just an introduction. ** Check TODO supplementary information ** The forward propagation formula is

y = x - \tanh x

And the back propagation is

\cfrac{\partial y}{\partial x} = \tanh^2 x

It will be.

hardtanh function

This is also Pytorch. Introductory only ... ** Check TODO supplementary information ** The forward propagation formula is

y = \left\{
  \begin{array}{cc}
    1 & (x \gt 1) \\
    -1 & (x \lt -1) \\
    x & (\textrm{otherwise})
  \end{array}
\right.

And the back propagation is

\cfrac{\partial y}{\partial x} = \left\{
  \begin{array}{cc}
    0 & (x \lt -1 \quad \textrm{or} \quad 1 \le x) \\
    1 & (\textrm{otherwise})
  \end{array}
\right.

It will be.

ReLU function

The ReLU function (generally called the ramp function) is a fairly recently proposed activation function that holds the hegemony. The feature is that simple and high-speed calculation. The forward propagation formula is

y = \left\{
  \begin{array}{cc}
    x & (x \gt 0) \\
    0 & (x \le 0)
  \end{array}
\right.

Backpropagation

\cfrac{\partial y}{\partial x} = \left\{
  \begin{array}{cc}
    1 & (x \gt 0) \\
    0 & (x \le 0)
  \end{array}
\right.

It will be. If the input is a positive value, the gradient will always be $ 1 $, so the gradient will not disappear easily and it will be easy to stack layers, but there is also the disadvantage that learning will not proceed at all for negative inputs. Also, it basically ignores the discontinuity at $ x = 0 $. In the error back propagation method, learning is advanced based on the propagation of the gradient using the chain rule, so the activation function should be differentiable with all real numbers, but in reality it is perfect $ x = 0 $ There are fewer cases where it becomes & $ 0 $ anyway, so it does not matter.

ReLU6 function

Only an introduction from Pytorch. ** Check TODO supplementary information ** The forward propagation formula is

y = \left\{
  \begin{array}{cc}
    0 & (x \le 0) \\
    6 & (x \ge 6) \\
    x & (\textrm{otherwise})
  \end{array}
\right.

And backpropagation

\cfrac{\partial y}{\partial x} = \left\{
  \begin{array}{cc}
    0 & (x \le 0 \quad \textrm{or} \quad 6 \le x) \\
    1 & (\textrm{otherwise})
  \end{array}
\right.

It will be.

leaky-ReLU function

The leaky-ReLU function outputs a linear function with a very small slope at the time of negative input to compensate for the drawback of "learning does not proceed for negative input" of [ReLU function](#relu function). It is something like that. You can hardly see it in the graph, but in the formula

y = \left\{
  \begin{array}{cc}
    x & (x \gt 0) \\
    0.01x & (x \le 0)
  \end{array}
\right.

The output is different when the input is negative. Therefore, backpropagation

\cfrac{\partial y}{\partial x} = \left\{
  \begin{array}{cc}
    1 & (x \gt 0) \\
    0.01 & (x \le 0)
  \end{array}
\right.

It will be. This is also discontinuous at $ x = 0 $. Also, I saw it in various places during my research, but it seems that the name of this function says, "There was no particular point in using this." It's a little surprising. It looks like it will improve a little ...

ELU function

A graph similar in shape to the [ReLU function](#relu function), one of the smoother functions when $ x = 0 $ is the ELU function. As you can see from the graph, even a negative input does not result in a gradient of $ 0 $ ($ 0 $ for $ x \ to-\ infinity $). In the formula

y = \left\{
  \begin{array}{cc}
    x & (x \ge 0) \\
    \alpha (e^x - 1) & (x \lt 0)
  \end{array}
\right.

And backpropagation

\cfrac{\partial y}{\partial x} = \left\{
  \begin{array}{cc}
    1 & (x \ge 0) \\
    \alpha e^x & (x \lt 0)
  \end{array}
\right.

It will be. It seems that the value of ~~ $ \ alpha $ often takes a theoretically appropriate value in the next [SELU function](#selu function) (probably). ~~ ** Revised 6/2/2020 ** The default value for $ \ alpha $ usually seems to be $ 1 $. I will replace it with a graph of $ \ alpha = 1 $. I'm sorry I sent the wrong information ...

SELU function

The SELU function is the output of the [ELU function](#elu function) multiplied by $ \ lambda $. In the formula

y = \left\{
  \begin{array}{cc}
    \lambda x & (x \ge 0) \\
    \lambda \alpha (e^x - 1) & (x \lt 0)
  \end{array}
\right.

And backpropagation

\cfrac{\partial y}{\partial x} = \left\{
  \begin{array}{cc}
    \lambda & (x \ge 0) \\
    \lambda \alpha e^x & (x \lt 0)
  \end{array}
\right.

It is multiplied by $ \ lambda $ like. It seems that the theoretically optimal parameter value can be obtained, and that value is

\alpha = 1.67326\ldots, \quad \lambda = 1.0507\ldots

It seems to be. I might read the dissertation soon ... I will supplement it when I read it.

CELU function

This is also an introduction only from Pytorch. ** Check TODO supplementary information ** The forward propagation formula is

y = \left\{
  \begin{array}{cc}
    x & (x \ge 0) \\
    \alpha \left( e^{\frac{x}{\alpha}} - 1 \right) & (\textrm{otherwise})
  \end{array}
\right.

And the formula for backpropagation is

y = \left\{
  \begin{array}{cc}
    1 & (x \ge 0) \\
    e^{\frac{x}{\alpha}} & (\textrm{otherwise})
  \end{array}
\right.

It will be.

Softmax function (softmax)

The softmax function is used as the activation function of the output layer of the multi-value classification problem. Due to the characteristics of the calculation, the output can be regarded as a probability. Don't worry too much about the vertical axis of the graph. The only thing that matters is that when you integrate (the computer is discrete so you sum it up) you get $ 1 $. Mathematically

y_i = \cfrac{e^{x_i}}{\displaystyle\sum_{k=1}^{n}{e^{x_k}}} \quad (i = 1, 2, \ldots, n)

It's like that. Back propagation is tentative

\left( \cfrac{\partial y}{\partial x} \right)_i = e^{x_i} \cfrac{\displaystyle\sum_{k=1}^{n}{e^{x_k}} - e^{x_i}}{\left( \displaystyle\sum_{k=1}^{n}{e^{x_k}} \right)^2}

However, ** cross entropy error **

Error = t \log y

By taking, the back propagation from the output layer to the intermediate layer

y - t

It will be very simple. By the way, this is not a coincidence, the cross entropy error is a function designed to fit the softmax function so that the gradient is $ y --t $. I may introduce it in the calculation graph someday.

softmin function

This is also from Pytorch. The opposite of [softmax function](#softmax function softmax), the probability of small values increases. ** Check TODO supplementary information ** The forward propagation formula is

y_i = \cfrac{e^{-x_i}}{\displaystyle\sum_{k=1}^{n}{e^{-x_k}}} \quad (i = 1, 2, \ldots, n)

And the formula for backpropagation is

\left( \cfrac{\partial y}{\partial x} \right)_i = e^{-x_i} \cfrac{\displaystyle\sum_{k=1}^{n}{e^{-x_k}} - e^{-x_i}}{\left( \displaystyle\sum_{k=1}^{n}{e^{-x_k}} \right)^2}

It will be. If we use the cross entropy error, will the error propagate back cleanly?

logSoftmax function

From Pytorch, this is the logarithm of the [softmax function](#softmax function softmax). ** Check TODO supplementary information ** It looks almost straight. I wonder if it matches ... I think the code is correct. The forward propagation formula is

y_i = \log \left( \cfrac{e^{x_i}}{\displaystyle\sum_{k=1}^{n}{e^{x_k}}} \right)

And the back propagation is

\left( \cfrac{\partial y}{\partial x} \right)_i = \cfrac{\displaystyle\sum_{k=1}^{n}{e^{x_k}} - e^{x_i}}{\displaystyle\sum_{k=1}^{n}{e^{x_k}}}

It will be.

softplus function

The softplus function has a similar name to [softmax function](#softmax function softmax), but is essentially similar to [ReLU function](#relu function).

In the formula

y = \log{(1 + e^x)} = \ln{(1 + e^x)}

Expressed as, backpropagation

\cfrac{\partial y}{\partial x} = \cfrac{e^x}{1 + e^x} = \cfrac{1}{1 + e^{-x}}

It looks like. It looks exactly like the [ReLU function](#relu function) and the differentiation.

By the way, $ \ ln x $ is to make it clear that the base is a logarithmic function of Napier's number. In other words

\ln x = \log_ex

It is that.

softsign function

Again, the name is similar to [softmax function](#softmax function softmax), but in reality it is similar to [tanh function](#tanh function) (forward propagation).

The forward propagation looks just like the tanh function, but the back propagation is completely different. It's very sharp. Looking at forward propagation with mathematical formulas

y = \cfrac{x}{1 + |x|}

And backpropagation

\cfrac{\partial y}{\partial x} = \cfrac{1}{(1 + |x|)^2}

It has become. At ~~ $ x = 0 $, the derivative has become a discontinuous function. ~~ ** Revised 6/2/2020 ** I misunderstood the continuity of the function. It's not discontinuous correctly. It cannot be differentiated.

\lim_{x \to \pm 0}{\cfrac{1}{(1 + |x|)^2}} = 1
\Leftrightarrow
\lim_{x \to 0}{\cfrac{1}{(1 + |x|)^2}} = 1

And

\cfrac{\partial y}{\partial x} = \cfrac{1}{(1 + |0|)^2} = 1 \quad (\because x = 0)

\lim_{x \to 0}{\cfrac{1}{(1 + |x|)^2}} = \cfrac{1}{(1 + |0|)^2}

Is shown to be continuous.

Swish function

This is the Swish function that is expected to be the successor to the [ReLU function](#relu function) that appeared in 2017. It looks exactly like the [ReLU function](#relu function), but unlike the [ELU function](#elu function) and [SELU function](#selu function), it is a continuous function even at $ x = 0 $. I am. Another feature is that it is a function of class $ C ^ {\ infinty} $. Furthermore, you can also see that it takes a small negative value for negative inputs. The nice point is that there is a minimum value and there is no maximum value. When the forward propagation is expressed by a mathematical formula

y = x \sigma_{sigmoid}(\beta x) = \cfrac{x}{1 + e^{-\beta x}}

It's like that. In the graph above, $ \ beta = 1 $ is set. By the way, it seems that $ \ beta $ can be optimized by the error back propagation method (not implemented). Backpropagation

\cfrac{\partial y}{\partial x} = \beta y + \sigma_{sigmoid}(\beta x)(1 - \beta y) = \beta y + \cfrac{1 - \beta y}{1 + e^{-\beta x}}

You can write like this. It feels like a glimpse of the [sigmoid function](#sigmoid function sigmoid).

Mish function

The Mish function is the successor to the [ReLU function](#relu function) proposed in 2019, which is even more recent than the [Swish function](#swish function). The paper has shown that it often outperforms the Swish function. (I haven't read the dissertation properly yet, but he wrote so) It looks almost the same as the Swish function, but it's slightly different. The graph on the far right shows the most difference. This graph is the calculation of each derivative. In other words, it represents the degree of change in the gradient. What can be read from the graph is that the error is transmitted more significantly in the $ \ Rightarrow $ gradient calculation, where the Mish function changes dynamically, especially near $ x = 0 $. As a forward propagation formula

y = x \tanh{(\varsigma(x))} = x \tanh{(\ln{(1 + e^x)})}

The back propagation is a little complicated

\cfrac{\partial y}{\partial x} = \cfrac{e^x \omega}{\delta^2}\\
\omega = 4(x + 1) + 4e^{2x} + e^{3x} + (4x + 6)e^x \\
\delta = 2e^x + e^{2x} + 2

It is calculated as. For this reason, learning takes longer than the ReLU function. However, it is often better than using the ReLU function in terms of accuracy, so select the activation function by considering the trade-off between learning time and accuracy.

tanhExp function

This is the tanhExp function provided by @ reppy4620! According to the Paper, it's from March 2020 ~ It's terribly recent. As you can see in the paper, it's a member of the [ReLU function](#relu function) (Is it called the ReLU family?). Apparently, it outperforms the [Mish function](#mish function) in famous datasets such as MNIST, CIFER-10, and CIFER-100 (I haven't read it yet). Compared to [Swish function](#swish function) The output of forward propagation looks almost the same, but the tanhExp function has a steeper slope and a smaller range above the derivative $ 1 $ in back propagation. Gradients are very delicate, and if the absolute value of the derivative is less than $ 1 $, the gradient disappears immediately, and if it is greater than $ 1 $, a gradient explosion occurs. The tanhExp function looks good in that respect as well. Then compare with [Mish function](#mish function). The Mish function follows the tanhExp function rather than the Swish function. However, the tanhExp function is still better at steep slopes around $ 0 $. Let's look at forward propagation with a mathematical formula.

y = x \tanh(e^x)

You use the tanh function as well as the Mish function. Isn't the tanh function getting more attention now? Backpropagation

\begin{align}
  \cfrac{\partial y}{\partial x} &= \tanh(e^x) + xe^x\textrm{sech}^2(e^x) \\
  &= \tanh(e^x) - xe^x(\tanh^2(e^x) - 1)
\end{align}

It will be. It's nice to be able to calculate much simpler than the Mish function ~

Code example

Here is an example of the code used when drawing the graph. Please use it as a reference when implementing. I am using jupyter notebook.

activators.py

`activators.py`


import numpy as np


class Activator():
    def __init__(self, *args,**kwds):
        pass
    

    def forward(self, *args,**kwds):
        raise Exception("Not Implemented")
    
    
    def backward(self, *args,**kwds):
        raise Exception("Not Implemented")
    
    
    def update(self, *args,**kwds):
        pass


class step(Activator):
    def forward(self, x, *args,**kwds):
        return np.where(x > 0, 1, 0)
    
    
    def backward(self, x, *args,**kwds):
        return np.zeros_like(x)


class identity(Activator):
    def forward(self, x, *args,**kwds):
        return x
    
    
    def backward(self, x, *args,**kwds):
        return np.ones_like(x)


class bentIdentity(Activator):
    def forward(self, x, *args,**kwds):
        return 0.5*(np.sqrt(x**2 + 1) - 1) + x
    
    
    def backward(self, x, *args,**kwds):
        return 0.5*x/np.sqrt(x**2 + 1) + 1


class hardShrink(Activator):
    def __init__(self, lambda_=0.5, *args,**kwds):
        self.lambda_ = lambda_
        super().__init__(*args,**kwds)
    
    
    def forward(self, x, *args,**kwds):
        return np.where((-self.lambda_ <= x) & (x <= self.lambda_),
                        0, x)
    
    
    def backward(self, x, *args,**kwds):
        return np.where((-self.lambda_ <= x) & (x <= self.lambda_),
                        0, 1)


class softShrink(Activator):
    def __init__(self, lambda_=0.5, *args,**kwds):
        self.lambda_ = lambda_
        super().__init__(*args,**kwds)
    
    
    def forward(self, x, *args,**kwds):
        return np.where(x < -self.lambda_, x + self.lambda_,
                        np.where(x > self.lambda_, x - self.lambda_, 0))
    
    
    def backward(self, x, *args,**kwds):
        return np.where((-self.lambda_ <= x) & (x <= self.lambda_),
                        0, 1)


class threshold(Activator):
    def __init__(self, threshold, value, *args,**kwds):
        self.threshold = threshold
        self.value = value
        super().__init__(*args,**kwds)
    
    
    def forward(self, x, *args,**kwds):
        return np.where(x > self.threshold, x, self.value)
    
    
    def backward(self, x, *args,**kwds):
        return np.where(x > self.threshold, 1, 0)


class sigmoid(Activator):
    def forward(self, x, *args,**kwds):
        return 1/(1 + np.exp(-x))
    
    
    def backward(self, x, y, *args,**kwds):
        return y*(1 - y)


class hardSigmoid(Activator):
    def forward(self, x, *args,**kwds):
        return np.clip(0.2*x + 0.5, 0, 1)
    
    
    def backward(self, x, *args,**kwds):
        return np.where((x > 2.5) | (x < -2.5), 0, 0.2)


class logSigmoid(Activator):
    def forward(self, x, *args,**kwds):
        return -np.log(1 + np.exp(-x))
    
    
    def backward(self, x, *args,**kwds):
        return 1/(1 + np.exp(x))


class act_tanh(Activator):
    def forward(self, x, *args,**kwds):
        return np.tanh(x)
    
    
    def backward(self, x, *args,**kwds):
        return 1 - np.tanh(x)**2


class hardtanh(Activator):
    def forward(self, x, *args,**kwds):
        return np.clip(x, -1, 1)
    
    
    def backward(self, x, *args,**kwds):
        return np.where((-1 <= x) & (x <= 1), 1, 0)


class tanhShrink(Activator):
    def forward(self, x, *args,**kwds):
        return x - np.tanh(x)
    
    
    def backward(self, x, *args,**kwds):
        return np.tanh(x)**2


class ReLU(Activator):
    def forward(self, x, *args,**kwds):
        return np.maximum(0, x)
    
    
    def backward(self, x, *args,**kwds):
        return np.where(x > 0, 1, 0)


class ReLU6(Activator):
    def forward(self, x, *args,**kwds):
        return np.clip(x, 0, 6)
    
    
    def backward(self, x, *args,**kwds):
        return np.where((0 < x) & (x < 6), 1, 0)


class leakyReLU(Activator):
    def __init__(self, alpha=1e-2, *args,**kwds):
        self.alpha = alpha
        super().__init__(*args,**kwds)
    
    
    def forward(self, x, *args,**kwds):
        return np.maximum(self.alpha * x, x)
    
    
    def backward(self, x, *args,**kwds):
        return np.where(x < 0, self.alpha, 1)


class ELU(Activator):
    def __init__(self, alpha=1., *args,**kwds):
        self.alpha = alpha
        super().__init__(*args,**kwds)
    
    
    def forward(self, x, *args,**kwds):
        return np.where(x >= 0, x, self.alpha*(np.exp(x) - 1))
    
    
    def backward(self, x, *args,**kwds):
        return np.where(x >= 0, 1, self.alpha*np.exp(x))


class SELU(Activator):
    def __init__(self, lambda_=1.0507, alpha=1.67326, *args,**kwds):
        self.lambda_ = lambda_
        self.alpha = alpha
        super().__init__(*args,**kwds)
    
    
    def forward(self, x, *args,**kwds):
        return np.where(x >= 0,
                        self.lambda_*x,
                        self.lambda_*self.alpha*(np.exp(x) - 1))
    
    
    def backward(self, x, *args,**kwds):
        return np.where(x >= 0, 
                        self.lambda_,
                        self.lambda_*self.alpha*np.exp(x))


class CELU(Activator):
    def __init__(self, alpha=1., *args,**kwds):
        self.alpha = alpha
        super().__init__(*args,**kwds)
    
    
    def forward(self, x, *args,**kwds):
        return np.where(x >= 0,
                        x,
                        self.alpha*(np.exp(x/self.alpha) - 1))
    
    
    def backward(self, x, *args,**kwds):
        return np.where(x >= 0, 1, np.exp(x/self.alpha))


class softmax(Activator):
    def forward(self, x, *args,**kwds):
        return np.exp(x)/np.sum(np.exp(x))
    
    
    def backward(self, x, *args,**kwds):
        return np.exp(x)*(np.sum(np.exp(x)) 
                          - np.exp(x))/np.sum(np.exp(x))**2


class softmin(Activator):
    def forward(self, x, *args,**kwds):
        return np.exp(-x)/np.sum(np.exp(-x))
    
    
    def backward(self, x, *args,**kwds):
        return -(np.exp(x)*(np.sum(np.exp(-x)) - np.exp(x))
                 /np.sum(np.exp(-x))**2)


class logSoftmax(Activator):
    def forward(self, x, *args,**kwds):
        return np.log(np.exp(x)/np.sum(np.exp(x)))
    
    
    def backward(self, x, *args,**kwds):
        y = np.sum(np.exp(x))
        return (y - np.exp(x))/y


class softplus(Activator):
    def forward(self, x, *args,**kwds):
        return np.logaddexp(x, 0)
    
    
    def backward(self, x, *args,**kwds):
        return 1/(1 + np.exp(-x))


class softsign(Activator):
    def forward(self, x, *args,**kwds):
        return x/(1 + np.abs(x))
    
    
    def backward(self, x, *args,**kwds):
        return 1/(1 + np.abs(x)) ** 2


class Swish(Activator):
    def __init__(self, beta=1, *args,**kwds):
        self.beta = beta
        super().__init__(*args,**kwds)
    
    
    def forward(self, x, *args,**kwds):
        return x/(1 + np.exp(-self.beta*x))
    
    
    def backward(self, x, y, *args,**kwds):
        return self.beta*y + (1 - self.beta*y)/(1 + np.exp(-self.beta*x))
    
    
    def d2y(self, x, *args,**kwds):
        return (-0.25*self.beta*(self.beta*x*np.tanh(0.5*self.beta*x) - 2)
                               *(1 - np.tanh(0.5*self.beta*x)**2))


class Mish(Activator):
    def forward(self, x, *args,**kwds):
        return x*np.tanh(np.logaddexp(x, 0))
    
    
    def backward(self, x, *args,**kwds):
        omega = (4*(x + 1) + 4*np.exp(2*x) 
                 + np.exp(3*x) + (4*x + 6)*np.exp(x))
        delta = 2*np.exp(x) + np.exp(2*x) + 2
        return np.exp(x)*omega/delta**2
    
    
    def d2y(self, x, *args,**kwds):
        omega = (2*(x + 2) 
                 + np.exp(x)*(np.exp(x)*(-2*np.exp(x)*(x - 1) - 3*x + 6)
                              + 2*(x + 4)))
        delta = np.exp(x)*(np.exp(x) + 2) + 2
        return 4*np.exp(x)*omega/delta**3


class tanhExp(Activator):
    def forward(self, x, *args,**kwds):
        return x*np.tanh(np.exp(x))
    
    
    def backward(self, x, *args,**kwds):
        tanh_exp = np.tanh(np.exp(x))
        return tanh_exp - x*np.exp(x)*(tanh_exp**2 - 1)
    
    
    def d2y(self, x, *args,**kwds):
        tanh_exp = np.tanh(np.exp(x))
        return (np.exp(x)*(-x + 2*np.exp(x)*x*tanh_exp - 2)
                         *(tanh_exp**2 - 1))


class maxout(Activator):
    def __init__(self, n_prev, n, k, wb_width=5e-2, *args,**kwds):
        self.n_prev = n_prev
        self.n = n
        self.k = k
        self.w = wb_width*np.random.rand((n_prev, n*k))
        self.b = wb_width*np.random.rand(n*k)
        
        super().__init__(*args,**kwds)
    
    
    def forward(self, x, *args,**kwds):
        self.x = x.copy()
        self.z = np.dot(self.w.T, x) + self.b
        self.z = self.z.reshape(self.n, self.k)
        self.y = np.max(self.z, axis=1)
        return self.y
    
    def backward(self, g, *args,**kwds):
        self.dw = np.sum(np.dot(self.w, self.x))

test_activators.py

`test_activators.py`


import numpy as np
import matplotlib.pyplot as plt


_act_dic = {"step": step,
            "identity": identity,
            "bent-identity": bentIdentity,
            "hard-shrink": hardShrink,
            "soft-shrink": softShrink,
            "threshold": threshold,
            "sigmoid": sigmoid,
            "hard-sigmoid": hardSigmoid,
            "log-sigmoid": logSigmoid,
            "tanh": act_tanh,
            "tanh-shrink": tanhShrink,
            "hard-tanh":hardtanh,
            "ReLU": ReLU,
            "ReLU6": ReLU6,
            "leaky-ReLU": leakyReLU,
            "ELU": ELU,
            "SELU": SELU,
            "CELU": CELU,
            "softmax": softmax,
            "softmin": softmin,
            "log-softmax": logSoftmax,
            "softplus": softplus,
            "softsign": softsign,
            "Swish": Swish,
            "Mish": Mish,
            "tanhExp": tanhExp,
           }


def get_act(name, *args,**kwds):
    for act in _act_dic:
        if name == act:
            activator = _act_dic[name](*args,**kwds)
            break
    else:
        raise ValueError(name, ": Unknown activator")
    
    return activator


def plot_graph(x, name, *args,**kwds):
    activator = get_act(name, *args,**kwds)
    
    y = activator.forward(x, *args,**kwds)
    dx = activator.backward(x, y, *args,**kwds)
    
    plt.plot(x, y, label="forward")
    plt.plot(x, dx, label="backward")
    plt.title(name)
    plt.xlabel("x")
    plt.ylabel("y")
    plt.grid()
    plt.legend(loc="best")
    plt.savefig("{}.png ".format(name))
    plt.show()


def vs_plot(x, A, B):
    A_activator = get_act(A)
    B_activator = get_act(B)
    
    y_A = {}
    y_B = {}
    
    y_A["{} y".format(A)] = A_activator.forward(x)
    y_B["{} y".format(B)] = B_activator.forward(x)
    y_A["{} dy".format(A)] = A_activator.backward(x, 
                                                  y_A["{} y".format(A)])
    y_B["{} dy".format(B)] = B_activator.backward(x,
                                                  y_B["{} y".format(B)])
    y_A["{} d2y".format(A)] = A_activator.d2y(x, y_A["{} y".format(A)])
    y_B["{} d2y".format(B)] = B_activator.d2y(x, y_B["{} y".format(B)])
    
    fig, ax = plt.subplots(1, 3, figsize=(18, 6))
    for i, key in enumerate(y_A):
        ax[i].plot(x, y_A[key], label=key)
        ax[i].set_xlabel("x")
        ax[i].set_ylabel("y")
        ax[i].grid()
    for i, key in enumerate(y_B):
        ax[i].plot(x, y_B[key], label=key)
        ax[i].legend(loc="best")
    ax[0].set_title("forward")
    ax[1].set_title("backward")
    ax[2].set_title("second-order derivative")
    fig.tight_layout()
    fig.savefig("{}_vs_{}.png ".format(A, B))
    plt.show()


x = np.arange(-5, 5, 5e-2)

plot_graph(x, "step")
plot_graph(x, "identity")
plot_graph(x, "bent-identity")
plot_graph(x, "hard-shrink")
plot_graph(x, "soft-shrink")
plot_graph(x, "threshold", -1, -2)
plot_graph(x, "sigmoid")
plot_graph(x, "hard-sigmoid")
plot_graph(x, "log-sigmoid")
plot_graph(x, "tanh")
plot_graph(x, "tanh-shrink")
plot_graph(x, "hard-tanh")
plot_graph(x, "ReLU")
plot_graph(x + 2, "ReLU6")
plot_graph(x, "leaky-ReLU")
plot_graph(x, "ELU")
plot_graph(x, "SELU")
plot_graph(x, "CELU")
plot_graph(x, "softmax")
plot_graph(x, "softmin")
plot_graph(x, "log-softmax")
plot_graph(x, "softplus")
plot_graph(x, "softsign")
plot_graph(x, "Swish")
plot_graph(x, "Mish")
plot_graph(x, "tanhExp")

vs_plot(x, "Swish", "Mish")
vs_plot(x, "Swish", "tanhExp")
vs_plot(x, "Mish", "tanhExp")

There are some others that are being implemented. I think I'll add it soon ...

reference

-[Artificial intelligence] Different from the type of activation function. merit and demerit.

Bent Identity Activation Function -Keras hard_sigmoid is max (0, min (1, (0.2 * x) + 0.5)) -Introduction of Swish activation function -Finally born! Expected new activation function "Mish" commentary -[About the expected rookie "Mish" of the activation function industry](https://medium.com/lsc-psd/ About the expected rookie of the activation function industry -mish--b1982782e186)
Pytorch

Addendum list & acknowledgments

--- @ reppy4620 gave us information about the tanhExp function! Thank you for politely posting the paper link!

Deep learning series

-Introduction to Deep Learning ~ Basics ~ -Introduction to Deep Learning ~ Coding Preparation ~ -Introduction to Deep Learning ~ Forward Propagation ~ -Thorough understanding of im2col

[PYTHON] List of activation functions (2020)

Target person

TODO list

table of contents

Step function (step)

Identity function

Bent Identity function

hardShrink function

softShrink function

Threshold function

Sigmoid function (sigmoid)

hardSigmoid function

logSigmoid function

tanh function

tanhShrink function

hardtanh function

ReLU function

ReLU6 function

leaky-ReLU function

ELU function

SELU function

CELU function

Softmax function (softmax)

softmin function

logSoftmax function

softplus function

softsign function

Swish function

Mish function

tanhExp function

Code example

`activators.py`

`test_activators.py`

reference

Addendum list & acknowledgments

Deep learning series