[PYTHON] The relationship between brain science and unsupervised learning. Maximize information amount Unsupervised learning MNIST: Google Colabratory (PyTorch)

In this article, I will explain the implementation of IIC that delivers high performance with MNIST in unsupervised learning, brain science, and unsupervised learning.

The papers focused on in this article are Invariant Information Clustering for Unsupervised Image Classification and Segmentation is.

In this paper, we use an index called ** mutual information ** to classify handwritten digit images by clustering unsupervised learning.

It is called IIC (Invariant Information Clustering).

In this article, I will explain brain science and unsupervised learning, IIC points, and implementation examples in MNIST.

The table of contents is as follows.

  1. Brain science and unsupervised learning
  2. Focus on unsupervised learning in the artificial intelligence area
  3. Points of the proposed method IIC
  4. Implementation of IIC with MNIST (Google Colabratory and PyTorch)

The implementation code "MNIST_IIC.ipynb" is located here. Implementation code repository

1. Brain science and unsupervised learning

The recent boom in AI and artificial intelligence is being driven by deep learning technology.

However, most deep learning is supervised learning or reinforcement learning (such as AlphaGo).

It's not common to use deep learning for unsupervised learning (I see dimensional compression represented by AutoEncoder, but there is little clustering).

Looking at the field of brain science, Book by Professor Doya of OIST (Okinawa Institute of Science and Technology Graduate University) "Invitation to Computational Neuroscience" Aiming to Understand the Learning Mechanism of the Brain " As shown in the figure below, the relationship between the three learning types of artificial intelligence and the brain is suggested.

l_bit201806011320437219.jpg

Image quote: Why can brain circuit modules be connected well--Professor Kenji Doya, Okinawa Institute of Science and Technology

It is the development of the cerebral cortex (especially the frontal lobe) that sets humans apart from other organisms in intelligence. ** In its cerebral cortex, the importance of unsupervised learning is suggested. ** **

Hubel and Hubel's work, which won the Nobel Prize in Physiology or Medicine in 1981, is well known among brain science researchers.

They pierce the nerve cells in the cat's brain with electrodes, By measuring the activity of nerve cells when showing various symbols and moving sticks, In the visual cortex of the cerebral cortex, each nerve cell has a symbol presentation position and the direction in which the bar extends (vertical, horizontal, diagonal), etc. ** It was revealed that nerve cells act selectively according to the presented object **.

In the figure below, this nerve cell is a neuron that responds strongly to the vertical bar.

1024px-Orientation_V1.svg.png Orientation selectivity

In other words, there are nerve cells corresponding to the horizontal bars and nerve cells corresponding to the vertical bars, and the processing of these neurons is processed in a complex manner to recognize the object.

Then, in the 21st century, ** A brain region that ignites in response to a facial image and a brain region that ignites in response to a specific celebrity were discovered. ** ** Mechanism of face recognition-from the October 2017 issue of Nikkei Science

Because there was such a premise (the existence of neurons that respond specifically to objects in the cerebral cortex), In the early days of deep learning, Google ** Announcement that neurons corresponding to the cat's face were born in the middle layer of deep learning ** Was big news. Using large-scale brain simulations for machine learning and A.I.

This is because deep learning has acquired characteristics similar to the brains of real organisms.

And there is another well-known experiment with the brain, especially unsupervised learning.

Blackmore and Cooper experiments. They are, ** If you raise a kitten in an environment where you can only see the vertical lines temporarily, you will not be able to recognize the horizontal lines for the time being ** I made it clear.

The environment is like the picture below.

catexperiment.gif

Cats and Vision: is vision acquired or innate?

In other words, Hubel and Weesel revealed the existence of neurons that selectively respond to vertical and horizontal bars. Blackmore and Cooper have shown that these neurons are born in the cerebral cortex when such a landscape is input to the eye during the development of the kitten.

It is a finding from brain science that unsupervised learning will play an important role in object recognition because of the development of the visual cortex of kittens **.

2. Focus on unsupervised learning in the artificial intelligence area

Even in the area of AI researchers, there are conspicuous remarks focusing on the importance of unsupervised learning rather than the deep learning of supervised learning.

Hinton was also featured in PFN Okanohara's blog in 2017,

Professor Geoffrey Hinton also [link] "There are 10 ^ 14 synapses in the brain, but a person can only live for 10 ^ 9 seconds. The number of parameters is much larger than the number of samples (of these synapses). We need 10 ^ 5 constraints per second) (to determine the weights), and we come to the idea that we are doing a lot of unsupervised learning. "

5 years after the counterattack of the neural network

It states.

We are actually conducting research and announced SimCLR (A Simple Framework for Contrastive Learning of Visual Representations) in February 2020.

Geoffrey Hinton & Google Brain Unsupervised Learning Algorithm Improves SOTA Accuracy on ImageNet by 7%

illustration-of-the-proposed-SimCLR-framework.gif

Image citation and explanation of SimCLR: SimCLR: Improve the performance of self-supervised learning by contrast learning

In SimCLR, ** input different converted pairs to the same image and learn that they are the same, while learning that different images are different from them by unsupervised learning, and the features of the image. Get the technique to extract first. ** **

In addition, Yann LeCun was at AAAI in February 2020.

Mr. Lucan said that the next innovation of deep learning is not supervised learning, but "unsupervised learning" that extracts features from data without correct answer tags, and "self-supervised learning" that creates correct answers from learning data. Supervised Learning) ”. "These are the same tasks that newborn babies are doing to the world," Lucan explains. Babies can learn by themselves without giving the "correct answer".

It states. What are the current AI shortcomings pointed out by the three "Godfathers" of deep learning

If we continue talking about brain science and AI, we will not be able to reach the original implementation, so go here.

In addition, Hinton published a paper "Backpropagation and the brain" in Nature Reviews Neuroscience in April 2020. Backpropagation and the brain

NGRAD Law in ([Neural Gradient Representation by Activity Differences](https://syncedreview.com/2020/04/23/new-hinton-nature-paper-revisits-backpropagation-offers-insights-for-understanding-learning- in-the-cortex /)))

image-78.png

Is introduced.

Other, ● Conversation between brain scientists and IT engineers about DL and general-purpose artificial intelligenceBusiness application of deep reinforcement learning and how to make AI understand natural language

Is also a recommended past article about brain science and AI.

3. Points of the proposed method IIC

Then, this paper, Invariant Information Clustering for Unsupervised Image Classification and Segmentation

I will explain the points of IIC (Invariant Information Clustering) in.

In addition, the IIC paper itself does not have a description such as "How does the brain work ...", it is a paper with a pure calculation method.

There are two points in IIC.

** The first point is input **. For input ・ Data (handwritten number image this time) ・ Data converted appropriately Use two of them. Input each to the network and get the output of each.

IIC's deep learning network is similar to supervised learning. The number of neurons in the output layer is 10 (corresponding to the numbers 0 to 9). Multiply the output by the softmax function to output the probability value that the input image is one of 0 to 9.

The second point is the ** loss function **. Since it is unsupervised learning, we do not use teacher labels. Instead, ・ Output vector (10 elements) that is output by inputting a handwritten numeric image into a neural network When, -Output vector (10 elements) that comes out by inputting an image that is a little randomly converted from a handwritten number image. Calculate the ** mutual information ** of and train it to maximize it.

Therefore, you need to be friends with mutual information to understand IIC.

Below is a slide explaining the amount of mutual information.

図1.png

Some difficult formulas are lined up, but ** after all, the information shared by two data is called mutual information **.

The key is the ** joint distribution and peripheral probability distribution ** in the log.

In the picture above, Example 1 shows an example of throwing two independent coins, and Example 2 shows an example of throwing two non-independent coins. (That is, in Example 2, the second coin has the same result as the first coin)

The picture above is an example of a coin, which outputs two, front and back.

In the case of MNIST neural network, 10 pieces from 0 to 9 are output.

Think of the two output versions of MNIST's neural network as the coin thrower above.

If the two in Example 1 are independent, the probability of being (back, back) is 0.25. And the probability that the first sheet will be on the back after marginalization is 0.5 for (back, back) + (back, front). Similarly, the probability that the second sheet will be on the back is 0.5 for (front, back) + (back, back). If you include these in the log calculation, the log result will be 0.

If you calculate not only (back, back) but also (back, front), (front, back), (front, front), all are 0, and the total is 0.

Therefore, if you throw two independent coins, the mutual information between the first result and the second result is 0.

Since the coin throwing trials are independent and the first and second results are irrelevant, there is no information to share with each other, and I think you can be convinced that it will be 0.

Consider the amount of mutual information when the second coin, which is not independent in Example 2, gives the same result as the first coin.

The probability of becoming (back, back) is 0.5. And the probability that the first sheet will be on the back after marginalization is 0.5 for (back, back) + (back, front). Similarly, the second piece is 0.5. Then, the calculation in log becomes (1 / 0.5), and log2 = 0.69. This is multiplied by the probability of (back, back) 0.5 to get about 0.35. (Back, front) is 0, (front, front) is about 0.35 like (back, back), and (front, back) is 0, so the total is 0.69.

Therefore, in the case of non-independent coin throwing, the mutual information amount will be larger than 0.

In other words, the result of the first coin throw and the result of the second coin throw have information. This time, the result of the second sheet is the same as that of the first sheet, so I think you can understand it.

Think of this as a coin thrower in 10 versions of MNIST image output 0-9 in IIC.

And the second coin is the MNIST image with appropriate conversion.

A suitable conversion is to rotate and stretch the image with an affine transformation, and the noise will be disappointing.

** I want to train the network so that the output result of the probability that the number 0 to 9 is the same for both the image that has undergone this slight conversion and the original image **

That is the feeling of IIC.

An image that has undergone some conversion is, after all, simulating an image of the same class.

There are two points here.

First, the network output does not correspond to the numbers 0 to 9 in order, nor does it correspond to the numbers. ** Simply separate similar images into the same class. ** **

The second point is that I'm worried that they will all be in the same class, but remember the mutual information of coin throwing. The second coin in Example 2 has the same result as the first coin.

The amount of mutual information is larger for coins that have the same front and back than coins that have all the backs.

With coins that all have a back, the probability of becoming (back, back) is 1.0. And the probability that the first sheet will be on the back after marginalization is 1.0 for (back, back) + (back, front). Similarly, the second piece is 1.0. Then, the calculation in log becomes (1/1), and log1 = 0.0.

** In order to maximize the amount of mutual information, the amount of mutual information is maximized if the probabilities are evenly distributed if there are two types that can be taken, and if there are ten types, the probabilities are evenly distributed. ** **

Therefore, if you calculate the mini-batch amount, 10 classes that are naturally separated will exist evenly.

The above is the point explanation of IIC.

Then we will start the implementation.

4. Implementation of IIC with MNIST (Google Colabratory and PyTorch)

The environment uses Google Colaboratory and the framework uses PyTorch.

Implement IIC for MNIST.

The implementation code "MNIST_IIC.ipynb" is located here. Implementation code repository

https://github.com/RuABraun/phone-clustering I will implement it with reference to, but I have made a lot of changes.

First, fix the seed

#Fixed random number seed
import os
import random
import numpy as np
import torch

SEED_VALUE = 1234  #This can be anything
os.environ['PYTHONHASHSEED'] = str(SEED_VALUE)
random.seed(SEED_VALUE)
np.random.seed(SEED_VALUE)
torch.manual_seed(SEED_VALUE)  #When using PyTorch

Next, check the GPU usage. For Google Colaboratory, from "Runtime" in the top menu Select "Change runtime type" and switch None to GPU.

#When GPU is available, use GPU (in case of Google Colaboratory, specify GPU from runtime)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)  

#Use GPU. Confirm that cuda is output.

Download the MNIST image and use it as a PyTorch data loader. Prepare for training and testing.

#Download the MNIST image and make it a DataLoader (Train and Test)
from torchvision import datasets, transforms

batch_size_train = 512

train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('.', train=True, download=True,
                   transform=transforms.Compose([
                       transforms.ToTensor(),
                   ])),
    batch_size=batch_size_train, shuffle=True, drop_last=True)
# drop_last is not used if the last mini-batch is smaller than the specified size


test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('.', train=False, transform=transforms.Compose([
        transforms.ToTensor(),
    ])),
    batch_size=1024, shuffle=False)

Next is the IIC deep learning model. It is the same as normal supervised learning, but it is heavier than just supervised learning in order to acquire richer expressiveness in the network.

And the final output layer is of 10 kinds of classes that you want to expect to correspond to 0-9, and separately, a technology called overclustring is also used.

This makes it even more categorized than the 10 expected types.

The final output layer will be the 10-type version and the overclustering version, and the loss function will also calculate the output of both and use the sum.

It is expected that the performance of the usual 10 types of classification will be improved if the network that overclustrings can capture minute changes.

In the IIC paper, this output layer is further multiplexed as multi-head. This is to prevent failures depending on the initial value of the output layer, but I omitted it because it doesn't make much sense to me and it's difficult to implement.

#Deep learning model
import torch.nn as nn
import torch.nn.functional as F

OVER_CLUSTRING_Rate = 10  #Also prepare overclsutering to classify more


class NetIIC(nn.Module):
    def __init__(self):
        super(NetIIC, self).__init__()

        self.conv1 = nn.Conv2d(1, 128, 5, 2, bias=False)
        self.bn1 = nn.BatchNorm2d(128)
        self.conv2 = nn.Conv2d(128, 128, 5, 1, bias=False)
        self.bn2 = nn.BatchNorm2d(128)
        self.conv3 = nn.Conv2d(128, 128, 5, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(128)
        self.conv4 = nn.Conv2d(128, 256, 4, 1, bias=False)
        self.bn4 = nn.BatchNorm2d(256)
        
        # 0-10 kinds of classes that you want to expect to support 9
        self.fc = nn.Linear(256, 10)

        # overclustering
        #By clustering more than the actual assumption, it is possible to capture minute changes in the network.
        self.fc_overclustering = nn.Linear(256, 10*OVER_CLUSTRING_Rate)

    def forward(self, x):
        x = F.relu(self.bn1(self.conv1(x)))
        x = F.relu(self.bn2(self.conv2(x)))
        x = F.relu(self.bn3(self.conv3(x)))
        x = F.relu(self.bn4(self.conv4(x)))
        x_prefinal = x.view(x.size(0), -1)
        y = F.softmax(self.fc(x_prefinal), dim=1)

        y_overclustering = F.softmax(self.fc_overclustering(
            x_prefinal), dim=1)  # overclustering

        return y, y_overclustering

Define the model weight initialization function.

import torch.nn.init as init


def weight_init(m):
    """Weight initialization"""
    if isinstance(m, nn.Conv2d):
        init.xavier_normal_(m.weight.data)
        if m.bias is not None:
            init.normal_(m.bias.data)
    elif isinstance(m, nn.BatchNorm2d):
        init.normal_(m.weight.data, mean=1, std=0.02)
        init.constant_(m.bias.data, 0)
    elif isinstance(m, nn.Linear):
        # Xavier
        #init.xavier_normal_(m.weight.data)

        # He 
        init.kaiming_normal_(m.weight.data)
        
        if m.bias is not None:
            init.normal_(m.bias.data)

Next, define the transformation to create an image that will be paired with the input image. The affine transformation is used to rotate and stretch, and then add noise to each pixel.

#Definition of a function that adds noise to data
import torchvision as tv
import torchvision.transforms.functional as TF


def perturb_imagedata(x):
    y = x.clone()
    batch_size = x.size(0)

    #Perform random affine transformation
    trans = tv.transforms.RandomAffine(15, (0.2, 0.2,), (0.2, 0.75,))
    for i in range(batch_size):
        y[i, 0] = TF.to_tensor(trans(TF.to_pil_image(y[i, 0])))

    #Add noise
    noise = torch.randn(batch_size, 1, x.size(2), x.size(3))
    div = torch.randint(20, 30, (batch_size,),
                        dtype=torch.float32).view(batch_size, 1, 1, 1)
    y += noise / div

    return y

And it is the calculation of mutual information, which is the key to IIC. What we are doing is calculating the amount of mutual information as explained in 3.

I want to maximize the amount of mutual information, but in order to make it a loss, multiply it by minus and We are replacing it with a minimization problem.

Also, Simple method to get MNIST correct answer rate of 97% or more by unsupervised learning (without transfer learning) A coefficient term is added to the calculation of mutual information introduced in, making it easier for classes to vary.

The above article carefully introduces the implementation in TensorFlow2 and is highly recommended for TensorFlow.

#Definition of loss function by IIS
#Reference: https://github.com/RuABraun/phone-clustering/blob/master/mnist_basic.py
import sys


def compute_joint(x_out, x_tf_out):

    # x_out、x_tf_out is torch.Size([512, 10]).. Multiply these two to find the joint distribution, torch.Size([2048, 10, 10])To.
    # torch.Size([512, 10, 1]) * torch.Size([512, 1, 10])
    p_i_j = x_out.unsqueeze(2) * x_tf_out.unsqueeze(1)
    # p_i_j is torch.Size([512, 10, 10])

    #Add all mini-batch ⇒ torch.Size([10, 10])
    p_i_j = p_i_j.sum(dim=0)

    #Add to transpose matrix and divide (symmetrize) ⇒ torch.Size([10, 10])
    p_i_j = (p_i_j + p_i_j.t()) / 2.

    #Standardization ⇒ torch.Size([10, 10])
    p_i_j = p_i_j / p_i_j.sum()

    return p_i_j
    #After all, p_i_j shows the probability distribution table of which of the 100 patterns all mini-batch was for 100 patterns of 10 types of normal image judgment output and 10 types of conversion image judgment.


def IID_loss(x_out, x_tf_out, EPS=sys.float_info.epsilon):
    # torch.Size([512, 10]), The last 10 is the number of classifications, so 100 for overclustering
    bs, k = x_out.size()
    p_i_j = compute_joint(x_out, x_tf_out)  # torch.Size([10, 10])

    #From the distribution table of simultaneous probabilities, sum 10 patterns of the converted image and marginize them to create a distribution table of peripheral probabilities only for the original image.
    p_i = p_i_j.sum(dim=1).view(k, 1).expand(k, k)
    #From the distribution table of simultaneous probabilities, sum 10 patterns of the original image and marginize them to create a distribution table of peripheral probabilities only for the converted image.
    p_j = p_i_j.sum(dim=0).view(1, k).expand(k, k)

    #Avoid entering a value close to 0 in the log as it will diverge
    #p_i_j[(p_i_j < EPS).data] = EPS
    #p_j[(p_j < EPS).data] = EPS
    #p_i[(p_i < EPS).data] = EPS
    #Reference GitHub implementation (↑) is PyTorch version 1.If it is 3 or more, an error will occur.
    # https://discuss.pytorch.org/t/pytorch-1-3-showing-an-error-perhaps-for-loss-computed-from-paired-outputs/68790/3

    #Avoid entering a value close to 0 in the log as it will diverge
    p_i_j = torch.where(p_i_j < EPS, torch.tensor(
        [EPS], device=p_i_j.device), p_i_j)
    p_j = torch.where(p_j < EPS, torch.tensor([EPS], device=p_j.device), p_j)
    p_i = torch.where(p_i < EPS, torch.tensor([EPS], device=p_i.device), p_i)

    #Calculate mutual information from the simultaneous probability and peripheral probability of the original image and converted image
    #However, multiply it by minus to make it a minimization problem.
    """
I want to maximize the amount of mutual information
⇒ After all, x_out, x_tf_I want more information to be shared by out
⇒ The point is x_out, x_tf_I want out to be together

    p_i_j is x_out, x_tf_With the joint probability distribution of out, mini-batch is as much as possible, various patterns of 10 × 10, I am happy that it is uniform evenly
    
First half section, torch.log(p_i_j)Is p_If ij is close to 1, it will be a large value (close to 0).
If any of them is 1 and does not vary with 0, log0 will have a small value (large negative value).
In other words, the first half of the term

The latter term is a term for calculating which of the 10 original images or converted images will be marginalized.
If the first half term becomes smaller by subtracting the marginalized 10x10 pattern,
    x_out and x_tf_out did not share much information.
    """
    # https://qiita.com/Amanokawa/items/0aa24bc396dd88fb7d2a
    #Add weight alpha with reference to
    #Small penalties due to variations in the joint probability distribution table = Make the distribution of joint probabilities easy to vary
    alpha = 2.0  #Alpha is 1 for papers and normal mutual information calculations

    loss = -1*(p_i_j * (torch.log(p_i_j) - alpha *
                        torch.log(p_j) - alpha*torch.log(p_i))).sum()

    return loss

We will carry out training.

#Implementation of training
total_epoch = 20


#model
model = NetIIC()
model.apply(weight_init)
model.to(device)

#Set optimization function
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)


def train(total_epoch, model, train_loader, optimizer, device):

    model.train()
    scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
        optimizer, T_0=2, T_mult=2)

    for epoch in range(total_epoch):
        for batch_idx, (data, target) in enumerate(train_loader):

            #Change in learning rate
            scheduler.step()

            #Create subtly converted data. SIMULTANEOUS_Make a pair for NUM
            data_perturb = perturb_imagedata(data)  #Give noise

            #Send if sent to GPU
            data = data.to(device)
            data_perturb = data_perturb.to(device)

            #Optimization function initialization
            optimizer.zero_grad()

            #Neural network output
            output, output_overclustering = model(data)
            output_perturb, output_perturb_overclustering = model(data_perturb)

            #Loss calculation
            loss1 = IID_loss(output, output_perturb)
            loss2 = IID_loss(output_overclustering,
                             output_perturb_overclustering)
            loss = loss1 + loss2

            #Updated to reduce losses
            loss.backward()
            optimizer.step()

            #Log output
            if batch_idx % 10 == 0:
                print('Train Epoch {}:iter{} - \tLoss1: {:.6f}- \tLoss2: {:.6f}- \tLoss_total: {:.6f}'.format(
                    epoch, batch_idx, loss1.item(), loss2.item(), loss1.item()+loss2.item()))

    return model, optimizer


model_trained, optimizer = train(
    total_epoch, model, train_loader, optimizer, device)

During training, the scheduler's Cosine Annealing Warm Restarts is used to change the learning rate. This scheduler raises or lowers the learning rate as shown below.

sgdr.jpg Figure: Quote https://www.kaggle.com/c/imet-2019-fgvc6/discussion/94783

When the learning rate decreases and increases rapidly, it is possible to get out of the local solution and get closer to the parameters of the global minimal solution.

Finally, infer the test data with the trained model.

#Check the results of model classification clusters


def test(model, device, train_loader):
    model.eval()

    #List to store results
    out_targs = []
    ref_targs = []
    cnt = 0

    with torch.no_grad():
        for data, target in test_loader:
            cnt += 1
            data = data.to(device)
            target = target.to(device)
            outputs, outputs_overclustering = model(data)

            #Add classification results to list
            out_targs.append(outputs.argmax(dim=1).cpu())
            ref_targs.append(target.cpu())

    #Collect the list
    out_targs = torch.cat(out_targs)
    ref_targs = torch.cat(ref_targs)

    return out_targs.numpy(), ref_targs.numpy()


out_targs, ref_targs = test(model_trained, device, train_loader)

Finally, find the frequency table of the output results. The vertical axis shows the actual labels from 0 to 9, and the horizontal axis shows the class that was judged.

import numpy as np
import scipy.stats as stats

#Make a confusion matrix
matrix = np.zeros((10, 10))

#Create a frequency table for classes judged vertically with numbers 0 to 9 and horizontally
for i in range(len(out_targs)):
    row = ref_targs[i]
    col = out_targs[i]
    matrix[row][col] += 1

np.set_printoptions(suppress=True)
print(matrix)

The output result is

[[   1.  978.    1.    0.    0.    0.    0.    0.    0.    0.]
 [   1.    0.    4. 1110.    2.    0.    2.    0.   13.    3.]
 [   0.    5.    4.    0.    0.    0.    0.    0. 1023.    0.]
 [   0.    0.    2.    0.    0.    4.  962.    0.   39.    3.]
 [   1.    0.    0.    0.  960.    0.    0.   19.    1.    1.]
 [   1.    1.    0.    0.    0.  866.   17.    0.    3.    4.]
 [ 940.    7.    0.    0.    0.    3.    0.    0.    4.    4.]
 [   0.    0.  921.    1.    0.    0.    1.   92.   13.    0.]
 [   0.    6.    0.    0.    0.    2.    4.    2.    2.  958.]
 [   0.    4.   14.    0.    2.    7.   27.  949.    2.    4.]]

For example, the number 0 is the first in the class to collect 978. If it is the number 9, 949 pieces are gathered in the 7th place.

Applying the same numbers in each estimated class to each correct label,

#All data
total_num = matrix.sum().sum()
print(total_num)

#Each number is neatly divided into each class.
#For example, the number 0 is the first in the class to collect 978. If it is the number 9, 949 pieces were collected in the 7th place.
#Therefore, if you add the largest ones, the number of correct answers
correct_num_list = matrix.max(axis=0)
print(correct_num_list)
print(correct_num_list.sum())

print("Correct answer rate:", correct_num_list.sum()/total_num*100)

And the output is

10000.0
[ 940.  978.  921. 1110.  960.  866.  962.  949. 1023.  958.]
9667.0
Correct answer rate: 96.67

is. The correct answer rate was 97%.

The implementation code "MNIST_IIC.ipynb" is located here. Implementation code repository

at the end

We introduced IIC (Invariant Information Clustering), which is unsupervised learning that utilizes maximization of mutual information.

The idea of equating similar inputs from the same image in IIC is very similar to the contrast learning of SimCLR (A Simple Framework for Contrastive Learning of Visual Representations) in Hinton et al.'S latest paper.

In IIC itself, the amount of mutual information only appears in the error function, and the gap between the IIC paper and the brain of living things is quite large.

However, the mutual information can also be expressed by the Kullback-Leibler information amount, and can be expanded to the free-energy principle of Carl Friston (I feel).

● Explanation of the principle of free energy: Inference of perception, behavior, and thoughts of others

IIC was a very interesting paper and method. In the future, I hope that the progress of unsupervised learning and the progress of brain and neuroscience will be further integrated.

This time I conducted IIC with image data, but next time I will write an article about what happens when IIC is performed with text data processed by BERT (here, universal to organisms with a visual cortex called object recognition). This is an interesting result because the difference between the mechanism of the target brain and the natural language, which is unique to humans, is remarkable).

Thank you for reading the above.

[Disclaimer] This article is the opinion / transmission of the author, not the official opinion of the company to which the author belongs.

Recommended Posts

The relationship between brain science and unsupervised learning. Maximize information amount Unsupervised learning MNIST: Google Colabratory (PyTorch)
[Implementation explanation] Classify livedoor news by Japanese BERT x unsupervised learning (information amount maximization clustering): Google Colaboratory (PyTorch)
The subtle relationship between Gentoo and pip
Investigation of the relationship between speech preprocessing and transcription accuracy in the Google Cloud Speech API