Let's make an app that can search similar images with Python and Flask Part1

Overview

I made a github repository because it's a big deal, please star (beggar) SSAM-SimilaritySearchAppwithMNIST

As the title says. This time, you can easily get the data and the data size is small. If you put the MNIST image into the MNIST you know, you will find a similar MNIST image. I will write a story about making an application using MNIST as my own memo.

procedure

Create a working directory!

--Working directory

It's okay to decide on a name, but I got along and gave it a fancy name like SSAM-SimilaritySearchAppwithMNIST.

--Directory for MNIST

Next, let's create a folder to save MNIST data. Otherwise, you will have to download it every time, which is troublesome. Create a folder with a name like static / mnist \ _data.

――The current directory structure looks like this

SSAM-SimilaritySearchAppwithMNIST
└── static
    └── mnist_data

Download MNIST data!

What is MNIST?

It's famous, so I don't think I need to explain it anymore, but for those who don't know, it's a 28x28 size handwritten number image. Since it is easy-to-use data, it is often used to evaluate the accuracy of machine learning models, but since it is clean data, the accuracy is generally high. I made a model with high accuracy with MNIST (like this article)! You should be careful about articles like this. Oh, it's safer to look at it with a stance that there are articles that are lively like MNIST. (This is just a self-deprecation story, so I'm not fanning other articles dealing with MNIST, it's true.)

Library to use

I'm a little worried whether it's easier to understand if the code is divided into small pieces and written with explanations one by one, or if it's easier to write them all at once and add supplementary explanations later. I'd like to think a little about it in the future, so I'd be happy if you have a reference opinion. (I don't even know if Qiita has a comment function.)

For the time being, at least I think it would be kinder to write the libraries used this time together, so I will do so. Please install pip by yourself as appropriate.

import os
import gzip
import pickle
import numpy as np
import urllib.request
from mnist import MNIST
from annoy import AnnoyIndex

from sklearn.metrics import accuracy_score

Download MNIST

I will write the code immediately. As I wrote above, I'm a little worried whether it's kind to divide the code into small pieces and write it while adding explanations one by one, or if it's kinder to write it all at once and add supplementary explanations later. I've written some comments carefully, so I'll write them all at once this time, but I'd like to hear your opinions here and there.

First, the code that downloads MNIST.

def load_mnist():
    #Download MNIST data
    url_base = 'http://yann.lecun.com/exdb/mnist/'
    key_file = {
        'train_img':'train-images-idx3-ubyte.gz',
        'train_label':'train-labels-idx1-ubyte.gz',
        'test_img':'t10k-images-idx3-ubyte.gz',
        'test_label':'t10k-labels-idx1-ubyte.gz'
    }

    #Download file(.gz format)
    for filename in key_file.values():
        file_path = f'./static/mnist_data/{filename}' #Read in.gz file path
        if os.path.isfile(file_path.replace('.gz', '')): continue #Skip if there is already a file

        urllib.request.urlretrieve(url_base + filename, file_path)

        # .With gz decompression.Delete gz file
        with gzip.open(file_path, mode='rb') as f:
            mnist = f.read()
            #Unzip and save
            with open(file_path.replace('.gz', ''), 'wb') as w:
                w.write(mnist)
            os.remove(file_path) # .Delete gz file

    #np data of mnist.Read and return in the form of array
    mndata = MNIST('./static/mnist_data/')
    # train
    images, labels = mndata.load_training()
    train_images, train_labels = np.reshape(np.array(images), (-1,28,28)), np.array(labels) # np.The image of mnist to convert to array is 28x28
    # test
    images, labels = mndata.load_testing()
    test_images, test_labels = np.reshape(np.array(images), (-1,28,28)), np.array(labels) # np.The image of mnist to convert to array is 28x28
    return train_images, train_labels, test_images, test_labels

When you run the above code, it first downloads a compressed file called a .gz file from the location where the MNIST data is located and puts it under static / mnist_data /. (If you don't create a folder in ./static/mnist_data/ in advance, you may get an error saying that there is no folder. I'm sorry.) Then unzip the .gz and delete the .gz file as you don't need it. Actually, this decompressed file is in binary format, so it is troublesome to handle, but

from mnist import MNIST

It seems that if you use a library with a stupid name, it will separate the data for train and the data for test and make it an array type. Actually, it's the only secret here that I have to worry a lot and spend several hours here. With that feeling, the function that downloads MNIST data was completed before I knew it.

Approximate nearest neighbor search

What is nearest neighbor search?

When I quote the definition from wikipedia, it is written as follows.

Nearest neighbor search (NNS) is a type of optimization problem that finds the closest point in a metric space, or a solution to it. It is also called a proximity search (English: proximity search), a similarity search (English: similarity search), or a closest point search (English: closest point search). The problem is that if we have a set S of points in the metric space M and we have a query point q ∈ M, we find the point in S that is closest to q. In many cases, M adopts a d-dimensional Euclidean space, and the distance is measured as the Euclidean distance or the Manhattan distance. Different algorithms are used for low and high dimensions. ~ From wikipedia

When this is translated into Japanese, it is written that it is an algorithm that determines a function that measures some degree of similarity and finds points with high similarity based on it. That's what it means.

Approximate?

The algorithm used this time is annoy, which is called "approximate" nearest neighbor search. What is this "approximation"? Actually, this neighborhood search algorithm consumes a lot of computational resources when trying to calculate properly. For example, it will be difficult on the day when you think that the difference between all pixel values should be calculated most simply. Especially in the case of images, the calculation time increases explosively because there are vertical x horizontal x number of channels. This MNIST is 28x28 after all, and it's grayscale (black and white image), so it's not a big deal, but dad, today I'm enthusiastic about searching for similar images of full HD 1920x1080. Dad is desperate and can't sleep. The computational complexity sister will tell you about the fear of exponential explosion. It's interesting and the strongest content to study, so if you don't know it, take a look and weep at your sister's obsession. "How to count Fukashigi" With your sister! Let's count together!

Algorithm used this time

We use an approximate nearest neighbor search library called annoy to find similar images. The reason I use this is because I'm used to it and because I think the code is relatively easy to read.

For more information on algorithms, see this blog by annoy author Nearest neighbors and vector models – part 2 – algorithms and data structures -part-2-how-to-search-in-high-dimensional-spaces.html) and Forefront of approximate nearest neighbor search in Japanese si-zui-jin-bang-tan-suo-falsezui-qian-xian? slide = 43) The explanation of this slide share is easy to understand. It's helpful to look at this before reading this article.

For those who don't like the trouble, here is a brief explanation. By recursively dividing the space where the data points exist and creating some binary trees, it is possible to search the vicinity of O (logn) at high speed. It is the algorithm that was made. However, I need to build a decision tree.

In fact, the author of annoy compares several approximate nearest neighbor search libraries and recommends other libraries. New approximate nearest neighbor benchmarks If you really want speed, it's probably better to use Facebook-made faiss, but it seems that it can only be installed from conda, so I don't feel like using it.

Use of annoy

The explanation is a little longer. Let's try neighborhood search using annoy at once.

def make_annoy_db(train_imgs):
    #Use the approximate neighborhood search library annoy
    #Decide the shape and metric of the data to be input, and insert the data
    annoy_db = AnnoyIndex((28*28), metric='euclidean') #MNIST has a size of 28x28, which gives the input data and how to calculate the similarity.*I write 28
    for i, train_img in enumerate(train_imgs):
        annoy_db.add_item(i, train_img.flatten()) #Enter the index and the corresponding data
    annoy_db.build(n_trees=10) #Build
    annoy_db.save('./static/mnist_db.ann') #Save the database created under static

Will it look like this? I use the term database, which isn't exactly a database, just because I don't know any other good and appropriate terms, maybe. By the way, AnnoyIndex can only be given a one-dimensional array, so if you want to put image data, use flatten (), reshape it, or hard-code it like this time.

Let's see the accuracy

Of course, there will be some error because the exact similarity is not calculated just because it is an approximate nearest neighbor search. (Annoy solves this by using multiple trees or something. See the URL above for details) Let's check a little if it is really accurate.

Fortunately, MNIST already has each image and its corresponding correct label. Ask them to pull one similar image for the test data and check if it is the same as the correct answer.

train_imgs, train_lbls, test_imgs, test_lbls = load_mnist()

if not os.path.isfile('./static/mnist_db.ann'):
    make_annoy_db(train_imgs) # .Build annoydb if you don't have ann files yet
annoy_db = AnnoyIndex((28*28), metric='euclidean')
annoy_db.load('./static/mnist_db.ann') #Load annoy database

#Try to see the accuracy by inputting test data, fetching a close neighborhood and comparing it with the actual one.
y_pred = [train_lbls[annoy_db.get_nns_by_vector(test_img.flatten(), 1)[0]] for test_img in test_imgs]
score  = accuracy_score(test_lbls, y_pred)
print('acc:', score)
#Output acc: 0.9595

Very high accuracy came out. If you have a lot of trouble, you can make a story with Nico Kitchen if you get 0.2525, but the reality is not so sweet. (What on earth am I talking about?)

In the case of an actual image, just because the correct label is the same does not mean that it is a really similar image (for example, if both images have different backgrounds, or if they are black cats and white cats). , The human brain does not judge that it is a similar image), but if it is limited to this MNIST, it is a beautiful data set, so maybe similar images are coming out.

Whole code

The whole code so far is as follows.

import os
import gzip
import pickle
import numpy as np
import urllib.request
from mnist import MNIST
from annoy import AnnoyIndex

from sklearn.metrics import accuracy_score

def load_mnist():
    #Download MNIST data
    url_base = 'http://yann.lecun.com/exdb/mnist/'
    key_file = {
        'train_img':'train-images-idx3-ubyte.gz',
        'train_label':'train-labels-idx1-ubyte.gz',
        'test_img':'t10k-images-idx3-ubyte.gz',
        'test_label':'t10k-labels-idx1-ubyte.gz'
    }

    #Download file(.gz format)
    for filename in key_file.values():
        file_path = f'./static/mnist_data/{filename}' #Read in.gz file path
        if os.path.isfile(file_path.replace('.gz', '')): continue #Skip if there is already a file

        urllib.request.urlretrieve(url_base + filename, file_path)

        # .With gz decompression.Delete gz file
        with gzip.open(file_path, mode='rb') as f:
            mnist = f.read()
            #Unzip and save
            with open(file_path.replace('.gz', ''), 'wb') as w:
                w.write(mnist)
            os.remove(file_path) # .Delete gz file

    #np data of mnist.Read and return in the form of array
    mndata = MNIST('./static/mnist_data/')
    # train
    images, labels = mndata.load_training()
    train_images, train_labels = np.reshape(np.array(images), (-1,28,28)), np.array(labels) # np.The image of mnist to convert to array is 28x28
    # test
    images, labels = mndata.load_testing()
    test_images, test_labels = np.reshape(np.array(images), (-1,28,28)), np.array(labels) # np.The image of mnist to convert to array is 28x28
    return train_images, train_labels, test_images, test_labels

def make_annoy_db(train_imgs):
    #Use the approximate neighborhood search library annoy
    #Decide the shape and metric of the data to be input, and insert the data
    annoy_db = AnnoyIndex((28*28), metric='euclidean')
    for i, train_img in enumerate(train_imgs):
        annoy_db.add_item(i, train_img.flatten())
    annoy_db.build(n_trees=10) #Build
    annoy_db.save('./static/mnist_db.ann')


def main():
    #mnist image loading process
    train_imgs, train_lbls, test_imgs, test_lbls = load_mnist()
    print(train_imgs.shape, train_lbls.shape, test_imgs.shape, test_lbls.shape) #I wanted to see how much data MNIST has

    if not os.path.isfile('./static/mnist_db.ann'):
        make_annoy_db(train_imgs) # .Build annoydb if you don't have ann files yet
    annoy_db = AnnoyIndex((28*28), metric='euclidean')
    annoy_db.load('./static/mnist_db.ann') #Load annoy database


    #Try to see the accuracy by inputting test data, fetching a close neighborhood and comparing it with the actual one.
    y_pred = [train_lbls[annoy_db.get_nns_by_vector(test_img.flatten(), 1)[0]] for test_img in test_imgs]
    score = accuracy_score(test_lbls, y_pred)
    print('acc:', score)


if __name__ == "__main__":
    main()

Next goal

Somehow, the amount has already increased so far, so I will continue next time. This time, I went to the point where I could search for similar images using annoy. Next time, let's create an app that will produce similar images when you select an image using Flask. (If I'm not bored or busy) Continue:-> Next time

Bonus: I really don't care, but I love Emacs's orgmode and I bravely started writing this article in orgmode, but alas I can't cope with the weak compatibility. After all, I wrote it in Markdown. I think that orgmode is the strongest document creation software on its own (or isn't Markdown difficult to use? Which idiot would think of a line break with two half-width spaces), but by all means It's sad that I can't say so when considering cooperation with others. There is something like export as markdown in orgmode, and I was thinking of doing something about it, but it wasn't as clean as I thought it would be on Qiita ... it's sad. Emacs also has a unique key binding and it is difficult to use if you can enjoy VS code, but VS code also has an org mode extension, but the most important tab key is a function to switch the display state to fold the eyes. And, I think that there is no export function to html etc., I think of something like VHS vs. Betamax, it's hard to know.

By the way, I was sad to see the cover of the August 2020 issue of Software Design. Emacs fights Vim, hasn't it been the case for a long time? Is it impossible anymore? Emacs? I would like to close the article like that.