Summary of how to use MNIST in Python

2020/2 update

The content is a little out of date, so I rewrote it on my personal blog. https://kakedashi-engineer.appspot.com/2020/02/10/mnist/

What is MNIST

Overview

MNIST is a 28 x 28 pixel handwritten digit dataset. People in the Deep Learning area often use it as a benchmark for the time being.

Each pixel takes an integer value from 0 to 255. There are 70,000 images in total, of which 60,000 are training data and 10,000 are test data. The point is that training data and test data are separated from the beginning. I sometimes use this as it is, Of the training data, 10,000 from the back may be used as validation data. (What is validation data? Let's google early stopping)

Data order

The data is not sorted by class, For example, training data

[5 0 4 ..., 8 4 8]

The order is not regular.

Percentage of each number

One thing to note is the percentage of each number. Naturally, I thought that each number was 6000 + 1000, but when I looked it up, it seemed different. The image of "1" is 24% more than the image of "5" ...

    0      1     2       3      4      5      6      7      8      9
[ 5923.  6742.  5958.  6131.  5842.  5421.  5918.  6265.  5851.  5949.] # training data
[  980.  1135.  1032.  1010.   982.   892.   958.  1028.   974.  1009.] # test data
[ 6903.  7877.  6990.  7141.  6824.  6313.  6876.  7293.  6825.  6958.] # training data + test data

Normalization

There is one more thing to note. The pixels in the corners of the image may be 0 in any of the 70,000 images. Therefore, if you try to normalize pixel by pixel, division by zero will occur. In sample code etc., it seems that it is common to divide by 255.0 at once with numpy etc.

The introduction has become longer. Next, I will briefly introduce how to use MNIST in Python.

1. THE MNIST DATABASE of handwritten digits (head family)

You can download it from the homepage of Yann LeCun and others.

train-images-idx3-ubyte.gz: training set images (9912422 bytes) train-labels-idx1-ubyte.gz: training set labels (28881 bytes) t10k-images-idx3-ubyte.gz: test set images (1648877 bytes) t10k-labels-idx1-ubyte.gz: test set labels (4542 bytes)

Since it is binary data, not text data, it is used after appropriate processing. As you can see, it is divided into training data and test data, and the labels are also different. By the way, labels are integer values from 0 to 9, so I think many people want to use 1-of-K representation. In such a case, sklearn.preprocessing.LabelBinarizer is recommended.

  1. sklearn

With sklearn, you don't need to download, and you can easily call the following benchmarks.

python


from sklearn.datasets import *
load_boston() 	
load_iris() 	
load_diabetes() 	
load_digits([n_class])
load_linnerud()

But not only that, from a repository for machine learning data called mldata.org You can also download the data. There is also MNIST here. Since the data is heavy, it will take some time to download, but Once downloaded, it will be saved in data_home, so you don't have to wait that long for the second time.

python


from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original', data_home=custom_data_home)

Very easy to use.

3. Other

Libraries such as TensorFlow and Theano seem to have their own functions to download MNIST. If you look at the sample code, you can see how to use it. It seems that chainer uses sklearn.

Recommended Posts

Summary of how to use MNIST in Python
[Python] Summary of how to use pandas
[Python2.7] Summary of how to use unittest
Summary of how to use Python list
[Python2.7] Summary of how to use subprocess
Summary of how to import files in Python 3
Summary of how to use pandas.DataFrame.loc
How to use SQLite in Python
Summary of how to use pyenv-virtualenv
How to use Mysql in python
How to use ChemSpider in Python
How to use PubChem in Python
Summary of how to use csvkit
[Introduction to Python] How to use class in Python?
[Python] Summary of how to use split and join functions
Comparison of how to use higher-order functions in Python 2 and 3
How to use __slots__ in Python class
How to use regular expressions in Python
How to use is and == in Python
[Question] How to use plot_surface of python
How to use the C library in Python
[Python] How to use two types of type ()
How to use Python Image Library in python3 series
How to use tkinter with python in pyenv
Summary of studying Python to use AWS Lambda
[Python] How to use list 1
How to use Python argparse
Python: How to use pydub
[Python] How to use checkio
How to develop in Python
[Python] How to use input ()
How to use Python lambda
[Python] How to use virtualenv
python3: How to use bottle (3)
python3: How to use bottle
How to use Python bytes
[For beginners] How to use say command in python!
Summary of tools needed to analyze data in Python
How to get the number of digits in Python
I tried to summarize how to use matplotlib of python
How to use Python Kivy ① ~ Basics of Kv Language ~
I tried to summarize how to use pandas in python
[Python] Summary of how to specify the color of the figure
How to use the model learned in Lobe in Python
[Python] How to do PCA in Python
Python: How to use async with
How to use classes in Theano
How to collect images in Python
How to use Requests (Python Library)
[Python] How to use list 3 Added
How to use OpenPose's Python API
How to wrap C in Python
How to use FTP with Python
Python: How to use pydub (playback)
How to use python zip function
How to handle Japanese in Python
[Python] How to use Typetalk API
[python] Summary of how to retrieve lists and dictionary elements
How to use the __call__ method in a Python class
[Introduction to Udemy Python 3 + Application] 36. How to use In and Not
How to develop in a virtual environment of Python [Memo]