[PYTHON] Various import methods of Mnist

(This article is a rewrite for Qiita that was written at here.)

Introduction

The handwriting recognition dataset is a well-known dataset.

It is prepared so that it can be used from various libraries, but at that time I was like "I do not read files from the outside" (← I think now, I do not understand well) or the net Even if I looked it up in, there were various ways to read it, and I was confused because I did not understand the relationship.

I thought there might be other people like that, so I wrote it for the purpose of organizing the information.

Premise

We start with the assumption that sklearn, tensorflow, and pytorch are installed. (I used Anaconda to prepare the environment)

Not all sklearn, tensorflow and pytorch are required. It means to explain each case.

The OS is Mac OS X.

Caution

It's a so-called handwriting recognition dataset, but there are two similar ones.

One is a dataset for handwriting recognition that comes with sklearn installation (included as standard).

The second one was obtained by a method other than the above.

The first one consists of an 8x8 pixel image.

The second one consists of a 28x28 pixel image.

Both of them were caught in searches such as "handwriting recognition" and "Mnist", and somehow the atmosphere of the image was the same, so I was confused in various ways.

Standard installation of sklearn (8 x 8 size data)

Data set location

The sklearn standard dataset can be found at:

 /(Parts that differ depending on the environment)/lib/python3.7/site-packages/sklearn/datasets

For reference, refer to the directory structure in my case. (I'm using Anaconda)

$ls  /Users/hiroshi/opt/anaconda3/lib/python3.7/site-packages/sklearn/    //Display a list of files and directories specified by the path

__check_build			dummy.py			model_selection
__init__.py			ensemble			multiclass.py
__pycache__			exceptions.py			multioutput.py
_build_utils			experimental			naive_bayes.py
_config.py			externals			neighbors
_distributor_init.py		feature_extraction		neural_network
_isotonic.cpython-37m-darwin.so	feature_selection		pipeline.py
base.py				gaussian_process		preprocessing
calibration.py			impute				random_projection.py
cluster				inspection			semi_supervised
compose				isotonic.py			setup.py
conftest.py			kernel_approximation.py		svm
covariance			kernel_ridge.py			tests
cross_decomposition		linear_model			tree
datasets			manifold			utils
decomposition			metrics
discriminant_analysis.py	mixture

And if you look at the datasets folder in it,

s /Users/hiroshi/opt/anaconda3/lib/python3.7/site-packages/sklearn/datasets

__init__.py					california_housing.py
__pycache__					covtype.py
_base.py					data
_california_housing.py				descr
_covtype.py					images
_kddcup99.py					kddcup99.py
_lfw.py						lfw.py
_olivetti_faces.py				olivetti_faces.py
_openml.py					openml.py
_rcv1.py					rcv1.py
_samples_generator.py				samples_generator.py
_species_distributions.py			setup.py
_svmlight_format_fast.cpython-37m-darwin.so	species_distributions.py
_svmlight_format_io.py				svmlight_format.py
_twenty_newsgroups.py				tests
base.py						twenty_newsgroups.py

It has become.

In addition to handwriting recognition, datasets are available here.

In addition, go deeper into the folder.

$ ls /Users/hiroshi/opt/anaconda3/lib/python3.7/site-packages/sklearn/datasets/data
boston_house_prices.csv		diabetes_target.csv.gz		linnerud_exercise.csv
breast_cancer.csv		digits.csv.gz			linnerud_physiological.csv
diabetes_data.csv.gz		iris.csv			wine_data.csv

Here you will find the iris and boston_house_prices datasets that are often cited in articles dealing with sklearn.

How to import a dataset

Although the code on the official page of sklearn is the same.

The following work is done by starting python from the terminal.

>>> from sklearn.datasets import load_digits
>>> import matplotlib.pyplot as plt
>>> digit=load_digits()
>>> digit.data.shape
(1797, 64)      // (8×8=Stored as a 64-column matrix)

>>> plt.gray()
>>> digit.images[0]
array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.],
       [ 0.,  0., 13., 15., 10., 15.,  5.,  0.],
       [ 0.,  3., 15.,  2.,  0., 11.,  8.,  0.],
       [ 0.,  4., 12.,  0.,  0.,  8.,  8.,  0.],
       [ 0.,  5.,  8.,  0.,  0.,  9.,  8.,  0.],
       [ 0.,  4., 11.,  0.,  1., 12.,  7.,  0.],
       [ 0.,  2., 14.,  5., 10., 12.,  0.,  0.],
       [ 0.,  0.,  6., 13., 10.,  0.,  0.,  0.]])
>>> plt.matshow(digit.images[0])
>>> plt.show()

Then, the following screen will appear.

スクリーンショット 2020-04-18 9.42.30.png

Download the original data (28 x 28 size data)

Mnist original data can be found here [http://yann.lecun.com/exdb/mnist/).

However, what you can get here is a binary file that cannot be used as is.

So, I have to process the data to a form that I can use by myself, but as I will see below, Mnist is a very famous data set, so it is a tool that is prepared so that it can be used immediately in various libraries. there is.

Of course, there seems to be a way to restore this binary data on its own, but I couldn't follow it that much, and I thought it would be a good idea to spend some time there, so I won't touch on that method.

Download via sklearn (28 x 28 size)

Looking at the articles on the net, in the old article

from sklearn.datasets import fetch_mldata

There is an article that says, but now the page you are trying to access is not available, so an error occurs.

So now it seems to use fetch_openml as below. (Scikit-learn (sklearn) fetch_mldata error solution)

This is also started from the terminal.

>>> import matplotlib.pyplot as plt   //It may have already been imported from the top, but for the time being. Do this if you haven't imported it yet.
>>> from sklearn.datasets import fetch_openml
>>> digits = fetch_openml(name='mnist_784', version=1)
>>> digits.data.shape
(70000, 784)
>>> plt.imshow(digits.data[0].reshape(28,28), cmap=plt.cm.gray_r)
<matplotlib.image.AxesImage object at 0x1a299dd850>
>>>>>> plt.show()

スクリーンショット 2020-04-18 10.37.11.png

tensorflow (28 x 28 size)

How to enter from the tensorflow tutorial.

>>> from tensorflow.examples.tutorials.mnist import input_data

It seems that I can do it, but in my case I got the following error.

From the conclusion, it seems that the tutorial folder may not be downloaded when installing tensorflow.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'tensorflow.examples.tutorials'

I looked at the contents of the actual directory.

 $ls /Users/hiroshi/opt/anaconda3/lib/python3.7/site-packages/tensorflow_core/examples/
__init__.py	__pycache__	saved_model

It has become.

I referred to the following page.

-ModuleNotFoundError: No module named'tensorflow.examples' (Stackoverflow) 5th answer from the top -Tensorflow Github Page

First, go to Tensorflow github page, download the zip file anywhere, and unzip it.

スクリーンショット 2020-04-18 13.58.00のコピー.png

There is a folder called tensorflow-master, so there is a folder called tutorials in the location of tensorflow-master \ tensorflow \ examples .

Copy this folder called turorials to /Users/hiroshi/opt/anaconda3/lib/python3.7/site-packages/tensorflow_core/examples/.

If possible so far

>>> import matplotlib.pyplot as plt   //It may have already been imported from the top, but for the time being. Do this if you haven't imported it yet.
>>> from tensorflow.examples.tutorials.mnist import input_data
>>> mnist = input_data.read_data_sets("MNIST_data", one_hot=True)
>>> im = mnist.train.images[1]
>>> im = im.reshape(-1, 28)
>>> plt.imshow(im)
<matplotlib.image.AxesImage object at 0x64a4ee450>
>>> plt.show()

If so, you should see the image as well.

keras (28 x 28 size)

>>> import matplotlib.pyplot as plt   //It may have already been imported from the top, but for the time being. Do this if you haven't imported it yet.
>>> import tensorflow as tf
>>> mnist = tf.keras.datasets.mnist
>>> mnist
>>> mnist_data = mnist.load_data()
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 1s 0us/step
>>> type(mnist_data[0])
<class 'tuple'>   //It will be returned as a tuple.
>>> len(mnist_data[0])
2
>>> len(mnist_data[0][0])
60000
>>> len(mnist_data[0][0][1])
28
>>> mnist_data[0][0][1].shape
(28, 28)

>>> plt.imshow(mnist_data[0][0][1],cmap=plt.cm.gray_r)
<matplotlib.image.AxesImage object at 0x642398550>
>>> plt.show()

I won't post the image anymore, but hopefully it will be displayed again.

pytorch (28 x 28 size)

First of all, it seems that if you can not do this, you can not proceed,

>>> from torchvision.datasets import MNIST

I got the following error. There seems to be no torch vision.

In my case, when I put pytorch in conda, I just

conda install pytorch

It seemed that I only went there.

It seems to do as follows to include accessories.

conda install pytorch torchvision -c pytorch 

You will be asked for confirmation, so press y.

After doing the above (if necessary), try running code similar to the following.

>>> import matplotlib.pyplot as plt   //It may have already been imported from the top, but for the time being. Do this if you haven't imported it yet.


>>> import torchvision.transforms as transforms
>>> from torch.utils.data import DataLoader
>>> from torchvision.datasets import MNIST
>>> mnist_data = MNIST('~/tmp/mnist', train=True, download=True, transform=transforms.ToTensor())
>>> data_loader = DataLoader(mnist_data,batch_size=4,shuffle=False)
>>> data_iter = iter(data_loader)
>>> images, labels = data_iter.next()
>>> npimg = images[0].numpy()
>>> npimg = npimg.reshape((28, 28))
>>> plt.imshow(npimg, cmap='gray')
<matplotlib.image.AxesImage object at 0x12c841810>
>>plt.show()

Extra edition (from the first Deep learning)

["Deep Learning from Zero"] from O'Reilly (https://www.amazon.co.jp/%E3%82%BC%E3%83%AD%E3%81%8B%E3%82% 89% E4% BD% 9C% E3% 82% 8BDeep-Learning-% E2% 80% 95Python% E3% 81% A7% E5% AD% A6% E3% 81% B6% E3% 83% 87% E3% 82 % A3% E3% 83% BC% E3% 83% 97% E3% 83% A9% E3% 83% BC% E3% 83% 8B% E3% 83% B3% E3% 82% B0% E3% 81% AE % E7% 90% 86% E8% AB% 96% E3% 81% A8% E5% AE% 9F% E8% A3% 85-% E6% 96% 8E% E8% 97% A4-% E5% BA% B7 % E6% AF% 85 / dp / 4873117585) is done independently in the files provided by this book.

Specifically, in the folder downloaded from github page of files used in "Deep Learning from scratch" I will read it all. (Of course, you need to prepare python, numpy, etc. in advance.

Follow the steps below.

First, download or clone the folder from the Github page mentioned above.

Here, we will download it. Then unzip it.

スクリーンショット 2020-04-18 22.56.57のコピー.png

This will create a folder called deep-learning-from-scratch-master.

** Since each chapter is divided into folders, it feels like moving to that chapter's folder and reading. ** **

The folder itself is from ch01, but since Mnist data is used in Chapter 3, I will enter it in ch03.

$ pwd
/Volumes/SONY_64GB/deep-learning-from-scratch-master/ch03

Start python ...

>>> import sys,os
>>> sys.path.append(os.pardir)
>>> from dataset.mnist import load_mnist
>>> (x_train,t_train),(x_test,t_test) = load_mnist(flatten=True,normalize=False)
Downloading train-images-idx3-ubyte.gz ... 
Done
Downloading train-labels-idx1-ubyte.gz ... 
Done
Downloading t10k-images-idx3-ubyte.gz ... 
Done
Downloading t10k-labels-idx1-ubyte.gz ... 
Done
Converting train-images-idx3-ubyte.gz to NumPy Array ...
Done
Converting train-labels-idx1-ubyte.gz to NumPy Array ...
Done
Converting t10k-images-idx3-ubyte.gz to NumPy Array ...
Done
Converting t10k-labels-idx1-ubyte.gz to NumPy Array ...
Done
Creating pickle file ...
Done!
>>> print(x_train.shape)
(60000, 784)
>>> print(t_train.shape)
(60000,)
>>> print(x_test.shape)
(10000, 784)
>>> print(t_test.shape)
(10000,)
>>> 


Referenced page

Mnist original data

MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges → A page with Mnist's original data. Binary data can be downloaded here.

sklearn

Sklearn official documentation digit page (→ Click on the dataset at the top to go to the page showing what other datasets sklearn comes standard with.)

Recognizing handwritten numbers with SVM from Scikit learn (Qiita) Scikit-learn (sklearn) fetch_mldata error solution (Qiita)

Understanding the MNIST data specifications

Handle handwritten digit data! How to use mnist with Python [For beginners]

7.5.3. Downloading datasets from the openml.org repository¶

Tensorflow

Basic usage of TensorFlow, Keras (model construction / training / evaluation / prediction)

ModuleNotFoundError: No module named'tensorflow.examples' (Stackoverflow) 5th answer from the top Tensorflow Github Page

Keras

Basic usage of TensorFlow, Keras (model construction / training / evaluation / prediction)

Pytorch Try MNIST with PyTorch conda install pytorch torchvision -c pytorch says PackageNotFoundError: Dependencies missing in current osx-64 channels: --pytorch-> mkl> = 2018

Other

OpenML (especially the data list page) ["Deep Learning from scratch"](https://www.amazon.co.jp/%E3%82%BC%E3%83%AD%E3%81%8B%E3%82%89%E4%BD % 9C% E3% 82% 8BDeep-Learning-% E2% 80% 95Python% E3% 81% A7% E5% AD% A6% E3% 81% B6% E3% 83% 87% E3% 82% A3% E3% 83% BC% E3% 83% 97% E3% 83% A9% E3% 83% BC% E3% 83% 8B% E3% 83% B3% E3% 82% B0% E3% 81% AE% E7% 90% 86% E8% AB% 96% E3% 81% A8% E5% AE% 9F% E8% A3% 85-% E6% 96% 8E% E8% 97% A4-% E5% BA% B7% E6% AF% 85 / dp / 4873117585) Github page of files used in "Deep Learning from scratch"

Recommended Posts

Various import methods of Mnist
Various processing of Python
Various class methods and static methods
About various encodings of Python 3
I measured various methods of interprocess communication in multiprocessing of python3
Practice typical methods of statistics (1)
[Sentence classification] I tried various pooling methods of Convolutional Neural Networks
Various of Tweepy. Ma ♡ and ♡ me ♡
[Python] The stumbling block of import
[Shell] Various patterns of string decomposition
Summary of various operations in Tensorflow
About import error of PyQt5.QtWidgets (Anaconda)
Data handling 2 Analysis of various data formats
Various methods to numerically create the inverse function of a certain function Introduction