Playing handwritten numbers with python Part 1

Image display of handwritten digit data

First, prepare handwritten digit data. This time, from the task of Digital Recognizer of Kaggle, teacher data named train I would like to download and use it.

Since the total amount of this data is 73MB, which is a considerable amount of data, we will prioritize the ease of understanding and pick up 20 pieces from each number from 0 to 9, for a total of 200 pieces. Please download the picked up data from here.

This handwritten digit data is a CSV file

8, 0, 0, 0, 128, ... , 54, 23, 0, 0


```The first digit is a label indicating which number was written, and the subsequent digits are 28x28.=Numerical data for 784 pixels follows.

 First, import the required libraries.

```py
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm

Then it reads the data, stores it in an array, and sorts it by label order.

size = 28
raw_data= np.loadtxt('train_small.csv',delimiter=',',skiprows=1)

digit_data = []
for i in range(len(raw_data)):
    digit_data.append((raw_data[i,0],raw_data[i,1:785]))

digit_data.sort(key=lambda x: x[0]) # sort array by label

First of all, let's display what kind of image the read data is (with the pcolor graph of matplotlib).


# draw digit images
plt.figure(figsize=(15, 15))

for i in range(len(digit_data)):
    X, Y = np.meshgrid(range(size),range(size))
    Z = digit_data[i][1].reshape(size,size)   # convert from vector to 28x28 matrix
    Z = Z[::-1,:]             # flip vertical
    plt.subplot(10, 20, i+1)  # layout 200 cells
    plt.xlim(0,27)
    plt.ylim(0,27)
    plt.pcolor(X, Y, Z)
    plt.flag()
    plt.gray()
    plt.tick_params(labelbottom="off")
    plt.tick_params(labelleft="off")
    
plt.show()

digit_list.png

The 8th data of "2" is amazing, there is no impression of "2" (laugh) If it is not said that it is "2", even human beings can not distinguish it ... This is the dataset we will be using this time.

Try to correlate each data

Plot the correlation matrix

Let's make a correlation matrix with this 28x28 = 784 pixel image data using a 784-dimensional vector with each element as a grayscale density. The question is how much it makes sense to simply correlate, but I feel that a simple method can express the closeness of the images to some extent. Since it is a 200x200 matrix, I can't understand it even if it's a number, so I'll show it in a graph to get an image.

corr.png

It's a pretty spectacular graph (laughs)

The complete diagonal components have the same data, so the correlation is 1. Looking at it lightly, I feel that the diagonal blocks (correlation coefficient between the same numbers) are a little dark. "1" is definitely highly correlated.

The calculation in python is done as follows.

data_mat = []

# convert list to ndarray
for i in range(len(digit_data)):
    label = digit_data[i][0]
    data_mat.append(digit_data[i][1])

A = np.array(data_mat)
Z = np.corrcoef(A)      # generate correlation matrix

area_size = len(digit_data)
X, Y = np.meshgrid(range(area_size),range(area_size)) 

Set a threshold to make it easier to see

To make it a little easier to see, let's set a threshold and plot the ones with a correlation coefficient greater than that as 0 and the ones with less than 1 as 0. I have selected 0.5 and 0.6 as the thresholds, but they are arbitrary, and I have tried several and picked up the ones whose diagonal components have begun to emerge. Looking at the 0.6 one, it seems that there is a difference between the diagonal block and the others. It also seems to indicate that "9" and "7" are similar. You can see that "2" has a particularly low correlation between "2".

corr2.png

corr4.png


plt.clf()
plt.figure(figsize=(10, 10))
plt.xlim(0,area_size-1)
plt.ylim(0,area_size-1)
plt.title("Correlation matrix of digit charcter vector. (corr>0.5)")

thresh = .5 
Z1 = Z.copy()
Z1[Z1 > thresh] = 1
Z1[Z1 <= thresh] = 0

plt.pcolor(X, Y, Z1, cmap=cm.get_cmap('Blues'),alpha=0.6)
plt.xticks([(i * 20) for i in range(10)],range(10))
plt.yticks([(i * 20) for i in range(10)],range(10))
plt.grid(color='deeppink',linestyle='--')
plt.show()

Average value per block

Finally, let's show the average value for each block in a 10x10 graph.

corr3.png


summary_Z = np.zeros(100).reshape(10,10)

for i in range(10):
    for j in range(10):
        i1 = i * 20
        j1 = j * 20
        #print "[%d:%d,%d:%d]" % (i1,i1+20,j1,j1+20)
        if i==j:
            #Since the diagonal component is fixed at 1, take the average excluding it to avoid the value from rising.
            summary_Z[i,j] = (np.sum(Z[i1:i1+20,j1:j1+20])-20)/380
        else:
            summary_Z[i,j] = np.sum(Z[i1:i1+20,j1:j1+20])/400

# average of each digit's grid
plt.clf()
plt.figure(figsize=(10, 10))
plt.xlim(0,10)
plt.ylim(0,10)

sX, sY = np.meshgrid(range(11),range(11))
plt.title("Correlation matrix of summuation of each digit's cell")
plt.xticks(range(10),range(10))
plt.yticks(range(10),range(10))
plt.pcolor(sX, sY, summary_Z, cmap=cm.get_cmap('Blues'),alpha=0.6)
plt.show()            

### for the next step

This time, I tried a rough analysis in a sense that the image data is regarded as a 784-dimensional vector and the vectors are correlated as they are, but since the image data is originally two-dimensional, the neighboring pixels such as top, bottom, left, and right I think that it is possible to express more plausible closeness between images by considering the closeness considering the value. It's still before machine learning at this stage. However, the diagonal components were properly displayed. I will write a little more serious thing as the next step in Next article.

Recommended Posts

Playing handwritten numbers with python Part 1
Play handwritten numbers with python Part 2 (identify)
Image processing with Python (Part 2)
Studying Python with freeCodeCamp part1
Bordering images with python Part 1
Scraping with Selenium + Python Part 1
Determine prime numbers with python
Studying Python with freeCodeCamp part2
Image processing with Python (Part 1)
Solving Sudoku with Python (Part 2)
Image processing with Python (Part 3)
Error when playing with python
Scraping with Selenium + Python Part 2
Testing with random numbers in Python
[Automation with python! ] Part 1: Setting file
Automate simple tasks with Python Part0
[Automation with python! ] Part 2: File operation
Excel aggregation with Python pandas Part 1
FM modulation and demodulation with Python Part 3
Process Pubmed .xml data with python [Part 2]
Automate simple tasks with Python Part1 Scraping
Algorithm learned with Python 4th: Prime numbers
Playing card class in Python (with comparison)
100 Language Processing Knock with Python (Chapter 2, Part 2)
Working with Azure CosmosDB from Python Part.2
Excel aggregation with Python pandas Part 2 Variadic
100 Language Processing Knock with Python (Chapter 2, Part 1)
FM modulation and demodulation with Python Part 2
[Part1] Scraping with Python → Organize to csv!
QGIS + Python Part 2
FizzBuzz with Python3
Scraping with Python
Statistics with python
Scraping with Python
Python with Go
QGIS + Python Part 1
Twilio with Python
Integrate with Python
Play with 2016-Python
AES256 with python
Tested with Python
python starts with ()
with syntax (Python)
Python: Scraping Part 1
Bingo with python
Zundokokiyoshi with python
Excel with Python
Python3 Beginning Part 1
Microcomputer with Python
Python: Scraping Part 2
Cast with python
Machine learning starting with Python Personal memorandum Part2
Create test data like that with Python (Part 1)
Learn to recognize handwritten numbers (MNIST) with Caffe
Generate two correlated pseudo-random numbers (with Python sample)
Machine learning starting with Python Personal memorandum Part1
Note for formatting numbers with python format function
How to measure execution time with Python Part 1
I tried playing mahjong with Python (single mahjong edition)
Generate n correlated pseudo-random numbers (with Python sample)
Create fractal shapes with python part1 (Sierpinski Gasket)