I started learning machine learning, but it was a little difficult to get along with. Character recognition is the first thing introduced in TensorFlow Tutorial, which creates a novice atmosphere. It was big that I was there. Suddenly "Let's create a function that distributes the input data of 784 elements (number of pixels) to the output data of 10 elements (numbers). We have a huge amount of manually input data (55,000). This is Hello for machine learning. It's World. " It was hard.

However, I couldn't help it, so I tried it, and somehow I found out how to do it, so I created a tutorial that I borrowed myself, so I will write it as a memorandum. The Python version used was 3.5.1.

What to make

--Distributor that classifies body type into "thin", "normal", and "chubby" based on height and weight

Assumed situation

Ask a large number of people for "height" and "weight"
Make a note of the person's "body shape" impression (individual subjectivity) in three categories: "thin," "normal," and "chubby."

172cm 62kg Normal 181cm 55kg thin ... ```

When a considerable amount of memo data is accumulated, machine learning the correlation between "height", "weight" and "body shape" so that the body shape can be judged without intuition if only "height" and "weight" are known. To.

Machine learning flow with TensorFlow

Prepare teacher data and test data
Design the data structure of input value and output value from teacher data
Assuming a correlation model between input and output values
Pass the teacher data to the model and train it
Check the learning status of the model
If no matter how much you learn, you will not get the correct result. → Return to 3.

1. Prepare teacher data and test data

Prepare sufficient data and teacher data that can analyze the correlation of data. The larger the number of data, the more accurate learning can be performed.

Y = f(X)

The set of $ X $ and $ Y $ is the teacher data. I don't know how to find $ Y $, but I want to know the information. $ X $ and $ Y $ are vectors, but don't think hard, the image below.

\left(
  \begin{array}{ccc}
usually\\
Skinny\\
      \vdots
  \end{array}
\right)
 = f\left(
  \begin{array}{ccc}
      172 & 62  \\
      181 & 55  \\
      \vdots & \vdots
  \end{array}
\right)

For the teacher data this time, the data format was CSV. I didn't have time to do a street survey, so I made it by hand.

import numpy as np
import csv
def gen_data(n):
    h = 160 + (np.random.randn(n) * 10)
    w = (h/100) ** 2 * 22 + (np.random.randn(n) * 10)
    bmi = w / (h/100) ** 2
    f = np.vectorize(lambda b: 'Chubby' if b > 25 else 'usually' if b > 18.5 else 'Skinny')
    return np.c_[h, w, f(bmi)]
fp = open('train.csv', 'w')
writer = csv.writer(fp)
writer.writerows(gen_data(100))
fp.close()

`train.csv`


172,62,usually
181,55,Skinny
...

Easy to read.

 fp = open('train.csv', 'r')
 train_data = np.array([r for r in csv.reader(fp)])

This is the data distribution in the graph

train_plot = [train_data[train_data[:,2] == 'Skinny'],
              train_data[train_data[:,2] == 'usually'],
              train_data[train_data[:,2] == 'Chubby']]
plt.plot(train_plot[0][:,0], train_plot[0][:,1], 'o', label = 'skinny')
plt.plot(train_plot[1][:,0], train_plot[1][:,1], 'o', label = 'normal')
plt.plot(train_plot[2][:,0], train_plot[2][:,1], 'o', label = 'fat')

In addition to the teacher data, test data for judging the learning result of machine learning is also prepared.

This test data is $ A function machine-learned to approach Y = f (X in teacher data) $ in teacher data $ Teacher data Y \ falling dotseq g (teacher data X) $ It is used to check the performance of this $ g $. Pass $ X $, which is different from the teacher data, to $ g $ and see if the expected result is returned.

Check contents: Is $ g (X of test data) $ equal to Y $ of $ test data?

This time, we prepared 100 teacher data and 50 test data.

2. Design the data structure of input value and output value from teacher data

Even if it says design, it is only necessary to determine the number of elements of the input value and the number of elements of the output value and decide the type. This time there are two input elements

[height: float,body weight: float]

3 output elements

[Probability of being thin:float,Probability of being normal:float,Probability of being chubby:float ]

And said. The probability is from $ 0 $ to $ 1 $.

Convert the data read from CSV as follows.

train_x = np.array([[float(r[0]), float(r[1])] for r in train_data])
train_y = np.array([ [1,0,0] if r[2] =='Skinny' else [0,1,0] if r[2] == 'usually' else [0,0,1] for r in train_data])

$ Y = f (X) $ looks like this.

\left(
  \begin{array}{ccc}
     0 & 1 & 0 \\
     1 & 0 & 0 \\
     \vdots & \vdots & \vdots 
  \end{array}
\right)
 = f\left(
  \begin{array}{ccc}
      172 & 62  \\
      181 & 55  \\
      \vdots & \vdots
  \end{array}
\right)

3. Assuming a correlation model between input and output values

This is the liver.

Assume the function $ g (X) $ to be machine-learned as follows.

Y = {\rm softmax}(WX+b)

The formula came out suddenly, but Linear regression Is possible with this. [See below](#% E6% 9C% 9F% E5% BE% 85% E3% 81% 97% E3% 81% 9F% E7% B5% 90% E6% 9E% 9C% E3% 81 % AB% E8% BF% 91% E3% 81% A5% E3% 81% 8B% E3% 81% AA% E3% 81% 84).

This is shown in the figure,

A function that multiplies all $ x_ {i} $ by the weight $ W_ {i, j} $, adds the bias $ b_ {j} $, and adds $ {\ rm softmax} $ to get $ y_ {j} $. Prepare. By changing $ W $ and $ b $ little by little to get closer to the relationship between $ X $ and $ Y $ in the teacher data, I think it will eventually become a high-performance classifier. is there.

\left(
  \begin{array}{ccc}
      y_{1,1} & y_{1,2} & y_{1,2}\\
      y_{2,1} & y_{2,2} & y_{1,3}\\
      \vdots & \vdots & \vdots
  \end{array}
\right)
 = {\rm softmax}(
\left(
  \begin{array}{ccc}
      W_{1,1} & W_{1,2} & W_{1,3}  \\
      W_{2,1} & W_{2,2} & W_{2,3} 
  \end{array}
\right) 
\cdot
\left(
  \begin{array}{ccc}
      x_{1,1} & x_{1,2}  \\
      x_{2,1} & x_{2,2}  \\
      \vdots & \vdots
  \end{array}
\right) + 
\left(
  \begin{array}{ccc}
      b_{1} &
      b_{2} &
      b_{3}
  \end{array}
\right))

If you unravel one line,

$ \ left (\ begin {array} {ccc} Probability of being thin y_ {, 1} & Probability of being normal y_ {, 2} & Probability of being chubby y_ {, 3} \ end {array} \ right) $

= 
{\rm softmax}
\left(
  \begin{array}{ccc}
    (W_{1.1} x_{,1} + W_{2.1} x_{,2} + b_{1}),\ (W_{1.2} x_{,1} + W_{2.2} x_{,2} + b_{2}),\ (W_{1.3} x_{,1} + W_{2.3} x_{,2} + b_{3}
  \end{array}
\right))

$ x_ {, 1} $: Height, $ x_ {, 2} $: Weight

This is what it is.

$ {\ rm softmax} $ remains.

The $ {\ rm softmax} $ function is a convenient function when classifying into multiple probability values by a neural network like this time. In the formula, $ A = \ left [\ begin {array} {ccc} a_ {1} \ ldots a_n \ end {array} \ right] $ {\ rm softmax} (A) = \ when there was $ left [\ frac {e ^ {a_ {1}}} {\ sum_ {j = 1} ^ ne ^ {a_j}} \ ldots \ frac {e ^ {a_n}} {\ sum_ {j = 1} ^ ne ^ {a_j}} \ right] It is a function that becomes $, but for the time being, please ignore it.

Simply put, the relatively larger values are normalized to $ 1 $ and the smaller values are normalized to $ 0 $ so that the sum of the array $ A $ is $ 1 $. It is a substitute for me. This time, there are three categories $ [Probability of being thin, \ Probability of being normal, \ Probability of being chubby] $ I want to find a pair of. For example, if you have an 80% chance of losing weight and a 20% chance of being normal height and weight, the answer is [0.8,\ 0.2,\ 0.0] I want to. Also, the sum of all records in the teacher data is $ 1 $. [1,\ 0,\ 0]\ {\rm or}\ [0,\ 1,\ 0]\ {\rm or}\ [0,\ 0,\ 1] Therefore, I don't feel like it will converge even if I get a result like $ [1, \ 2, \ 3] $. I don't say. $ [1, \ 2, \ 3] $ is $ {\ rm softmax} $ to get $ [0.09, \ 0.245, \ 0.665] $. I feel like I can do this.

By the way, in machine learning, the function that normalizes the value like this $ {\ rm softmax} $ is [activation function](https://ja.wikipedia.org/wiki/%E6%B4%BB%E6 It is called% 80% A7% E5% 8C% 96% E9% 96% A2% E6% 95% B0).

The definition of the model in TensorFlow

import tensorflow as tf

#Input value definition
x = tf.placeholder('float', [None, 2])

#Output value definition
w = tf.Variable(tf.ones([2, 3]))
b = tf.Variable(tf.zeros([3]))
y = tf.nn.softmax(tf.matmul(x, w) + b)

It will be like this. Since the tensorflow api has appeared, I will explain it briefly

api	Description
`tf.placeholder`	Input value definition. Teacher data and test data go here. You need to give it at runtime. The arguments are type and number of dimensions.
`tf.Variable`	Definition of values that vary by learning. It fluctuates so that the error becomes smaller each time learning is executed. The argument is the initial value.
`tf.ones`	1 Returns a filled matrix
`tf.zeros`	0 Returns a filled matrix
`tf.matmul`	Returns the result of matrix multiplication

4. Pass the teacher data to the model and train it

In the previous setting, the initial value was set appropriately, so even if you pass $ X $, only an appropriate result will be returned. The direction to the correct result needs to be set in advance.

Calculate by substituting $ W $, $ b $ for the initial value, and $ X $ for the teacher data for $ g (X) $ and $ Y = {\ rm softmax} (WX + b) $ assumed in the previous section. ,

\left(
  \begin{array}{ccc}
     0.333 & 0.333 & 0.333 \\
     0.333 & 0.333 & 0.333 \\
     \vdots & \vdots & \vdots 
  \end{array}
\right)
 = g\left(
  \begin{array}{ccc}
      172 & 62  \\
      181 & 55  \\
      \vdots & \vdots
  \end{array}
\right)

This is the result. As a result, the error value of $ Y $ and the true result of the teacher data (set as $ Y'$) is calculated, and if this error value can be minimized, the learning is completed. The error value of the probability distribution is calculated using a formula called cross entropy.

loss = -\sum Y' {\rm log}(Y)

This time, we defined this cross entropy as a method for measuring the error value, because if the exact match is true, $ 0 $ can be obtained, and if the difference is different, a larger value can be obtained.

Next, it is necessary to reduce the measured error value while adjusting $ W $ and $ b $ defined as tf.Variable (value that fluctuates by learning) in the model.

The function that obtains this error value (in this case, the cross-entropy of $ loss $: $ Y $ and $ Y'$) is called the objective function.

There are various algorithms for optimizing the objective function.

--Algorithms that converge quickly but are unlikely to find the optimal solution --Algorithm that makes it easy to find the optimal solution but takes time to converge --Algorithms for which a solution cannot be found for a certain problem --Algorithms with troublesome parameter settings

Such

However, TensorFlow has an implementation of many optimization algorithms, so switch and try it. Is easy.

This time, gradient descent (tf.train.GradientDescentOptimizer) and Adam % A2% BA% E7% 8E% 87% E7% 9A% 84% E5% 8B% BE% E9% 85% 8D% E9% 99% 8D% E4% B8% 8B% E6% B3% 95 # Adam) ( I tried two of them (tf.train.AdamOptimizer), but both of them gave me a solution. Adam finally chose this because it converged faster.

TensorFlow's optimization algorithm changes the fluctuation values $ W $ and bias $ b $ little by little so that the result (= error value) of the objective function becomes small.

Even if you change it all at once, you will not find the answer, so you have to change it little by little, but this amount of change can be adjusted with parameters. If it is too large, the answer cannot be found, and if it is too small, it takes time to find the answer. This time, the setting value for ʻAdam Optimizer` was 0.05.

The definition of learning in TensorFlow is as follows.

#Answer input area for teacher data(Y')
y_ = tf.placeholder('float', [None, 3])
#Objective function-sum(Y'log(Y))
#log(0)Adjusted the minimum amount to indicate nan(tf.clip_by_value)
loss = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.clip_by_value(y, 1e-10, 1.0))))
#Gradually optimize the variation value so that the result of the objective function is minimized
train_step = tf.train.AdamOptimizer(0.05).minimize(loss)

#Variable initialization
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)

train_feed_dict={x: train_x, y_: train_y}
for i in range(100001):
    #Learning
    sess.run(train_step, feed_dict=train_feed_dict)

Running sess.run evaluates the contents of the first parameter. At that time, the data mapped to placeholder, which is necessary to calculate the parameter, is passed to the feed_dict parameter. Since train_step requires $ X $ and $ Y'$ of teacher data, set as above.

The learning was performed 100,000 times.

5. Check the learning status of the model

Learning takes time. It will be sad if the result is not obtained after waiting, so I want to check during execution whether the set model is functioning correctly, that is, whether the error value is toward convergence.

TensorBoard is provided as a visualization tool in TensorFlow, but I will not touch it this time.

Visualized on a muddy console.

test_feed_dict={x: test_x}
for i in range(100001):
    sess.run(train_step, feed_dict=train_feed_dict)
    if(i % 10000 == 0 or (i % 1000 == 0 and i < 10000) or (i % 100 == 0 and i < 1000) or (i % 10 == 0 and i < 100)):
        #Error output
        test_y = ['Skinny' if max(a) == a[0] else 'usually' if max(a) == a[1] else 'Chubby' for a in sess.run(y, feed_dict=test_feed_dict)]
        bools = train_data[:,2] == test_y
        print (i, sess.run(loss, feed_dict=train_feed_dict), str(sum(bools) / len(bools) * 100) + '%')

The number of times of learning, the error value from the teacher data, and the correct answer rate when $ X $ of the test data is used ($ Y'$ is not required for feed_dict) are output. Correct the output timing as appropriate.

10 89.8509 64.0%
20 80.4948 64.0%
30 73.6655 64.0%
40 68.4465 65.0%
50 64.4532 69.0%
60 61.0676 73.0%
70 58.317 73.0%
80 56.0346 74.0%
90 54.1317 74.0%
100 52.5213 74.0%
200 44.4377 79.0%
300 41.6028 79.0%
400 40.2241 80.0%
...

It is displayed like this. Somehow the learning seems to be successful.

I wanted to make it a little easier to understand, so I tried to visualize it with pyplot.

import matplotlib.pyplot as plt

# height: 130～190, weight:Make all combinations of 20-100 and extract y from the learned function
px, py = np.meshgrid(np.arange(130, 190+1, 1), np.arange(20, 100+1, 1))
graph_x = np.c_[px.ravel() ,py.ravel()]
graph_y = sess.run(y, feed_dict={x: graph_x})
#y is the color gradient (-Convert to 1 to 1)
pz = np.hsplit(np.array([sum(e * [-1, 0, 1]) for e in graph_y]), len(px))
plt.pcolor(px, py, pz)
plt.cool()

The teacher data range this time was set to $ X $ for all combination patterns that can be taken in increments of height 1 cm and weight 1 kg, and the model was evaluated and plotted.

I think this is easier to grasp as a learning situation.

6. If no matter how much you learn, you will not get the correct result.

Review the model design. Depending on the problem, a simple model like this one cannot handle it, so it will be necessary to make adjustments little by little.

I haven't grasped the details, so if you give a simple example

Learning is slow or stops in the middle

--Cut a mini batch --Adjust the parameters of the optimization algorithm --Try changing to another algorithm

Stay away from expected results

Try adding an intermediate layer (hidden layer)

The example of TensorFlow Playground is easy to understand.

Since $ WX + b $ is a linear expression, it cannot deal with nonlinear problems. However, by preparing an intermediate layer omitted in this model between the input layer and the output layer, it becomes possible to train a complicated nonlinear model by combining linear models. It is less time efficient, but it has the potential to reduce the final error value.

Intermediate layer addition example

#Middle layer
with tf.name_scope('hidden'):
    w0 = tf.Variable(tf.ones([2, 4])) #Receives 2 x's from the input layer and converts them to 4 outputs
    b0 = tf.Variable(tf.zeros([4]))   #Bias on each output
    h0 = tf.nn.relu(tf.matmul(x, w0) + b0)

#Output layer
with tf.name_scope('output'):
    w = tf.Variable(tf.ones([4, 3])) #Receives 4 outputs from the middle layer and converts them to 3 outputs
    b = tf.Variable(tf.zeros([3]))   #Bias on each output
    y = tf.nn.softmax(tf.matmul(h0, w) + b)

Try to process the input data

It may be worth trying to process the input data, for example, linear by adding $ {x_ {1}} ^ 2, \ {\ rm sin} (x_ {2}) $ as new $ x_i $. Convert to a regressable problem.

Input data processing example

#Input layer
with tf.name_scope('input'):
    x = tf.placeholder('float', [None, 2])
    #Pre-processing of input values
    x1, x2 = tf.split(1, 2, x)
    x_ = tf.concat(1, [x, x1 ** 2, tf.sin(x2)])
#Output layer
with tf.name_scope('output'):
    w = tf.Variable(tf.ones([4, 3])) #[[x1, x2, x1**2, sin(x2)],[skinny, normal, fat]] 
    b = tf.Variable(tf.zeros([3]))
    y = tf.nn.softmax(tf.matmul(h0, w) + b)

Add input data

I suspect that there may be insufficient input information. In this example, adding gender, age, etc. to the data may further improve accuracy. However, it is necessary to start over from data collection.

Learning results

In this model, 100 cases were 100,000 times and the calculation time was about 2 minutes. (Core i5 1.8GHz)

...
70000 3.63972 99.0%
80000 3.27686 100.0%
90000 3.02285 100.0%
100000 2.80263 100.0%

The correct answer rate reached 100% around the 80,000th time.

The graph is also nice

`console`


w: [[ 3.11868572  1.0388186  -0.9223755 ]
 [-2.45032024  0.99802458  3.3779633 ]]
b: [-172.08648682   -3.14501309  158.91401672]

Partition function $ (Probability of being thin, Probability of being normal, Probability of being chubby) $ $ = {\ rm softmax} ((3.12 height -2.45 weight -172.96), (1.04 height + weight -3.15), (-0.92 height + 3.38 weight + 158.91)) $ Is completed.

Check with ipython just in case

import numpy as np
#Since softmax was not in numpy, I made my own definition
def softmax(a):
    e = np.exp(np.array(a))
    return e / np.sum(e)
def taikei(h, w):
    return softmax([(3.12*h - 2.45*w - 172.96), (1.04*h + w - 3.15), (-0.92*h + 3.38*w + 158.91)])
print(np.round(taikei(172,60),2))

↓ [ 0. 1. 0.]

If you are 172 cm tall and weigh 60 kg 100% normal.

A delicate line, height 172 cm, weight 74 kg

print(np.round(taikei(172,74),2))

↓ [ 0. 0.26 0.74] 74% chubby.

All source code

import numpy as np
import tensorflow as tf
import csv
import matplotlib.pyplot as plt
import math

def read(path):
    fp = open(path, 'r')
    data = np.array([r for r in csv.reader(fp)])
    fp.close()
    return data

def convert(data):
    return [np.array([[float(r[0]), float(r[1])] for r in data]),
            np.array([ [1,0,0] if r[2] =='Skinny' else [0,1,0] if r[2] == 'usually' else [0,0,1]  for r in data])]

train_data = read('train.csv')
test_data = read('test.csv')

#Teacher data display
plt.xlabel('height')
plt.ylabel('weight')
plt.xlim(130, 190)
plt.ylim(20, 100)
train_plot = [train_data[train_data[:,2] == 'Skinny'],
              train_data[train_data[:,2] == 'usually'],
              train_data[train_data[:,2] == 'Chubby']]
plt.plot(train_plot[0][:,0], train_plot[0][:,1], 'o', label = 'skinny')
plt.plot(train_plot[1][:,0], train_plot[1][:,1], 'o', label = 'normal')
plt.plot(train_plot[2][:,0], train_plot[2][:,1], 'o', label = 'fat')
plt.legend()

train_x, train_y = convert(train_data)
test_x, test_y = convert(test_data)

#Input layer
with tf.name_scope('input'):
    x = tf.placeholder('float', [None, 2])

#Output layer
with tf.name_scope('output'):
    w = tf.Variable(tf.ones([2, 3]))
    b = tf.Variable(tf.zeros([3]))
    y = tf.nn.softmax(tf.matmul(x, w) + b)

with tf.name_scope('train'):
    #Answer input area for teacher data(Y')
    y_ = tf.placeholder('float', [None, 3])
    #Objective function-sum(Y'log(Y))
    #log(0)Adjusted the minimum amount to indicate nan(tf.clip_by_value)
    loss = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.clip_by_value(y, 1e-10,1.0))))
    #Gradually optimize the variation value so that the result of the objective function is minimized
    train_step = tf.train.AdamOptimizer(0.05).minimize(loss)

#Variable initialization
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)

train_feed_dict={x: train_x, y_: train_y}
test_feed_dict={x: test_x}
for i in range(100001):
    sess.run(train_step, feed_dict=train_feed_dict)
    if(i % 10000 == 0 or (i % 1000 == 0 and i < 10000) or (i % 100 == 0 and i < 1000) or (i % 10 == 0 and i < 100)):
        #Error output
        test_y = ['Skinny' if max(a) == a[0] else 'usually' if max(a) == a[1] else 'Chubby' for a in sess.run(y, feed_dict=test_feed_dict)]
        bools = train_data[:,2] == test_y
        print (i, sess.run(loss, feed_dict=train_feed_dict), str(sum(bools) / len(bools) * 100) + '%')
        #Classification status display
        # height: 130～190, weight:Make all combinations of 20-100 and extract y from the learned function
        px, py = np.meshgrid(np.arange(130, 190+1, 1), np.arange(20, 100+1, 1))
        graph_x = np.c_[px.ravel() ,py.ravel()]
        graph_y = sess.run(y, feed_dict={x: graph_x})
        #y is the color gradient (-Convert to 1 to 1)
        pz = np.hsplit(np.array([sum(e * [-1, 0, 1]) for e in graph_y]), len(px))
        plt.pcolor(px, py, pz)
        plt.cool()
        plt.pause(.01)
print ('w:',sess.run(w))
print ('b:',sess.run(b))

Impressions

I don't feel like I wasn't going to make it easy. It was my first time to touch python seriously, but I was impressed by the ease of matrix operations. I want to find practical machine learning material, but I can't find it easily.

[PYTHON] TensorFlow tutorial tutorial