[PYTHON] Searching for properties to start with TensorFlow-Part 1

TensorFlow released by Google in November 2015. The official tutorial focuses on MNIST handwriting recognition. https://www.tensorflow.org/get_started/mnist/beginners

However, I wanted to do something that would be more useful for my life, so I decided on the theme of "finding properties".

Notice

--I want to write it as a series (planned). ――There is a time between trying and publishing the article.

MNIST handwritten digit recognition

MNIST Handwritten Number Recognition is a well-known problem in artificial intelligence, especially in the area of image recognition. It is a task to input handwritten characters from 0 to 9 and determine which number from 0 to 9 it is. The higher the percentage of test data that can be judged correctly, the better (simply speaking).

The number of data and the dimensions of input data and output data are as follows.

num_of_data=70,000(60,000 training data, 10,000 test data)
input=784 (28*28 pixels)
output=10 (Labels from 0 to 9)

Image and label training, validation, test data dimensions

---- _images ----
(55000, 784)
(5000, 784)
(10000, 784)
<type 'numpy.ndarray'>
---- _labels ----
(55000, 10)
(5000, 10)
(10000, 10)
(agile_env)tensorflow $ pwd

Property search: Problem setting

Let's think about what kind of problem to set regarding "property search" with reference to "MNIST handwritten number recognition". This time, we will assume rental properties in particular.

When looking for a rental property, it is common to narrow down the property based on the conditions such as the area you want to live in, but "How much is it appropriate to pay the monthly rent for the wishes you have, such as the nearest station and floor plan?" I think that is a point of view that will definitely be of concern at some stage.

If you can use other property data to derive something like a "reasonable rent guideline", you may be able to find out whether the rent of the property you are interested in is cheap or expensive ...? I thought.

Real estate with different characteristics one by one seems to be a world where price forecasting is difficult to make, but at least for me it was a problem that was more interesting than recognizing handwritten numbers, so I will try it.

◯◯ I want to live in a 10-year-old building with 1LDK, a 10-minute walk to the station.
・ I was told that it was ◯◯ 10,000 yen a month, but is it a bargain?
・ As a guide, it seems to be XX 10,000 yen. ◯◯ Isn't 10,000 yen too expensive?

Consider the input data and output data in a manner similar to the MNIST data. I referred to Assessment of salary of professional baseball players with neural network.

input=34(29+1+3+1)See below
output=10

*type of input
・ Nearest station: 29 stations on the Yamanote line (one)-hot vector) *Decrease if too much, increase if too little (other routes?).
Going around from Shinagawa station. Shinagawa Station:0th element of input,Osaki Station:1,...
・ Walk to the nearest station ◯ minutes (0 based on the maximum value in the data-Normalized to 1)
・ Floor plan: 1LDK, 2DK, 2LDK(one-hot vector) *The floor plan where the author is likely to live
·Age(0 at maximum in the data-Normalize to 1)
##・ With or without security deposit
##・ With or without key money
##·area(Normalize with the maximum value in the data)
##・ Condominiums and apartments: (one-hot vector)
…

*output type
0: 7.Less than 50,000 yen
1: 7.5-8.Less than 0,000 yen
2: 8.0-8.Less than 50,000 yen
3: 8.5-9.Less than 0,000 yen
4: 9.0-9.Less than 50,000 yen
5: 9.5-10.Less than 0,000 yen
6: 10.0-10.Less than 50,000 yen
7: 10.5-11.Less than 0,000 yen
8: 11.0-11.Less than 50,000 yen
9: 11.50,000 yen or more

Data collection

This time, I will use the search results of a comprehensive information site about a major real estate / housing. For example, specify conditions such as "JR Yamanote Line, 1LDK / 2DK / 2LDK, 100 items displayed per page".

Crawl with python etc. and collect in this way, for example.

train_JR Yamanote Line/Tabata Station 14 2DK 15 11.8000001907 [Property URL]
train_JR Yamanote Line/Tabata Station 14 2DK 15 11.8000001907 [Property URL]
train_JR Yamanote Line/Tabata Station 14 2DK 15 11.8000001907 [Property URL]
train_JR Yamanote Line/Tabata Station 14 2DK 15 11.8000001907 [Property URL]
…

In total, 29786 cases (at the time of execution).

Data shaping / reading

Modify the input_data.py provided in the TensorFlow tutorial to Make it possible to read the input data prepared as csv as it is.

Determine the ratio of training, validation, and test data (22000, 2000, 5786).

split -l 24000 attributes.csv
(after splitting, rename to train- and test- attributes.csv)
split -l 24000 labels.csv
(after splitting, rename to train- and test- labels.csv)

REAL_ESTATE_data $ tail -5 train-attributes.csv 
0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.06451612903225806,1,0,0,0.36666666666666664
0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.01935483870967742,0,1,0,0.6333333333333333
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0.0967741935483871,0,0,1,0.1111111111111111
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.1032258064516129,1,0,0,0.32222222222222224
0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.05161290322580645,1,0,0,0.1
REAL_ESTATE_data $ tail -5 train-labels.csv 
8
9
9
6
9
REAL_ESTATE_data $ tail -5 test-attributes.csv 
0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0967741935483871,0,0,1,0.25555555555555554
0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.05806451612903226,0,1,0,0.34444444444444444
0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.12258064516129032,0,1,0,0.35555555555555557
0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.03870967741935484,0,1,0,0.5
0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.16129032258064516,1,0,0,0.022222222222222223
REAL_ESTATE_data $ tail -5 test-labels.csv 
9
9
9
9
9
REAL_ESTATE_data $ 

extract_images (filename) just reads csv as is (although it is no longer "images").

def extract_images(filename):
  print('Extracting images ', filename)
  data = numpy.genfromtxt(filename, delimiter=',') # default is dtype=float
  return data

The DataSet class is also omitted because it does not require dimension conversion.

class DataSet(object):
  
  def __init__(self, images, labels, fake_data=False):
    if fake_data:
      self._num_examples = 10000
    else:
      assert images.shape[0] == labels.shape[0], ( "images.shape: %s labels.shape: %s" % (images.shape, labels.shape))
      self._num_examples = images.shape[0]

    self._images = images
    self._labels = labels
    self._epochs_completed = 0
    self._index_in_epoch = 0
…

model

Follow the TensorFlow tutorial (for beginners) to build a Softmax Regression model. The Softmax Regression model generally seems to be a popular model when you want to assign a probability that something will be one of several different candidates. In the example of handwritten digit recognition, it is an image that assigns the probability of "9" to 80%, the probability of "8" to 4%, and so on for a certain image.

[0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.04, 0.80]
softmax: exponentiating its inputs and then normalizing them. 

exponentiation:
- one more unit of evidence increases the weight given to any hypothesis multiplicatively. 
- one less unit of evidence means that a hypothesis gets a fraction of its earlier weight. 
- No hypothesis ever has zero or negative weight. 

normalization:
- they add up to one, forming a valid probability distribution.

First result

(agile_env)tensorflow $ python intro_mnist_refactor.py
##### prepare and read data set #####
Extracting images  REAL_ESTATE_data/train-attributes.csv
Extracting labels  REAL_ESTATE_data/train-labels.csv
Extracting images  REAL_ESTATE_data/test-attributes.csv
Extracting labels  REAL_ESTATE_data/test-labels.csv
##### init and run session #####
can't determine number of CPU cores: assuming 4
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 4
can't determine number of CPU cores: assuming 4
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 4
##### training #####
##### evaluation #####
0.859834
(agile_env)tensorflow $ 

Since the test data was 5786, it seems that less than 5,000 of them were able to judge the rent correctly. If you look only at this number, did you make a decent judgment?

Looking at the data

You may have noticed it from the rent labeling stage, 9 (= rent is 115,000 yen or more this time) is overwhelmingly large ...! After reading the data set, if you aggregate it in this way ...

print mnist.train._labels.shape
print mnist.validation._labels.shape
print mnist.test._labels.shape

print numpy.sum(mnist.train._labels, axis = 0)
print numpy.sum(mnist.validation._labels, axis = 0)
print numpy.sum(mnist.test._labels, axis = 0)
(22000, 10): training
(2000, 10): validation
(5786, 10): test
[   127.    158.    199.    235.    314.    407.    442.    539.    598.  18981.]
[    7.    10.    25.    19.    33.    33.    38.    49.    47.  1739.]
[   48.    41.    51.    71.    84.   123.   113.   133.   141.  4981.]

There is a lot more data labeled 9 than others. Standard deviation is

numpy.std(numpy.sum(mnist.train._labels, axis = 0))
numpy.std(numpy.sum(mnist.validation._labels, axis = 0))
numpy.std(numpy.sum(mnist.test._labels, axis = 0))
5595.73064041
513.175213743
1467.88052647

What about MNIST data?

(55000, 10): training
(5000, 10): validation
(10000, 10): test
[ 5444.  6179.  5470.  5638.  5307.  4987.  5417.  5715.  5389.  5454.]
[ 479.  563.  488.  493.  535.  434.  501.  550.  462.  495.]
[  980.  1135.  1032.  1010.   982.   892.   958.  1028.   974.  1009.]

It can be said that the data from 0 to 9 is collected in handwritten characters in a well-balanced manner. Standard deviation is

291.905806725
37.6218021897
59.1962836671

next time

I plan to try the analysis again while collecting data from 0 to 9 in a well-balanced manner and looking at the contents of the data.

Recommended Posts

Searching for properties to start with TensorFlow-Part 1
For those who want to start machine learning with TensorFlow2
Memo to ask for KPI with python
Use Docker Desktop for Windows to start the latest odo with 2 commands
Join Azure Using Go ~ For those who want to start and know Azure with Go ~
Searching for an efficient way to write a Dockerfile in Python with poetry
I want to start over with Django's Migrate
I tried to start Jupyter with Amazon lightsail
Easy IoT to start with Raspberry Pi and MESH
Stop EC2 for specified time + start with Lambda (python)
From user to root ~ Searching for promotion ~ (TAMUctf 2020: Writeup)
Preparing to start "Python machine learning programming" (for macOS)
For those who want to write Python with vim
Things to do when you start developing with Django
Things to watch out for when migrating with Django
Convert 202003 to 2020-03 with pandas
Start M5Stack with UIFlow
Start IPython with virtualenv
Start today with Django
Update Python for Raspberry Pi to 3.7 or later with pyenv
Try to display various information useful for debugging with python
I want to start a jupyter environment with one command
Experiment to make a self-catering PDF for Kindle with Python
[NetworkX] I want to search for nodes with specific attributes
From environment construction to deployment for flask + Heroku with Docker
Introduction to Python for VBA users-Calling Python from Excel with xlwings-
Try automating Start / Stop for EC2 instances with AWS Lambda
For beginners, how to deal with common errors in keras
How to start Apache by specifying httpd.conf with systemd (CentOS7, CentOS8)