[PYTHON] A story about predicting prefectures from the names of cities, wards, towns and villages with Jubatus

Old tale

When I started learning machine learning, there was no such thing as github in the first place. I dropped the source on the net, compiled it, and said, "Oh, I don't know, but it doesn't work ..." Or, "Oh, I don't know, but it worked ...". Otherwise, I'm not sure how well I've written my own code and implemented it correctly. I was experimenting with the code (which is definitely inefficient computer science).

But these days, even overly complex theory implementations are usually found on github. And the fact that it will be published on github means that the usage is clearly stated and anyone can use it. The plain interface is easy.

Jubatus is just such a framework, like a dream where you can do machine learning without knowing any complicated theory. http://jubat.us/ja/

This is the end of the taikomochi.

It is an article that I tried using Jubatus. The first thing you want to do is to enter the name of the city, town, and village, and then make something that will tell you which prefecture the place name is. Since the purpose is "try to use", the target was anything, but to be clear, anything was fine. The address book of the prefecture name was provided as a csv file on the [here] 1 site, so I used it.

pre-processing

I will use the data downloaded above, but I don't like the character code in SJIS, so I will convert it to utf-8.

wget http://jusyo.jp/downloads/new/csv/csv_zenkoku.zip
unzip csv_zenkoku.zip
nkf -w zenkoku.csv > zenkoku_utf-8.csv

I think that the data can now be read in Japanese. By the way, this conversion is not necessary in the Windows environment, but this time it is assumed to be Linux (CentOS). (If you try to use jubatus on Windows in the first place, it should not be straightforward, so I don't think this explanation is necessary.)

So I don't like the first row (explanation of each column) so I delete it.

With this, the data to be eaten by jubatus is ready, but if it is left as it is, the arrangement of data is too regular. No matter what you say, the Hokkaido Lover, which returns "Hokkaido", will only be completed, so shuffle the lines in advance.

shuf zenkoku_utf-8.csv > shuffled_zenkoku.csv

Save this in a directory called data.

configration

Now that the data is in place, it's time to write the settings in json to feed Jubatus. https://github.com/chase0213/address_classifier/blob/master/adrs_clf.json

AROW is used as the learning algorithm. There is no particular reason.

So basically, as it is now, the input vector is a vector that has a character string as an element, so Write in the string_rules section how to handle this string. It's not a plan to make something practical, so for the time being, I'll just count the number of characters divided by unigram.

"string_rules": [
      { "key": "*", "type": "unigram", "sample_weight": "bin", "global_weight": "bin" }
]

Of course, if you want to make something practical, you need to think about this part properly. (In the first place, there is nothing practical because almost nothing is done in the pretreatment part)

Please see [Jubatus Official Page] 2 for the details of the setting.

starting jubatus server

After completing the settings, start the jubatus server.

$ jubaclassifier --configpath adrs_clf.json

If there is no error, it is running.

training

After completing the settings, we will finally enter the learning phase. This is so-called training. https://github.com/chase0213/address_classifier/blob/master/train.py

When I learned all the data, it timed out, so I gave about 50,000 for the time being.

tnum = 50000

Normally, data for learning and data for classification are stored separately. This time it's a hassle (

I haven't done anything particularly difficult, so if you've read this far, you'll know what you're doing by looking at the code. So I will omit the explanation.

The only important point is

# training data must be shuffled on online learning!
random.shuffle(train_data)

Here it is. Since the sample is diverted as it is, I have carefully included comments, but If you pass the teacher data without shuffling, the effect of the data alignment will be reflected. I don't really understand the algorithm so I can't say anything in detail, I think that the influence of the data eaten at the end will probably become stronger. In this case, the data was originally shuffled, so Even if you don't shuffle here, the performance will not deteriorate so much, but if you forget it when you reuse it, that's it.

After shuffling, you can start learning.

# run train
client.train(train_data)

classification

This isn't particularly difficult either, so take a look at the code. https://github.com/chase0213/address_classifier/blob/master/detect.py

This time, I gave three place names, "Isesaki", "Takasaki", and "Kamakura", and which prefecture is it! ?? I will do that.

Click here for the results.

$ python detect.py
Gunma Prefecture Isesaki
Takasaki, Gunma Prefecture
Kamakura, Kanagawa Prefecture

Oh! !! correct! !! Wow! !! !!

・ ・ ・ ・ ・ ・.

Save 50,000 "prefecture-city" pairs in your python, Which prefecture is this? Please try something like that. The total should be about 160,000, so there should be a 1/3 chance of hitting.

It's good because I knew before I started that this example wasn't smart, However, there are still good points. It is "classification ability for unknown data".

Classifiers (or machine learning) originally give known data to predict unknown data Because, even for place names that were not given as teacher data, You can predict (return the answer for the time being). If you try to do this with python only, it should be quite difficult.

That's it for trying to use jubaclassifier.

Recommended Posts

A story about predicting prefectures from the names of cities, wards, towns and villages with Jubatus
We have released a tool that merges the boundary data of cities, wards, towns and villages with the target statistical data using the API of the official statistics counter "e-Stat"!
The story of making a sound camera with Touch Designer and ReSpeaker
A story about calculating the speed of a small ball falling while receiving air resistance with Python and Sympy
The story of launching a Minecraft server from Discord
A story about changing the master name of BlueZ
A story about getting the Atom field (XML telegram) of the Japan Meteorological Agency with Raspberry Pi and tweeting it
(First post) A story about numerical calculation of influenza and new pneumonia coronavirus with Tensorflow
A story about porting the code of "Try and understand how Linux works" to Rust
The story of a Django model field disappearing from a class
A story about how to deal with the CORS problem
The story of making a question box bot with discord.py
A story stuck with the installation of the machine learning library JAX
The story of making a standard driver for db with python.
Dig the directory and create a list of directory paths + file names
The story of making a module that skips mail with python
The story of Python and the story of NaN
The story of writing a program
A story about creating a program that will increase the number of Instagram followers from 0 to 700 in a week
A story that visualizes the present of Qiita with Qiita API + Elasticsearch + Kibana
The story of making a university 100 yen breakfast LINE bot with Python
Calculate the shortest route of a graph with Dijkstra's algorithm and Python
Find the inertial spindle and moment of inertia from the inertial tensor with NumPy
The story of having a hard time introducing OpenCV with M1 MAC
A story about automating online mahjong (Mahjong Soul) with OpenCV and machine learning
Make a DNN-CRF with Chainer and recognize the chord progression of music
Get the average salary of a job with specified conditions from indeed.com
[python] A story about collecting twitter account names from handle names (like @ 123456) by combining BeautifulSoup and Excel input / output.
A story about machine learning with Kyasuket
The story of trying deep3d and losing
A story about Python pop and append
The story of blackjack A processing (python)
The story of making a box that interconnects Pepper's AL Memory and MQTT
Learn the trends of feature words in texts with Jubatus and categorize input texts
Deep Learning from scratch The theory and implementation of deep learning learned with Python Chapter 3
The story of making a web application that records extensive reading with Django
SSH login to the target server from Windows with a click of a shortcut
Get the stock price of a Japanese company with Python and make a graph