I started machine learning with Python (I also started posting to Qiita) Data preparation

</ i> Nice to meet you.

I'm an engineer at a mid-sized company. Suddenly, I decided to summarize what I learned on my own in Qiita. For the time being, I will summarize the machine learning that I am studying. (If you break your heart, take a break with another theme that is easy to write) By the way, I plan to mainly use Python as the language.

</ i> Please note!

――I like the way of touching and understanding rather than theory, so it's a lot of miscellaneous things. ――Because it is illiterate, I think it will be difficult to understand. ――For that reason, I think there are many things that make you want to read it, but please understand!

  • Gentle Tsukkomi is welcome

</ i> Summary in 3 lines

――Let's study machine learning using Twitter data. ――This time, we're talking about data and environment preparation to use. ――Next time, I plan to cluster tweets and calculate similarity.

</ i> Dataset

At the beginning of my study, I tried various things using the "iris" and "Titanic" that appear in reference books and tutorials as they are, but I have no interest in the data. I didn't get into my head at all ...

I changed my mind and decided to use ** Twitter ** data that seems to be interesting to analyze. Also, to make it more intriguing (?), I will focus on tweets about "** Perfume **" this time. (Note: It's a Nocchi school)

By the way, when I got the tweets containing "Perfume" and looked at them, I found that the number of tweets was smaller than I expected (250 / 1h), and even less when I excluded tweets that seemed to be bots. I'm wondering if there are many fans who don't publish tweets. (I wonder if you can publish private tweets without a user name ... Muri ...)

As a result, the number of data is a little small, so I think I will increase the number of artists for comparison in the future. (Candidates: ** CAPSULE **, ** Sakanaction ** ...)

What to do specifically with machine learning is ... ** I'm thinking while moving **.

</ i> How to prepare data

We have already built an environment to collect tweets about Perfume using Streming API and accumulate them with Elasticsearch. This time, in order to speed up studying, let's save some data from Elasticsearch to a file (** es.log ), machine learning with a local script ( tw_ml.py **), and so on. I think.

It looks like the following.

構成図.jpg

The data format is as follows. (Since it is a dictionary type when saved, the Unicode object has "u" attached.)

es.log(1 excerpt, data is dummy)


[{u'_score': 1.0, u'_type': u'raw', u'_id': u'AVZkL6ZevipIIzTJxrL7', u'_source': {u'retweeted_status': u'True', u'text': u'Perfume\u306e\u597d\u304d\u306a\u6b4c', u'user': u'xxxxxxxxxx', u'date': u'2016-08-07T08:45:27', u'retweet_count': u'0', u'geo': u'None', u'favorite_count': u'0'}, u'_index': u'tweet'}]

To read es.log from Python, open it with codecs and use ast to return it to the dictionary. For the time being, there are 265 cases, but if I think that the amount of data is insufficient, I will get it again from Elasticsearch and increase it.

tw_ml.py(Excerpt)


import codecs
import ast
with codecs.open("es.log", "r", "utf-8") as f:
    es_dict = ast.literal_eval(f.read())
    print "doc:%d" % len(es_dict)  # doc:265

</ i> Environment, main libraries, etc.

--Script execution environment

  • Windows7 64bit
  • Python2.7

--Main Python libraries

  • numpy 1.11.1
  • scipy 0.12.0
  • scikit-learn 0.16.0
  • mecab-python 0.996

By the way, Mecab is used for Japanese morphological analysis, but "** mecab-ipadic-neologd **" is used for the dictionary. If I didn't use this, even "Kashiyuka", "Ah-chan", and "Nocchi" wouldn't be regarded as words ... lol

  • However, even if you use mecab-ipadic-neologd, "Nocchi" is katakana and "notch"! Dangerous! !!

tw_ml.py(Excerpt)


MECAB_OPT = "-Ochasen -d C:\\tmp\\mecab-ipadic-neologd\\"
t = mc.Tagger(MECAB_OPT)

</ i> Then ...

I thought, but since it has become long, I will do the main subject of machine learning from the next time onwards! Lol For the time being, I plan to try clustering tweets and calculating similarity.

</ i> It was helpful! (Mainly how to write)

-Use Font-Awesome for Qiita article headlines to improve the appearance # qiita -One year has passed since the second year programmer posted to Qiita once a week

Recommended Posts