[PYTHON] Try using scikit-learn (1) --K-means clustering

Last time roughly wrote the machine learning method implemented in scikit-learn, but it seems that there is some demand from today. While writing sample code for machine learning using scikit-learn, I would like to approach the understanding and practice of the method.

First of all, I will give an example of clustering by the K-means method that I did previously.

The K-means method is one of the basic methods of clustering, it is simple and fast, and it is also ideal for getting started. I recommend explaining the operation every time, but around here is easy to understand.

Stock price data is still used as the target for clustering.

Stock price data

Anyone can get it for free
Real data that is an indicator of a company's "performance"
Easy to analyze because it is quantitative data It is easy to handle because it has such features.

A company's performance and stock price are closely related. It is said that there is actually a gap of about six months to three years between the two. That's because investors invest in future performance.

In other words, future performance has already been factored into the stock price. For example, when forecasting IT investment and its demand, you can think of a simple sales strategy in which IT demand is also expected in areas where business performance is growing.

This time, I will analyze the data of the following companies. Both are companies close to our company (DTS).

Brand	company name
9682	DTS
9742	INES
9613	NTT DATA
2327	NS Solutions
9640	Saison Information Systems
3626	IT Holdings
2317	Systena
4684	Obic
9739	NSW
4726	Softbank Technology
4307	Nomura Research Institute
9719	SCSK
4793	Fujitsu BSC
4812	Dentsu International Information Services
8056	Nihon Unisys

Return index

Returns in the financial world usually refer to a percentage change in asset prices starting on a certain day. A simple return index can be found using pandas as follows:

returns = pd.Series(close).pct_change() #Find the rate of increase / decrease
ret_index = (1 + returns).cumprod() #Find the cumulative product
ret_index[0] = 1 #First value 1.Set to 0

When focusing on multiple companies, the return index measures how the value of the asset changes, with 1 as the standard for one day's price.

For example, let's look at the return index for the last 30 days from the date of writing this article.

#Read time series data from csv file
df = pd.read_csv(csvfile,
                 index_col=0, parse_dates=True)
df = df[-30:] #Last 30 days
#List for return index
indexes = get_ret_index(df)['ret_index'].values.tolist()
#Show DTS
if stock == "9682":
    ts = df.index.values
    for t, v in zip(ts, indexes):
        print(t,v)
#=>
# 2015-02-23 1.0
# 2015-02-24 1.010054844606947
# 2015-02-25 1.020109689213894
# 2015-02-26 1.0351919561243146
# 2015-02-27 1.0680987202925045
# ...
# 2015-04-01 1.0237659963436931
# 2015-04-02 1.0530164533820843
# 2015-04-03 1.040219378427788

That's why.

This time, let's cluster using the values for the last 30 days as features. That is, the above becomes a 30-dimensional vector as it is.

K-Means clustering

Here, let k = 4.

kmeans_model = KMeans(n_clusters=4, random_state=30).fit(features)
labels = kmeans_model.labels_
for label, name, feature in zip(labels, names, data):
    print(label, name)
#=>
# 2 9742
# 1 9682
# 2 9613
# 1 2327
# 3 9640
# 1 3626
# 1 2317
# 2 4684
# 0 9739
# 0 4726
# 2 4307
# 1 9719
# 0 4793
# 0 4812
# 1 8056

The cluster number and brand code to which you belong are displayed in this way.

Visualization

It's hard to understand if it's just this, so let's visualize it.

df = pd.DataFrame(df, index=ts)
plt.figure()
df.plot()
plt.subplots_adjust(bottom=0.20)
plt.legend(loc="best")
plt.savefig("cluster.png ")
plt.close()

First of all, this is cluster 0.

If you try to visualize it, you can see that the stocks that had a considerable downward swing in March have solidified.

Next is cluster number 1.

This is a collection of stocks whose prices have been raised toward the end of the fiscal year, although there is some range of price movements.

Cluster number 2.

Companies that have increased their value are gathering. It can be said that the performance of these four companies was favorable.

Cluster 3 seems to have been selected from a company with slightly irregular movements.

A cluster was formed for each company with similar price movements. See the table above for the relationship between stocks and company names.

This time, we targeted only SIer stocks, but if you want to do it, you can derive similar stocks from the data of all other listed companies.

Japan Exchange-Other statistical data http://www.jpx.co.jp/markets/statistics-equities/misc/01.html

A list of all listed companies can be downloaded from the above. I wrote about the acquisition of stock data in Previous, so I will omit it.

Summary

What can we learn from such an analysis?

One idea is to extract companies that show similar indicators from data in a wide range of industries, for example, to color the cycle if trading, or to estimate the demand for hidden IT investment if it is a business strategy. Can be done. If clustering can be done mechanically for all industries, it will save the trouble of human judgment and picking up.

Alternatively, this time we simply used the return index as an index, but in principle any index can be used. For example, the Nikkei Stock Average is calculated from the average of 225 companies, but if you want to build an index similar to this from only 20 companies, you can use machine learning.

In any case, there is nothing less than an analysis that relies on intuition and experience. Humans have cognitive distortions and make emotional decisions. This area is [Behavioral Economics](http://ja.wikipedia.org/wiki/%E8%A1%8C%E5%8B%95%E7%B5%8C%E6%B8%88%E5%AD% As you can see from A6), humans do not always make rational decisions. Mechanical analysis support is essential to eliminate human emotions and make rational decisions in financial data analysis.