[PYTHON] Try using scikit-learn (1) --K-means clustering

Last time roughly wrote the machine learning method implemented in scikit-learn, but it seems that there is some demand from today. While writing sample code for machine learning using scikit-learn, I would like to approach the understanding and practice of the method.

First of all, I will give an example of clustering by the K-means method that I did previously.

The K-means method is one of the basic methods of clustering, it is simple and fast, and it is also ideal for getting started. I recommend explaining the operation every time, but around here is easy to understand.

Stock price data is still used as the target for clustering.

Stock price data

  1. Anyone can get it for free
  2. Real data that is an indicator of a company's "performance"
  3. Easy to analyze because it is quantitative data It is easy to handle because it has such features.

A company's performance and stock price are closely related. It is said that there is actually a gap of about six months to three years between the two. That's because investors invest in future performance.

In other words, future performance has already been factored into the stock price. For example, when forecasting IT investment and its demand, you can think of a simple sales strategy in which IT demand is also expected in areas where business performance is growing.

This time, I will analyze the data of the following companies. Both are companies close to our company (DTS).

Brand company name
9682 DTS
9742 INES
9613 NTT DATA
2327 NS Solutions
9640 Saison Information Systems
3626 IT Holdings
2317 Systena
4684 Obic
9739 NSW
4726 Softbank Technology
4307 Nomura Research Institute
9719 SCSK
4793 Fujitsu BSC
4812 Dentsu International Information Services
8056 Nihon Unisys

Return index

Returns in the financial world usually refer to a percentage change in asset prices starting on a certain day. A simple return index can be found using pandas as follows:

returns = pd.Series(close).pct_change() #Find the rate of increase / decrease
ret_index = (1 + returns).cumprod() #Find the cumulative product
ret_index[0] = 1 #First value 1.Set to 0

When focusing on multiple companies, the return index measures how the value of the asset changes, with 1 as the standard for one day's price.

For example, let's look at the return index for the last 30 days from the date of writing this article.

#Read time series data from csv file
df = pd.read_csv(csvfile,
                 index_col=0, parse_dates=True)
df = df[-30:] #Last 30 days
#List for return index
indexes = get_ret_index(df)['ret_index'].values.tolist()
#Show DTS
if stock == "9682":
    ts = df.index.values
    for t, v in zip(ts, indexes):
        print(t,v)
#=>
# 2015-02-23 1.0
# 2015-02-24 1.010054844606947
# 2015-02-25 1.020109689213894
# 2015-02-26 1.0351919561243146
# 2015-02-27 1.0680987202925045
# ...
# 2015-04-01 1.0237659963436931
# 2015-04-02 1.0530164533820843
# 2015-04-03 1.040219378427788

That's why.

This time, let's cluster using the values for the last 30 days as features. That is, the above becomes a 30-dimensional vector as it is.

K-Means clustering

Here, let k = 4.

kmeans_model = KMeans(n_clusters=4, random_state=30).fit(features)
labels = kmeans_model.labels_
for label, name, feature in zip(labels, names, data):
    print(label, name)
#=>
# 2 9742
# 1 9682
# 2 9613
# 1 2327
# 3 9640
# 1 3626
# 1 2317
# 2 4684
# 0 9739
# 0 4726
# 2 4307
# 1 9719
# 0 4793
# 0 4812
# 1 8056

The cluster number and brand code to which you belong are displayed in this way.

Visualization

It's hard to understand if it's just this, so let's visualize it.

df = pd.DataFrame(df, index=ts)
plt.figure()
df.plot()
plt.subplots_adjust(bottom=0.20)
plt.legend(loc="best")
plt.savefig("cluster.png ")
plt.close()

First of all, this is cluster 0.

df0.png

If you try to visualize it, you can see that the stocks that had a considerable downward swing in March have solidified.

Next is cluster number 1.

df1.png

This is a collection of stocks whose prices have been raised toward the end of the fiscal year, although there is some range of price movements.

Cluster number 2.

df2.png

Companies that have increased their value are gathering. It can be said that the performance of these four companies was favorable.

Cluster 3 seems to have been selected from a company with slightly irregular movements.

df3.png

A cluster was formed for each company with similar price movements. See the table above for the relationship between stocks and company names.

This time, we targeted only SIer stocks, but if you want to do it, you can derive similar stocks from the data of all other listed companies.

Japan Exchange-Other statistical data http://www.jpx.co.jp/markets/statistics-equities/misc/01.html

A list of all listed companies can be downloaded from the above. I wrote about the acquisition of stock data in Previous, so I will omit it.

Summary

What can we learn from such an analysis?

One idea is to extract companies that show similar indicators from data in a wide range of industries, for example, to color the cycle if trading, or to estimate the demand for hidden IT investment if it is a business strategy. Can be done. If clustering can be done mechanically for all industries, it will save the trouble of human judgment and picking up.

Alternatively, this time we simply used the return index as an index, but in principle any index can be used. For example, the Nikkei Stock Average is calculated from the average of 225 companies, but if you want to build an index similar to this from only 20 companies, you can use machine learning.

In any case, there is nothing less than an analysis that relies on intuition and experience. Humans have cognitive distortions and make emotional decisions. This area is [Behavioral Economics](http://ja.wikipedia.org/wiki/%E8%A1%8C%E5%8B%95%E7%B5%8C%E6%B8%88%E5%AD% As you can see from A6), humans do not always make rational decisions. Mechanical analysis support is essential to eliminate human emotions and make rational decisions in financial data analysis.

Recommended Posts

Try using scikit-learn (1) --K-means clustering
Try using Tkinter
Try using docker-py
Try using cookiecutter
Try using PDFMiner
Try using geopandas
Try using Selenium
Try using scipy
Try cluster analysis using the K-means method
Clustering with scikit-learn (1)
Try using pandas.DataFrame
Clustering with scikit-learn (2)
Try using django-swiftbrowser
Try using matplotlib
Try using tf.metrics
Try using PyODE
kmeans ++ with scikit-learn
Clustering with scikit-learn + DBSCAN
Try using virtualenv (virtualenvwrapper)
[Roughly] Clustering by KMeans
[Azure] Try using Azure Functions
Try using virtualenv now
Try using W & B
Try using Django templates.html
[Kaggle] Try using LGBM
Try using Python's feedparser.
Try using Python's Tkinter
DBSCAN (clustering) with scikit-learn
Try using Tweepy [Python2.7]
Try using Pytorch's collate_fn
Continued) Try other distance functions with kmeans in Scikit-learn
Try using PythonTex with Texpad.
[Python] Try using Tkinter's canvas
Try using Jupyter's Docker image
Try function optimization using Hyperopt
Try using matplotlib with PyCharm
Try using Azure Logic Apps
Try using Kubernetes Client -Python-
[Kaggle] Try using xg boost
Try using the Twitter API
Try using OpenCV on Windows
Try using Jupyter Notebook dynamically
python: Basics of using scikit-learn ①
Try using AWS SageMaker Studio
Try tweeting automatically using Selenium.
Try using SQLAlchemy + MySQL (Part 1)
Try using the Twitter API
Try using SQLAlchemy + MySQL (Part 2)
Try using Django's template feature
Try using the PeeringDB 2.0 API
Try using Pelican's draft feature
Try using pytest-Overview and Samples-
Similar face image detection using face recognition and PCA and K-means clustering
Try using folium with anaconda
Try using Janus gateway's Admin API
[Statistics] [R] Try using quantile regression.
Try using Spyder included in Anaconda
Try using design patterns (exporter edition)
Simple grid search template using Scikit-learn
Try using Pillow on iPython (Part 2)
100 language processing knock-76 (using scikit-learn): labeling