In the previous article, from the analysis of the two stock indexes Topix / Nikkei225, especially the regression analysis, the NT plot is largely on the two Trend Lines, and in chronological order, the slope is gentle Trend-1 ( It was found that the NT ratio = 10.06) changed to Trend-2 (NT ratio = 12.81) with a steep slope.
** Figure. Reprint of the above figure (Topix vs. Nikkei225) **
The steep slope of Trend-2 is presumed to be related to "Abenomics" economic policy, but regression analysis did not clarify when it started. This time, we used a machine learning method to classify Trend-1 and Trend-2, and tried to clarify when Trend-2 started.
Trial.1 - K-Means Clustering I decided to use scikit-learn as a Python module for machine learning, but there are various possible approaches to classification, but I first tried using the K-Means method. This is a typical example of clustering performed without a label.
Code is as follows.
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
mypair.dropna(inplace=True)
X = np.column_stack([mypair['topix'].values, mypair['n225'].values])
# K-means clustering process
myinit = np.array([[mypair.loc['20050104', 'topix'], mypair.loc['20050104', 'n225']], \
[mypair.loc['20130104', 'topix'], mypair.loc['20130104', 'n225']]])
k_means = KMeans(init=myinit, n_clusters=2, n_init=10)
k_means.fit(X) # ... compute k-means clustering
k_means_labels = k_means.labels_
k_means_cluster_centers = k_means.cluster_centers_
k_means_labels_unique = np.unique(k_means_labels)
colors = ['b', 'r']
n_clusters = 2
for k, col in zip(range(n_clusters), colors):
my_members = k_means_labels == k
cluster_center = k_means_cluster_centers[k]
plt.plot(X[my_members, 0], X[my_members, 1], 'w', markerfacecolor=col, marker='.')
plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=6)
plt.title('K-Means')
plt.grid(True)
After all it was useless. In the K-Means method, it seems that the method is to measure the (abstract) distance between the data and collect the close ones to your own member, but like this time, the ones with a large aspect ratio scattered along the line are used. It doesn't seem to be suitable for handling.
Trial.2 - Primary Component Anarysis (PCA) Looking at the K-means plot, I wondered if it would be possible to apply some kind of coordinate transformation to make it into a "lump" and then cluster it in order to collect the data scattered linearly. However, when I looked up the documents, I found that principal component analysis (PCA) could be applied, so I decided to try the classification by PCA.
** Figure. Plot ** after PCA processing
From here, we decided to classify into two groups with a boundary line at Y = 0.
# PCA process
pca = PCA(n_components=2)
X_xf = pca.fit(X).transform(X)
plt.scatter(X_xf[:,0], X_xf[:,1])
plt.grid(True)
border_line = np.array([[-6000,0], [6000, 0]])
plt.plot(border_line[:,0], border_line[:,1],'r-', lw=1.6)
col_v = np.zeros(len(X_xf), dtype=int)
for i in range(len(X_xf)):
col_v[i] = int(X_xf[i,1] / abs(X_xf[i,1])) * (-1)
mypair['color'] = col_v
mypair['color'].plot(figsize=(8,2), grid=True, lw=1.6) # color historical chart
plt.ylim([-1.2, 1.2])
# plot scatter w/ colors
plt.figure(figsize=(12,5))
plt.subplot(121)
plt.scatter(X_xf[:,0], X_xf[:,1], marker='o', c=col_v)
plt.grid(True)
plt.title('Topix vs. Nikkei225 (PCA processed)')
plt.subplot(122)
plt.scatter(X[:,0], X[:,1], marker='o', c=col_v)
plt.grid(True)
plt.title('Topix vs. Nikkei225 (raw values)')
The results are shown in the figure below. (Sorry, the "color" is hard to see.)
The left side is a color-coded one in the coordinate system converted by PCA, and a plot of this color in the original coordinate system. It can be confirmed that Trends are grouped as originally intended. Setting Y = 0 as the boundary line of the group seems to be "well" valid.
By PCA, we were able to classify into a blue plot group with a gentle gradient and a red plot group with a steep gradient. Let's make the series data of this color a Historical Chart.
** Figure. Trend (color) transition (y = -1: Trend-1, y = + 1: Trend-2) **
From the chart above, ** Trend-1 ** until the latter half of 2009, then there is a slight transition period, and from the second quarter of 2011, the NT magnification is large ** Trend-2 **. It can be seen that it continues until 2014. If ** Trend-2 ** = "Abenomics", it can be inferred that Abenomics started in the first half of 2011. (The first half of 2011 reminds me of the Great East Japan Earthquake.)
I would like to consider the application of other machine learning methods and the verification of this PCA method in the future. Also, when other economic data (for example, fossil fuel imports) are available, I would like to investigate the relationship with them.
--Data Scientist Training Reader (Technical Review) http://gihyo.jp/book/2013/978-4-7741-5896-9 7741-5896-9)
Recommended Posts