I clustered the dollar yen using the k-medoids method in python and calculated the correct answer rate.

Introduction

Starting with "Forecasting stock prices and exchange rates seems to be interesting" and "Let's read some papers", if you download and read some papers and academic conference materials that you may be interested in, personally I was impressed to find a material that seems to be extremely interesting. When it ends there, dame. I haven't touched the stock price and exchange rate data until now because I tried to implement it in my own way, not just reading it, but I got the data and tried to make it a form.

Target: For those who understand the basics of programming, such as arithmetic operations, if statements, and for statements, and are interested in analyzing stock prices and exchange rates. It is written so that first and second graders who started programming from university can read it. It is not for people who have advanced knowledge and can make various predictions and analyzes (I can't write first ... lol))

Overview: The flow up to clustering the past dollar-yen using the k-medoids method and finding the correct answer rate is described.

1 Data acquisition

This time, we have acquired daily data of dollar yen for the past 15 years and hourly data for the past 5 years. I didn't even know that I could get the data, so I took a lot of time here, but I happened to discover the existence of "oanda api" and managed to get it by referring to Qiita, which summarizes how to use it. The acquired daily data looks like this (see the figure below). スクリーンショット 2020-03-03 18.32.24.png excel is somewhat more intimate.

Get data from OANDA API

I refer to How to get a large amount of past exchange data from FX API (for machine learning) I got the data. It took a lot of time to use the API, so here are the steps. First of all, oanda api is an api provided by FX trader oanda. You need an id and key to use it, and you have to open a demo account for oanda. "Oanda's homepage" https://www.oanda.jp Select "Open new account"-> "Open new demo account" from the homepage. Enter various information in the free demo account opening form and issue a demo account ID. After that, an email with the ID and password will be sent, so log in to the demo account. There is "Account ID" in the account information in the middle part of the figure below. Enter from "API Access Management" at the bottom right and get a Personal Access Token. If you can get AccountID and AccessKey (PersonalAccessToken, you can use API with this.

Install a package called oandapy

pip install git+https://github.com/oanda/oandapy.git

Get currency exchange information. Load the required library and try to get the current dollar-yen rate.


import pandas as pd
import oandapy
import configparser
import datetime
from datetime import datetime, timedelta
import pytz

account_id = "xxxxx"
access_key = "xxxxx"

#oanda API call
oanda = oanda.API(access_token = access_key, environment = "practice")

#Get the dollar-yen rate for the current time
res = oanda.get_prices(instruments = "USD_JPY")

Output result ↓ ↓ ↓ {'prices': [{'ask': 107.321, 'bid': 107.317, 'instrument': 'USD_JPY', 'time': '2020-03-05T06:12:23.365940Z'}]} Due to the influence of the coronavirus, it has plummeted from the 112 yen level to the 107 yen level. From here on, go to How to get a large amount of past exchange data from FX API (for machine learning). It is written in a very easy-to-understand manner, so please refer to this article to obtain the data for the desired period.

2 Analysis method

Perform the analysis according to the following steps. Step1: Determine the forecast period, data collection period, and verification period Step2: Cluster the data collection period Step3: Of the verification period, trade for the forecast period Step4: Add the data of the transaction period you performed to the data collection period Step5: When the verification period is not over, return to Step2 and end when the verification period is over.

<img width = "600"src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/512508/db4b3cf1-8577-914c-fc19-7580deeab002.png ">

Each period has the following meanings: Data collection period: A period for referring to past price movements. (Generally training data) Verification period: Period for evaluating the correct answer rate. (Generally test data) Prediction interval: What will happen next month, next week, tomorrow? Period for forecasting (monthly, weekly, and daily)

As an example, the forecast period is monthly, the data collection period is 156 months from 2003 to the end of 2015, and the verification period is 36 months from 2016 to the end of 2018. (Once you understand it, just change the period) It looks like this in the figure. (Figure below)

Apply the k-medoid method to the data collection period in Step 2. The image is "When the data collection period is divided into several patterns, what pattern will the expected period be?" So first we classify the data collection period into several patterns. Since the price movements of the chart are basically classified into three patterns of rising, leveling, and falling, we classify them into three classes here.

Here is a summary of the price movements for 156 months (see the figure below).

The number of series is 156, and the reason why the number of elements varies is that there are differences such as a month ending in 30 days, a month ending in 31 days, and February. In addition, foreign exchange is not traded on holidays and New Year's Day, so the number of elements in each series is about 18 to 22.

In order to understand the characteristics of each series, divide each element by the value at the beginning of each period (element [0]). Then, it looks like this (see the figure below).

DTW distance

Next, I will classify these confused series into three categories. But the question here is how to classify? Is it possible to quantitatively evaluate each shape in time series? I thought, but it can be solved by using a measurement method called DTW distance. This returns a value when you put the two time series you want to compare in the DTW function, and you can use this to quantitatively evaluate between each time series.

For example, if there is the following time series from A to E

If you put the time series of A and B in the DTW function, it will be 297. スクリーンショット 2020-03-03 20.28.25.png The table below shows the DTW function for all combinations from A to E. スクリーンショット 2020-03-03 20.29.25.png The DTW distance of the closest time series A and B is 297, and the DTW distance of the farthest time series D and E is 2900. ..

DTW algorithm, code

The algorithm looks like this: Step1: Create a Cost matrix of length M x length N obtained by finding the difference between the absolute values of each point. Step2: Create an M × N Dist matrix, $ Dist {(1, 1)} = Cost {(1,1)} $, Initialize by substituting $ ∞ $ for other elements. Step3: In the 1st row and 1st column of the Dist matrix, add the value of the previous value and the value of the Cost matrix at the same position. Step4: Create a Dist matrix according to the following formula.

Dist_{(i, j)} = Cost{(i, j)} + min(Dist{(i, j-1)}, Dist{(i-1, j)}, Dist{(i-1, j-1)})

Step5: Let $ Dist {(M, N)} $ be the DTW distance. First, pay attention to the time series of A and B, and what happens when measuring the DTW distance? If you create a Cost matrix that finds the difference between the absolute values of each point from Step 1, initialize the Dist matrix from Step 2 and Step 3, and create a Dist matrix according to Step 4, it will be as follows. From Step 5, the DTW distance between time series A and B is calculated to be 297. The following is a function of these.

#dtw Function to find the distance
def dtw(x, y):
    #Create a distance matrix.
    X, Y = np.meshgrid(x, y)
    dist = abs(X-Y)  #Euclidean distance
    #print(dist)

    #Matrix initialization to calculate dtw
    #On the dtw algorithm, dtw[-1][-1]To refer to first,line,Original time series length for both columns+Secure a dtw matrix of 1
    dtw = np.full((len(y) + 1, len(x) + 1), np.inf)
    dtw[0, 0] = 0

    for i in range(1, len(y) + 1):
        for j in range(1, len(x) + 1):
            dtw[i, j] = min(
                dtw[i - 1, j],
                dtw[i, j - 1],
                dtw[i - 1, j - 1]) + dist[i - 1, j - 1]
    return dtw[-1, -1]

Use the dtw function to create a dtw matrix that stores the dtw distance values between each time series (a table containing all the combinations A to E above).

def make_dtw_matrix(data):
    dtw_matrix = [[0 for i in range(len(data))] for j in range(len(data))]
    for i in range(0, len(data)):
        for j in range(0, len(data)):
            dtw_matrix[i][j] = dtw(data[i], data[j])
    return dtw_matrix

k-medoids method

Next, we will finally perform clustering using the k-medoids method. The k-medoids method is a division optimization clustering method similar to the k-means method. In the k-means method, the base point is set to the centroid, but in the k-medoids method, other items are included in the assigned class. The points that minimize the sum of the distances to all points are used as the base points (medoids). Therefore, the k-means method has the disadvantage that it is easily affected by outliers from the calculation method of the centroid, but k- The medoids method has the advantage of reducing the effect of outliers because one of the unclassified data is assigned to the medoids. (I referred to the k-means method, but this article without knowing it. Can be read, so it's okay.)

The distance between time series is quantified by the DTW distance, and the quantified dtw matrix is classified using the k-medoids method.

It looks like this when classified into three classes. (Figure below) image.pngimage.pngimage.png Somehow it is divided into a flat class (53 series), a falling class (66 series), and an ascending class (37 series).

k-medoids method algorithm, code

The algorithm will proceed in the next step. Step1: Randomly select points for k classes. (Medoid) Step2: Assign each point to the nearest medoid class. Step3: The point where the total distance to all other points in each class is minimized is newly set as medoid. Step4: If there is no change, it ends, and if there is a change, it returns to Step2. The code is posted below.

As an example, what happens if we classify the time series A to E into two classes? I will do it. The dtw matrix is a symmetric matrix with 5 × 5 diagonal elements 0 in the above table.

First, select two series at random in Step 1. Here, let's select the series A and B as medoid. Corresponds to initialize_medoids in the code. A medoid is the one in the class that minimizes the sum of the distances to all the points in the class. So this time, we have two medoids. The medoids that store the medoids are medoids = \ [A, B ](programmatically, medoids = [0, 1]).

Step2 Assign each series (A to E) to the nearest medoid class. Corresponds to assign_to_nearest in the code. Here, the assigned class is referred to as a label, and since it is classified into two classes, we will prepare two labels, "0" and "1". Since we have not done anything yet, each of the current ones The label assigned to the series is label = \ [?,?,?,?,? ](Label = [∞, ∞, ∞, ∞, ∞] on the program).

Focusing on the dtw matrix, the series selected for medoid are A and B. The medoid with label 0 is now series A, and the medoid with label 1 is now series B. If you look at which medoid each series is closer to, you will see the red circle below.

Therefore, the label to which the medoid closest to each series is assigned is label = \ [0, 1, 0, 0, 1 ].

Step3: The point where the total distance to all other points in each class is minimized is newly set as medoid. Corresponds to update_medoids in the code. Focus on each series in the class, prepare a mindist that stores the value that minimizes the total distance from that series to all other series in the class, and medoids that store new medoids. It is in the state of mindist = \ [?,? ], Medoids = \ [?,? ]. Then, which label is assigned to A to E? , What is the total distance to other labels? Let's update the series with the smallest total distance to other labels to the new medoid. First series A. The series A is labeled 0, and the other series assigned to 0 correspond to the red circle below when checking the values from the series C and the series D. dtw matrix. Since 363 + 1388 = 1751, mindist = \ [1751,? ], Medoids = \ [0,?]. Series B. Series B is labeled 1, and the only series assigned to 1 is series E, so if you check the value from the dtw matrix, 1156. So mindist = \ [1751, 1156 ], medoids = \ [0, 1 ](so far, it hasn't changed) Series C. Series C is labeled 0, and the other series assigned to 0 are series A and series D, so if you check the values from the dtw matrix, the total is 1447 (363 + 1084). Since this is smaller than 1751, update mindist = \ [1447, 1156 ], medoids = \ [2, 1 ]. Series D has label 0, and if you check the values from the dtw matrix, the total is 2472, which is through because the mindist is not updated. Series E checks the value from the dtw matrix as well as label 1, and passes through because it does not update the mindist.

From the above, medoid has been updated from \ [0, 1 ] to \ [2, 1 ]. Allocate this new medoid to the medoid closest to each series using the same procedure as in Step 2. It can be divided into good feelings.

I think that the explanation is verbose and difficult to understand, but the k-medoids method is to repeat the above procedure until the medoid is not updated and classify it.

Below is a program of the k-medoids method written in python. (I wrote it so far, but it is easier to use it because the library is prepared ...)

#kmedoids method algorithm
def kmedoids(dtw_matrix, total_class_num):
    medoids = initialize_medoids(dtw_matrix, total_class_num)
    label = [0 for i in range(len(dtw_matrix))] #len(dtw_matrix)Contains 0 of the length of,So now all time series labels are 0
    for i in range(0, 100):
        new_label = assign_to_nearest(dtw_matrix, medoids)
        if new_label == label:
            break
        label = new_label
        medoids = update_medoids(dtw_matrix, label, total_class_num)
    return (label, medoids)

def update_medoids(dtw_matrix, label, total_class_num):
    n = len(dtw_matrix)
    mindists = [np.inf for i in range(total_class_num)]  #Array with inf for the number of classes. k=If 3[inf, inf, inf]
    medoids = [np.inf for i in range(total_class_num)]
    for i in range(0, n):
        ts_label = label[i]
        dist_total = 0
        for j in range(0, n):
            if label[j] == ts_label:
                dist_total += dtw_matrix[i][j]
        if dist_total < mindists[ts_label]:
            mindists[ts_label] = dist_total
            medoids[ts_label] = i
    return medoids

def assign_to_nearest(dtw_matrix, medoids):
    total_class_num = len(medoids)
    label = [0 for i in range(len(dtw_matrix))]
    for i in range(0, len(dtw_matrix)):
        mindist = np.inf
        nearest = 0
        for j in range(0, total_class_num):
            if dtw_matrix[i][medoids[j]] < mindist:
                mindist = dtw_matrix[i][medoids[j]]
                nearest = j
        label[i] = nearest
    return label

def initialize_medoids(dtw_matrix, total_class_num):
    medoids = list(range(len(dtw_matrix)))
    return medoids[0:total_class_num]

transaction

Then make a deal. Focus on the series immediately before the predicted series and check which class the series belongs to. If more than half of the series in that class are rising (falling), we predict that they will rise (fall) in the future and make a buy (sell) decision. If it actually rises even a little, it is counted as success, and the predicted period is stored in the data collection period and clustering is performed again.

In the previous example, when predicting January 2016, which is the first of the verification period (2016 to the end of 2018), the classified class of the series of December 2015, which is the previous series, is used. Confirmation. Transaction according to the class, after the end, the series of January 2016 is stored in the data collection period and clustered again, then the series of February 2016 is predicted. <img width = "600"src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/512508/db4b3cf1-8577-914c-fc19-7580deeab002.png ">

In this way, the correct answer rate was calculated by devising various data collection periods and verification periods, and changing the prediction period.

3 Analysis execution

When you finish writing the code, you can get rich if you can analyze it accurately! I was excited. But it wasn't that sweet. The results are as follows.

Prediction interval: monthly

image.png image.png image.png image.png image.png

Prediction interval: weekly

image.png image.png image.png image.png image.png

Prediction interval: daily

image.png image.png image.png image.png image.png

The number of classes was limited to 5 patterns from 3 to 7 (it did not change much even if I increased it) and the result was obtained. All the results have a correct answer rate of about 40 to 50%, which is a little disappointing. But there is still plenty of room for improvement.

After reviewing the trading method and considering and implementing two hypotheses, we were able to achieve an average correct answer rate of about 65% for each week. I will summarize this in another article.

Execution time is long!

It's very long. In cases where the prediction interval is daily, it takes about 100 minutes to execute once. I wanted to get results sooner. I'm using a MacBook Air (Early 2015), a 1.6GHz dual-core Intel Core i5 processor, and 8GB of memory that I haven't customized, but I bought a new computer the other day.

The specifications are CPU: Core i7-9850H (2.6GHz, 6 cores), memory: 16GB, NVIDIA GeForce MX150 (GPU).

In addition to this program, I'm thinking about comparing how much time difference will occur when a machine learning program based on books etc. is executed on a MacBook Air and a laptop computer equipped with a GPU.

References

How to get a large amount of past Forex data from FX API (for machine learning)Stock Price Forecast Using Similarity of Stock Price Fluctuation PatternsMarket Forecasting Using Price Fluctuation Patterns k-Medoids Clustering with Indexing Dynamic Time Warping Applied to Stock Market / _pdf) ・ Analysis of securities / exchange / virtual currency market by price fluctuation pattern

Recommended Posts

I clustered the dollar yen using the k-medoids method in python and calculated the correct answer rate.
I tried the least squares method in Python
Determine the threshold using the P tile method in python
I checked the reference speed when using python list, dictionary, and set type in.
I got an AttributeError when mocking the open method in python
I compared Node.js and Python in creating thumbnails using AWS Lambda
I wrote the queue in Python
I calculated "Levenshtein distance" using Python
I wrote the stack in Python
How to write the correct shebang in Perl, Python and Ruby scripts
I set the environment variable with Docker and displayed it in Python
The file name was bad in Python and I was addicted to import
Notes using cChardet and python3-chardet in Python 3.3.1.
Try using the Wunderlist API in Python
Try using the Kraken API in Python
Tweet using the Twitter API in Python
I tried using Bayesian Optimization in Python
CheckIO (Python)> The Most Wanted Letter I tried> Searching for the most frequent alphabets> Using Counter in collections, using filter () and sorted (), method by keying max (), string.ascii_lowercase
I compared the speed of regular expressions in Ruby, Python, and Perl (2013 version)
Note that I understand the least squares algorithm. And I wrote it in Python.
I want to get the file name, line number, and function name in Python 3.4
vprof --I tried using the profiler for Python
I tried web scraping using python and selenium
I tried object detection using Python and OpenCV
The answer of "1/2" is different between python2 and 3
Learn the design pattern "Template Method" in Python
Try using the BitFlyer Ligntning API in Python
I tried simulating the "birthday paradox" in Python
I wrote a class in Python3 and Java
To dynamically replace the next method in python
Learn the design pattern "Factory Method" in Python
About the difference between "==" and "is" in python
Correct half-width and full-width notation fluctuations in Python
Pre-process the index in Python using Solr's ScriptUpdateProcessor
I implemented the inverse gamma function in python
Try implementing the Monte Carlo method in Python
Try using ChatWork API and Qiita API in Python
I want to display the progress in Python!
Try using the DropBox Core API in Python
Object tracking using OpenCV3 and Python3 (tracking feature points specified by the mouse using the Lucas-Kanade method)
Approximate a Bezier curve through a specified point using the least squares method in Python
I compared the speed of the reference of the python in list and the reference of the dictionary comprehension made from the in list.
I want to replace the variables in the python template file and mass-produce it in another file.
In Python3.8 and later, the inverse mod can be calculated with the built-in function pow.
I tried the python version of "Consideration of Conner Davis's answer" Printing numbers from 1 to 100 without using loops, recursion, and goto "