[PYTHON] Work memorandum (pymongo) Part 3. I don't want to look it up again, so make a note of it (aggregate)

About this article

This article is a continuation of the article I posted the other day.

  1. Work memorandum (pymongo) Part 1. Basic operation
  2. Work memorandum (pymongo) Part 2. Convenient operation (bulk_write)

This time, I personally collected BitCoint price fluctuation data with a certain Rest API and it became too heavy, so I will write about when downsampling and reducing the number of documents.

What is aggregate

It is an aggregation process. In pymongo, aggregation processing is executed by passing a pipeline (conditional statement) to the aggregate function. The pileline is composed of stage and operator, each stage corresponds to "select", "group by", "where" etc. for SQL, and operator corresponds to "sum", "max", "min" etc. ..

Reference: Aggregate Pipeline (stage) Reference: Aggregate Pileline (operator)


Background

As a result of putting the acquisition result of Rest API into the collection as it is, I put a lot of data like the following in vain (51240). (If I left it for several days at 10-minute intervals, it would have accumulated before I knew it ...) Since it is annoying, I downsampled it to reduce the number of data.

Documents stored in the collection


client = MongoClient()
db = client["BitCoin"]["document"]
pprint(db.count()) #A function to get the number of documents in a collection
pprint(db.find_one())

"""
#Output result
51240
{'_id': ObjectId('5f328ad85ae5ac59aee515cb'),
 'best_ask': 1245419.0,
 'best_ask_size': 0.02,
 'best_bid': 1244658.0,
 'best_bid_size': 0.05,
 'ltp': 1245615.0,
 'product_code': 'BTC_JPY',
 'tick_id': 10956004,
 'timestamp': 1597115465.0,
 'total_ask_depth': 1364.44898005,
 'total_bid_depth': 1637.4300907,
 'volume': 126756.67774321,
 'volume_by_product': 6571.45287901
}
"""

The graph looks like this ... There are too many points and it's really annoying test.jpg


Pipeline settings for aggregate

For the time being, we grouped the data at 10-minute intervals into daily data, averaged each value, and downsampled.

Below is the aggregate pipeline used in pymongo.

pipeline


coin = "BTC_JPY"
interval = 60*60*24 # 24hour 
pipeline = [
    # match stage 
    {"$match": {"product_code": coin} },
    # group stage
    {"$group": {
        "_id":
        {"timestamp":
            {"$subtract":  ["$timestamp", { "$mod": ["$timestamp", interval]}]
             }
         ,
        "product_code": "$product_code"
        },
        "timestamp":{"$avg": "$timestamp"},
        "ltp": {"$avg": "$ltp"},
        "best_ask": {"$avg": "$best_ask"},
        "best_ask_size": {"$avg": "$best_ask_size"},
        "best_bid_size": {"$avg": "$best_bid_size"},
        "total_ask_depth": {"$avg": "$total_ask_depth"},
        "total_bid_depth": {"$avg": "$total_bid_depth"},
        "volume": {"$avg": "$volume"},
        "volume_by_product": {"$avg": "$volume_by_product"},
    }},
    # presentation stage
    {"$project": {
        "product_code": "$_id.product_code",
        "_id": 0, "timestamp": 1,"ltp": 1,
        "best_ask": 1,"best_ask_size":   1,
        "best_bid_size": 1,
        "total_ask_depth": 1,
        "total_bid_depth": 1,
        "volume": 1, "volume_by_product": 1,
        }
    }
]

I will explain the pipeline.

  1. Get the target to be aggregated ($ match) This time, I got the ones that match product_code. (You can specify it in the same way as find.)

{"$match": {"product_code": coin} }, ```

  1. Grouping ($ group) The product_code and timestamp were grouped on unix time so that they match at 1-day intervals, and the other values were averaged. The following two points can be mentioned.

  2. Set the target to be grouped in _id

  3. After _id, specify the key and calculation method (operator) you want to get, such as the average and maximum value.

    {"$group": {
        "_id": #Set the target to be grouped here
        {"timestamp":
            {"$subtract":  
                ["$timestamp", 
                    { "$mod": ["$timestamp", interval]}]
             }
         ,
        "product_code": "$product_code"
        },
        "timestamp":{"$avg": "$timestamp"},
        "ltp": {"$avg": "$ltp"},
    
    
  4. Specify the data to display ($ project) (You can do it with the same operation as project in find)

 {"$project": {
     "product_code": "$_id.product_code",
     "_id": 0, "timestamp": 1,"ltp": 1,
     "best_ask": 1,"best_ask_size":   1,
     "best_bid_size": 1,
     "total_ask_depth": 1,
     "total_bid_depth": 1,
     "volume": 1, "volume_by_product": 1,
     }
 }

Downsampling results

I compared the data downsampled by the previous pipeline with the original data. The red dot is before downsampling and the blue is after downsampling. You can see that the data is thinned out nicely.

import matplotlib.pyplot as plt

plt.figure()
for i in db.find( filter= {"product_code": coin
                          } ):
    plt.scatter(i["timestamp"], i["ltp"], marker=".", color="r")
for i in  db.aggregate(pipeline=pipeline):
    plt.scatter(i["timestamp"], i["ltp"], marker=".", color="b")
plt.grid()
plt.xlabel("Data[unixtime]")
plt.ylabel(coin)
plt.savefig("test2.jpg ")
plt.show()

test2.jpg


Well, there are still many things about aggregate, but there are too many, so this time I will stop here. I will add corrections, questions, and anything I want you to write.

Recommended Posts

Work memorandum (pymongo) Part 3. I don't want to look it up again, so make a note of it (aggregate)
I set up TensowFlow and was addicted to it, so make a note
I wanted to use the find module of Ansible2, but it took some time, so make a note
[Python] I tried to implement stable sorting, so make a note
[Fabric] I was addicted to using boolean as an argument, so make a note of the countermeasures.
I don't want to admit it ... The dynamical representation of Neural Networks
I tried to make a calculator with Tkinter so I will write it
I want to color a part of an Excel string in Python
[Hi Py (Part 1)] I want to make something for the time being, so first set a goal.
A note I looked up to make a command line tool in Python
I was addicted to trying Cython with PyCharm, so make a note
Suddenly I needed to work on a project using Python and Pyramid, so a note of how I'm studying
I want to make a game with Python
I don't want to take a coding test
I want to collect a lot of images, so I tried using "google image download"
A Python beginner made a chat bot, so I tried to summarize how to make it.
I stumbled upon using MoviePy, so make a note
I want to work with a robot in python.
I want to install a package of Php Redis
[Python] I want to make a nested list a tuple
I don't like to be frustrated with the release of Pokemon Go, so I made a script to detect the release and tweet it
[Python] I want to make a 3D scatter plot of the epicenter with Cartopy + Matplotlib!
I made a function to crop the image of python openCV, so please use it.
There was a doppelganger, so I tried to distinguish it with artificial intelligence (laughs) (Part 1)
I want to make a blog editor with django admin
I want to start a lot of processes from python
I want to make a click macro with pyautogui (desire)
NikuGan ~ I want to see a lot of delicious meat! !!
I want to make a click macro with pyautogui (outlook)
I stumbled when I tried to install Basemap, so a memorandum
I want to know the legend of the IT technology world
I want to make input () a nice complement in python
I tried to make a site that makes it easy to see the update information of Azure
I tried to find out the difference between A + = B and A = A + B in Python, so make a note
[Twitter] I want to make the downloaded past tweets (of my account) into a beautiful CSV
I want to clear up the question of the "__init__" method and the "self" argument of a Python class.
Don't write Python if you want to speed it up with Python
A memorandum when I tried to get it automatically with selenium
I tried to make a regular expression of "amount" using Python
I tried to make a regular expression of "time" using Python
Comparison of GCP computing services [I want to use it serverless]
Scraping and tabelog ~ I want to find a good restaurant! ~ (Work)
[Introduction] I want to make a Mastodon Bot with Python! 【Beginners】
I want to create a pipfile and reflect it in docker
I want to make a parameter list from CloudFormation code (yaml)
I tried to make a mechanism of exclusive control with Go
I got stuck trying to install various things with Mac El captain pip, so make a note