Dump, restore and query search for Python class instances using mongodb

Introduction

Since I started creating a lot of analysis scripts in Python, there have been many requests to save the state during learning at any time so that it can be restored at any time. When the amount of code reaches a certain level, a class that combines the data set, parameter group, and operation required for learning is created, and processing is executed on an instance basis of the class. At this time, I always wonder if I can dump this class instance as a whole. This request is fulfilled by the serialization package of the object such as pickle, and each object can be dumped and loaded.

However, this gradually becomes unsatisfactory.

What you want to do becomes more and more luxurious. At this point, standard serializers such as pickle aren't enough, and you'll want to manage them in a database. However, it is not easy to find a mechanism to do this easily.

If it is about dump and restore, you can make it yourself, but when it comes to searching and sorting, the code of the dump part will be larger than the original analysis code, which will lead to an overwhelming situation.

There was a time when I wanted to create a utility for such a problem, but I was hesitant when I realized that I needed to create a fairly large framework. Here are some of the implementation issues:

Considering various things, it far exceeded the level of making one effort (implemented in a little over a day and released in about a week), so I wanted to do it, but I couldn't afford it.

However, recently, when I was studying mongodb thinking that I should study properly, I noticed that the above problem was solved by combining with the python package.

When you notice the required function, it is usually provided. After that, it seems that it can be realized if only a lightweight wrapper is made, so I tried it (I have to do it if I do not come to that point, my waist is too heavy ...)

Packaging

The implemented source code is named dbarchive and is already on github so you can see it.

I also prepared setup.py, so you can install it.

design

What is the design that people who use parsing classes want? I want to get rid of my consciousness such as dump / restore as much as possible. That's right ... If possible, couldn't we create such a mechanism that all can be solved by inheriting one superclass? The query search is not original as much as possible, but I want to follow the existing mechanism (no need to write a document ...), so I made the following specifications

As a result of making it satisfy the above, it became a package that can be used as follows.

How to use

Below is the sample code using dbarchive.

import numpy
import logging
from datetime import datetime
from dbarchive import Base

class Sample(Base):
    def __init__(self, maxval=10):
        self.base = "hoge"
        self.bin = numpy.arange(maxval)
        self.created = datetime.now()

print 'create sample instance'
sample01 = Sample(10)
sample01.save()
sample02 = Sample(3)
sample02.save()

for sample in Sample.objects.all():
    print 'sample: ', type(sample)
    print '\tbase: ', sample.base
    print '\tbin: ', sample.bin
    print '\tcreated: ', sample.created

sample01.bin = numpy.arange(20)
sample01.save()

for sample in Sample.objects.all():
    print 'sample: ', type(sample)
    print '\tbase: ', sample.base
    print '\tbin: ', sample.bin
    print '\tcreated: ', sample.created

print "all task completed"

Let's follow the source in detail. First, let the class you want to manage in the database inherit the dbarchive.Base class.

class Sample(Base):
    def __init__(self, maxval=10):
        self.base = "hoge"
        self.bin = numpy.arange(maxval)
        self.created = datetime.now()

By inheriting the dbarchive.Base class, you can create a class with the utilities required to save the database. All you have to do is save the instance with the save function.

Note that the \ __ init__ method of a class that inherits the Base class must be designed so that it can be executed without arguments. This is because it is necessary to be able to instantiate without arguments when automatically creating an instance from the database, and such restrictions are unavoidably required.

print 'create sample instance'
sample01 = Sample(10)
sample01.save()
sample02 = Sample(3)
sample02.save()

When the save function is called, the class creates a table (collection) called \ \ _ table in the database and stores its value.

The search is done through a handler called objects that the class has. Issuing queries through objects is basically django-compliant, so if you're used to it, you can use it without any discomfort.

for sample in Sample.objects.all():
    print 'sample: ', type(sample)
    print '\tbase: ', sample.base
    print '\tbin: ', sample.bin
    print '\tcreated: ', sample.created

The above is the code to get and display all the instances saved so far. For details on creating a query set with the objects handler, refer to the following document.

In order to confirm whether it can handle huge binaries, we also prepared sample code to be applied to deep learning by chainer.

See the dbarchive readme for more details on other uses.

interface

The following tools are useful when checking the values saved in mongodb.

mongohub is perfect for personal use. It can be used in the same way as existing database client tools. If you want to check with multiple people using the web interface, use mongo-express.

Recommended Posts

Dump, restore and query search for Python class instances using mongodb
Causal reasoning and causal search with Python (for beginners)
Python pandas: Search for DataFrame using regular expressions
Refined search for Pokemon race values using Python
Initial settings for using Python3.8 and pip on CentOS8
Searching for pixiv tags and saving illustrations using Python
Extendable skeletons for Vim using Python, Click and Jinja2
Search Twitter using Python
Recursively search for files and directories in Python and output
Try a similar search for Image Search using the Python SDK [Search]
Collect tweets using tweepy in Python and save them in MongoDB
[Python] Accessing and cropping image pixels using OpenCV (for beginners)
Aggregate and analyze product prices using Rakuten Product Search API [Python]
This and that for using Step Functions with CDK + Python
Python: Class and instance variables
Search algorithm using word2vec [python]
Python class variables and instance variables
[Python] Depth-first search and breadth-first search
Collect product information and process data using Rakuten product search API [Python]
perl objects and python class part 2.
Python class definitions and instance handling
Search for profitable brands using COTOHA
[TouchDesigner] Tips for for statements using python
Clustering and visualization using Python and CytoScape
perl objects and python class part 1.
Python logging and dump to json
[Python] Reasons for overriding using super ()
[Python] Multiplication table using for statement
Depth-first search using stack in Python
Python classes and instances, instance methods
Vectorize sentences and search for similar sentences
Python 2-minute search and its derivation
Reading and creating a mark sheet using Python OpenCV (Tips for reading well)