[PYTHON] [Development environment] How to create a data set close to the production DB

Overview

When creating a Web service, do you want data that is close to the database in the production environment even in your development environment? It will be. In your local environment, you may have forgotten to apply the patch to maintain the integrity of the data executed in the production DB, etc., so the data at hand may not be in the expected form. In addition, it is one way to prepare seed data, but maintenance of seed data is unexpectedly costly, and the characteristics of the data are different from the production data, so if you release it, there will be performance problems. Things also happen.

So, I tried to partially extract the data from the production DB and use it locally, so I will write a summary of what I did with the script. The script is written in Python, so I'm sorry if you can't read Python ...

** As a super-major premise, I think that there are many legally dangerous services if you bring the production data locally without processing anything, so be sure to put in appropriate masking processing. The masking method also depends on the characteristics of the data and will not be discussed in detail in this article. ** **

Overall flow

  1. Create a table that is a subset of the production DB table
  2. Create a dump of the subset table
  3. Import the dump locally

Caution) Although it is written as a production DB, of course, please do it in a place that does not affect the DB server that is running in production. Imagine working on another server that you restored from a snapshot. If you do it in the production environment, please make an entry in next year's Advent Calendar for people who did it in the production environment!

1. Create a table that is a subset of the production DB table

This is the part I think most about. I will explain with an example for easy understanding. Suppose you have three tables, books, users, and favs, and you have a Foreign Key from favs to books and users. Creating a subset table is the data you want to use locally when there are books: 1 million, users: 1 million, favs: 5 million records in the production DB books: 10,000, users: 10,000 , favs: Refers to creating a table with 50,000. For convenience, we'll call the subset tables books_small, users_small, and favs_small, respectively.

CREATE TABLE books_small SELECT * FROM books LIMIT 10000;
CREATE TABLE users_small SELECT * FROM users LIMIT 10000;
CREATE TABLE favs_small SELECT * FROM favs LIMIT 50000;

However, this alone is not successful because favs_small has a Foreign Key constraint on books_small and users_small. (Because favs_small may contain users that are not included in users_small but are included in users)

Furthermore, in this example, since the table dependency is simple, it is clear that if you execute favs after book, users, the dependency has already been created and it does not matter, but the resolution of the dependency between tables is 100 or more. It is not realistic for humans to do to a table.

In other words, there are the following two issues, and we will create a subset table to satisfy these two issues. i. I want to create a subset from a Foreign Key dependency when creating a ** table ** ii. I want to keep the Foreign Key constraint when ** putting data into a subset table **

i. I want to create from a Foreign Key dependency when creating a subset table

To achieve this, we need to create a table from the dependent side. Let's define a function of get_table_list () that returns the order in which the tables should be created.

from typing import List

import MySQLdb.cursors

global db
db = MySQLdb.connect()

def get_table_list() -> List[str]:
    """
Return table names in the order of create table and insert while paying attention to data dependency
    """
    global db

    def _get_list_of_referring_tables(table_name) -> List[str]:
        """
        `show create table`To get a list of table names with Foreign Keys
Recursively process the dependents of the dependent table to see them
        """
        tables = [table_name]
        cur.execute(f'show create table {table_name}')
        statement = cur.fetchone()['Create Table']
        references = re.findall(r'REFERENCES `(\w+)`', statement)
        if references:
            for referring_table in references:
                tables = _get_list_of_referring_tables(referring_table) + tables  #Before being dependent
        return tables

    # `show tables`The table list obtained by_Put it in the list.(I don't care about dependencies)
    cur = db.cursor()
    cur.execute("show tables")
    rows = cur.fetchall()
    table_list = []
    for row in rows:
        table_list.append(row['Tables_in_[database]'])

    #A list of tables that are meaningful in order so that the dependent tables always come first(Duplicate table name allowed)
    table_list_order_by_referred = []  
    for table_name in table_list:
        table_list_order_by_referred += _get_list_of_referring_tables(table_name)

    # table_list_order_by_Remove duplicates because referred contains duplicate table names
    #By erasing the duplicates in order from the front, the dependent ones come first
    unique_table_list_order_by_referred = []
    for table_name in table_list_order_by_referred:
        if table_name not in unique_table_list_order_by_referred:
            unique_table_list_order_by_referred.append(table_name)
    return unique_table_list_order_by_referred

Now in the table order obtained by get_table_list ()

CREATE TABLE books_small SELECT * FROM books LIMIT 10000;

Dependencies between tables are resolved if you do something like that.

ii. I want to keep the Foreign Key constraint when putting data in a subset table

Next is how to resolve the dependency when entering data. I wrote it earlier, but without thinking about anything

CREATE TABLE books_small SELECT * FROM books LIMIT 10000;
CREATE TABLE users_small SELECT * FROM users LIMIT 10000;
CREATE TABLE favs_small SELECT * FROM favs LIMIT 50000;

Then you will get angry with the error Cannot add or update a child row: a foreign key constraint fails. I want favs_small to contain only the books and users in books_small and users_small.

I think there are two options below.

CREATE TABLE favs_small 
SELECT * 
FROM favs 
WHERE book_id IN (SELECT id FROM books_small) 
AND user_id IN (SELECT id FROM users_small)
LIMIT 50000;
SET FOREIGN_KEY_CHECKS = 0
CREATE TABLE favs_small SELECT * FROM favs LIMIT 50000;
DELETE FROM favs_small WHERE book_id NOT IN (SELECT id FROM books_small);
DELETE FROM favs_small WHERE user_id NOT IN (SELECT id FROM users_small);
SET FOREIGN_KEY_CHECKS = 1

Either one is fine, but I felt that the cost of assembling the SQL statement was lower for the latter, so I chose that approach this time.

By the way

DELETE FROM favs_small WHERE book_id NOT IN (SELECT id FROM books_small);

At least when executing with MySQL, it takes a lot of time because it may be a poor construction of the execution plan of DELETE ... NOT IN ...

SELECT id FROM favs_small WHERE book_id NOT IN (SELECT id FROM books_small);
DELETE FROM favs_small WHERE id IN ([List of ids obtained above]);

I'm glad if you break it down into two queries and execute it.

So, if you realize it with Python, the code will look like this.


#Define the upper limit of the number of records in each table like this
TABLE_RECORD_LIMIT = {
  'users': 10000,
  'books': 10000,
  'favs': 50000,
}


def create_small_table():
    """
    [table_name]_Create a table called small and put the data to be dumped in it.
    """
    global db

    table_list = get_table_list()
    cur = db.cursor()
    for table_name in table_list:
        small_table_name = get_small_table_name(table_name)
        cur.execute(f'SHOW CREATE TABLE {table_name}')
        table_meta_data = cur.fetchone()['Create Table']
        # `table_name`Get a list of table names that
        references = re.findall(r'REFERENCES `(\w+)`', table_meta_data)

        limit_statement = ''
        if table_name in TABLE_RECORD_LIMIT:
            limit_statement = f'LIMIT {TABLE_RECORD_LIMIT[table_name]}'

        cur.execute('SET FOREIGN_KEY_CHECKS = 0')
        cur.execute(f'CREATE TABLE {small_table_name} SELECT * FROM {table_name} {limit_statement}')
        for parent_table in references:
            small_parent_table = get_small_table_name(parent_table)
            reference_column_name = get_reference_column_name(table_meta_data, parent_table)
            cur.execute(f"""
            SELECT id 
            FROM {small_table_name} 
            WHERE {reference_column_name} NOT IN (SELECT id FROM {small_parent_table})
            """)
            delete_id_list = ','.join([str(row['id']) for row in cur.fetchall()])
            if delete_id_list:
                cur.execute(f'DELETE FROM {small_table_name} WHERE id IN ({delete_id_list})')
        cur.execute('SET FOREIGN_KEY_CHECKS = 1')


def get_small_table_name(original_table_name):
    """
You can implement it as you like
Note that returning something longer than the original table name may violate the maximum table name length.
    """
    return original_table_name + '_small'
    # return original_table_name[:-2] + '_s'  #I implemented it here


def get_reference_column_name(table_meta_data, referring_table_name):
    """
    `SHOW CREATE TABLE`Table metadata obtained in(table_meta_data)From
Referenced table(referring_table_name)Get the column name pointing to
    """
    return re.findall(r'\(`(\w+)`\) REFERENCES `' + referring_table_name, table_meta_data)[0]

As a caveat, the upper limit of the number of favs that can be acquired is defined as 50000 at the very beginning, but since the record of constraint violation is deleted after fetching 50,000, the actual number that remains is 50,000 or less. .. If you want to get exactly 50,000 cases, you can do it by taking the first method out of the two options explained earlier.

2. Create a dump of the subset table

Now that you have a subset table with a consistent Foreign Key, all you have to do is dump the table without thinking. If you want to use mysqldump, grep the result of show tables with _small which is a subset table post-fix.

$ mysqldump -u user -p [database] books_small users_small favs_small hoge_small .... > hoge.dump

All you have to do is assemble a command.

If you want to write the dump by yourself, do your best, you can write like this. I think it would be convenient to put in the masking process here.

from functools import lru_cache

def create_small_db_dump():
    global db

    cur = db.cursor()
    table_list = get_table_list()
    BATCH_SIZE = 30000
    for table_name in table_list:
        small_table_name = get_small_table_name(table_name)
        offset = 0
        while True:
            cur.execute(f'SELECT * FROM {small_table_name} LIMIT {BATCH_SIZE} OFFSET {offset}')
            rows = cur.fetchall()
            if not rows:
                break
            create_insert_statement(table_name, rows)
            offset += batch_size


def create_insert_statement(table_name, rows):
    """
    :param table_name:Table name to insert
    :param rows:Select table*Result array
    :return:
    """
    global output_file

    statement = f'INSERT INTO {table_name} VALUES '
    for i, row in enumerate(rows):
        value_list = row.values()
        tmp = '('
        for value in value_list:
            tmp += convert_to_str(table_name, i, value)
            tmp += ','
        tmp = tmp[:-1] + '),'
        statement += tmp
    statement = statement[:-1] + ';'
    output_file.write(f'{statement}\n\n')


#How to mask the Nth column of which table
#If you want to do something a little complicated, you can use Lambda function or something.
MASKING_TARGET = {
    'users': {2: '***'},
}
def convert_to_str(table_name, i, value):
    """
Escape to an importable form
Masking processing is also done here
    """
    if table_name in MASKING_TARGET:
        if i in MASKING_TARGET[table_name]:
            return MASKING_TARGET[table_name][i]
    elif isinstance(value, str):
        escaped = value.replace("\\", "\\\\").replace("'", "\\'")
        return f"'{escaped}'"
    elif isinstance(value, int):
        return str(value)
    elif isinstance(value, float):
        return str(value)
    elif isinstance(value, datetime.datetime):
        return f"'{str(value)}'"
    elif isinstance(value, datetime.date):
        return f"'{str(value)}'"
    elif value is None:
        return 'null'
    #Add patterns as needed
    else:
        raise Exception(f'Value Error. data: {value}, data class: {value._class_}')

# create_small_db_dump()At the stage of calling the original table_There are a mixture of small tables,
#I only want the information from the original table, so
# _get before making a small table_table_list()It is good to have a cache on the assumption that
# (Get if it's too implicit and scary_table_list()In_Please write the process of playing a small table)
@lru_cache(maxsize=1)
def get_table_list() -> List[str]:
    #The process written above

After that, you need to drop the existing DB or create table, so let's write it quickly. If you can follow up to this point, you should be able to write immediately.

def build_drop_and_create_database_statement(database_name):
    global output_file

    output_file.write(f'DROP DATABASE IF EXISTS {database_name};\n')
    output_file.write(f'CREATE DATABASE IF NOT EXISTS {database_name};\n')
    output_file.write(f'USE {database_name};\n\n')


def build_create_table_statement():
    global db

    table_list = get_table_list()
    cur = db.cursor()
    for table_name in table_list:
        cur.execute(f'show create table {table_name}')
        statement = cur.fetchone()['Create Table']
        output_file.write(f'{statement};\n\n')

3. Import the dump locally

If you come here, just bring the dump file locally and import it.

$ mysql -u user -p database < hoge.dump

Thank you for your hard work.

Impressions

At first, it was not a policy to create a subset table, but a policy to maintain data consistency while generating INSERT statements directly from the original table, but dump (long time)-> insert (long time) In the middle of the process, an error such as "Foreign Key integrity!" Or "This data has not been escaped!" Since the PDCA cycle was too long and inefficient, I switched to the strategy of first creating a consistent table of subsets with a small amount of data and dumping it honestly.

Since we are dealing with production data, there are many things to be careful about, but I hope that development efficiency will improve by being able to check the operation with data that is closer to reality in the development environment.

Recommended Posts

[Development environment] How to create a data set close to the production DB
[Note] How to create a Ruby development environment
[Note] How to create a Mac development environment
How to set up the development environment of ev3dev [Windows version]
How to create a Python virtual environment (venv)
How to set up a local development server
How to run a Django application on a Docker container (development and production environment)
Read the Python-Markdown source: How to create a parser
How to set up a Python environment using pyenv
How to create a submenu with the [Blender] plugin
How to set the development environment for each project with VSCode + Python extension + Miniconda
Build a Python environment and transfer data to the server
How to build a python2.7 series development environment with Vagrant
How to get into the python development environment with Vagrant
How to set up WSL2 on Windows 10 and create a study environment for Linux commands
How to create a Conda package
How to create a virtual bridge
How to create a Dockerfile (basic)
How to get only the data you need from a structured data set using a versatile method
How to create a config file
How to create a large amount of test data in MySQL? ??
How to create a clone from Github
How to create a git clone folder
How to create an NVIDIA Docker environment
How to create a repository from media
How to divide and process a data frame using the groupby function
[Python] What is a formal argument? How to set the initial value
Create a Python development environment locally at the fastest speed (for beginners)
How to create a wrapper that preserves the signature of the function to wrap
How to calculate the volatility of a brand
How to create sample CSV data with hypothesis
How to share a virtual environment [About requirements.txt]
Introduction to Python Let's prepare the development environment
I want to create a nice Python development environment for my new Mac
How to create a function object from a string
How to set Django DB to mongodb visual studio 2019
Randomly sample MNIST data to create a dataset
How to create a JSON file in Python
[For recording] Keras image system Part 1: How to create your own data set?
How to create data to put in CNN (Chainer)
How to create a shortcut command for LINUX
How to create a face image data set used in machine learning (1: Acquire candidate images using WebAPI service)
How to create a Kivy 1-line input box
How to create a multi-platform app with kivy
Set up a Python development environment on Marvericks
Install and set Jupyter Notebook to create a study note creation environment [Mac]
How to create a Rest Api in Django
Create a command to get the work log
Create a Python virtual development environment on Windows
[Python] How to create a local web server environment with SimpleHTTPServer and CGIHTTPServer
[Introduction to Python] How to get the index of data with a for statement
How to set up a Google Colab environment with Coursera's advanced machine learning courses
CodePicnic that can create a [Development | Run | Tutorial | Demo] environment that runs on the browser
Try to create a battle record table with matplotlib from the data of "Schedule-kun"
How to quickly create a machine learning environment using Jupyter Notebook with UbuntuServer 16.04 LTS
How to create a Python 3.6.0 environment by putting pyenv on Amazon Linux and Ubuntu
How to use Docker to containerize your application and how to use Docker Compose to run your application in a development environment
[It's not too late to learn Python from 2020] Part 2 Let's create a Python development environment
Create a shogi game record management application using Django 5 ~ Pass DB data to Template ~
A memo to create a virtual environment (venv) before Django
How to create an article from the command line