[PYTHON] Prepare a high-speed analysis environment by hitting mysql from the data analysis environment

I will write about how to connect to rds from a data analysis tool such as jupyter and get a data frame.

motivation

When you pull data from a database and want to analyze it

Create the result set you want to type SQL
Drop CSV on result set
Put CSV in the analysis environment
Read CSV and create data frame
With the data frame at hand, ready to go

I think that it often follows the process.

However

――If you want to try various data frames --When the desired data frame changes depending on the situation ――If you always want fresh data

Under such circumstances, the previous process will be repeated many times. As a result, workers may repeatedly jump between the database and the analytical environment, or incur some communication costs between the engineer who prepares the data and the data scientist who analyzes it.

Therefore, I thought that the cost could be reduced by connecting the analysis environment to the database as it is.

What happened

I often use Python for data analysis, but this time I tried to connect to the database using Python's MySQL client mysqlclient. (Mysqlclient is the recommended driver for Python's web framework Django)

Without a bastion server

import MySQLdb

connect = MySQLdb.connect(
    host='rds endpoint',
    user='username',
    db='Database name',
    passwd='password'
)

cursor = connect.cursor()
sql = "Write a sql statement such as select as a string"
cursor.execute(sql)
for row in cursor.fetchall():
    print(row)
cursor.close()
connect.close()

The result will be returned as a tuple. Be sure to add limit to the sql specification because the process will not end if the table is large.

If there is a bastion server

If you have a bastion server, use the sshtunnel library to dig an ssh tunnel with mysqlclient.

import MySQLdb
from sshtunnel import SSHTunnelForwarder

with SSHTunnelForwarder(
    ('IP address of bastion server',Step server port number),
    ssh_host_key=None,
    ssh_username='Username to access the bastion server',
    ssh_password=None,
    ssh_pkey='ssh Specify the private key path',
    remote_bind_address=('rds endpoint',rds port number),
    local_bind_address=('local host',Arbitrary port number)
) as tunnel:
    connect = MySQLdb.connect(
        host='local host',
        port=Local on_bind_Arbitrary port number determined by address,
        user='username',
        db='Database name',
        passwd='password'
    )
    cursor = connect.cursor()
    sql = "Write a sql statement such as select as a string"
    cursor.execute(sql)
    for row in cursor.fetchall():
        print(row)
    cursor.close()
    connect.close()

The result is also returned as a tuple.

Summary

――I want to try various data frames ――The data frame you want changes depending on the situation ――I always want fresh data

We were able to meet these demands and provide an environment in which analysts can quickly obtain the data they want.

In addition, since variables can be specified on the python side, there was a secondary merit of being able to flexibly deal with it, such as being able to expand expressions into character strings that specify sql.