I will write about how to connect to rds from a data analysis tool such as jupyter and get a data frame.
When you pull data from a database and want to analyze it
I think that it often follows the process.
However
――If you want to try various data frames --When the desired data frame changes depending on the situation ――If you always want fresh data
Under such circumstances, the previous process will be repeated many times. As a result, workers may repeatedly jump between the database and the analytical environment, or incur some communication costs between the engineer who prepares the data and the data scientist who analyzes it.
Therefore, I thought that the cost could be reduced by connecting the analysis environment to the database as it is.
I often use Python for data analysis, but this time I tried to connect to the database using Python's MySQL client mysqlclient. (Mysqlclient is the recommended driver for Python's web framework Django)
import MySQLdb
connect = MySQLdb.connect(
    host='rds endpoint',
    user='username',
    db='Database name',
    passwd='password'
)
cursor = connect.cursor()
sql = "Write a sql statement such as select as a string"
cursor.execute(sql)
for row in cursor.fetchall():
    print(row)
cursor.close()
connect.close()
The result will be returned as a tuple. Be sure to add limit to the sql specification because the process will not end if the table is large.
If you have a bastion server, use the sshtunnel library to dig an ssh tunnel with mysqlclient.
import MySQLdb
from sshtunnel import SSHTunnelForwarder
with SSHTunnelForwarder(
    ('IP address of bastion server',Step server port number),
    ssh_host_key=None,
    ssh_username='Username to access the bastion server',
    ssh_password=None,
    ssh_pkey='ssh Specify the private key path',
    remote_bind_address=('rds endpoint',rds port number),
    local_bind_address=('local host',Arbitrary port number)
) as tunnel:
    connect = MySQLdb.connect(
        host='local host',
        port=Local on_bind_Arbitrary port number determined by address,
        user='username',
        db='Database name',
        passwd='password'
    )
    cursor = connect.cursor()
    sql = "Write a sql statement such as select as a string"
    cursor.execute(sql)
    for row in cursor.fetchall():
        print(row)
    cursor.close()
    connect.close()
The result is also returned as a tuple.
――I want to try various data frames ――The data frame you want changes depending on the situation ――I always want fresh data
We were able to meet these demands and provide an environment in which analysts can quickly obtain the data they want.
In addition, since variables can be specified on the python side, there was a secondary merit of being able to flexibly deal with it, such as being able to expand expressions into character strings that specify sql.
Recommended Posts