[PYTHON] Try accessing AWS Redshift data using Oracle Cloud Infrastructure Data Science

The aim of this article

In February 2020, Oracle Cloud Infrastructure Data Science (OCI-Data Science) was released. Last time, I tried to query the data of the file on AWS S3 from OCI-Data Science (Qiita article: Try to access the file data of OCI Object Storage and AWS S3 using Oracle Cloud Infrastructure Data Science), In this article, I would like to carry out the procedure to access and acquire data on AWS Redshift using Python from OCI-Data Science. ..

References

Links that will be helpful for implementation

-Let's use Oracle Cloud Infrastructure Data Science (OCI-Data Science) -OCI-Data Science Official Document -Oracle Accelerated Data Science SDK (ADS) Official Documentation -Qiita article: I tried Redshift for the first time! -What can Amazon Redshift do? Explaining AWS Data Warehouse Services -Read data from Redshift and put it in pandas data frame -Qiita article: Connect to ubuntu16.04@AWS postgresql from remote python3 (psycopg2)

procedure

Follow the procedure below.

  1. OCI-Data Science settings
  2. Start Redshift, create table, register data 2-1. Create and start a Redshift cluster 2-2. Create a table in Redshift, register data in the table from a file in S3
  3. Access Redshift tables from OCI-Data Science

1. OCI-Data Science settings

Build a notebook environment for OCI-Data Science and make initial settings. Do the following with reference to "Let's use Oracle Cloud Infrastructure Data Science (OCI-Data Science)" in the above reference.

-Build a notebook environment after the basic settings of Oracle Cloud. -Use ** getting-started.ipynb ** to perform the initial work of the notebook environment (JupyerLab).

2. Start Redshift, create a table, and register data

2-1. Create and start a Redshift cluster

I created a Redshift cluster by referring to the above reference "Qiita article: I tried Redshift for the first time!".

2-2. Create a table in Redshift, register data in the table from a file in S3

This time, refer to "Trying to use Amazon Redshift" in "What can Amazon Redshift do? Explaining AWS data warehouse services" in the above reference, ** Creating an IAM role **, ** I registered the data by following the steps of Creating a table on Redshift **, ** Registering data from the csv file on S3 to the table on Redshift **.

This time, the following table is created and sample data is registered.

CREATE TABLE PURCHASE_ITEM (
CUST_ID integer,
AGE integer,
MARRIED VARCHAR(4000),
ADDRESS VARCHAR(4000),
CHILD VARCHAR(4000),
OCCUPATION VARCHAR(4000),
LASTCONTACT VARCHAR(4000),
LASTCALL integer,
CONTACT integer,
CONTACT_BEFORE_CAMPAIGN integer,
Purchased VARCHAR(4000)
);

3. Access Redshift tables from OCI-Data Science

Now, let's actually access the Redshift table from OCI-Data Science. This time, we will use sqlalchemy-redshift as described in the above reference" Reading data from Redshift and putting it in the pandas data frame ".

First, install the module ** sqlalchemy-redshift **.

pip install sqlalchemy-redshift

Then install the required libraries.

import redshift_sqlalchemy
from sqlalchemy import create_engine

Connect to Redshift.

engine = create_engine('{dialect}+{driver}://{user}:{pwd}@{url}:{port}/{db}'.format(
    dialect = 'redshift',
    driver = 'psycopg2',
    user='awsuser', #Redshift username
    pwd ='XXXXXX', #password
    url='redshift-cluster-1.XXX.XXX.redshift.amazonaws.com', #Redshift cluster screen endpoint
    port=5439, #Redshift port number
    db='dev' #Redshift database name
))

Here, the following ** Connection timed out ** error occurred. image.png

After investigating, it looks the same as the above reference material "Connecting to ubuntu16.04@AWS postgresql from remote python3 (psycopg2)", so set the ** VNC security group inbound rule ** of Redshift as follows. To do. image.png

After setting the security group, I was able to connect normally.

Read the data on Redshift and put it in a data frame.

import pandas as pd
redshift_data = pd.read_sql_query('SELECT * FROM PURCHASE_ITEM limit 100;', engine)

Let's check the result.

redshift_data.head()

image.png

Certainly, I was able to get it safely.

At the end

This time, I tried to access the data on AWS Redshift from OCI-Data Science using ** sqlalchemy-redshift **.

There may be a better connection method than this, so please give it a try. (And tell me ...)

OCI-Data Science is a target service of Oracle Free Trial that allows you to use free credit for up to $ 300 for up to 30 days. Therefore, we recommend that you try it out.

Let's Enjoy Data Science!

Recommended Posts

Try accessing AWS Redshift data using Oracle Cloud Infrastructure Data Science
Try "100 knocks on data science" ①
Try using AWS SageMaker Studio
Start data science on the cloud
Try using scanpy's data integration function (sc.tl.ingest)
Try using Python with Google Cloud Functions