[PYTHON] Settings when reading S3 files with pandas from Jupyter Notebook on AWS

Overview

If you are operating Jupyter Notebook on EC2 on AWS, specify the data path in S3 from pd.read_csv () as shown below. You can read it directly.

import pandas as pd
os.environ["AWS_ACCESS_KEY_ID"] = "XXX..."
os.environ["AWS_SECRET_ACCESS_KEY"] = "YYY..."
all_df = pd.read_csv("s3n://mybucket/path/to/dir/data.csv")

At this time, it is necessary to specify AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in the environment variables in advance as described above.

To be honest, it's a hassle to make these designations every time. I'm sure many people have a habit of importing packages such as ʻimport pandas as pd`, but not so many people can remember the keys of random strings. Besides, this kind of private information is information that you don't want to include when sharing your notebook with others, so it is very inefficient and dangerous to delete it each time.

Therefore, let's set in advance so that you can access S3 files without being aware of it with Jupyter Notebook. There are several ways to do this, and I would like to introduce each one.

--Load settings when jupyter notebook starts --Set environment variables directly in the shell --List in boto profile

You only need to do one of these three methods. I don't have that much pros / cons, so I think it's okay to do whatever you want.

Method

Method 1: Load the settings when the jupyter notebook starts up

It is assumed that jupyter notebook is installed.

Save the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY settings in ~ / .ipython / profile_default / startup / with the file name 00_startup.ipy. The file name is free, but if you add a number at the beginning, it will be executed in that order.

os.environ["AWS_ACCESS_KEY_ID"] = "XXX..."
os.environ["AWS_SECRET_ACCESS_KEY"] = "YYY..."

The file layout looks like the following.

$ tree .ipython/profile_default/startup
.ipython/profile_default/startup
├── 00_startup.ipy
└── README

By doing this, the above command will be executed when jupyter notebook is started, so you do not have to specify it in ʻos.environ` from the notebook.

Method 2: Set environment variables directly in the shell

This is a way to set it as a shell environment variable. Write it in your shell's configuration file, such as .bash_profile or .zshrc, and load it when the shell starts, or run it directly at the prompt to set it.

export AWS_ACCESS_KEY_ID=XXX...
export AWS_SECRET_ACCESS_KEY=YYY...

Method 3: Describe in the profile read by boto

Behind the scenes of pandas is a Python AWS SDK called boto, which you can also list directly in the profile of that package. Save the code below with the file name ~ / .boto.

[Credentials]
aws_access_key_id = XXX...
aws_secret_access_key = YYY...

== Addendum: 2016/10/03 == The above method was the method with boto2. Although ~ / .boto is also read in boto3, it is said that it is read in the following order from boto3, so it is better to describe it in ~ / .aws / credentials which is the credentials of aws itself ʻawscli ` I think it may mean sharing the settings with the command.

The mechanism in which boto3 looks for credentials is to search through a list of possible locations and stop as soon as it finds credentials. The order in which Boto3 searches for credentials is:

  1. Passing credentials as parameters in the boto.client() method
  2. Passing credentials as parameters when creating a Session object
  3. Environment variables
  4. Shared credential file (~/.aws/credentials)
  5. AWS config file (~/.aws/config)
  6. Assume Role provider
  7. Boto2 config file (/etc/boto.cfg and ~/.boto)
  8. Instance metadata service on an Amazon EC2 instance that has an IAM role configured.
[default]
aws_access_key_id=foo
aws_secret_access_key=bar

== End of postscript ==

Then, when you specify the path of S3 in pandas, boto will be imported and the above profile will be loaded. (source)

Please note that pip install pandas alone will not install boto because it is not included in the dependencies. Before using it, do pip install boto.

Summary

As mentioned above, there are several setting methods when accessing S3 with pandas. All of them are easy to do when building an environment, so be sure to set them.

Supplement

These methods are not limited to Jupyter Notebook on AWS, but you can connect to S3 in any environment as long as you have the AWS key, so you can also connect to S3 locally, pd.read_csv ("s3n: //...) However, you need to be careful when using it because the data is sent from S3 to the outside of AWS and the transfer amount is generated. In the first place, heavy data is difficult to handle due to the problem of transfer time. , If you want to run the source code that is supposed to use S3 in a different environment, you should check it. It may be wise not to set the AWS key as the default locally.

reference

-Introduction of Spark to EC2 and cooperation of iPython Notebook --Qiita

Recommended Posts

Settings when reading S3 files with pandas from Jupyter Notebook on AWS
Formatting with autopep8 on Jupyter notebook
Convenient analysis with Pandas + Jupyter notebook
[AWS] Search and acquire necessary data from S3 files with S3 Select
Try SVM with scikit-learn on Jupyter Notebook
[Easy Python] Reading Excel files with pandas
When Html cannot be output with Jupyter Notebook
Enable Jupyter Notebook with conda on remote server
Jupyter on AWS
Settings when using Jupyter Notebook under Proxy server
Proxy settings when using pip or Jupyter Notebook
Easily launch jupyter notebook on AWS and access locally
Run Tensorflow from Jupyter Notebook on Bash on Ubuntu on Windows
Be careful when reading data with pandas (specify dtype)
Monitor the training model with TensorBord on Jupyter Notebook
Try basic operations for Pandas DataFrame on Jupyter Notebook
Import specific cells from other notebooks with Jupyter notebook
EC2 provisioning with Vagrant + Jupyter (IPython Notebook) on Docker
Addictive points when downloading files using boto on AWS Lambda
Write charts in real time with Matplotlib on Jupyter notebook
Try clustering with a mixed Gaussian model on a Jupyter Notebook
A note when I can't open Jupyter Notebook on Windows
Reading and writing Files from notebook on Watson Studio to IBM Cloud Object Storage (ICOS)-using project-lib-
Using Graphviz with Jupyter Notebook
Use pip with Jupyter Notebook
High charts on Jupyter notebook
View PDF on Jupyter Notebook
Use Cython with Jupyter Notebook
Play with Jupyter Notebook (IPython Notebook)
Reading .txt files with Python
Run Jupyter Notebook on windows
Things to note when running Python on EC2 from AWS Lambda
Real-time display of video acquired from webcam on Jupyter notebook (Python3)
How to run Jupyter and Spark on Mac with minimal settings