Introduction

I built a Spark environment on the EMR and touched it, but it is costly to always start it up and put it on. It was difficult to erase it when it was no longer used and to build the environment when it was time to use it. I built a Spark environment on EC2. Since you can start and stop the server at any time, you can verify Spark at low cost. I also put in an iPython Notebook for easy analysis so that I can handle Spark there as well.

EC2 instance preparation

A low spec is enough because it only starts, stops, and deletes the server used for Spark.

This time, I used the cheapest t2.micro.

Download Spark

Because it uses the git command First, install git and bring the necessary files.

sudo yum install -y git
git clone git://github.com/apache/spark.git -b branch-1.2

I have included 1.2, which was the latest at the moment (January 2015).

【reference】 Spark Lightning-fast cluster computing

AWS settings

If you look at spark / ec2 / spark_ec2.py If AWS is set in .boto, it seems to refer to the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY etc. if it is not set. This time, I set the access secret key in the boto configuration file.

`~/.boto`


[Credentials]
aws_access_key_id = <AWS_ACCESS_KEY_ID>
aws_secret_access_key = <AWS_SECRET_ACCESS_KEY>

Launching a Spark cluster

~/spark/ec2/spark-ec2 --key-pair=<KEY_PAIR_NAME> --identity-file=<SSH_KEY_FILE_NAME> --region=ap-northeast-1 --spark-version=1.2.0 --copy-aws-credentials launch <CLUSTER_NAME>

[Remarks] It will take a few minutes to start up. This time it took about 5 minutes. By default, two instances of m1.large are used. "--Copy-aws-credentials" If you add this, the AWS settings will be inherited.

See below for detailed settings. Running Spark on EC2 http://spark.apache.org/docs/1.2.0/ec2-scripts.html

About security groups

When you start a Spark cluster, a security group is also created automatically, but in the default state, some ports are fully open, so you need to change it if necessary.

Also, iPython Notebook uses ports 8888 and 9000, so please add it so that you can access it.

Log in to the server

~/spark/ec2/spark-ec2 --key-pair=<KEY_PAIR_NAME> --identity-file=<SSH_KEY_FILE_NAME> --region=ap-northeast-1 login <CLUSTER_NAME>

Cluster outage

~/spark/ec2/spark-ec2 --region=ap-northeast-1 stop <CLUSTER_NAME>

Cluster startup

~/spark/ec2/spark-ec2 --key-pair=<KEY_PAIR_NAME> --identity-file=<SSH_KEY_FILE_NAME> --region=ap-northeast-1 start <CLUSTER_NAME>

Run pyspark

After logging in to the master server, you can execute it with the following command.

/root/spark/bin/pyspark

scala is

/root/spark/bin/spark-shell

iPython Notebook settings

Create profile

ipython profile create myserver

Edit configuration file

`~/.ipython/profile_myserver/ipython_notebook_config.py`


c = get_config()
c.NotebookApp.ip = '*' #Or the master's local IP
c.NotebooksApp.open_browser = False

`~/.ipython/profile_myserver/startup/00-myserver-setup.py`


import os
import sys

os.environ['SPARK_HOME'] = '/root/spark/'
CLUSTER_URL = open('/root/spark-ec2/cluster-url').read().strip()
spark_home = os.environ.get('SPARK_HOME', None)
if not spark_home:
    raise ValueError('SPARK_HOME environment variable is not set')
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
execfile(os.path.join(spark_home, 'conf/spark-env.sh'))

【reference】 How-to: Use IPython Notebook with Apache Spark

Launch iPython Notebook

ipython notebook --pylab inline --profile=myserver

It can be used by accessing <Spark master server domain: 8888>.

Read file

[Read from S3 and treat as RDD]

s3_file = sc.textFile("s3n://<BUCKET>/<DIR>")
s3_file = sc.textFile("s3n://<BUCKET>/<DIR>/<FILE>")

The point to note is to write "s3n: //". I couldn't read it with "s3: //". You can refer to folders as well as files alone.

[Read local file]

local_file = sc.textFile("file://" + "Local file path")

When specifying the file name, it seems to refer to HDFS by default.

I'd like to manage the files with HDFS, If you want to refer to a local file, you need to prefix the path with "file: //".

Furthermore, when referencing a local file from iPython Notebook It uses 9000 ports, so you need to open security groups as needed.

[Read from S3 and treat as DataFrame]

Spark has nothing to do with this, but I will write it for the time being. If you want to handle S3 files with pandas, you need to set boto.

`~/.boto`


[Credentials]
aws_access_key_id = <AWS_ACCESS_KEY_ID>
aws_secret_access_key = <AWS_SECRET_ACCESS_KEY>

df = pd.read_csv("s3://<BUCKET>/<DIR>/<FILE>")

Unlike the case of RDD, write "s3: //" instead of "s3n: //" and you cannot select the folder.

About server monitoring

In Spark Web UI with <Spark Master Server Domain: 8080> You can access ganglia with <Spark master server domain: 5080 / ganglia>.

2015/2/19 postscript

First of all, there is preparation of EC2 instance, but if you do it individually, you can bring the source of github on your PC. This time, I set up an instance because it is assumed that multiple people will handle the Spark cluster.

Introducing Spark to EC2 and linking iPython Notebook

Introduction

EC2 instance preparation

Download Spark

AWS settings

~/.boto

Launching a Spark cluster

About security groups

Log in to the server

Run pyspark

iPython Notebook settings

~/.ipython/profile_myserver/ipython_notebook_config.py

~/.ipython/profile_myserver/startup/00-myserver-setup.py

Launch iPython Notebook

Read file

~/.boto

About server monitoring

2015/2/19 postscript

`~/.boto`

`~/.ipython/profile_myserver/ipython_notebook_config.py`

`~/.ipython/profile_myserver/startup/00-myserver-setup.py`

`~/.boto`