I built a Spark environment on the EMR and touched it, but it is costly to always start it up and put it on. It was difficult to erase it when it was no longer used and to build the environment when it was time to use it. I built a Spark environment on EC2. Since you can start and stop the server at any time, you can verify Spark at low cost. I also put in an iPython Notebook for easy analysis so that I can handle Spark there as well.
A low spec is enough because it only starts, stops, and deletes the server used for Spark.
This time, I used the cheapest t2.micro.
Because it uses the git command First, install git and bring the necessary files.
sudo yum install -y git
git clone git://github.com/apache/spark.git -b branch-1.2
I have included 1.2, which was the latest at the moment (January 2015).
【reference】 Spark Lightning-fast cluster computing
If you look at spark / ec2 / spark_ec2.py If AWS is set in .boto, it seems to refer to the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY etc. if it is not set. This time, I set the access secret key in the boto configuration file.
~/.boto
[Credentials]
aws_access_key_id = <AWS_ACCESS_KEY_ID>
aws_secret_access_key = <AWS_SECRET_ACCESS_KEY>
~/spark/ec2/spark-ec2 --key-pair=<KEY_PAIR_NAME> --identity-file=<SSH_KEY_FILE_NAME> --region=ap-northeast-1 --spark-version=1.2.0 --copy-aws-credentials launch <CLUSTER_NAME>
[Remarks] It will take a few minutes to start up. This time it took about 5 minutes. By default, two instances of m1.large are used. "--Copy-aws-credentials" If you add this, the AWS settings will be inherited.
See below for detailed settings. Running Spark on EC2 http://spark.apache.org/docs/1.2.0/ec2-scripts.html
When you start a Spark cluster, a security group is also created automatically, but in the default state, some ports are fully open, so you need to change it if necessary.
Also, iPython Notebook uses ports 8888 and 9000, so please add it so that you can access it.
~/spark/ec2/spark-ec2 --key-pair=<KEY_PAIR_NAME> --identity-file=<SSH_KEY_FILE_NAME> --region=ap-northeast-1 login <CLUSTER_NAME>
Cluster outage
~/spark/ec2/spark-ec2 --region=ap-northeast-1 stop <CLUSTER_NAME>
Cluster startup
~/spark/ec2/spark-ec2 --key-pair=<KEY_PAIR_NAME> --identity-file=<SSH_KEY_FILE_NAME> --region=ap-northeast-1 start <CLUSTER_NAME>
After logging in to the master server, you can execute it with the following command.
/root/spark/bin/pyspark
scala is
/root/spark/bin/spark-shell
Create profile
ipython profile create myserver
Edit configuration file
~/.ipython/profile_myserver/ipython_notebook_config.py
c = get_config()
c.NotebookApp.ip = '*' #Or the master's local IP
c.NotebooksApp.open_browser = False
~/.ipython/profile_myserver/startup/00-myserver-setup.py
import os
import sys
os.environ['SPARK_HOME'] = '/root/spark/'
CLUSTER_URL = open('/root/spark-ec2/cluster-url').read().strip()
spark_home = os.environ.get('SPARK_HOME', None)
if not spark_home:
raise ValueError('SPARK_HOME environment variable is not set')
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
execfile(os.path.join(spark_home, 'conf/spark-env.sh'))
【reference】 How-to: Use IPython Notebook with Apache Spark
ipython notebook --pylab inline --profile=myserver
It can be used by accessing <Spark master server domain: 8888>.
[Read from S3 and treat as RDD]
s3_file = sc.textFile("s3n://<BUCKET>/<DIR>")
s3_file = sc.textFile("s3n://<BUCKET>/<DIR>/<FILE>")
The point to note is to write "s3n: //". I couldn't read it with "s3: //". You can refer to folders as well as files alone.
[Read local file]
local_file = sc.textFile("file://" + "Local file path")
When specifying the file name, it seems to refer to HDFS by default.
I'd like to manage the files with HDFS, If you want to refer to a local file, you need to prefix the path with "file: //".
Furthermore, when referencing a local file from iPython Notebook It uses 9000 ports, so you need to open security groups as needed.
[Read from S3 and treat as DataFrame]
Spark has nothing to do with this, but I will write it for the time being. If you want to handle S3 files with pandas, you need to set boto.
~/.boto
[Credentials]
aws_access_key_id = <AWS_ACCESS_KEY_ID>
aws_secret_access_key = <AWS_SECRET_ACCESS_KEY>
df = pd.read_csv("s3://<BUCKET>/<DIR>/<FILE>")
Unlike the case of RDD, write "s3: //" instead of "s3n: //" and you cannot select the folder.
In Spark Web UI with <Spark Master Server Domain: 8080> You can access ganglia with <Spark master server domain: 5080 / ganglia>.
First of all, there is preparation of EC2 instance, but if you do it individually, you can bring the source of github on your PC. This time, I set up an instance because it is assumed that multiple people will handle the Spark cluster.
Recommended Posts