[PYTHON] I tried using PySpark from Jupyter 4.x on EMR

Put Jupyter in the Spark cluster launched by Amazon EMR On top of that, when using PySpark, a summary of how to deal with the jammed points.

EMR launch

This time for verification

Applications: All Applications: Hadoop 2.6.0, Hive 1.0.0, Hue 3.7.1, Mahout 0.11.0, Pig 0.14.0, and Spark 1.5.0 Instance type: m3.xlarge Number of instances: 1 Permission: Default

Prepare.

If you include Hue, Hue will use port 8888 Jupyter can no longer use port 8888 (default). In that case, make it accessible from your PC Make a hole in the security group.

Python2.6-> Go to Python2.7

EC2 started by EMR has Python version 2.6.9, so change it to 2.7. Since 2.7 is originally installed, just change the link destination.

sudo unlink /usr/bin/python
sudo ln -s /usr/bin/python2.7 /usr/bin/python

To pip2.6-> pip2.7

pip upgraded and changed the link destination.

sudo pip install -U pip
sudo ln -s /usr/bin/pip-2.7 /usr/bin/pip

Jupyter installation

Currently (October 2015), Jupyter 4.0.6 is installed.

sudo pip install jupyter

Launch Jupyter

jupyter-notebook

About creating a profile

Create template configuration file (output destination is ~ / .jupyter / jupyter_notebook_config.py)

jupyter notebook --generate-config

py:~/.jupyter/jupyter_notebook_config.py


c = get_config()

c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8888

If you include Hue, go to c.NotebookApp.port Set a port other than 8888 opened in the security group.

Does profile seem to disappear from Jupyter 4.X? You can specify a configuration file using the config option. Example)

jupyter-notebook --config='~/.ipython/profile_nbservers/ipython_config.py'

If you specify the directory path in the environment variable JUPYTER_CONFIG_DIR It will read jupyter_notebook_config.py in that directory.

Make Spark available on Jupyter

Changed spark.master from yarn to local. (If you don't do this, SparkContext will stop)

/usr/lib/spark/conf/spark-defaults.conf


# spark.master yarn
spark.master local

Previously in ~ / .ipython / profile_ \ <profile name > / startup / 00- \ <profile name >-setup.py I was preparing for Spark, but I couldn't do that either The following command is executed on Jupyter Notebook.

export SPARK_HOME='/usr/lib/spark'
import os
import sys

spark_home = os.environ.get('SPARK_HOME', None)
if not spark_home:
    raise ValueError('SPARK_HOME environment variable is not set')
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))

execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))

It may be read as a file.

Recommended Posts

I tried using PySpark from Jupyter 4.x on EMR
I tried using Jupyter
[Pythonocc] I tried using CAD on jupyter notebook
Somehow I tried using jupyter notebook
I tried using UnityCloudBuild API from Python
I tried to visualize BigQuery data using Jupyter Lab on GCP
I tried using Headless Chrome from Selenium
I tried using Remote API on GAE / J
I tried using parameterized
I tried using argparse
I tried using mimesis
I tried using anytree
I tried using aiomysql
I tried using coturn
I tried using Pipenv
I tried using matplotlib
I tried using "Anvil".
I tried using Hubot
I tried using ESPCN
I tried using openpyxl
I tried using Ipython
I tried using ngrok
I tried using face_recognition
I tried using PyCaret
I tried using Heapq
I tried using doctest
I tried using folium
I tried using jinja2
I tried using folium
I tried using time-window
I tried reading data from a file using Node.js.
[I tried using Pythonista 3] Introduction
I tried using easydict (memo).
I tried face recognition using Face ++
I tried using Random Forest
I tried using Amazon Glacier
I tried using git inspector
[Python] I tried using OpenPose
I tried using magenta / TensorFlow
I tried to touch jupyter
I tried MLflow on Databricks
I tried using AWS Chalice
I tried using Slack emojinator
[AWS] I tried using EC2, RDS, Django. Environment construction from 1
I tried using the Python library from Ruby with PyCall
[Images available] I tried using neofetch on various operating systems!
I tried using "Syncthing" to synchronize files on multiple PCs
I tried changing the python script from 2.7.11 to 3.6.0 on windows10
I tried to get data from AS / 400 quickly using pypyodbc
I tried AdaNet on table data
I tried using Rotrics Dex Arm # 2
Notes on using MeCab from Python
I tried using Rotrics Dex Arm
I tried using GrabCut of OpenCV
I tried using the COTOHA API (there is code on GitHub)
I tried to digitize the stamp stamped on paper using OpenCV
I tried using Thonny (Python / IDE)
I tried to display GUI on Mac with X Window System
I tried server-client communication using tmux
I tried task queuing from Celery
I tried reinforcement learning using PyBrain