At first

Japanese translation of Apache Spark documentation. Please check other pages as they are translated.

Apache Spark document Japanese translation --Quick Start
http://qiita.com/mychaelstyle/items/46440cd27ef641892a58
Apache Spark Document Japanese Translation --Cluster Mode Overview
http://qiita.com/mychaelstyle/items/610b432a1ef1a7e3d2a0
Apache Spark document Japanese translation --Apache Spark on AWS EC2
http://qiita.com/mychaelstyle/items/b752087a0bee6e41c182

If you find something wrong with the translation, please let us know in the comments or on facebook.

Submitting Applications

The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application specially for each one.

The spark-submit script in Spark's bin directory is used to start the application on the cluster. This is all Spark-supported thruster managers available without any special settings in your application.

Bundling Your Application ’s Dependencies --Building Dependencies

If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. To do this, to create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime. Once you have an assembled jar you can call the bin/spark-submit script as shown here while passing your jar.

If your code depends on other projects, you need to package them together for deployment to Spark. For this reason, it will generate a single executable jar (or "uber" jar) of your code, including dependencies. Both sbt and Maven have plugins for aggregation. Spark and Hadoop are also listed as dependencies when you generate the assembly jar. These are provided by the runtime cluster manager and do not need to be bundled. Once you have aggregated the jars, you can call the bin / spark-submit script as described here.

For Python, you can use the --py-files argument of spark-submit to add .py, .zip or .egg files to be distributed with your application. If you depend on multiple Python files we recommend packaging them into a .zip or .egg.

The spark-submit script provides a --py-files argument to deploy .py, .zip, .egg files with your application in Python. If you rely on several Python files, it's a good idea to combine them into a single zip or egg file.

Launching Applications with spark-submit --Launch with spark-submit

Once a user application is bundled, it can be launched using the bin/spark-submit script. This script takes care of setting up the classpath with Spark and its dependencies, and can support different cluster managers and deploy modes that Spark supports:

Once the user application is bundled, it can be launched with the bin / spark-submit script. This script takes care of Spark and its dependency classpaths. And it supports Pro Mode with different cluster managers and Spark supports.

./bin/spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  ... # other options
  <application-jar> \
  [application-arguments]

Some of the commonly used options are:

Some generic options

--class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)

--class: Application entry point (class name: e.g. org.apache.spark.examples.SparkPi)

--master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)

--master: URL of the master node for the cluster

--deploy-mode: Whether to deploy your driver program within the cluster or run it locally as an external client (either cluster or client)

--deploy-mode: Specifies whether to deploy the driver program to a cluster or run it locally as an external client. (cluster / client)

application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.

application-jar: The path to the jar file that contains your bundled application and all its dependencies. The URL must be globally visible from your cluster to your instance. Must be present on all nodes, such as hdfs: // path or file: // path.

application-arguments: Arguments passed to the main method of your main class, if any

application-arguments: You are free to write the arguments passed to your main class.

For Python applications, simply pass a .py file in the place of instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.

For Python applications, simply specify the .py file instead of the JAR, and specify the Python .zip, .egg or .py file with --py-files.

To enumerate all options available to spark-submit run it with --help. Here are a few examples of common options:

If you want to see all the options, run --help. Here are just a few common examples.

Run application locally on 8 cores

./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master local[8]
/path/to/examples.jar
100

Run on a Spark standalone cluster

./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master spark://207.184.161.138:7077
--executor-memory 20G
--total-executor-cores 100
/path/to/examples.jar
1000

Run on a YARN cluster

export HADOOP_CONF_DIR=XXX ./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master yarn-cluster \ # can also be yarn-client for client mode --executor-memory 20G
--num-executors 50
/path/to/examples.jar
1000

Run a Python application on a cluster

./bin/spark-submit
--master spark://207.184.161.138:7077
examples/src/main/python/pi.py
1000

Master URLs-Master URLs

The master URL passed to Spark can be in one of the following formats:

Please use the master URL in one of the following formats.

local
Run Spark locally with one worker thread (i.e. no parallelism at all).
Run Spark locally in one worker thread.
local[K]
Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
Run Spark locally on K worker threads. (Ideally this number should be the number of cores on your machine)
local[*]
Run Spark locally with as many worker threads as logical cores on your machine.
Run Spark locally on many worker threads depending on the number of theoretical cores on your machine.
spark://HOST:PORT
Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default.
Connect to the Spark standalone cluster master. The port number must be the one set on your master. The default is 7077.
mesos://HOST:PORT
Connect to the given Mesos cluster. The port must be whichever one your is configured to use, which is 5050 by default. Or, for a Mesos cluster using ZooKeeper, use mesos://zk://....
Connect to a Mesos cluster. The port number is the one you set, and Mesos defaults to 5050. If you are using ZooKeeper, you need to write mesos: // sk: // ...
yarn-client
Connect to a YARN cluster in client mode. The cluster location will be found based on the HADOOP_CONF_DIR variable.
Connect to the YARN cluster in client mode. The cluster must be in the location defined by the HADOOP_CONF_DIR variable.
yarn-cluster
Connect to a YARN cluster in cluster mode. The cluster location will be found based on HADOOP_CONF_DIR.
Connect to a YARN cluster in cluster mode. The cluster must be defined in the HADOOP_CONF_DIR variable.

Loading Configuration from a File --Loading configuration from a file

The spark-submit script can load default Spark configuration values from a properties file and pass them on to your application. By default it will read options from conf/spark-defaults.conf in the Spark directory. For more detail, see the section on loading default configurations.

The spark-submit script can read the settings from your application's properties file. By default, the options are read from conf / spark-defaults.conf in the Spark directory. For more information, see the section on loading default settings.

http://spark.apache.org/docs/latest/configuration.html#loading-default-configurations

Loading default Spark configurations this way can obviate the need for certain flags to spark-submit. For instance, if the spark.master property is set, you can safely omit the --master flag from spark-submit. In general, configuration values explicitly set on a SparkConf take the highest precedence, then flags passed to spark-submit, then values in the defaults file.

The method of loading Spark's default settings avoids having to use multiple flags with spark-submit. For the instance, you can safely omit the --master flag from the command once spark.masterproperty is set. Generally, the setting value explicitly set in SparkConf has the highest priority, then the spark-submit flag, and then the default file.

If you are ever unclear where configuration options are coming from, you can print out fine-grained debugging information by running spark-submit with the --verbose option.

If you don't know where your settings came from, you can run spark-submit with the --verbose option to get detailed debugging output.

Advanced Dependency Management-Advanced Dependency Management

When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster. Spark uses the following URL scheme to allow different strategies for disseminating jars:

When using spark-submit, anything in the --jars option of the application jar is automatically transferred to the cluster. Spark uses the following URL scheme to allow different ways for unresolvable jars.

file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.

file: URI of the full path provided by the driver's HTTP file server, each executer gets the file from the driver's HTTP server.

hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as expected

hdf :, http :, https :, ftp Get the file or JAR from the location expected by the URI.

local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.

Local: The URI starting with local: / is expected to exist as a local file for each worker node. This means that no network IO is needed, and you can expect better behavior for large files and JARs that have already been pushed by each worker or shared using NFS, GlusterFS, etc.

Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes. This can use up a significant amount of space over time and will need to be cleaned up. With YARN, cleanup is handled automatically, and with Spark standalone, automatic cleanup can be configured with the spark.worker.cleanup.appDataTtl property.

The JARs and files are copied to the SparkContext working directory of each executor node. This will need to be cleaned up to reserve a serious amount of area for a long time. For YARN, cleanup is handled automatically. For Spark standalone You can set automatic cleanup with the spark.worker.cleanup.appDataTtl property.

For python, the equivalent --py-files option can be used to distribute .egg, .zip and .py libraries to executors.

Equivalent functionality for Python The --py-files option delivers .egg, .zip, .py libraries to executer.

More Information

Once you have deployed your application, the cluster mode overview describes the components involved in distributed execution, and how to monitor and debug applications.

Once you have deployed your application, it describes the esoteric components running in the cluster mode overview, how to monitor them, and how to debug your application.

Original) cluster mode overview http://spark.apache.org/docs/latest/cluster-overview.html translation) http://qiita.com/mychaelstyle/items/610b432a1ef1a7e3d2a0

My Facebook https://www.facebook.com/masanori.nakashima

[PYTHON] Apache Spark Document Japanese Translation --Submitting Applications