[PYTHON] Apache Spark Document Japanese Translation --Submitting Applications

At first

Japanese translation of Apache Spark documentation. Please check other pages as they are translated.

If you find something wrong with the translation, please let us know in the comments or on facebook.

Submitting Applications

The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application specially for each one.

The spark-submit script in Spark's bin directory is used to start the application on the cluster. This is all Spark-supported thruster managers available without any special settings in your application.

Bundling Your Application ’s Dependencies --Building Dependencies

If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. To do this, to create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime. Once you have an assembled jar you can call the bin/spark-submit script as shown here while passing your jar.

If your code depends on other projects, you need to package them together for deployment to Spark. For this reason, it will generate a single executable jar (or "uber" jar) of your code, including dependencies. Both sbt and Maven have plugins for aggregation. Spark and Hadoop are also listed as dependencies when you generate the assembly jar. These are provided by the runtime cluster manager and do not need to be bundled. Once you have aggregated the jars, you can call the bin / spark-submit script as described here.

For Python, you can use the --py-files argument of spark-submit to add .py, .zip or .egg files to be distributed with your application. If you depend on multiple Python files we recommend packaging them into a .zip or .egg.

The spark-submit script provides a --py-files argument to deploy .py, .zip, .egg files with your application in Python. If you rely on several Python files, it's a good idea to combine them into a single zip or egg file.

Launching Applications with spark-submit --Launch with spark-submit

Once a user application is bundled, it can be launched using the bin/spark-submit script. This script takes care of setting up the classpath with Spark and its dependencies, and can support different cluster managers and deploy modes that Spark supports:

Once the user application is bundled, it can be launched with the bin / spark-submit script. This script takes care of Spark and its dependency classpaths. And it supports Pro Mode with different cluster managers and Spark supports.

./bin/spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  ... # other options
  <application-jar> \
  [application-arguments]

Some of the commonly used options are:

Some generic options

--class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)

--class: Application entry point (class name: e.g. org.apache.spark.examples.SparkPi)

--master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)

--master: URL of the master node for the cluster

--deploy-mode: Whether to deploy your driver program within the cluster or run it locally as an external client (either cluster or client)

--deploy-mode: Specifies whether to deploy the driver program to a cluster or run it locally as an external client. (cluster / client)

application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.

application-jar: The path to the jar file that contains your bundled application and all its dependencies. The URL must be globally visible from your cluster to your instance. Must be present on all nodes, such as hdfs: // path or file: // path.

application-arguments: Arguments passed to the main method of your main class, if any

application-arguments: You are free to write the arguments passed to your main class.

For Python applications, simply pass a .py file in the place of instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.

For Python applications, simply specify the .py file instead of the JAR, and specify the Python .zip, .egg or .py file with --py-files.

To enumerate all options available to spark-submit run it with --help. Here are a few examples of common options:

If you want to see all the options, run --help. Here are just a few common examples.

Run application locally on 8 cores

./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master local[8]
/path/to/examples.jar
100

Run on a Spark standalone cluster

./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master spark://207.184.161.138:7077
--executor-memory 20G
--total-executor-cores 100
/path/to/examples.jar
1000

Run on a YARN cluster

export HADOOP_CONF_DIR=XXX ./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master yarn-cluster \ # can also be yarn-client for client mode --executor-memory 20G
--num-executors 50
/path/to/examples.jar
1000

Run a Python application on a cluster

./bin/spark-submit
--master spark://207.184.161.138:7077
examples/src/main/python/pi.py
1000

Master URLs-Master URLs

The master URL passed to Spark can be in one of the following formats:

Please use the master URL in one of the following formats.

Loading Configuration from a File --Loading configuration from a file

The spark-submit script can load default Spark configuration values from a properties file and pass them on to your application. By default it will read options from conf/spark-defaults.conf in the Spark directory. For more detail, see the section on loading default configurations.

The spark-submit script can read the settings from your application's properties file. By default, the options are read from conf / spark-defaults.conf in the Spark directory. For more information, see the section on loading default settings.

http://spark.apache.org/docs/latest/configuration.html#loading-default-configurations

Loading default Spark configurations this way can obviate the need for certain flags to spark-submit. For instance, if the spark.master property is set, you can safely omit the --master flag from spark-submit. In general, configuration values explicitly set on a SparkConf take the highest precedence, then flags passed to spark-submit, then values in the defaults file.

The method of loading Spark's default settings avoids having to use multiple flags with spark-submit. For the instance, you can safely omit the --master flag from the command once spark.masterproperty is set. Generally, the setting value explicitly set in SparkConf has the highest priority, then the spark-submit flag, and then the default file.

If you are ever unclear where configuration options are coming from, you can print out fine-grained debugging information by running spark-submit with the --verbose option.

If you don't know where your settings came from, you can run spark-submit with the --verbose option to get detailed debugging output.

Advanced Dependency Management-Advanced Dependency Management

When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster. Spark uses the following URL scheme to allow different strategies for disseminating jars:

When using spark-submit, anything in the --jars option of the application jar is automatically transferred to the cluster. Spark uses the following URL scheme to allow different ways for unresolvable jars.

file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.

file: URI of the full path provided by the driver's HTTP file server, each executer gets the file from the driver's HTTP server.

hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as expected

hdf :, http :, https :, ftp Get the file or JAR from the location expected by the URI.

local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.

Local: The URI starting with local: / is expected to exist as a local file for each worker node. This means that no network IO is needed, and you can expect better behavior for large files and JARs that have already been pushed by each worker or shared using NFS, GlusterFS, etc.

Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes. This can use up a significant amount of space over time and will need to be cleaned up. With YARN, cleanup is handled automatically, and with Spark standalone, automatic cleanup can be configured with the spark.worker.cleanup.appDataTtl property.

The JARs and files are copied to the SparkContext working directory of each executor node. This will need to be cleaned up to reserve a serious amount of area for a long time. For YARN, cleanup is handled automatically. For Spark standalone You can set automatic cleanup with the spark.worker.cleanup.appDataTtl property.

For python, the equivalent --py-files option can be used to distribute .egg, .zip and .py libraries to executors.

Equivalent functionality for Python The --py-files option delivers .egg, .zip, .py libraries to executer.

More Information

Once you have deployed your application, the cluster mode overview describes the components involved in distributed execution, and how to monitor and debug applications.

Once you have deployed your application, it describes the esoteric components running in the cluster mode overview, how to monitor them, and how to debug your application.

Original) cluster mode overview http://spark.apache.org/docs/latest/cluster-overview.html translation) http://qiita.com/mychaelstyle/items/610b432a1ef1a7e3d2a0

My Facebook https://www.facebook.com/masanori.nakashima

Recommended Posts

Apache Spark Document Japanese Translation --Submitting Applications
Apache Spark Document Japanese Translation --Quick Start
Apache Spark Document Japanese Translation --Cluster Mode Overview
sosreport Japanese translation
man systemd Japanese translation
streamlit explanation Japanese translation
streamlit tutorial Japanese translation
Apache Spark Starter Kits
man systemd.service Japanese translation
man nftables Japanese translation
Dockerfile reference Japanese translation
docker-compose --help Japanese translation
docker help Japanese translation
Pandas User Guide "Multi-Index / Advanced Index" (Official document Japanese translation)
Pandas User Guide "Manipulating Missing Data" (Official Document Japanese Translation)