[PYTHON] Apache Spark Document Japanese Translation --Quick Start

The following page is a Japanese translation. Let's all use Spark. http://spark.apache.org/docs/latest/quick-start.html

The Japanese translation of the Spark environment construction guide on AWS EC2 is also made below, so please have a look.

Spark on AWS EC2
http://qiita.com/mychaelstyle/items/b752087a0bee6e41c182
Cluster Mode Overview
http://qiita.com/mychaelstyle/items/610b432a1ef1a7e3d2a0

If you find something strange in the Japanese translation, please let us know in the comments.

Quick Start

This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark’s interactive shell (in Python or Scala), then show how to write standalone applications in Java, Scala, and Python. See the programming guide for a more complete reference.

This tutorial is a brief introduction to using Spark. We will first introduce the API through the Spark interactive shell (Python or Scala). Then I'll show you how to write a standalone application in Java, Scala, and Python. See the programming guide for a more detailed reference.

http://spark.apache.org/docs/latest/programming-guide.html

To follow along with this guide, first download a packaged release of Spark from the Spark website. Since we won’t be using HDFS, you can download a package for any version of Hadoop.

To understand this guide, get the first packaged release of Spark from our website. It doesn't use HDFS, so any version of Hadoop will do.

Interactive Analysis with the Spark Shell

Interactive analysis using Spark Shell

Basics

Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. It is available in either Scala (which runs on the Java VM and is thus a good way to use existing Java libraries) or Python. Start it by running the following in the Spark directory:

The Spark shell is a collaborative tool that makes it easy to learn APIs and interactively analyze data. It can be used with Scala (which runs on the JVM and is best used with Java libraries) or Python. Start the Spark shell by running the following command in the Spark directory.

`Scala`


./bin/spark-shell

`Python`


./bin/pyspark

Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Let’s make a new RDD from the text of the README file in the Spark source directory:

Spark's most important abstraction is a split collection called by the Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormas (like HDF files) or other transformed RDDs. Let's create a new RDD from the README text file in the Spark source directory.

`Scala`


scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3

`Python`


>>> textFile = sc.textFile("README.md")

RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Let’s start with a few actions:

The RDD holds an action that returns a value or a conversion of a pointer to a new RDD. Let's start some actions.

`Scala`


scala> textFile.count() //Number of items (number of lines) in this RDD
res0: Long = 126

scala> textFile.first() //First item of RDD (first line)
res1: String = # Apache Spark

`Python`


>>> textFile.count() # Number of items in this RDD
126

>>> textFile.first() # First item in this RDD
u'# Apache Spark'

Now let’s use a transformation. We will use the filter transformation to return a new RDD with a subset of the items in the file.

Now let's use transformation. We make use of a filter transform that returns a new RDD as a subset of the RDD created from the file.

`Scala`


scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09

`Python`


>>> linesWithSpark = textFile.filter(lambda line: "Spark" in line)

We can chain together transformations and actions:

You can also chain transformations and actions.

`Scala`


scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?
res3: Long = 15

`Python`


>>> textFile.filter(lambda line: "Spark" in line).count() # How many lines contain "Spark"?
15

More on RDD Operations Learn more about RDD operations.

RDD actions and transformations can be used for more complex computations. Let’s say we want to find the line with the most words:

RDD actions and transformations can be used for more complex calculations.

`scala`


scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
res4: Long = 15

This first maps a line to an integer value, creating a new RDD. reduce is called on that RDD to find the largest line count. The arguments to map and reduce are Scala function literals (closures), and can use any language feature or Scala/Java library. For example, we can easily call functions declared elsewhere. We’ll use Math.max() function to make this code easier to understand:

First I'm creating a new RDD with a row to Integer value mapping. reduce is called on this RDD to find the largest row. The map and reduce arguments are Scala closure functions, and you can use any Scala and Java library. For example, you can easily call a function declared anywhere. It would be helpful to use the Math.max () function in this code.

`scala`


scala> import java.lang.Math
import java.lang.Math

scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
res5: Int = 15

One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:

One of the common data flow patterns is MapReduce, which became famous in Hadoop. Spark makes it easy to implement MapReduce flows.

`Scala`


scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts: spark.RDD[(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8

Here, we combined the flatMap, map and reduceByKey transformations to compute the per-word counts in the file as an RDD of (String, Int) pairs. To collect the word counts in our shell, we can use the collect action:

Here, in order to count the number of words in the file, flatMap is combined and map and redeceByKey conversion are performed to generate RDD of String and Int pair. You can use the collect action to collect the number of words in the shell.

`Scala`


scala> wordCounts.collect()
res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3), (Because,1), (Python,2), (agree,1), (cluster.,1), ...)

Caching

Spark also supports pulling data sets into a cluster-wide in-memory cache. This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset or when running an iterative algorithm like PageRank. As a simple example, let’s mark our linesWithSpark dataset to be cached:

Spark can retrieve datasets from a cluster-wide in-memory cache. This is very useful for recurringly accessed data, querying small hot datasets, and interactive algorithms like PageRank.

As a simple example, let's mark the linesWithSpark dataset as a cache.

`Scala`


scala> linesWithSpark.cache()
res7: spark.RDD[String] = spark.FilteredRDD@17e51082

scala> linesWithSpark.count()
res8: Long = 15

scala> linesWithSpark.count()
res9: Long = 15

It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is that these same functions can be used on very large data sets, even when they are striped across tens or hundreds of nodes. You can also do this interactively by connecting bin/spark-shell to a cluster, as described in the programming guide.

It may seem like a ridiculous way to use Spark to cache a text file of about 100 lines. The interesting thing is that these same set of functions can be used on huge datasets as well, on striped dozens or hundreds of nodes as well. You can connect to the cluster with bin / spark-shell and do these interactively as described in the programming guide.

Standalone Applications

Scala

Now say we wanted to write a standalone application using the Spark API. We will walk through a simple application in both Scala (with SBT), Java (with Maven), and Python.

Now let's open up a way to write a standalone application that uses the Spark API. Let's take a look at a simple application in Scala (with SBT), Java (with Maven), Python.

We’ll create a very simple Spark application in Scala. So simple, in fact, that it’s named SimpleApp.scala:

Let's make a very simple Spark application. I named it SimpleApp.scala because it's so simple and substantive.

`SimpleApp.scala`


/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "YOUR_SPARK_HOME/README.md" //Path on your machine
    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
  }
}

This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in the Spark README. Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed. Unlike the earlier examples with the Spark shell, which initializes its own SparkContext, we initialize a SparkContext as part of the program.

This program just counts the number of lines in the Spark README file that contain a and b. Change YOUR_SPARK_HOME to the path of the folder where you installed Spark. Unlike the previous example using the spark shell, here we are initializing our own SparkContext, which is initialized as part of the program.

We pass the SparkContext constructor a SparkConf object which contains information about our application.

I'm passing a SparkConf object containing application information to the SparkContext constructor.

Our application depends on the Spark API, so we’ll also include an sbt configuration file, simple.sbt which explains that Spark is a dependency. This file also adds a repository that Spark depends on:

Since our application relies on the Spark API, it also includes the sbt config file. This file also adds the repositories that Spark depends on.

`python`


name := "Simple Project"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0"

resolvers += "Akka Repository" at "http://repo.akka.io/releases/"

For sbt to work correctly, we’ll need to layout SimpleApp.scala and simple.sbt according to the typical directory structure. Once that is in place, we can create a JAR package containing the application’s code, then use the spark-submit script to run our program.

In order for sbt to work properly, lay out SympleApp.scala and simple.sbt with a general directory structure. Once deployed, you can generate a JAR package containing your application code and run the program with the spark-submit script.

# Your directory layout should look like this
$ find .
.
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala

# Package a jar containing your application
$ sbt package
...
[info] Packaging {..}/{..}/target/scala-2.10/simple-project_2.10-1.0.jar

# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
  --class "SimpleApp" \
  --master local[4] \
  target/scala-2.10/simple-project_2.10-1.0.jar
...
Lines with a: 46, Lines with b: 23

Java

This example will use Maven to compile an application jar, but any similar build system will work. We’ll create a very simple Spark application, SimpleApp.java:

This example uses Maven to compile the application jar but will work on a similar build system. Let's create a very simple Spark application called SimpleApp.java.

`SimpleApp.java`


/* SimpleApp.java */
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;

public class SimpleApp {
  public static void main(String[] args) {
    String logFile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system
    SparkConf conf = new SparkConf().setAppName("Simple Application");
    JavaSparkContext sc = new JavaSparkContext(conf);
    JavaRDD<String> logData = sc.textFile(logFile).cache();

    long numAs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("a"); }
    }).count();

    long numBs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("b"); }
    }).count();

    System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
  }
}

This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file. Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed.

This program only counts the number of lines in a text file that contain a and b. Replace YOUR_SPARK_HOME with your Spark installation folder path.

As with the Scala example, we initialize a SparkContext, though we use the special JavaSparkContext class to get a Java-friendly one. We also create RDDs (represented by JavaRDD) and run transformations on them. Finally, we pass functions to Spark by creating classes that extend spark.api.java.function.Function. The Spark programming guide describes these differences in more detail.

As with the Scala example, you need to initialize the SparkContext, but as a special note, use the JavaSparkContext class to get a SparkContext that is more compatible with Java. Similarly, Java RDD is substituted to generate RDD, and conversion is performed on it. Finally, extend spark.api.java.funtction.Function to create a class and pass the function to Spark. You can read more about this difference in the Spark Programming Guide.

http://spark.apache.org/docs/latest/programming-guide.html

To build the program, we also write a Maven pom.xml file that lists Spark as a dependency. Note that Spark artifacts are tagged with a Scala version.

In order to build this program, describe the Spark dependency in Maven's pom.xml file as follows. Please note that Spark artifacts are tagged with the Scala version.

`pom.xml`


<project>
  <groupId>edu.berkeley</groupId>
  <artifactId>simple-project</artifactId>
  <modelVersion>4.0.0</modelVersion>
  <name>Simple Project</name>
  <packaging>jar</packaging>
  <version>1.0</version>
  <repositories>
    <repository>
      <id>Akka repository</id>
      <url>http://repo.akka.io/releases</url>
    </repository>
  </repositories>
  <dependencies>
    <dependency> <!-- Spark dependency -->
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.10</artifactId>
      <version>1.0.0</version>
    </dependency>
  </dependencies>
</project>

We layout these files according to the canonical Maven directory structure:

Arrange these files in a directory structure that matches Maven's method.

$ find .
./pom.xml
./src
./src/main
./src/main/java
./src/main/java/SimpleApp.java

Now, we can package the application using Maven and execute it with ./bin/spark-submit.

You can then use Maven to package your application as a Spark application that you can run with the ./bin/spark-submit script.

# Package a jar containing your application
$ mvn package
...
[INFO] Building jar: {..}/{..}/target/simple-project-1.0.jar

# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
  --class "SimpleApp" \
  --master local[4] \
  target/simple-project-1.0.jar
...
Lines with a: 46, Lines with b: 23

Python

Now we will show how to write a standalone application using the Python API (PySpark). As an example, we’ll create a simple Spark application, SimpleApp.py:

Now let's see how to write a standalone application using the Python API (PySpark). As an example, create a simple Spark application called SimpleApp.py.

`SimpleApp.py`


"""SimpleApp.py"""
from pyspark import SparkContext

logFile = "YOUR_SPARK_HOME/README.md"  # Should be some file on your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()

numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()

print "Lines with a: %i, lines with b: %i" % (numAs, numBs)

This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file. Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed.

This program (abbreviated below)

As with the Scala and Java examples, we use a SparkContext to create RDDs. We can pass Python functions to Spark, which are automatically serialized along with any variables that they reference.

Use SparkContext to generate RDDs as in the Scala and Java examples. You can pass Python functions directly to Spark and they will be automatically serialized with argument values and references.

For applications that use custom classes or third-party libraries, we can also add code dependencies to spark-submit through its --py-files argument by packaging them into a .zip file (see spark-submit --help for details). SimpleApp is simple enough that we do not need to specify any code dependencies.

For applications that utilize custom classes or third-party libraries, you can pass the dependency to spark-submit a zip file packaged with the --py-files option. (See --help in spark-submit for more details) If you look at SimpleApp, you can see that you don't need to write any special dependency code.

We can run this application using the bin/spark-submit script:

You can use the bin / spark-submit script to launch this application.

# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
  --master local[4] \
  SimpleApp.py
...
Lines with a: 46, Lines with b: 23

Where to Go from Here

Congratulations on running your first Spark application! For an in-depth overview of the API, start with the Spark programming guide, or see “Programming Guides” menu for other components. For running applications on a cluster, head to the deployment overview. Finally, Spark includes several samples in the examples directory (Scala, Java, Python). You can run them as follows:

Congratulations on launching your first Spark application.

Start the Spark Programming Guide to get a deeper overview of the API, or look at the programming guide menus for other components. http://spark.apache.org/docs/latest/programming-guide.html
Go to Deployment Overview to launch the application on the cluster. http://spark.apache.org/docs/latest/cluster-overview.html
Finally, Spark contains a lot of useful examples of Scala, Java, and Python in the examples directory, so try using run-example as follows.

# For Scala and Java, use run-example:
./bin/run-example SparkPi

# For Python examples, use spark-submit directly:
./bin/spark-submit examples/src/main/python/pi.py

This is the end of the quick start.