[PYTHON] Traduction japonaise de documents Apache Spark

La traduction japonaise de la page suivante. Utilisons tous Spark. http://spark.apache.org/docs/latest/quick-start.html

La traduction japonaise du guide de construction de l'environnement Spark sur AWS EC2 est également faite ci-dessous, alors jetez un œil.

Spark on AWS EC2
http://qiita.com/mychaelstyle/items/b752087a0bee6e41c182
Cluster Mode Overview
http://qiita.com/mychaelstyle/items/610b432a1ef1a7e3d2a0

Si vous trouvez quelque chose d'étrange dans la traduction japonaise, veuillez nous le faire savoir dans les commentaires.

Quick Start

This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark’s interactive shell (in Python or Scala), then show how to write standalone applications in Java, Scala, and Python. See the programming guide for a more complete reference.

Ce didacticiel est une brève introduction à l'utilisation de Spark. Nous allons d'abord introduire l'API via le Spark Interactive Shell (Python ou Scala). Ensuite, je vais vous montrer comment écrire une application autonome en Java, Scala et Python. Voir le guide de programmation pour une référence plus détaillée.

http://spark.apache.org/docs/latest/programming-guide.html

To follow along with this guide, first download a packaged release of Spark from the Spark website. Since we won’t be using HDFS, you can download a package for any version of Hadoop.

Pour comprendre ce guide, obtenez la première version packagée de Spark sur notre site Web. Il n'utilise pas HDFS, donc n'importe quelle version de Hadoop fera l'affaire.

Interactive Analysis with the Spark Shell

Analyse interactive à l'aide de Spark Shell

Basics

Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. It is available in either Scala (which runs on the Java VM and is thus a good way to use existing Java libraries) or Python. Start it by running the following in the Spark directory:

Le shell Spark est un outil collaboratif qui facilite l'apprentissage des API et l'analyse interactive des données. Il peut être utilisé avec Scala (qui fonctionne sur des JVM et est mieux utilisé avec les bibliothèques Java) ou Python. Démarrez le shell Spark en exécutant la commande suivante dans le répertoire Spark.

`Scala`


./bin/spark-shell

`Python`


./bin/pyspark

Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Let’s make a new RDD from the text of the README file in the Spark source directory:

L'abstraction la plus importante de Spark est une collection fractionnée appelée par le Resilient Distributed Dataset (RDD). Les RDD peuvent être créés à partir de Hadoop InputFormas (comme des fichiers HDF) ou d'autres RDD transformés. Créons un nouveau RDD à partir du fichier texte README dans le répertoire source de Spark.

`Scala`


scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3

`Python`


>>> textFile = sc.textFile("README.md")

RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Let’s start with a few actions:

Le RDD contient une action qui renvoie une valeur ou une conversion d'un pointeur vers un nouveau RDD. Commençons quelques actions.

`Scala`


scala> textFile.count() //Nombre d'articles (nombre de lignes) dans ce RDD
res0: Long = 126

scala> textFile.first() //Premier élément de RDD (première ligne)
res1: String = # Apache Spark

`Python`


>>> textFile.count() # Number of items in this RDD
126

>>> textFile.first() # First item in this RDD
u'# Apache Spark'

Now let’s use a transformation. We will use the filter transformation to return a new RDD with a subset of the items in the file.

Utilisons maintenant la transformation. Nous utilisons une transformation de filtre qui renvoie un nouveau RDD en tant que sous-ensemble du RDD créé à partir du fichier.

`Scala`


scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09

`Python`


>>> linesWithSpark = textFile.filter(lambda line: "Spark" in line)

We can chain together transformations and actions:

Vous pouvez également enchaîner les transformations et les actions.

`Scala`


scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?
res3: Long = 15

`Python`


>>> textFile.filter(lambda line: "Spark" in line).count() # How many lines contain "Spark"?
15

More on RDD Operations En savoir plus sur les opérations RDD.

RDD actions and transformations can be used for more complex computations. Let’s say we want to find the line with the most words:

Les actions et transformations RDD peuvent être utilisées pour des calculs plus complexes.

`scala`


scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
res4: Long = 15

This first maps a line to an integer value, creating a new RDD. reduce is called on that RDD to find the largest line count. The arguments to map and reduce are Scala function literals (closures), and can use any language feature or Scala/Java library. For example, we can easily call functions declared elsewhere. We’ll use Math.max() function to make this code easier to understand:

Tout d'abord, nous créons un nouveau RDD avec un mappage ligne à valeur entière. réduire est appelé sur ce RDD pour trouver la plus grande ligne. Les arguments de mappage et de réduction sont les fonctions de fermeture de Scala, et toutes les bibliothèques Scala et Java peuvent être utilisées. Par exemple, vous pouvez facilement appeler une fonction déclarée n'importe où. Il serait utile d'utiliser la fonction Math.max () dans ce code.

`scala`


scala> import java.lang.Math
import java.lang.Math

scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
res5: Int = 15

One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:

L'un des modèles de flux de données courants est MapReduce, qui est devenu célèbre dans Hadoop. Spark facilite l'implémentation d'un flux MapReduce.

`Scala`


scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts: spark.RDD[(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8

Here, we combined the flatMap, map and reduceByKey transformations to compute the per-word counts in the file as an RDD of (String, Int) pairs. To collect the word counts in our shell, we can use the collect action:

Ici, afin de compter le nombre de mots dans le fichier, flatMap est combiné et la conversion map et redeceByKey est effectuée pour générer RDD de la paire String et Int. Vous pouvez utiliser l'action de collecte pour collecter le nombre de mots dans le shell.

`Scala`


scala> wordCounts.collect()
res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3), (Because,1), (Python,2), (agree,1), (cluster.,1), ...)

Mise en cache

Spark also supports pulling data sets into a cluster-wide in-memory cache. This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset or when running an iterative algorithm like PageRank. As a simple example, let’s mark our linesWithSpark dataset to be cached:

Spark peut récupérer des ensembles de données à partir d'un cache en mémoire à l'échelle du cluster. Ceci est très utile pour les données accédées de manière récurrente, l'interrogation de petits ensembles de données chauds et les algorithmes interactifs comme le classement de page.

À titre d'exemple simple, marquons l'ensemble de données linesWithSpark comme cache.

`Scala`


scala> linesWithSpark.cache()
res7: spark.RDD[String] = spark.FilteredRDD@17e51082

scala> linesWithSpark.count()
res8: Long = 15

scala> linesWithSpark.count()
res9: Long = 15

It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is that these same functions can be used on very large data sets, even when they are striped across tens or hundreds of nodes. You can also do this interactively by connecting bin/spark-shell to a cluster, as described in the programming guide.

Cela peut sembler une façon ridicule d'utiliser Spark pour mettre en cache un fichier texte de 100 lignes. La chose intéressante est que ce même ensemble de fonctions peut également être utilisé dans d'énormes ensembles de données, avec des dizaines ou des centaines de nœuds entrelacés également. Vous pouvez vous connecter à votre cluster avec bin / spark-shell et les exécuter de manière interactive comme décrit dans le guide de programmation.

Applications autonomes

Scala

Now say we wanted to write a standalone application using the Spark API. We will walk through a simple application in both Scala (with SBT), Java (with Maven), and Python.

Ouvrons maintenant un moyen d'écrire une application autonome qui utilise l'API Spark. Jetons un coup d'œil à une application simple en Scala (avec SBT), Java (avec Maven), Python.

We’ll create a very simple Spark application in Scala. So simple, in fact, that it’s named SimpleApp.scala:

Créons une application Spark très simple. Je l'ai nommé SimpleApp.scala parce que c'est tellement simple et substantiel.

`SimpleApp.scala`


/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "YOUR_SPARK_HOME/README.md" //Chemin sur votre machine
    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
  }
}

This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in the Spark README. Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed. Unlike the earlier examples with the Spark shell, which initializes its own SparkContext, we initialize a SparkContext as part of the program.

Ce programme ne compte que le nombre de lignes contenant a et b dans le fichier README de Spark. Remplacez YOUR_SPARK_HOME par le chemin du dossier dans lequel vous avez installé Spark. Contrairement à l'exemple précédent utilisant le shell spark, nous initialisons ici notre propre SparkContext, qui est initialisé dans le cadre du programme.

We pass the SparkContext constructor a SparkConf object which contains information about our application.

Je passe un objet SparkConf contenant des informations d'application au constructeur SparkContext.

Our application depends on the Spark API, so we’ll also include an sbt configuration file, simple.sbt which explains that Spark is a dependency. This file also adds a repository that Spark depends on:

Étant donné que notre application repose sur l'API Spark, elle inclut également le fichier de configuration sbt. Ce fichier ajoute également les référentiels dont dépend Spark.

`python`


name := "Simple Project"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0"

resolvers += "Akka Repository" at "http://repo.akka.io/releases/"

For sbt to work correctly, we’ll need to layout SimpleApp.scala and simple.sbt according to the typical directory structure. Once that is in place, we can create a JAR package containing the application’s code, then use the spark-submit script to run our program.

Pour que sbt fonctionne correctement, disposez SympleApp.scala et simple.sbt avec une structure de répertoire générale. Une fois déployé, vous pouvez générer un package JAR contenant le code de votre application et exécuter le programme avec le script spark-submit.

# Your directory layout should look like this
$ find .
.
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala

# Package a jar containing your application
$ sbt package
...
[info] Packaging {..}/{..}/target/scala-2.10/simple-project_2.10-1.0.jar

# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
  --class "SimpleApp" \
  --master local[4] \
  target/scala-2.10/simple-project_2.10-1.0.jar
...
Lines with a: 46, Lines with b: 23

Java

This example will use Maven to compile an application jar, but any similar build system will work. We’ll create a very simple Spark application, SimpleApp.java:

Cet exemple utilise Maven pour compiler le fichier jar de l'application mais fonctionnera sur un système de construction similaire. Créons une application Spark très simple appelée SimpleApp.java.

`SimpleApp.java`


/* SimpleApp.java */
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;

public class SimpleApp {
  public static void main(String[] args) {
    String logFile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system
    SparkConf conf = new SparkConf().setAppName("Simple Application");
    JavaSparkContext sc = new JavaSparkContext(conf);
    JavaRDD<String> logData = sc.textFile(logFile).cache();

    long numAs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("a"); }
    }).count();

    long numBs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("b"); }
    }).count();

    System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
  }
}

This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file. Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed.

Ce programme ne compte que le nombre de lignes dans un fichier texte contenant a et b. Remplacez YOUR_SPARK_HOME par le chemin du dossier d'installation de Spark.

As with the Scala example, we initialize a SparkContext, though we use the special JavaSparkContext class to get a Java-friendly one. We also create RDDs (represented by JavaRDD) and run transformations on them. Finally, we pass functions to Spark by creating classes that extend spark.api.java.function.Function. The Spark programming guide describes these differences in more detail.

Comme pour l'exemple Scala, vous devez initialiser SparkContext, mais comme note spéciale, utilisez la classe JavaSparkContext pour obtenir un SparkContext plus compatible avec Java. De même, Java RDD est remplacé pour générer RDD, et la conversion est effectuée sur celui-ci. Enfin, étendez spark.api.java.funtction.Function pour créer une classe et transmettre la fonction à Spark. Le guide de programmation Spark détaille cette différence.

http://spark.apache.org/docs/latest/programming-guide.html

To build the program, we also write a Maven pom.xml file that lists Spark as a dependency. Note that Spark artifacts are tagged with a Scala version.

Pour créer ce programme, décrivez les dépendances de Spark dans le fichier pom.xml de Maven comme suit. Veuillez noter que les artefacts Spark sont étiquetés avec la version Scala.

`pom.xml`


<project>
  <groupId>edu.berkeley</groupId>
  <artifactId>simple-project</artifactId>
  <modelVersion>4.0.0</modelVersion>
  <name>Simple Project</name>
  <packaging>jar</packaging>
  <version>1.0</version>
  <repositories>
    <repository>
      <id>Akka repository</id>
      <url>http://repo.akka.io/releases</url>
    </repository>
  </repositories>
  <dependencies>
    <dependency> <!-- Spark dependency -->
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.10</artifactId>
      <version>1.0.0</version>
    </dependency>
  </dependencies>
</project>

We layout these files according to the canonical Maven directory structure:

Organisez ces fichiers dans une structure de répertoires qui correspond à la méthode de Maven.

$ find .
./pom.xml
./src
./src/main
./src/main/java
./src/main/java/SimpleApp.java

Now, we can package the application using Maven and execute it with ./bin/spark-submit.

Vous pouvez ensuite utiliser Maven pour empaqueter votre application en tant qu'application Spark que vous pouvez exécuter avec le script ./bin/spark-submit.

# Package a jar containing your application
$ mvn package
...
[INFO] Building jar: {..}/{..}/target/simple-project-1.0.jar

# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
  --class "SimpleApp" \
  --master local[4] \
  target/simple-project-1.0.jar
...
Lines with a: 46, Lines with b: 23

Python

Now we will show how to write a standalone application using the Python API (PySpark). As an example, we’ll create a simple Spark application, SimpleApp.py:

Voyons maintenant comment écrire une application autonome à l'aide de l'API Python (PySpark). À titre d'exemple, créez une application Spark simple appelée SimpleApp.py.

`SimpleApp.py`


"""SimpleApp.py"""
from pyspark import SparkContext

logFile = "YOUR_SPARK_HOME/README.md"  # Should be some file on your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()

numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()

print "Lines with a: %i, lines with b: %i" % (numAs, numBs)

This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file. Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed.

Ce programme (abrégé ci-dessous)

As with the Scala and Java examples, we use a SparkContext to create RDDs. We can pass Python functions to Spark, which are automatically serialized along with any variables that they reference.

Utilisez SparkContext pour générer des RDD comme dans les exemples Scala et Java. Vous pouvez passer des fonctions Python directement à Spark et elles seront automatiquement sérialisées avec des valeurs d'argument et des références.

For applications that use custom classes or third-party libraries, we can also add code dependencies to spark-submit through its --py-files argument by packaging them into a .zip file (see spark-submit --help for details). SimpleApp is simple enough that we do not need to specify any code dependencies.

Pour les applications qui utilisent des classes personnalisées ou des bibliothèques tierces, vous pouvez transmettre la dépendance pour envoyer par spark un fichier zip fourni avec l'option --py-files. (Voir --help dans spark-submit pour plus de détails) Si vous regardez SimpleApp, vous pouvez voir que vous n'avez pas besoin d'écrire de code de dépendance spécial.

We can run this application using the bin/spark-submit script:

Vous pouvez utiliser le script bin / spark-submit pour lancer cette application.

# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
  --master local[4] \
  SimpleApp.py
...
Lines with a: 46, Lines with b: 23

Where to Go from Here

Congratulations on running your first Spark application! For an in-depth overview of the API, start with the Spark programming guide, or see “Programming Guides” menu for other components. For running applications on a cluster, head to the deployment overview. Finally, Spark includes several samples in the examples directory (Scala, Java, Python). You can run them as follows:

Félicitations pour le lancement de votre première application Spark.

Démarrez le guide de programmation Spark pour une vue d'ensemble plus approfondie de l'API, ou consultez les menus des guides de programmation pour d'autres composants. http://spark.apache.org/docs/latest/programming-guide.html
Accédez à Présentation du déploiement pour lancer l'application sur le cluster. http://spark.apache.org/docs/latest/cluster-overview.html
Enfin, Spark contient de nombreux exemples utiles de Scala, Java et Python dans le répertoire des exemples, essayez donc d'utiliser run-example comme suit.

# For Scala and Java, use run-example:
./bin/run-example SparkPi

# For Python examples, use spark-submit directly:
./bin/spark-submit examples/src/main/python/pi.py

C'est la fin du démarrage rapide.

[PYTHON] Traduction japonaise de documents Apache Spark - Démarrage rapide

Scala

Python

Scala

Python

Scala

Python

Scala

Python

Scala

Python

scala

scala

Scala

Scala

Mise en cache

Scala

Applications autonomes

SimpleApp.scala

python

SimpleApp.java

pom.xml

SimpleApp.py

`Scala`

`Python`

`Scala`

`Python`

`Scala`

`Python`

`Scala`

`Python`

`Scala`

`Python`

`scala`

`scala`

`Scala`

`Scala`

`Scala`

`SimpleApp.scala`

`python`

`SimpleApp.java`

`pom.xml`

`SimpleApp.py`