[Python, Scala] Do a tutorial for Apache Spark

Overview

--Do the Spark tutorial on the following page

http://spark.apache.org/docs/latest/quick-start.html --What is Spark?
https://qiita.com/tetsuro731/items/64abce51021c904bb7ab --Continued from this article --Try with Scala and Python --The inside of Spark is written in Scala, but Python and Java APIs are also provided. --Use Pyspark when using Python

Environment

macOS Mojave, 10.14.3
spark --Install from the official website
- https://spark.apache.org/downloads.html --If you want to install an older version, choose from Archived Releases --This time, for some reason, I installed the old spark 2.2.0. --Set SPARK_HOME in .bashrc or something

export SPARK_HOME='/usr/local/bin/spark-2.2.0-bin-hadoop2.7'

If you select without-hadoop when you install it, you need to add hadoop and set HADOOP_HOME. When using PySpark

pip pyspark

I put it in something. If the version is different from spark, it will not work well, so if you want to specify the version

pip pyspark-2.2.1

Specify it by doing something like that.

Try running with Scala

The directory structure looks like this. The contents will be explained below.

$ tree
.
├── SimpleApp.scala
├── build.sbt
├── input.txt
└── run.sh

SimpleApp.scala is almost the same as tutorial. Only the input has been changed a little. Read a text file and count how many "a" and "p" are included.

`SimpleApp.scala`


/* SimpleApp.scala */
import org.apache.spark.sql.SparkSession

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "input.txt" // Should be some file on your system
    val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
    val logData = spark.read.textFile(logFile).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numPs = logData.filter(line => line.contains("p")).count()
    println(s"Lines with a: $numAs, Lines with p: $numPs")
    spark.stop()
  }
}

Appropriate sentences in input.txt

this is a pen
this is an apple
apple pen
pen pen

sbt package

Will generate a jar file under target based on the contents of build.sbt. (Target is also created arbitrarily at this time) build.sbt example

`build.sbt`


name := "simple"

version := "1.0"

scalaVersion := "2.11.12"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"

In this case, the generated jar file will be target / scala-2.11 / simple_2.11-1.0.jar.

run.sh is a script for execution.

`run.sh`


spark-submit \
 --class SimpleApp \
 --master local[4] \
 --conf spark.driver.host=localhost \
 target/scala-2.11/simple_2.11-1.0.jar

When I run sh run.sh, the results are staggered,

Lines with a: 3, Lines with p: 4

If it is included, it's okay. ** Of the inputs, 3 lines contain a and 4 lines contain p **. The part of spark.driver.host = localhost was not included in the tutorial, but if you do not write this, it will be your environment

`Error`


Exception in thread "main" java.lang.AssertionError: assertion failed: Expected hostname

Came out, so I added it.

Try running with Python

It's a little easier for Python. The file structure is as follows

$ tree
.
├── SimpleApp.py
├── input.txt
└── run.sh

`SimpleApp.py`


from pyspark.sql import SparkSession

logFile = "input.txt"  # Should be some file on your system
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
logData = spark.read.text(logFile).cache()
numAs = logData.filter(logData.value.contains('a')).count()
numPs = logData.filter(logData.value.contains('p')).count()

print("Lines with a: %i, lines with p: %i" % (numAs, numPs))

spark.stop()

`run.sh`


spark-submit \
  --master local[4] \
  --conf spark.driver.host=localhost \
  SimpleApp.py

input.txt is the same. If Pypark is included in pip, you don't have to run the script

python SimpleApp.py

It seems that it can be executed just by itself.

easy explanation

This time, I'm running spark using Spark Session. You can do the same with SparkContext. If you don't understand the difference between Spark Context and Spark Session, it seems that Spark Session has Spark Context inside, and it seems to be written after Spark Context.

--Reference: http://www.ne.jp/asahi/hishidama/home/tech/scala/spark/SparkSession.html

spark = SparkSession.builder.appName("SimpleApp").getOrCreate()

Create a spark instance by specifying appName with.

logData = spark.read.text(logFile).cache()

Read the file with. cache () seems to be a setting to persist in memory. (It worked without this code) This time we are reading a text file, but we can also read structured data such as csv files and sql tables.

The read data is parallelized and calculated by rdd. Various functions such as map and fileter are prepared in rdd.

--Reference: https://dev.classmethod.jp/articles/apache-spark_rdd_investigation/

filter is a method for literally filtering. This time I read the file line by line and use count () to count how many contain a particular string.

numAs = logData.filter(logData.value.contains('a')).count()

SparkSession seems to be running as an http server, so stop it with stop () at the end.

spark.stop()

Summary

--I did a spark tutorial --If there is a mistake in the explanation, it will be corrected and updated at any time.