--Do the Spark tutorial on the following page
export SPARK_HOME='/usr/local/bin/spark-2.2.0-bin-hadoop2.7'
If you select without-hadoop
when you install it, you need to add hadoop and set HADOOP_HOME.
When using PySpark
pip pyspark
I put it in something. If the version is different from spark, it will not work well, so if you want to specify the version
pip pyspark-2.2.1
Specify it by doing something like that.
The directory structure looks like this. The contents will be explained below.
$ tree
.
├── SimpleApp.scala
├── build.sbt
├── input.txt
└── run.sh
SimpleApp.scala is almost the same as tutorial. Only the input has been changed a little. Read a text file and count how many "a" and "p" are included.
SimpleApp.scala
/* SimpleApp.scala */
import org.apache.spark.sql.SparkSession
object SimpleApp {
def main(args: Array[String]) {
val logFile = "input.txt" // Should be some file on your system
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numPs = logData.filter(line => line.contains("p")).count()
println(s"Lines with a: $numAs, Lines with p: $numPs")
spark.stop()
}
}
Appropriate sentences in input.txt
this is a pen
this is an apple
apple pen
pen pen
sbt package
Will generate a jar file under target based on the contents of build.sbt. (Target is also created arbitrarily at this time) build.sbt example
build.sbt
name := "simple"
version := "1.0"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
In this case, the generated jar file will be target / scala-2.11 / simple_2.11-1.0.jar
.
run.sh is a script for execution.
run.sh
spark-submit \
--class SimpleApp \
--master local[4] \
--conf spark.driver.host=localhost \
target/scala-2.11/simple_2.11-1.0.jar
When I run sh run.sh
, the results are staggered,
Lines with a: 3, Lines with p: 4
If it is included, it's okay.
** Of the inputs, 3 lines contain a and 4 lines contain p **.
The part of spark.driver.host = localhost
was not included in the tutorial, but if you do not write this, it will be your environment
Error
Exception in thread "main" java.lang.AssertionError: assertion failed: Expected hostname
Came out, so I added it.
It's a little easier for Python. The file structure is as follows
$ tree
.
├── SimpleApp.py
├── input.txt
└── run.sh
SimpleApp.py
from pyspark.sql import SparkSession
logFile = "input.txt" # Should be some file on your system
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
logData = spark.read.text(logFile).cache()
numAs = logData.filter(logData.value.contains('a')).count()
numPs = logData.filter(logData.value.contains('p')).count()
print("Lines with a: %i, lines with p: %i" % (numAs, numPs))
spark.stop()
run.sh
spark-submit \
--master local[4] \
--conf spark.driver.host=localhost \
SimpleApp.py
input.txt is the same. If Pypark is included in pip, you don't have to run the script
python SimpleApp.py
It seems that it can be executed just by itself.
This time, I'm running spark using Spark Session. You can do the same with SparkContext. If you don't understand the difference between Spark Context and Spark Session, it seems that Spark Session has Spark Context inside, and it seems to be written after Spark Context.
--Reference: http://www.ne.jp/asahi/hishidama/home/tech/scala/spark/SparkSession.html
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
Create a spark instance by specifying appName with.
logData = spark.read.text(logFile).cache()
Read the file with. cache () seems to be a setting to persist in memory. (It worked without this code) This time we are reading a text file, but we can also read structured data such as csv files and sql tables.
The read data is parallelized and calculated by rdd
.
Various functions such as map and fileter are prepared in rdd.
--Reference: https://dev.classmethod.jp/articles/apache-spark_rdd_investigation/
filter is a method for literally filtering. This time I read the file line by line and use count () to count how many contain a particular string.
numAs = logData.filter(logData.value.contains('a')).count()
SparkSession seems to be running as an http server, so stop it with stop ()
at the end.
spark.stop()
--I did a spark tutorial --If there is a mistake in the explanation, it will be corrected and updated at any time.
Recommended Posts