[Python, Scala] Do a tutorial for Apache Spark

Overview

--Do the Spark tutorial on the following page

Environment

export SPARK_HOME='/usr/local/bin/spark-2.2.0-bin-hadoop2.7'

If you select without-hadoop when you install it, you need to add hadoop and set HADOOP_HOME. When using PySpark

pip pyspark

I put it in something. If the version is different from spark, it will not work well, so if you want to specify the version

pip pyspark-2.2.1

Specify it by doing something like that.

Try running with Scala

The directory structure looks like this. The contents will be explained below.

$ tree
.
├── SimpleApp.scala
├── build.sbt
├── input.txt
└── run.sh

SimpleApp.scala is almost the same as tutorial. Only the input has been changed a little. Read a text file and count how many "a" and "p" are included.

SimpleApp.scala


/* SimpleApp.scala */
import org.apache.spark.sql.SparkSession

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "input.txt" // Should be some file on your system
    val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
    val logData = spark.read.textFile(logFile).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numPs = logData.filter(line => line.contains("p")).count()
    println(s"Lines with a: $numAs, Lines with p: $numPs")
    spark.stop()
  }
}

Appropriate sentences in input.txt

this is a pen
this is an apple
apple pen
pen pen
sbt package

Will generate a jar file under target based on the contents of build.sbt. (Target is also created arbitrarily at this time) build.sbt example

build.sbt


name := "simple"

version := "1.0"

scalaVersion := "2.11.12"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"

In this case, the generated jar file will be target / scala-2.11 / simple_2.11-1.0.jar.

run.sh is a script for execution.

run.sh


spark-submit \
 --class SimpleApp \
 --master local[4] \
 --conf spark.driver.host=localhost \
 target/scala-2.11/simple_2.11-1.0.jar 

When I run sh run.sh, the results are staggered,

Lines with a: 3, Lines with p: 4

If it is included, it's okay. ** Of the inputs, 3 lines contain a and 4 lines contain p **. The part of spark.driver.host = localhost was not included in the tutorial, but if you do not write this, it will be your environment

Error


Exception in thread "main" java.lang.AssertionError: assertion failed: Expected hostname

Came out, so I added it.

Try running with Python

It's a little easier for Python. The file structure is as follows

$ tree
.
├── SimpleApp.py
├── input.txt
└── run.sh

SimpleApp.py


from pyspark.sql import SparkSession

logFile = "input.txt"  # Should be some file on your system
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
logData = spark.read.text(logFile).cache()
numAs = logData.filter(logData.value.contains('a')).count()
numPs = logData.filter(logData.value.contains('p')).count()

print("Lines with a: %i, lines with p: %i" % (numAs, numPs))

spark.stop()

run.sh


spark-submit \
  --master local[4] \
  --conf spark.driver.host=localhost \
  SimpleApp.py

input.txt is the same. If Pypark is included in pip, you don't have to run the script

python SimpleApp.py

It seems that it can be executed just by itself.

easy explanation

This time, I'm running spark using Spark Session. You can do the same with SparkContext. If you don't understand the difference between Spark Context and Spark Session, it seems that Spark Session has Spark Context inside, and it seems to be written after Spark Context.

--Reference: http://www.ne.jp/asahi/hishidama/home/tech/scala/spark/SparkSession.html

spark = SparkSession.builder.appName("SimpleApp").getOrCreate()

Create a spark instance by specifying appName with.

logData = spark.read.text(logFile).cache()

Read the file with. cache () seems to be a setting to persist in memory. (It worked without this code) This time we are reading a text file, but we can also read structured data such as csv files and sql tables.

The read data is parallelized and calculated by rdd. Various functions such as map and fileter are prepared in rdd.

--Reference: https://dev.classmethod.jp/articles/apache-spark_rdd_investigation/

filter is a method for literally filtering. This time I read the file line by line and use count () to count how many contain a particular string.

numAs = logData.filter(logData.value.contains('a')).count()

SparkSession seems to be running as an http server, so stop it with stop () at the end.

spark.stop()

Summary

--I did a spark tutorial --If there is a mistake in the explanation, it will be corrected and updated at any time.

Recommended Posts

[Python, Scala] Do a tutorial for Apache Spark
[Python] Building a virtual python environment for the pyramid tutorial (summary)
Do you need a Python re.compile?
Do a non-recursive Euler Tour in Python
Let's create a virtual environment for Python
Python tutorial
[Mac] Building a virtual environment for Python
Make Qt for Python app a desktop app
Get a token for conoha in python
A tool for easily entering Python code
Building a Python development environment for AI development
A textbook for beginners made by Python beginners
A memo of a tutorial on running python on heroku
I made a python dictionary file for Neocomplete
Created a Python wrapper for the Qiita API
Get a ticket for a theme park with python
Create a LINE BOT with Minette for Python
Procedure for creating a LineBot made with Python
Python: Prepare a serializer for the class instance:
Boost.NumPy Tutorial for Extending Python in C ++ (Practice)
A memorandum for touching python Flask on heroku
Commands for creating a python3 environment with virtualenv
Procedure for creating a Python quarantine environment (venv environment)
A memo for creating a python environment by a beginner
Let's make a module for Python using SWIG
2016-10-30 else for Python3> for:
I use python but I don't know the class well, so I will do a tutorial
python [for myself]
Consideration when you can do a good job in 10 years with Python3 and Scala3.
Python Django Tutorial (5)
Python Django Tutorial (2)
Python tutorial summary
Python Django Tutorial (8)
Python Django Tutorial (6)
Python Django Tutorial (7)
Python Django Tutorial (1)
Python Django tutorial tutorial
Python Django Tutorial (3)
Python Django Tutorial (4)
Building a Python environment for pyenv, pyenv-virtualenv, Anaconda (Miniconda)
Write about building a Python environment for writing Qiita Qiita
Building a Docker working environment for R and Python
Run with CentOS7 + Apache2.4 + Python3.6 for the time being
[Python] 2 Create a risk-return map for your asset portfolio
Build a Python extension for E-Cell 4 on Windows 7 (64bit)
Build a python environment for each directory with pyenv-virtualenv
Try searching for a million character profile in Python
I made a VM that runs OpenCV for Python
[Introduction to python] A high-speed introduction to Python for busy C ++ programmers
[For play] Let's make Yubaba a LINE Bot (Python)
Procedure for building a CDK environment on Windows (Python)
Create a Layer for AWS Lambda Python with Docker
[Python] I made a classifier for irises [Machine learning]
Python: Get a list of methods for an object
A Java programmer studied Python. (for, if, while statement)
I made a python library to do rolling rank
Building a Python environment for programming beginners (Mac OS)
Memo for building a machine learning environment using Python
Set a proxy for Python pip (described in pip.ini)
Do you have any recommendations for a commentary book on Google App Engine / Python development?