I went to "Summer is in full swing! Spark + Python + Data Science Festival".

http://connpass.com/event/34680/

As usual, I participated in the middle of the session, so I forgot about the & macbook and took notes on my iPhone, so I feel like I'm not good at Japanese.

Ibis: Great pandas Easy large-scale data analysis

Mr. Yasuaki Ariga (@chezou) of Cloudera

http://www.slideshare.net/Cloudera_jp/ibis-pandas-summerds

Demo on Jupyter notebook

Scikit-learn comes out after creating teacher data

Compared to PySpark

spark-sklearn

pip install ibis-framework Can be installed with

If you want to use Impala, you should use Cloudera's director.

Introduction of recommendation system in Ameba

Mr. Haruka Naito of CyberAgent

Recommendation system overview

The following three types of recommendation systems are used in Ameba.

Use of recommendation system

Overview

  1. Activity log to hadoop
  2. Send the recommendation result to hbase
  1. Feedback of recommendation result imp / clk etc.

Item to Item collaborative filtering

User-based collaborative filtering

Based on the evaluation of users who are close to each other

Item base

Based on user ratings based on the distance between items Accuracy can be obtained even if the item is evaluated less

Cosine similarity

Divide the number of co-occurrence (number of duplicate users) by the sum of the square roots of the elements

Ingenuity for each case

Keep it simple

Allocate to each worker using broadcast variables. This eliminates the need for complicated joins

I want to limit the recommendation results to those with new freshness

Create an item set (filter) in advance and filter the results

Performance tuning & automation of recommendation engine using Spark

Mr. Nagato Kasaki, DMM.com Lab

Operation story after making

Overview of Spark utilization system

Spark has been used since February 2015.

13 to 168 cases with 3 engineers I was able to handle it because it was automated

Resources are about 1.5 times 230CPUs / 580GB to 360CPUs / 900GB

Time from 3h to 4h

Installation automation

Since there are many services, it is easy to start using new services.

When you want to add a service

  1. Write a recipe
  2. Run the test according to the recipe jenkins
  1. Performance in staging
  2. Release to production

Since the ratio of the number of users and the number of items varies greatly depending on the service, tuning is also required individually.

The sense of scale is 1 million users or 4 million products

We have an item matrix across all services → Recommendations between services will also be possible

Ranking

Two types of algorithms are used properly

  1. Data shaping with Hive
  2. Only recommendation calculation in Spark
  1. Output to DB with Sqoop

The recipe defines the parameter settings for hive, spark, and sqoop in JSON.

Precision tuning is actually put in and A / B tested (there are academic evaluation formulas, but there are some things that you can't understand without trying). Performance is easy to understand and issues, so tune in advance

  1. Looking for a bottleneck
  2. Eliminate data bias

Data division sometimes fails due to the 20:80 rule (in many cases, even if it is divided, it is biased). If you can divide it well, it will be shortened from 3 hours to 3 minutes


(Editing below)

LT frame

spark beginners were addicted to recommendations

Disk depletion when submit every 15 minutes jar is copied Submit while recreating the cluster

Small number of partitions when loading from BigQuery The executor cannot be used up Repartitioning is important

Not recommended There are too many users to get a direct product Processed together in a user set

Recommendation engine performance tuning using Spark

dag visualization Let's see

If not dispersed, distribute Do not shuffle with a large amount of data

Rdd used multiple times is cached

Option not to serialize when cpu bottleneck

KryoSerializer is twice as fast

Recommended Posts

I went to "Summer is in full swing! Spark + Python + Data Science Festival".
Data science companion in python, how to specify elements in pandas
I tried to implement PLSA in Python
[Data science basics] I tried saving from csv to mysql with python
I tried to implement permutation in Python
I tried to implement PLSA in Python 2
I took Udemy's "Practical Python Data Science"
I tried to implement ADALINE in Python
I wanted to solve ABC159 in Python
I tried to implement PPO in Python
How to use is and == in Python
Books on data science to read in 2020
Python program is slow! I want to speed up! In such a case ...
I want to do Dunnett's test in Python
I was able to recurse in Python: lambda
I want to create a window in Python
I wrote "Introduction to Effect Verification" in Python
I tried to get CloudWatch data with Python
I want to merge nested dicts in Python
I tried to implement TOPIC MODEL in Python
I tried to implement selection sort in python
I want to display the progress in Python!
I want to use a python data source in Re: Dash to get query results
[Impression] [Data analysis starting from zero] Introduction to Python data science learned in business cases
[First data science ⑥] I tried to visualize the market price of restaurants in Tokyo
I want to write in Python! (1) Code format check
I tried to graph the packages installed in Python
I want to embed a variable in a Python string
I want to easily implement a timeout in python
Summary of tools needed to analyze data in Python
[Small story] In Python, i = i + 1 is slightly faster than i + = 1.
I want to write in Python! (2) Let's write a test
Even in JavaScript, I want to see Python `range ()`!
I tried to implement a pseudo pachislot in Python
I want to randomly sample a file in Python
I tried to implement Dragon Quest poker in Python
I was addicted to scraping with Selenium (+ Python) in 2020
I want to work with a robot in python.
I tried to implement GA (genetic algorithm) in Python
I want to write in Python! (3) Utilize the mock
I tried to summarize how to use pandas in python
I tried to analyze J League data with Python
I was able to repeat it in Python: lambda
I want to say that there is data preprocessing ~
I want to use the R dataset in python
I want to do something in Python when I finish
I want to manipulate strings in Kotlin like Python!
I want to be able to analyze data with Python (Part 3)
I want to initialize if the value is empty (python)
How to test that Exception is raised in python unittest
I tried to create API list.csv in Python from swagger.yaml
I tried to make various "dummy data" with Python faker
I tried to implement a one-dimensional cellular automaton in Python
I want to be able to analyze data with Python (Part 1)
I want to do something like sort uniq in Python
Various ways to calculate the similarity between data in python
How to generate exponential pulse time series data in python
I want to be able to analyze data with Python (Part 4)
I want to be able to analyze data with Python (Part 2)
I tried "How to get a method decorated in Python"
I tried to implement the mail sending function in Python