[PYTHON] I tried to make an analysis base of 5 patterns in 3 years

It's been three years since I joined the company, so a summary of what I've done so far, including my own notes.

Refusal

Due to the development period, I have not been able to understand and tune each system in detail. Because it was constructed by an individual with information examined within the period Something like "No, no, if you tune it properly, you won't get that result." Please note that there may be many. Also, three years ago, I had never touched Linux or SQL, so I think that the one I was in charge of at the beginning was not as easy to make.

flow

First Generation: A Foundation Based on Amazon RedShift

Overview

An analysis platform for cross-sectional viewing of the KPIs of our series of games. Put data in RedShift using fluentd, aggregate in daily batch, The aggregation result was stored in another table of RedShift. Ruby is used for batch and CakePHP is used for framework. Since fluentd is made by Ruby, I made Ruby so that I could make plugins. CakePHP is a corporate culture.

Development period

Approximately 6 months from October 2013 to March 2014 (parallel to other operations, including verification period)

Impressions

Only the image that I was desperate to learn because I only touched it for the first time ...

Second Generation: Hive-based foundation with Amazon EMR

Trigger for transition

The maintenance cost of RedShift was high for the small amount of data. (At that time, the cost performance was bad because there was only a Dense Storage node) Also, the aggregation time is longer than I expected, and I verified it with the glue "Hadoop is popular recently, let's verify it" Introduced because it was confirmed that aggregation can be done at low cost.

Overview

An analysis platform that allows you to see KPIs for games other than the series. "Python is rich in statistical libraries, so if you're an analysis engineer, why not Python?" There was a recommendation of a senior. So I used Python for batch and django made for framework.

Development period

Approximately 14 months from April 2014 to June 2015 (parallel to other work, including verification period. Visualization is handled by another person)

Impressions

I don't think the tally time has changed dramatically (no detailed notes left ...) However, with EMR, the total cost can be reduced to about half because the server only needs to be started for the total time. I think that starting Python here was the correct answer for doing Spark and machine learning later. Easy to write!

3rd Generation: Spark-based foundation with Amazon EMR

Trigger for transition

I wanted to touch new technology, and Spark was becoming a hot topic, so I decided to try it first. If the cache is effective, the aggregation time will be shortened, if the aggregation time is short, the aggregation cost can be expected to be reduced, and since Spark processing is performed with Python, it is easy to migrate.

Overview

Just changing from Hive to Spark is almost the same as the 2nd generation. I made an RDD-based one with Spark1.2 and a DataFrame-based one with Spark1.4.

Development period

Approximately 6 months from June 2015 to December 2015 (parallel to other operations, including verification period)

Impressions

Due to the lack of documentation, it was difficult to build through trial and error. Even if I go to seminars, there are many companies around me that are in the stage of considering introduction, so I can not consult much ... I think I have acquired the skills to translate the results of Google Translate in order to read unfamiliar English articles. If Spark uses the cache function, iterative processing is fast (when dividing the DAU into various segments, the login log is cached, etc.), but the first time the cache is not working, it takes time. It was.

4th Generation: A foundation based on distributed processing on MySQL and servers

Trigger for transition

While making various things, I thought, "Oh, with this amount of data, I don't need Hive or Spark in the first place?" It was confirmed by reviewing the aggregation algorithm and verifying that there is no problem with the current amount of data when considering the DB table design properly. Also, due to the effect that the cache does not work the first time of Spark, it seems that the processing on the DB or server can shorten the aggregation time.

Overview

The processing content is the same as the previous generation. Changed to process in multi-process in Python. Instead of putting all the logs in the DB I put only the logs necessary for aggregation and deleted them when they were no longer used. (Log file remains in S3)

Development period

Approximately 3 months from December 2015 to February 2016 (parallel to other operations, including verification period)

Impressions

Compared to Spark, the aggregation time is less than half. No additional cost because we used a server and DB that are always running for other purposes. It feels like this will work for a few GB a day.

5th Generation: BigQuery-based Foundation (Under Construction)

Trigger for transition

Even if the amount of data increases in the future, it can be handled to some extent, but there is concern that it will not be usable when the range of analysis expands and the amount of data increases at once, and when the amount of data is large in ad hoc analysis, a Spark cluster is launched. I was doing ... I verified that it would be easier if I could do daily batch and ad hoc analysis with BigQuery.

Overview

The processing content is the same as the previous generation. I used Luigi, a framework for building a data pipeline made by Python, for batch.

Development period

From February 2016 to the present

Impressions (halfway)

It takes a lot of time to make a table. Even if you ask Google, it will take time, so expect it in the future. The limit of 1000 tables per query is currently difficult to use. For titles that exceed 1000 days after release, it is not possible to simply ask for the cumulative billing amount for each user as of the Xth and Yth days. It is necessary to avoid the 1000 table limit, such as creating a table for each day with "cumulative billing amount up to the previous day + billing amount for the current day".

However, the time required for aggregation is short, and even if all the data is included, there is almost no management cost, so it is a big advantage that ad hoc analysis is easy. The aggregation cost for the first 1TB of each month is free, so if the amount of data is small, the aggregation cost will be almost zero.

Summary

When the log to be aggregated is about several GB

item Installation cost Operating costs Ease of ad hoc analysis
RedShift ✕(If you use SSD, it's about △)
Hive
Spark
SQL and distributed processing
BigQuery △(Table creation is slow) ?(The final cost is still unknown. Probably ◯)

[Introduction cost] ◯: If you write SQL, there is almost no cost △: There are other points to consider besides writing SQL

[Operating cost] ◎: No additional cost because the server and DB that are always running for other purposes are used. ◯: 50,000 or less △: A little less than 100,000 ✕: Over 100,000

[Ease of ad hoc analysis] ◯: No problem if SQL is applied △: There are some difficulties. Slow response or need to consider cache


I think that it may not fit depending on the amount of data and the purpose of analysis. I think there are some places where on-premise is better than cloud. After all, it feels like the right person in the right place. Personally, I like Spark or BigQuery. Spark has a machine learning library, and it is nice because it is flexible because it can be processed by Scala and Python in addition to SQL. After that, if the processing speed is improved. BigQuery doesn't require fine tuning and is low cost, so it's good that it's easy to use. I hope the current problems will be resolved soon. Looking back, it was undergoing major renovations at a pace of once every six months. I haven't been able to dig deeper one by one, but I am grateful for the environment where I have experienced so many things in three years.

Recommended Posts

I tried to make an analysis base of 5 patterns in 3 years
I tried various patterns of date strings to be entered in pandas.to_datetime
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
I tried to make an OCR application with PySimpleGUI
I tried to make a stopwatch using tkinter in python
I tried to implement blackjack of card game in Python
I tried to create an article in Wiki.js with SQLAlchemy
I tried to make an activity that collectively sets location information
I tried to create an environment of MkDocs on Amazon Linux
I tried to make a regular expression of "amount" using Python
I tried to make a regular expression of "time" using Python
I tried to make a regular expression of "date" using Python
[Python] Simple Japanese ⇒ I tried to make an English translation tool
I tried to make an image similarity function with Python + OpenCV
I tried to make a mechanism of exclusive control with Go
I tried to implement PLSA in Python
[First scraping] I tried to make a VIP character of Smash Bros. [Beautiful Soup] [Data analysis]
I tried to understand supervised learning of machine learning in an easy-to-understand manner even for server engineers 1
I tried to implement PLSA in Python 2
I tried to implement ADALINE in Python
I tried to implement PPO in Python
I want to make an automation program!
I tried to make a Web API
I tried to understand supervised learning of machine learning in an easy-to-understand manner even for server engineers 2
[Python] I tried to summarize the set type (set) in an easy-to-understand manner.
I tried to make an open / close sensor (Twitter cooperation) with TWE-Lite-2525A
I tried to summarize Cpaw Level1 & Level2 Write Up in an easy-to-understand manner
I want to color a part of an Excel string in Python
I tried to summarize Cpaw Level 3 Write Up in an easy-to-understand manner
I want to leave an arbitrary command in the command history of Shell
I tried to perform a cluster analysis of customers using purchasing data
I tried to display the altitude value of DTM in a graph
[Lambda] I tried to incorporate an external module of python via S3
I tried to implement a card game of playing cards in Python
I tried to make PyTorch model API in Azure environment using TorchServe
I tried to get an image by scraping
I tried to find 100 million digits of pi
I tried to integrate with Keras in TFv1.1
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried to make AI for Smash Bros.
I tried to implement TOPIC MODEL in Python
I tried to detect an object with M2Det!
Make fixture an argument to parameterize in py.test
I tried morphological analysis and vectorization of words
I tried to implement selection sort in python
I tried to make a ○ ✕ game using TensorFlow
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried to make a simple mail sending application with tkinter of Python
[Patent analysis] I tried to make a patent map with Python without spending money
I tried to build an environment of Ubuntu 20.04 LTS + ROS2 with Raspberry Pi 4
I made an appdo command to execute a command in the context of the app
[Azure] I tried to create a Linux virtual machine in Azure of Microsoft Learn
I tried to predict the change in snowfall for 2 years by machine learning
I want to store the result of% time, %% time, etc. in an object (variable)
I tried my best to make an optimization function, but it didn't work.
Draw a graph in Julia ... I tried a little analysis
I tried to make a "fucking big literary converter"
I tried to make an air lip detection & automatic response BOT for remote work
I tried to make something like a chatbot with the Seq2Seq model of TensorFlow