[PYTHON] The story of Airflow's webserver and DAG, which takes a long time to load

Christmas is near, so let's talk about Airflow's web server.

What i want to say

Modules that make up Airflow

The leading web server this time is one of the Airflow modules. For other modules,

And so on. For more information, Astronomer's article is easy to understand.

What is a web server

webserver is the management screen (shown below), CLI command (part), [API](https://airflow.apache. org / docs / stable / api.html) etc.

We accept processing such as.

Internally, it has a Flask + Gunicorn configuration, and the endpoint from the screen is [here](https://github.com/apache/airflow/blob/1.10.2/airflow/www/views. Defined in py).

Airflow UI (The figure is from Airflow official page)

DAG problem that takes a long time to load

The webserver not only accepts requests, but also ** reads DAG files on a regular basis **.

As a result, if the DAG file ** reads ** takes a long time,

Sometimes

Is it a warning? It has been.

Does it take a long time to load?

It may be confusing that ** it takes a long time to load the DAG ** and ** it takes a long time to execute DAGRun **, but it is a different story, and the former is the problem this time.

To give an example, this is a ** DAG that takes a long time to load **

    sleep(10000000)
    start = DummyOperator(task_id='start')

This is a ** DAG that takes a long time to run ** DAGRun.

        def hoge():
            sleep(1000000)
        slow_task = PythonOperator(
            task_id='query_' + str(i),
            python_callable=hoge,
        )

Loading can be slow if there are a large number of tasks or if you are accessing the outside ** outside the task.

Flow of webserver parsing DAG

For those who are worried about the detailed flow:

  1. Periodically (*) [Restart child process (gunicorn worker)] when starting webserver Set to (https://github.com/apache/airflow/blob/4a344f13d26ecbb627bb9968895b290bfd86e4da/airflow/cli/commands/webserver_command.py#L146)
  2. [DagBag object is created] when loading the endpoint file (https://github.com/apache/airflow/blob/cb8b2a1dc64c3ea6ba445893c65c6c953dfb476a/airflow/www/views.py#L92)
  3. While the DagBag object is created, DAG file is parsed

Relationship with the number of tasks

Cloud Composer (Airflow 1.10.2) -I tried it with a DAG only for BigQuery Operator:

If only Graph View or Tree View is heavy, default_dag_run_display_number should be changed.

A bright future story

Some improvements have been proposed for this "loading DAG".

Cloud Composer implements an option to make DAG loading asynchronous on webserver (https://cloud.google.com/composer/docs/how-to/accessing/airflow-web-interface#asynchronous-load) It has also been [ported] to Airflow 1.10.4 (https://issues.apache.org/jira/browse/AIRFLOW-4924).

It's still a draft, but [AIP-24 DAG Persistence in DB using JSON for Airflow Webserver and (optional) Scheduler](https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in The proposal (+ DB + using + JSON + for + Airflow + Webserver + and +% 28optional% 29 + Scheduler? FocusedCommentId = 123898950) is a more significant change.

We are proposing options. (It seems that it is not good that the webserver has a state in the first place It seems that there is a story))

Cloud Composer webserver

A note about the Cloud Composer webserver:

By the way, Astronomer.io can change the size of vCPU / memory.

Recommended Posts

The story of Airflow's webserver and DAG, which takes a long time to load
Now in Singapore The story of creating a LineBot and wanting to do a memorable job
A story about porting the code of "Try and understand how Linux works" to Rust
I want to record the execution time and keep a log.
The story of Python and the story of NaN
A story that struggled to handle the Python package of PocketSphinx
[Python3] Define a decorator to measure the execution time of a function
It is surprisingly troublesome to get a list of the last login date and time of Workspaces
The story of returning to the front line for the first time in 5 years and refactoring Python Django
The story of writing a program
The story of making a tool to load an image with Python ⇒ save it as another name
The story of having a hard time introducing OpenCV with M1 MAC
I tried to extract and illustrate the stage of the story using COTOHA
The story of making a sound camera with Touch Designer and ReSpeaker
Python: I want to measure the processing time of a function neatly
The story of trying to reconnect the client
The story of adding MeCab to ubuntu 16.04
The story of trying deep3d and losing
The story of blackjack A processing (python)
The story of pep8 changing to pycodestyle
I made a tool to estimate the execution time of cron (+ PyPI debut)
Experiment to collect tweets for a long period of time (aggregation & content confirmation)
The story of IPv6 address that I want to keep at a minimum
A programming beginner tried to find out the execution time of sorting etc.
The story of making a box that interconnects Pepper's AL Memory and MQTT
How to count the number of elements in Django and output to a template
I want to make a music player and file music at the same time
A record of the time it took to deploy mysql on Cloud9 + Rails
It takes a long time to shut down in CentOS 7 with LVM configuration.
Make sure to align the pre-processing at the time of forecast model creation and forecast
Build a python environment to learn the theory and implementation of deep learning
A story of a high school graduate technician trying to predict the survival of the Titanic
How to calculate the volatility of a brand
The story of making a lie news generator
The story of making a mel icon generator
A discussion of the strengths and weaknesses of Python
The story of moving from Pipenv to Poetry
How to execute a schedule by specifying the Python time zone and execution frequency
I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"
I can't find the clocksource tsc! ?? The story of trying to write a kernel patch
The story of Linux that I want to teach myself half a year ago
A story of trial and error trying to create a dynamic user group in Slack
The story of switching from WoSign to Let's Encrypt for a free SSL certificate
The story of trying to contribute to COVID-19 analysis with AWS free tier and failing
The story of porting code from C to Go and getting hooked (and to the language spec)
I just wanted to extract the data of the desired date and time with Django
A memo of misunderstanding when trying to load the entire self-made module with Python3
A story about trying to introduce Linter in the middle of a Python (Flask) project
It's time to seriously think about the definition and skill set of data scientists
[Information compression note 003] A plan to compress the story and board of a professor of electromagnetics at a dull university into one Jpeg.
The story of launching a Minecraft server from Discord
A story that reduces the effort of operation / maintenance
The story of Python without increment and decrement operators.
A memo to visually understand the axis of pandas.Panel
A story of trying out pyenv, virtualenv and virtualenvwrapper
The story of making a music generation neural network
Steps to calculate the likelihood of a normal distribution
A story about changing the master name of BlueZ
Zip 4 Gbyte problem is a story of the past
A story that analyzed the delivery of Nico Nama.
The story of wanting to buy Ring Fit Adventure