[PYTHON] What I was careful about when implementing Airflow with docker-compose

I implemented Airflow in my personal study, so I will write what I noticed at that time [^ 1]. I hope fewer people are addicted to similar problems.

Premise

--Use Local Executor --MySQL container and Airflow container --Access Redshift from Airflow

Docker image

Install airflow == 1.10.10 based on python: 3.7.7

--To write a Dockerfile yourself ――By the way, ʻentrypoint.sh` --For learning purposes, please refrain from puckel / docker-airflow -Insanely helpful --The official Docker image master and latest are version 2.0 and development version [^ 2]

AIRFLOW_EXTRAS It is a plug-in for extending airflow, and there is access to GCP from DB system such as MySQL. The official Dockerfile says:

ARG AIRFLOW_EXTRAS="async,aws,azure,celery,dask,elasticsearch,gcp,kubernetes,mysql,postgres,redis,slack,ssh,statsd,virtualenv"

crypto is virtually required as it is needed to generate FERNET_KEY. I use MySQL for the backend DB and psycopg2 for connecting to Redshift, so I also need something related to these.

entrypoint.sh

As you can see in the docs (https://airflow.apache.org/docs/stable/howto/secure-connections.html) and puckel, you can now generate a FERNET_KEY to encrypt your connection. To. Safer than solid writing in ʻairflow.cfg`.

: "${AIRFLOW__CORE__FERNET_KEY:=${FERNET_KEY:=$(python -c "from cryptography.fernet import Fernet; FERNET_KEY = Fernet.generate_key().decode(); print(FERNET_KEY)")}}"

Confirmation of DB startup by nc (netcat) command

Even if you do depends_on: mysql in docker-compose.yml, it just waits for the container to start and does not confirm that the DB starts. puckel uses the nc command in ʻentrypoint.sh` to check if a connection to the DB is established. This kind of thing is very helpful.

wait_for_port() {
  local name="$1" host="$2" port="$3"
  local j=0
  while ! nc -z "$host" "$port" >/dev/null 2>&1 < /dev/null; do
    j=$((j+1))
    if [ $j -ge $TRY_LOOP ]; then
      echo >&2 "$(date) - $host:$port still not reachable, giving up"
      exit 1
    fi
    echo "$(date) - waiting for $name... $j/$TRY_LOOP"
    sleep 5
  done
}

DB settings

MySQL preferences

Set the following environment variables (Reference). These are the settings on the DB side, but they are accessed from Airflow using the same user password.

my.cnf Setting ʻexplicit_defaults_for_timestamp = 1` when using MySQL, as described in Airflow's database backend Description (https://airflow.apache.org/docs/stable/howto/initialize-database.html) Is necessary. In addition, add settings for handling multi-byte characters.

[mysqld]
character-set-server=utf8mb4
explicit_defaults_for_timestamp=1

[client]
default-character-set=utf8mb4

Access to DB

AIRFLOW__CORE__SQL_ALCHEMY_CONN The default is sqlite, but change it according to the DB and driver. For how to write, refer to [Document] of SqlAlchemy (https://docs.sqlalchemy.org/en/13/core/engines.html)

The host is the one specified by container_name in docker-compose.yml, and the port is basically 3306 for MySQL and 5432 for PostgreSQL. The user name and DB name are the ones set above.

Use of environment variables

Where to write various Airflow settings

As described in Documentation, you can write settings in multiple places, and environment variables take precedence. To. I want to use environment variables for AWS authentication. If you want to make it static, such as accessing the DB, you may use ʻairflow.cfg`. At the production level, the members should set the rules properly.

  1. set as an environment variable
  2. set as a command environment variable
  3. set in airflow.cfg
  4. command in airflow.cfg
  5. Airflow’s built in defaults

The higher ones have priority.

Where to define environment variables

This is also quite various. This should be prioritized as the one below.

  1. Dockerfile: Use the one with few changes, like the default
  2. ʻentrypoint.sh`: Same as above
  3. docker-compose.yml: Can be modified for other containers. More flexible.
  4. .env file: If specified as ʻenv_file in docker-compose.yml`, it will be read when the container starts. Write here the authentication system that you do not want to leave in Git.

DB settings and SqlAlchemy settings are in docker-compose.yml

Since the same settings are used, it is easier to manage and maintain if the description location is the same.

version: "3.7"
services:
    mysql:
        image: mysql:5.7
        container_name: mysql
        environment:
            - MYSQL_ROOT_PASSWORD=password
            - MYSQL_USER=airflow
            - MYSQL_PASSWORD=airflow
            - MYSQL_DATABASE=airflow
        volumes:
            - ./mysql.cnf:/etc/mysql/conf.d/mysql.cnf:ro
        ports:
            - "3306:3306"
    airflow:
        build: .
        container_name: airflow
        depends_on:
            - mysql
        environment:
            - AIRFLOW_HOME=/opt/airflow
            - AIRFLOW__CORE__LOAD_EXAMPLES=False
            - AIRFLOW__CORE__EXECUTOR=LocalExecutor
            - AIRFLOW__CORE__SQL_ALCHEMY_CONN=mysql+mysqldb://airflow:airflow@mysql:3306/airflow
            - MYSQL_PORT=3306
            - MYSQL_HOST=mysql
#Abbreviation

AWS and Redshift connections are in the .env file

Aside from Redshift, AWS access keys and secret keys are highly confidential, so I don't want to write them in docker-compose.yml or ʻentrypoint.sh. ʻAirflow.cfg has room for consideration, but in reality it may be a consultation with the development team.

** For the time being, it's not modern to type in the GUI. ** **

When writing, refer to Documentation and write as follows.

Conn Id Conn Type Login Password Host Port Schema Environment Variable
redshift_conn_id postgres awsuser password your-cluster-host 5439 dev AIRFLOW_CONN_REDSHIFT_CONN_ID=postgres://awsuser:password@your-cluster-host:5439/dev
aws_conn_id aws your-access-key your-secret-key AIRFLOW_CONN_AWS_CONN_ID=aws://your-access-key:your-secret-key@

Even if the ID is lowercase, the environment variable name will be uppercase.

With AWS keys, you need to add @ at the end even if you don't have a host. Otherwise, an error will occur. Also, if the key contains a colon or slash, it will not parse well, so it is better to regenerate the key.

On the contrary, I wanted to know the URI format of the connection entered in the GUI, etc. You can output as follows.

from airflow.hooks.base_hook import BaseHook

conn = BaseHook.get_connection('postgres_conn_id')
print(f"AIRFLOW_CONN_{conn.conn_id.upper()}='{conn.get_uri()}'")

Set Airflow Key-Value with environment variable

As with connections, you can Set Key Value. As a method,

  1. Set with GUI
  2. Set with .py code.
  3. Set with environment variables

For code,

from airflow.models import Variable
Variable.set(key="foo", value="bar")

For environment variables

Key Value Environment Variable
foo bar AIRFLOW_VAR_FOO=bar

Even if the key is lowercase, the environment variable name will be uppercase.

the end

I would like to introduce Repository.

[^ 1]: I'm a Data Engineer from Udacity. I touched Cassandra, Redshift, Spark, Airflow. It was said that it would take 5 months, but it ended in 3 months, so it seems better to sign a monthly contract. Also, you will get 50% off on a regular basis, so it is recommended that you register aiming for that. ~~ Otherwise Takasugi ~~

[^ 2]: When I touched ʻapache / airflow: 11.1010while writing the article, it seemed to be relatively smooth. If you executedocker run -it --name test -p 8080 -d apache / airflow: 11.1010" ", it will start with bash open, so it is flexible such as docker exec test airflow initdb`. You can operate it.

Recommended Posts

What I was careful about when implementing Airflow with docker-compose
What I was worried about when displaying images with matplotlib
[Ansible] What I am careful about when writing ansible
[Go language] Be careful when creating a server with mux + cors + alice. Especially about what I was addicted to around CORS.
Two things I was happy about with Python 3.9
What I was addicted to when using Python tornado
What I did when I was angry to put it in with the enable-shared option
What I was addicted to when dealing with huge files in a Linux 32bit environment
What I was addicted to when migrating Processing users to Python
What I learned about Linux
What I was asked when using Random Forest in practice
What I was addicted to when introducing ALE to Vim for Python
What I was addicted to with json.dumps in Python base64 encoding
I came across a lambda expression when I was worried about functionalization
I tried to summarize what was output with Qiita with Word cloud
A note I was addicted to when creating a table with SQLAlchemy
What I stumbled upon using Airflow
Where I was worried about heroku
What was surprising about Python classes
I tried implementing DeepPose with PyTorch
What I checked about Qiita's post
About launching an instance with an encrypted EBS volume (where I was addicted)
Since handling the Cython mold was troublesome, I summarized the points I was careful about
A reminder of what I got stuck when starting Atcoder with python
What I got into when using Tensorflow-gpu
What I referred to when studying tkinter
When I get an error with PyInstaller
I tried implementing DeepPose with PyTorch PartⅡ
What I did with a Python array
What I was addicted to Python autorun
Be careful when running CakePHP3 with PHP7.2
What I was addicted to when creating a web application in a windows environment
I failed when clustering with k-means, but what should I do (implementation of kernel k-means)
What I did when I got stuck in the time limit with lambda python
Three things I was addicted to when using Python and MySQL with Docker
When I tried to connect with SSH, I got a warning about free space.
A note I was addicted to when running Python with Visual Studio Code
[Linux] Let's talk about when I stumbled upon a symbolic link I was using.
A story that I was addicted to when I made SFTP communication with python