[PYTHON] What is the ETL processing framework clivoa?

What is ETL processing?

ETL is an acronym for extract, transform, and load. Extract, transform, and load are literally translated into Japanese for extraction, processing, and loading. ETL processing is the processing indicated by ETL for some data (text file, csv file, etc ...).

What is cliboa

BrainPad Design and implement the common infrastructure part of the ETL processing function that was developed and operated in-house as an application framework. The fixed one is clivoa.

GitHub https://github.com/BrainPad/cliboa

PyPI https://pypi.org/project/cliboa/

Definition of ETL processing in cliboa

In cliboa, extract is defined as downloading data from some box, transform is defined as processing the downloaded data, and load is defined as extracting and uploading the processed data to any box. are doing. A conceptual diagram is shown below. image.png

Features of cliboa

--Implemented in Python. --It is possible to run a simple ETL process just by writing a Yaml file. --Additional implementation by Python is possible.

quick start

Required environment

It runs on Linux OS such as Debian, Ubuntu and CentOS.

Installation method

After preparing python version3.0 or higher, install it with the pip command.

sudo pip3 install cliboa

After the installation is complete, you can run the command clivoadmin. Run cliboadmin in any directory.

$ cd /usr/local
$ cliboadmin init sample
$ cd sample
$ cliboadmin create simple-etl

Program structure

The program structure initialized by cliboadmin is as follows.

sample
|-- bin
|   `-- clibomanager.py
|-- common
|   |-- __init__.py
|   |-- environment.py
|   |-- scenario
|   `-- scenario.yml
|-- conf
|-- logs
|-- project
|   `-- simple-etl
|       |-- scenario
|       `-- scenario.yml
`-- requirements.txt

Install PyPI package

Since the set of python packages required to run clivoa is defined in requirements.txt, specify it with the pip command and install it.

$ cd sample
$ pip3 install -r requirements.txt

Write an ETL processing scenario

Write the following process as an example in project / simple-etl / scenario.yml.

Processing content Download test.csv.gz from sftp server, unzip the downloaded file, upload the unzipped test.csv to sftp server

scenario:
- step:
  class: SftpDownload
  arguments:
    host: localhost
    user: root
    password: pass
    src_dir: /usr/local
    src_pattern: test.csv.gz
    dest_dir: /tmp
- step: FileDecompress
  arguments:
    src_dir: /tmp
    src_pattern: test.*\.csv.*\.gz
- step:
  class: SftpUpload
  arguments:
    host: localhost
    user: root
    password: pass
    src_dir: /tmp
    src_pattern: test.*\.csv
    dest_dir: /usr/local

Run

Prepare the following before execution

Execute with the following command

cd sample
bin/clibomanager.py simple-etl

After execution, if it looks like the following, it succeeds --test.csv.gz placed under / usr / local is expanded under / tmp and becomes test.csv. --test.csv exists under / usr / local

Recommended Posts

What is the ETL processing framework clivoa?
What is the activation function?
What is the Linux kernel?
[Definition] What is a framework?
What is the interface for ...
What is the Callback function?
[Python] What is @? (About the decorator)
[python] What is the sorted key?
What is the X Window System?
What is the python underscore (_) for?
[Unix] What is the zombie process / orphan process?
What is the cause of the following error?
What is "mahjong" in the Python library? ??
[Machine learning] What is the LP norm?
What is namespace
What is Django? .. ..
What is dotenv?
What is POSIX?
What is wheezy in the Docker Python image?
What is Linux
What is klass?
What is SALOME?
What is Linux?
What is python
What is hyperopt?
It's a Mac. What is the Linux command Linux?
(Linux beginner) What is the magic word aux?
What is Linux
What is pyvenv
What is __call__
What is Linux
What is the difference between Unix and Linux?
What is Python
What is the difference between usleep, nanosleep and clock_nanosleep?
What is the domain attribute written in Plotly's Layout?
What is the true identity of Python's sort method "sort"? ??
What is a recommend engine? Summary of the types
[Python] What is Pipeline ...
What is Calmar Ratio?
What is a terminal?
[PyTorch Tutorial ①] What is PyTorch?
What is hyperparameter tuning?
What is a hacker?
What is JSON? .. [Note]
The Common Clk Framework
What is Linux for?
What is a pointer?
What is ensemble learning?
What is TCP / IP?
What is Python's __init__.py?
What is an iterator?
What is UNIT-V Linux?
[Python] What is virtualenv
What is machine learning?
What is the difference between a symbolic link and a hard link?
What is the default TLS version of the python requests module?
The image display function of iTerm is convenient for image processing.
[Pyro] Statistical modeling by the stochastic programming language Pyro ① ~ What is Pyro ~