[LINUX] Batch design and python

Introduction

Python Advent Calendar 2020 Day 25: Christmas_tree :.

The era is server racing, but I think there are still many systems that operate batches on on-premises servers and instances such as the cloud (IaaS).

This article assumes a batch that runs on a server, and the first half summarizes the points of batch design based on anti-patterns. The second half will be tips for batch development in Python.

** * The content of this article is just an example of the way of thinking, and not all ideas fit the system or are satisfied with the contents written here. ** **

Batch design

Batch processing is a processing method that continuously executes a series of processing on a set of data. The etymology goes back to the era of general-purpose computers.

For the purpose of processing data in a batch, Unix-like OS often operates at a specified date and time using cron. Also, the batch itself is sometimes called a job. Since a large number of jobs are managed in a large-scale system, a dedicated job management server is built and job management software is installed to manage the jobs.

Ultimately, depending on the system, it can be operated manually without batch processing. However, in reality, batch processing is indispensable to meet the cost, processing time, and certainty of manpower. In addition, there are cases where the necessity was not understood in the requirement definition process during system development, but it becomes necessary after the system is put into operation.

Therefore, it is very important for batch design to take a bird's-eye view of the entire system and consider the basic points described below and operations such as anti-patterns.

Basic points

-[x] Give variable names that are easy to understand (do not add x, i, etc.) for variables used in batches to improve maintainability. -[x] Keep the method simple and don't combine multiple processes -[x] DB connection settings that change for each environment should be separated from the executable file by using config etc. -[x] Create general-purpose processes such as log modules as util as needed. -[x] When managing jobs such as cron, it is important to design with consideration for penetration. -[x] Create created_at and updated_at columns when registering or updating DB records. -[x] Number of commits during transaction execution (throughput and rollback are considered according to the amount of data) -[x] Built-in considering rerun (simplify recovery method)

Anti-pattern

The anti-patterns related to batch processing that I experienced every time I suddenly took over the operation of the system are described below.

--Log files that are not log rotated It is a batch created in Python and is logged using the logging module. However, since log rotation is not performed on the program side, log output continues to be performed in the same log file.

** If you know Linux rsyslog, you can solve it by just setting the OS without implementing it programmatically. ** **

--Multiple created congig It has a directory structure in which a, b, c and each batch are stored as shown below. Each batch has a different purpose, but the DB settings for data linkage are the same. In addition, the URL of the Webhook that is the notification destination when an exception such as a program error occurs is also described in each config, but they are all the same.

.
|-- a
|   `-- config
|-- b
|   `-- config
`-- c
    `-- config

** For example, when a server migration occurs and you change the URL of the webhook, you have to rewrite everything. ** **

--Batch processing breakthrough There is a batch that starts and stops an instance in batch processing. One day, it took longer than usual to start and stop the instance, probably because the load was applied to the entire AWS and it was affected by it. Therefore, the time of the preceding batch processing becomes long, and it overlaps with the later batch processing, and the batch processing fails.

** Insufficient consideration of batch design without considering breakthrough of batch processing due to unexpected system abnormality. ** **

--DB that does not know when the update was done It was decided to investigate the DB that was updated by batch processing, but the investigation was difficult because the update time was not recorded. When registering or updating a DB in batch processing, creating columns such as created_at and updated_at as a DB design makes it easier to investigate when a failure occurs.

** If you do not know when the data was registered / updated, the operation / maintainability will be greatly reduced. ** **

--Batch to notify anything Notification of info messages for events that do not need to be confirmed, and batches that notify all errors due to temporary communication failure and connection failure are meaningless for those who have taken over the operation. In addition, unnecessary error notifications are harmful for system operation because they become just wolf boys when they become mere corpses.

** Do not notify other than errors that affect the service or messages that require confirmation from the operator. ** **

--Batch without considering extensibility The batch processing time affects the service, but when the amount of batch processing increases after operation, there is a limit to sequential processing unless parallel processing and extensibility are taken into consideration.

** There is no problem if it can be handled by tuning, but a design that does not consider extensibility will have a large effect later. ** **

--Install unnecessary libraries with pip A server migration occurs and run pip freeze on the source server to create requirements.txt. When installing based on requirements.txt on a new server, an error occurs regarding an unused library.

** Do not install unnecessary libraries. ** **

document

When the operator in charge of the system changes due to handing over, etc., ** document maintenance ** is more important for a more profane system in order to continue operation.

Suppose that the system is set to notify an alert to notify the operator when something goes wrong. If the alert is not set properly when you suddenly take over the operation, you do not even know the failure, so you can isolate it from info or error.

If it is an error, look for the batch log file. However, since we do not know where it is output, we will investigate using the information as a clue. Also, the scary thing is that sometimes the batch is not even logged.

When the person in charge of system operation changes, it is desirable to have a document that at least understands the entire system, such as the batch schedule of the entire system and the batch list. In particular, the more data linkage is performed, the more caution is required.

Development style

I think there are various ways to develop Python, but the following is an example to improve development efficiency.

Creating a Python image with Docker in your local environment and mounting the directory containing the source will improve development efficiency.

First, create a Dokcerfile and build it. Next, with the directory containing the source mounted, start it with the docker run command.

--Creating a Dokcer file

FROM python:3

WORKDIR /usr/src/app

COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD [ "python", "./your-daemon-or-script.py" ]

--Build $ docker build -t python3/test .

--Starting the container $ docker run -v <source directory>:/batch -it python3/test/bin/bash

** After that, you can develop the program file stored in the source directory with an editor. ** **

Tips Here are some tips for batch development in Python.

config If there are multiple environments such as development environment and production environment, you can prevent it from becoming complicated by creating a file such as config.py and importing it.

sys.argv is a list of command line arguments passed to the Python script. argv [0] is the name of the script, and argv [1] contains the first argument.

For example, using sys.argv, you can set each environment as follows. I think there are many ways to check the arguments by looking at the value of sys.argv.

import sys

args = sys.argv
env = args[1]

if env == 'local':
    pass

logfile

I created a batch file that outputs logs, but an error may occur because the log file does not exist.

If the specified log file does not exist, you can prevent it by adding a process to create a log file.

log_file = config.base_dir + 'log/bacth.log'
if not os.path.exists(log_file):
    with open(log_file, 'w') as f:
        f.write('')

Exclusion list

If it is not in the exclude list (exclude_list), add it to the list to be processed.

if item_id not in exclude_list:
    stock_list.append({"item_id": item_id})

Extract in dictionary and store in list

Extract the DB result (result_set) from the dictionary and add it to the list.

for row in result_set:
    row_dict = {"id": row[0], "name": row[1], "age": row[2]}
    target_list.append(row_dict)

Extract from multiple lists and add to list

If it is a list, it may shift, so basically it is better to use a dictionary.

for (z, x, y) in zip(list1, list2, list3):
    temp_list.append([z, x, y])

uuid Generating uuids is easy with the standard library.

import uuid

def make_sys_id():
        return str(uuid.uuid4())

#Execution example
>>> make_sys_id()
'ac441afe-fc2d-4ebb-a9cf-18a49c77ec71'

hash Hash with MD5 to find the hash value.

import hashlib

serialized = 'hoge'
md5 = hashlib.md5(serialized.encode('utf-8')).hexdigest()

#Execution example
>>> print(md5)
ea703e7aa1efda0064eaa507d9e8ab7e

date

An example when you want to perform date processing.

import datetime
from dateutil.relativedelta import relativedelta

#Today's date
today_tmp = datetime.date.today()
today = today_tmp.strftime('%Y%m%d')

>>> print(today)
20201225

#Tomorrow date
tomorrow_tmp = today_tmp + datetime.timedelta(days=1)
#Yesterday date
yesterday_tmp = today_tmp - datetime.timedelta(days=1)

#Tomorrow's date one month ago
one_month_before = tomorrow_tmp - relativedelta(months=1)
one_month_before = one_month_before.strftime('%Y%m%d')

>>> print(one_month_before)
20201125

#Yesterday's date one month later
one_month_later = yesterday_tmp + relativedelta(months=1)
one_month_later = one_month_later.strftime('%Y%m%d')

>>> print(one_month_later)
20210123

Batch return value

When you want to output the return value of the batch and end it according to the processing result in the try catch. There are other ways to output the return value.

import os

#Example
try:
Contents to be processed (Example: DB registration)
    #Successful completion
    os._exit(0)
except:
    #Abnormal termination
    os._exit(99)

Self-made exception class

Create your own exception class and throw an exception with raise. Indispensable for try catching in batch processing.

class BatchError(Exception):
    def __init__(self, m):
        self.message = m
    def __str__(self):
        return self.message

#Example
try:
Contents to be processed (Example: DB connection)
except:
    e = traceback.format_exc()
    logging.error(e)
    logging.error('Processing will end because it cannot connect to the DB')
    raise BatchError("DB connection failure")

debug

An example of how to debug. My personal recommendation is the pysnooper library.

pysnooper

import pysnooper

It's easy to use. Decorate @ pysnooper.snoop () to the function you want to debug. If you execute the batch with this, the details such as the contents of the variables will be output.

pprint Pprint is useful when you want to see clearly in json format.

from pprint import pprint

Japanese encoding measures

Depending on the OS environment etc., Japanese output may fail in Python 3 when the encoding method is ANSI.

sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

Other

As you learn Pyhon, most introductory books probably don't mention how to write Python these days. In order to improve Python, it is essential to catch up with new information by yourself.

f string

The f string was added in Python 3.6. A formatted string literal. The substitutions realized by the conventional format () method can be described in string literals.

>>> word = "WORLD"
>>> f'HELLO {word}' 
'HELLO WORLD'

>>> today = datetime(year=2020, month=5, day=6)
>>> f"{today:%B %d, %Y}"
'May 06, 2020'

Annotation

Python is a dynamically typed language, but since Python 3.5, type annotation is possible. The following code will result in an error because it will result in an error if the same types are not aligned due to the nature of Python.

>>> def test(word):
...  return 'Hello' + word
... 
>>> test(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in test
TypeError: can only concatenate str (not "int") to str

You can improve the maintainability of your code by using type annotations. In the following cases, it is expected that the type of the actual argument name is str and the type of the return value is str. The point can only be annotated, and error checking is not performed.

>>> def greeting(name: str) -> str:
...     return 'Hello' + name
... 
>>> greeting("apple")
'Hello apple'

dataclasses dataclasses was added in Python 3.7. Provides decorators and functions that are automatically added to user-defined classes.

There is no need to write by initializing with the conventional init.

>>> class Animal:
...     def __init__(self, type, age, name):
...         self.type = type
...         self.age = age
...         self.name = name
... 
>>> cat = Animal("cat", 0,"Tulle" )
>>> print(cat.type, cat.age, cat.name)
cat 0 tulle

You can easily do the same with dataclasses.

>>> @dataclass
... class Animal:
...  type: str
...  age: int
...  name: str
... 
>>> cat = Animal("cat", 0, "Tulle")
>>> print(cat)
Animal(type='cat', age=0, name='Tulle')

in conclusion

In the coming era, the system will be based on container technology, so the concept of batch design will change.

However, no matter how much advanced technology such as serverless is used, if operation is not considered, only issues and technical debts remain.

Technology is just a means. The important thing is to design an appropriate batch and make it so that the service will not be hindered in order to continue the service as a business.

reference

Recommended Posts

Batch design and python
Python handy batch
[python] Compress and decompress
Python and numpy tips
[Python] pip and wheel
Python iterators and generators
Python packages and modules
Vue-Cli and Python integration
Ruby, Python and map
python input and output
Python and Ruby split
Python3, venv and Ansible
Python asyncio and ContextVar
[# 1] Make Minecraft with Python. ~ Preliminary research and design ~
[Python] Start a batch file from Python and pass variables.
Programming with Python and Tkinter
Encryption and decryption with Python
Python: Class and instance variables
3-3, Python strings and character codes
Python 2 series and 3 series (Anaconda edition)
Python and hardware-Using RS232C with Python-
Python on Ruby and angry Ruby on Python
Python indentation and string format
Install Python and Flask (Windows 10)
About python objects and classes
About Python variables and objects
Apache mod_auth_tkt and Python AuthTkt
Å (Ongustromu) and NFC @ Python
Understand Python packages and modules
# 2 [python3] Separation and comment out
Python shallow copy and deep copy
Python and ruby slice memo
Python installation and basic grammar
I compared Java and Python!
Python shallow and deep copy
About Python, len () and randint ()
About Python datetime and timezone
Install Python 3.7 and Django 3.0 (CentOS)
Python environment construction and TensorFlow
Python class variables and instance variables
Design Patterns in Python: Introduction
Ruby and Python syntax ~ branch ~
[Python] Python and security-① What is Python?
Stack and Queue in Python
python metaclass and sqlalchemy declareative
Fibonacci and prime implementations (python)
Python basics: conditions and iterations
Python bitwise operator and OR
Python debug and test module
Python list and tuples and commas
Python variables and object IDs
Python list comprehensions and generators
About Python and regular expressions
python with pyenv and venv
Unittest and CI in Python
Maxout description and implementation (Python)
[python] Get quotient and remainder
Python 3 sorted and comparison functions
[Python] Depth-first search and breadth-first search
Identity and equivalence Python is and ==
Source installation and installation of Python