Summary of Python implementation know-how and tips that AI engineers want to be careful about

In this article, I have summarized the points, tips, and know-how that data scientists and AI engineers should be aware of when implementing programs in Python.


● May 16, 2020: Added Book: Self-propelled programmer author: Mr. Shimizukawa gave me a detailed supplement to this article and answers to my questions. This link will take you to the comment.


The book was published in April 2020.

【Qiita】 I wrote a book that teaches machine learning implementations and algorithms in a well-balanced manner

[Book] [Introduction to machine learning for those who aim to become AI engineers Learn the flow of algorithms while implementing (Takuya Shimizu, Yutaro Ogawa, Gijutsu-Hyoronsha)](https://www.amazon.co.jp/ dp / 4297112094 /) https://www.amazon.co.jp/dp/4297112094/

This post could not be written in the above book ** "Summary of things AI engineers should be aware of when implementing programs in Python" ** is.

The content of this article is just the point that the author is aware of. This is not the only answer, but we hope you find it helpful.

First, the contents of this article are shown.

After that, the explanation of each content will be described.

Python implementation points that AI engineers want to be careful about

Level 1 1.1 Observe the naming conventions for variables, functions, classes and methods 1.2 Naming method 1: Remove redundant parts from variable names and method names 1.3 Import description follows the rules 1.4 Random number seeds are fixed to ensure reproducibility 1.5 Program is executed as a function

Level 2 2.1 Naming method 2: Name with reverse notation to make it easier to read 2.2 Being aware of S in SOLID, make functions and methods short with a single responsibility 2.3 Add type hints to functions and methods 2.4 docstring for classes, methods and functions 2.5 When saving a trained model, save information such as preprocessing and hyperparameters together.

Level 3 3.1 Naming method 3: Give a name that understands your responsibilities with appropriate English words and part of speech 3.2 Implement exception handling appropriately 3.3 Implement logs properly 3.4 Function and method arguments should be 3 or less 3.5 Use * args, ** kwargs properly

Level 4 4.1 Short if statement with ternary operator 4.2 Implement preprocessing and model classes in sklearn compliance 4.3 Make good use of decorators 4.4 Unify editor settings for team development 4.5 Prepare a template for pull request on GitHub and describe notes

Commentary

Level 1

1.1 Observe the naming conventions for variables, functions, classes and methods

How to name functions and variables is a problem. The naming method is also important, but first of all, follow the naming convention as a rule. (The naming convention is introduced as "PEP8: Python Code Style Guide")

● Variables, functions, methods, modules ⇒ Only lowercase letters, separate words with underscores as needed Example lower_case_with_underscores

● Class name ⇒ Connect uppercase words only at the beginning, do not use underscore Example CapWords

● Private variables used only within the class ⇒ Underscore before variable name Example _single_leading_underscores

"Private methods used only within the class" ⇒ Underscore before method name Example _single_leading_underscore (self, ...)

● Constant ⇒ Only uppercase letters, separate words with underscores Example ʻALL_CAPS_WITH_UNDERSCORES`

● Package name ⇒ Lowercase only Example lowers

** * Remark 1 **: Difference between function and method Functions are independent procedures that are not in the class. The method points to a function within the class.

** * Remark 2 **: Modules and packages The package is the largest top level. Modules are files in packages. For example, sklearn is a package. The linear_model in sklearn.linear_model is a module.

1.2 Naming method 1: Remove redundant parts from variable names and method names

For example, if class class_1 has the variable max_length, the name of the member variable is

Do not set it to class_1_max_length, but simply set it to max_length.

Because when accessing this class variable from another class

class_1.class_1_max_length = 10

This is because the class name becomes redundant.

class_1.max_length = 10

It is better to be.

Member variables are named by imagining that they will be "class name.variable name" when used.

1.3 Import description follows the rules

Here are three things to keep in mind when importing external classes and functions.

● Order of writing import Describe the three types of libraries separated by blank lines as shown below.

import standard library
Blank line
import Third party related (pip install from PyPI)
Blank line
import What we created for this time`
Blank line

● How to describe import Import the entire module. For example, if you have class class_1 in module module_1 of package pkg

from pkg.module_1 import class_1 Not

from pkg import module_1 As

In the program, for example my_class = module_1.class_1() And use it at the module level.

● How to describe import 2 Do not import multiple packages with import.

import pkg, pkg2 Not import pkg import pkg2

It is described as.

However, multiple descriptions for modules are OK. For example:

from pkg import module_1, module_2

** * Remarks ** I also have a fairly appropriate import method. It's not good. Autoformatter and other tools will correct the description of the import statement, so it is also recommended to utilize them.

1.4 Random number seeds are fixed to ensure reproducibility

Since there are many random parts when implementing data science and AI, the seed of random numbers is always fixed to ensure the reproducibility of the program.

The implementation example is as follows.


import os
import random
import numpy as np
import torch

SEED_VALUE = 1234  #This can be anything
os.environ['PYTHONHASHSEED'] = str(SEED_VALUE)
random.seed(SEED_VALUE)
np.random.seed(SEED_VALUE)
torch.manual_seed(SEED_VALUE)  #When using PyTorch

However, if you use GPU with PyTorch,

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

To set. If you do not set these, reproducibility cannot be guaranteed when using GPU.

However, torch.backends.cudnn.deterministic = True slows down the calculation on the GPU. Therefore, in my case, I give priority to execution speed and do not seek to guarantee the reproducibility of learning on the GPU.

Click here for details on the procedure for ensuring reproducibility with PyTorch. PyTorch REPRODUCIBILITY

Also, in the case of scikit-learn, there is a part that receives a random number seed in each algorithm, so this is also fixed.

Example: scikit-learn

from sklearn.linear_model import LogisticRegression

SEED_VALUE = 1234  #This can be anything
clf = LogisticRegression(random_state=SEED_VALUE)

1.5 Program is executed as a function

Non-functional programs will run slower, whether in Jupyter Notebook or Python command line execution.

For example, on Google Colaboratory

import time
import numpy as np

start = time.time()

data = np.random.rand(5000)
sum = 0

for i in range(len(data)):
    for j in range(len(data)):
        sum += data[j]

elapsed_time = time.time() - start
print(elapsed_time)

It took 8.4 seconds to run.

This is the state where the program is written and executed in one cell.

It is executed in one cell in the same way, but let's execute it as a function around the main for statement.

import time
import numpy as np


def main():
    data = np.random.rand(5000)
    sum = 0

    for i in range(len(data)):
        for j in range(len(data)):
            sum += data[j]
    return 0


start = time.time()
main()
elapsed_time = time.time() - start
print(elapsed_time)

The result is 6.2 seconds. What took 8 seconds is now 6 seconds.

Even if you execute it with Jupyter Notebook like this, instead of executing the processing that takes a long time in solid writing, make it a function like main () and execute that function.

The same is true when implementing a python file that runs from the command line, for example hogehoge.py.

python hogehoge.py Hogehoge.py is written as follows so that it will not be slow when executed with.

#import system
import fuga

#Functions and classes used within the main function
def piyo():
    your code

#main function
def main():
    your code

if __name__ == "__main__":
    main()

At the end, ʻif name == "main": , the execution of main () `is included in the if statement.

import hogehoge This is to prevent this main () function from being executed when you do.

Without this if statement, main () will be executed just by importing.

Level 2

2.1 Naming method 2: Name with reverse notation to make it easier to read

For example, if you want to make three types of variables, such as the length of a, the length of b, and the length of c, a_length b_length c_length Without length_a length_b length_c will do.

This notation is called reverse notation.

In the case of reverse notation, the beginnings of the words are the same, which makes the program easier to read.

Also, if you are focusing on a instead of length a_length a_width a_max_length It will be written like this.

My feeling is, "Write the object you are focusing on at the beginning and unify the beginning of the variables."

2.2 Be aware of S in SOLID and keep the number of lines of functions and methods as short as possible.

Books [Mystery of Agile Software Development, Clean Code, Clean Coder, Clean Architecture, etc.](https://www.amazon.co.jp/%E6%9C%AC-%E3%83%AD%E3%83% 90% E3% 83% BC% E3% 83% 88% E3% 83% BBC% E3% 83% BB% E3% 83% 9E% E3% 83% BC% E3% 83% 81% E3% 83% B3 / s? rh = n% 3A465392% 2Cp_27% 3A% E3% 83% AD% E3% 83% 90% E3% 83% BC% E3% 83% 88% E3% 83% BBC% E3% 83% BB% E3% 83% 9E% E3% 83% BC% E3% 83% 81% E3% 83% B3) Author of Robert C. Martin (one of the members of the Agile Software Development Declaration) The principle of software design advocated by SOLID is.

However, at the level of this post, you don't have to be aware of all of SOLID.

However, be strongly aware of the first SOLID, ** S: Single responsibility principle **.

Single responsibility means "functions, classes, and methods should fulfill only one responsibility."

The definition of size indicated by ** "single" ** here, the balance between abstraction and concreteness is difficult, ** The point is that functions, classes, and methods are as short as possible and are not affected by changes in superordinate concepts **.

The work handled by data scientists and AI engineers is It is very procedural and flow-like content such as "data preprocessing, learning, inference, ...".

Then, the program to be implemented becomes procedural, and one main class or method becomes bloated. A single class or method tends to do a lot (have many responsibilities) from top to bottom.

It is good if it is closed with Jupyter Notebook, but this state is difficult to introduce data science and AI to system development.

Development of SoE (System of Engagement) like AI system There are many agile developments, not waterfall developments such as SoR (System of Records), which do the requirement definition and external / internal design exactly.

In agile development, it is basic to carry out CI (Continuous Integration) ≒ automatic test. And because it's agile, we'll actually make it, see how it works at the prototype level, find kaizen changes, and aim for something better.

At the time of this kaizen, if the responsibility of a single class or method is huge, the number of lines of code to be recreated will increase.

The more lines of code you recreate, the wider the impact.

Then there will be a lot of new unit tests that need to be created.

At the same time, you're throwing away many of the unit tests you've created.

** If you proceed with development in a situation where such large-scale replacement of unit tests occurs frequently, the unit tests will not be written properly and the quality of the system will deteriorate. ** **

In addition, remaking by Kaizen will affect unexpected places and make bugs more likely to occur.

Data scientists and AI engineers also want to actively embrace kaizen in agile development. The functions, classes, and methods to be implemented should be short and have a single responsibility. We will keep in mind ** implementation that is resistant to kaizen changes **.

Rather than having one long function, class, method ** Is it too short? It is desirable that there are so many functions, classes, and methods that you are worried about **.

Depending on the book or article, it may be said that "the number of lines for one function or method should be within 5 lines". I think that using 5 lines as a standard is too short and severe for AI-based implementations, which is counterproductive.

Instead of the standard of how many lines, we try to have a length that fulfills a single responsibility, an implementation that has a narrow range of influence when changing Kaizen, and an implementation that rarely discards unit tests.

One trick for that is to comment at the beginning of the function, class, or method. "" "This method is responsible for implementing ●●." ""

It is also a good practice at first to explicitly write down the responsibilities.

2.3 Add type hints to functions and methods

If you divide a function or method with "SOLID S: Single responsibility" in mind, you can create many functions and methods.

It is difficult to understand if there are many functions and methods. It's okay when writing the code, but if you review this code three months later or someone else tries to use it,

"What is in the argument of this function and what is the output?"

And get confused.

Therefore, when implementing a function, add a type hint. Type hints are written as follows.

def calc_billing_amount(amount: int, price: int) -> int:
    billing_amount = amount*price
    return billing_amount

Write the type after the argument name so that you know what the type of the variable of the argument is.

Also, describe the type so that you can see the type of the variable output from the function.

This type hint, as the name implies, is a hint, not a compulsion. So, in the above function, give float instead of int to the first amount,

calc_billing_amount(0.5, 100)

Can also be executed. No error will occur. You need to be careful about that point.

When using a list or dictionary with a type hint, or when multiple types such as float are OK even with int, write as follows.

from typing import Dict, List, Union


def calc_billing_amount(
    amount_list: List[int], price_dictionary: Dict[str, Union[int, float]]
) -> int:
    billing_amount = 0
    for index, (key, value) in enumerate(price_dictionary.items()):
        billing_amount += amount_list[index] * value

    return int(billing_amount)

With from typing import List, Dict, Union, import a list for type hints, a dictionary, and a Union to use if either is fine.

And, for example, if the element is an int type list, use List [int]. The dictionary when the key is of type string and the value can be either int or float Dict[str, Union[int, float]] Write.

Execution

amount = [3, 10]
price = {"item1": 100, "item2": 30.5}
calc_billing_amount(amount, price)

Then, 605 is output.

Also, if you want to use the original class you defined as a type hint, write as follows.

class User:
    def __init__(self, name: str, user_type: str):
        self.name = name
        self.user_type = user_type


def print_user_type(user: "User") -> str:
    print(user.user_type)

I define my own class User and define the function print_user_type that is executed with this User as an argument.

If you have Python version 3.7 or later from __future__ import annotations You can change "User" to User using, but Google Colaboratory also has a Python version of 3.6, so we recommend the above writing method.

To execute the class with the above type hints, as usual

taro = User("taro", "admin")
print_user_type(taro)

will do. Then the output will show ʻadmin`.

Writing type hints is a hassle, but if you divide it into many classes and methods with a single responsibility and implement it, it will be a problem later.

Especially when it comes to work, team members often see and use the code they write. ** I will keep in mind the implementation that is easy for others to use & kaizen **.

2.4 docstring for classes, methods and functions

docstring is a description of class, method and function specifications and usage.

When it comes to many classes and functions with a single responsibility, it is difficult to understand when reusing them.

Since it is difficult to understand with just the type hint, I will write a docstring as a more detailed explanation.

However, since it is troublesome, I think that a single line docstring may be used for private methods and methods with a small number of lines.

On the other hand, classes and methods used by other team members, main classes that occupy an important position in the AI system, classes with long processing, etc.

If it is easier for other members to use the detailed explanation, write the docstring in detail.

You can write the docstring in any way, but usually ・ ReStructuredText ・ Google style ・ Numpy style Write in one of.

I write in one of these three types because I can then use Sphinx to document it automatically. Therefore, we will use the docstring notation that supports Sphinx.

Click here for a detailed explanation of the three types of docstring. Example of 3 styles of docstring

I like reStructuredText because Google style and Numpy style tend to be tall.

The docstring in reStructuredText is written as follows.

class User:
    """This class indicates the account user who uses this system.

    :param name:User account name
    :param user_type: Account type (admin or normal)

    :Example:

    >>> import User
    >>> taro = User("taro", "admin")
    """

    def __init__(self, name: str, user_type: str):
        self.name = name
        self.user_type = user_type

    def print_user_type(self):
        """Print the user type with a print statement

        :pram None:No input arguments
        :return: user_Output type as a string
        :rtype: str

        :Example:

        >>> import User
        >>> taro = User("taro", "admin")
        >>> taro.print_user_type()
        admin
        """
        print(self.user_type)

If you write up to Example, you will know how to use it immediately, so it is easy for others to use, and I like it. It depends on the situation how much detail to write, examples, arguments, and explanation of return.

If you write as above, in Editor such as VS code, As shown in the figure below, when you hover the mouse cursor over the relevant program part, this docstring is displayed, so it is easy to understand the program. (In the figure below, the mouse cursor is placed on print_user_type ())

図1.png

2.5 When saving a trained model, save information such as preprocessing and hyperparameters together.

In AI, machine learning, and deep learning, not only the trained model is saved, but also the information that can reproduce the learning such as the preprocessing pipeline, the setting of hyperparameters of the model, and all the objects necessary for inference. Save.

It doesn't matter how you save it as long as all of this information is included, but here is an example of saving with scikit-learn and PyTorch.

Example: for scikit-learn

from datetime import datetime, timedelta, timezone

import numpy as np
from joblib import dump, load
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

#Prepare iris as appropriate data
X, y = load_iris(return_X_y=True)

#Pre-processing (standardized and added features of squared terms)
preprocess_pipeline = Pipeline(steps=[("standard_scaler", StandardScaler())])
preprocess_pipeline.steps.append(("polynominal_features_2", PolynomialFeatures(2)))

#Application of pretreatment
X_preprocessed = preprocess_pipeline.fit_transform(X)

#Preparation of learning device
C = 1.2  #Hyperparameter settings
model = LogisticRegression(random_state=0, C=C)

#Implementation of learning
model.fit(X_preprocessed, y)

#Performance with training data
accuracy_training = model.score(X_preprocessed, y)

#Ready to save various data
JST = timezone(timedelta(hours=+9), "JST")  #In Japan time
now = datetime.now(JST).strftime("%Y%m%d_%H%M%S")  #Get current time

training_info = {
    "training_data": "iris",
    "model_type": "LogisticRegression",
    "hyper_pram_logreg_C": C,
    "accuracy_training": accuracy_training,
    "save_date": now,
}

save_data = {
    "preprocess_pipeline": preprocess_pipeline,
    "trained_mode": model,
    "training_info": training_info,
}
filename = "./iris_model_" + now + ".joblib"

#Save
dump(save_data, filename)

If you want to load the contents saved in this way,

load_data = load(filename)

#Load the loaded content
preprocess_pipeline = load_data["preprocess_pipeline"]
model = load_data["trained_mode"]
print(load_data["training_info"])

is. In this load example, training_info is printed, so {'training_data': 'iris', 'model_type': 'LogisticRegression', 'hyper_pram_logreg_C': 1.2, 'accuracy_training': 0.9866666666666667, 'save_date': '20200503_205145'} Is output.

Example: For PyTorch Reference PyTorch SAVING AND LOADING MODELS

PATH = './checkpoint_' + str(epoch) + '.pt'

torch.save({
    'epoch': epoch,
    'total_epoch': total_epoch,
    'model_state_dict': model.state_dict(),
    'scheduler.state_dict': scheduler.state_dict()
    'optimizer_state_dict': optimizer.state_dict(),
    'loss_train': loss_train,
    'loss_eval': loss_eval,
}, PATH)

When loading

#Create objects such as models first
model = TheModelClass()  #This is the same model I saved
scheduler = TheSchedulerClass()  #This is the same Scheduler Class you saved
optimizer = TheOptimizerClass()  #This is the same Optimizer Class you saved

#Load and give
checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model_state_dict'])
scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
total_epoch = checkpoint['total_epoch']
epoch = checkpoint['epoch']
loss_train = checkpoint['loss_train']
loss_eval = checkpoint['loss_eval']

#Do the following depending on learning or reasoning
model.train()
# model.eval()

If you save the deep learning dataset and data loader to checkpoint, the save file will be too big, so save them separately.

#Data set, data loader storage
torch.save(trainset, './trainset.pt')
torch.save(trainloader, './dataloader.pt')

When loading

trainset = torch.load('./trainset.pt')
trainloader = torch.load('./dataloader.pt')

That's all for saving and loading in PyTorch.

Whether it's scikit-learn or PyTorch, it doesn't matter how you save it.

However, I saved only the model and later

** "How was this model learned !?" ** ** "What kind of performance is this model !?" ** ** "There is no trained pre-processing pipe to populate this model !!" **

Be careful not to get into such a situation.

** * Remark 1 **: The reason why preprocessing and model are not combined into one pipeline in the scikit-learn example is that preprocessing may be performed by another resource and thrown into the model API. Especially when you want to process a large amount of data at high speed, you want to make it easy to use when only preprocessing is distributed in other places. Also, the pre-processing is the same, and there is a desire to divert only this pre-processing pipeline when training different models.

** * Remark 2 **: I will explain the points to note when creating your own preprocessing class later. Here, when loading a preprocessing pipeline that uses a self-made class etc., not only import the normal sckit-learn preprocessing class before loading, but also import the self-made class etc. Then load it. Note that if the expanded class is not imported when loading, an error will occur during loading.

Level 3

3.1 Naming method 3: Give a name that understands your responsibilities with appropriate English words and part of speech

The more subdivided the classes, methods and functions are, the more important their naming becomes.

If you look at the name, "What do you do? That is, what responsibility do you have, what is input and what is output?" Ideally, you should know this.

However, it is difficult for Japanese people.

No comments are needed in the programming world, if you look at the code naming, it's a comment, There is also a general way of thinking, The ability of Japanese people to understand the difference between English vocabulary and nuances is difficult.

Also, in the case of data science and AI, the algorithm itself is complicated, so It's hard for first-timers to see the uncommented implementation and understand the contents of a class or method.

However, the naming method should be as easy to convey as possible.

The minimum I want to protect is

[1] Use nouns for class names and variable names

[2] Method and function names start with a verb

[3] In the case of the truth value (boolean type) of a member variable, it may start with the following verb. (Example) is_admin, has_item, can_drive, etc.

[4] Let's use codic etc. https://codic.jp/engine

キャプチャ.PNG

3.2 Implement exception handling appropriately

The error (exception) try-catch is a tedious part that is not fun at all, and it is a pain (I think) for data scientists and AI engineers.

There is no problem if the project scale ends only with Jupyter Notebook, but error handling is important when implementing AI in the system. When an exception occurs, the entire system process stops.

Therefore, in the implementation code, Make sure the try is at the top level (that is, avoid situations where the code is not in try :).

Be sure to read the official Python commentary. Official 8. Errors and Exceptions

Example: When defining a function to divide

def func_division(a, b):
    ret = a/b
    return ret

If you do this, ans = func_division(10, 0) When something like this comes, an error will occur and the entire program will stop.

So

def func_division(a, b):
   try:
      ret = a/b
      return ret
   except:
      print("Exception occured")

Write.

However, this is not enough. If an exception that occurs from the processing content can be predicted, handle it properly with that error and handle the exception accordingly.

def func_division(a, b):
   try:
      ret = a/b
      return ret
   except ZeroDivisionError as err:
      print('A division by zero exception has occurred:', err)
   except:
      print("An unexpected exception has occurred")

If this is the case

ans = func_division(10, 0) When comes An exception to division by zero has occurred: division by zero` Is output,

ans = func_division("hoge", "fuga") And if a non-numeric input is given

An unexpected exception has occurred Is output.

Since the implementation of AI in real systems will accelerate in the future, it will be a problem for people who can only write programs at the Jupyter Notebook level (I think).

The point is to be aware that "try is at the top level (that is, avoid situations where the code is not in try :)".

** * Remarks **: Narrow the range of try Try-catch alone can be a hassle, but that doesn't mean you have to put a lot of processing (many lines) into one try. I don't know where the exception occurred. Try to try-catch in one meaningful processing unit. Also, as I will explain later, when there are many lines in a method, it is often necessary to split it into another method.

3.3 Implement logs properly

For data scientists and AI engineers, it is common to output with a print statement and check the status, but when incorporating it into the system, the print statement is troublesome, so write it properly as a log.

The image of how to use the log is as follows.

import logging

logger = logging.getLogger(__name__)

#Create a value to put in log appropriately
total_epoch = 1000
epoch = 100
loss_train = 5.44444

#Contents to be recorded in log
log_list = [total_epoch, epoch, loss_train]

#Record in log
logger.info(
    "total_epoch: {0[0]}, epoch: {0[1]}, loss_train: {0[2]:.2f}".format(log_list)
)

#Output the contents recorded in the log and check (for confirmation now. Originally unnecessary)
print("total_epoch: {0[0]}, epoch: {0[1]}, loss_train: {0[2]:.2f}".format(log_list))

In this case, if you check the logged contents with the print statement, total_epoch: 1000, epoch: 100, loss_train: 5.44 It has become.

Here, {0 [2]: .2f} means to display the second decimal place of the list received in .format.

There are many ways to write in Python, whether it's a logger or a print statement, If there are a lot of variables to write, I write them in a list as above.

Not only logger.info, but also logger.debug, logger.warning, logger.error, etc. will change the log level according to the situation.

At a minimum, the atmosphere in the above example is fine, but the log world is deep.

It's also a good idea to read the official docs for the logs.

Python Logging


@paulxll, more advice added: sunny:

I received an example using f-string (assuming Python 3.8). Please also refer to this ♪

#Create a value to put in log appropriately
total_epoch = 1000
epoch = 100
loss_train = 5.44444

#Record in log
logger.info(f"{total_epoch=}, {epoch=}, loss_train: {loss_train=:.2f}")

Remarks: The above is how to write in Python 3.8 or higher. For version 3.7 or lower

logger.info(f"total_epoch: {total_epoch}, epoch: {epoch}, loss_train: {loss_train:.2f}")

(Thank you for your advice: sunny :).


3.4 Function and method arguments should be 3 or less

The maximum number of function and method arguments is three. Avoid more than 4 (in my case).

If there are many arguments, it will be difficult to understand how to use the function, and it will be troublesome to prepare and manage unit tests.

If you want to take a lot of arguments, make it a dictionary variable such as hogehoge_config, and pass it to the function as one dictionary variable.

Example: Complex computational function (I'm sorry I explained above to write exception handling, but exception handling is omitted because it is troublesome)

def func_many_calculation(a, b, c, d, e):
    ret = a*b*c/d/e
    return ret

Defined as ans = func_many_calculation(10, 2, 3, 5, 2) If you use, there are too many arguments and it is troublesome. It can be a source of mistakes.

So, for example,

def func_many_calculation(func_config):
    a = func_config["a"]
    b = func_config["b"]
    c = func_config["c"]
    d = func_config["d"]
    e = func_config["e"]
    ret = a*b*c/d/e
    return ret

Defined as func_config = {"a": 10, "b": 2, "c": 3, "d": 5, "e": 2} Create a variable to substitute with in a dictionary

ans = func_many_calculation(func_config) And run.

However, this way of writing is too complicated to define the function, so

def func_many_calculation(a, b, c, d, e):
    ret = a*b*c/d/e
    return ret

And, in the definition part of the function, write a lot of arguments,

In the part to be executed, try to reduce the number of arguments,

func_config = {"a": 10, "b": 2, "c": 3, "d": 5, "e": 2}
ans = func_many_calculation(**func_config)

It is good to say.

** * Remarks **: This is an unfamiliar argument writing method called ** func_config. This ** means an unpacking operation for dictionary variables. So here, func_many_calculation(**func_config) Is func_many_calculation (func_config ["a"], func_config ["b"], ..., func_config ["e"]) Means. Unpacking dictionary variables with ** is convenient, so let's make it available.

3.5 Use * args, ** kwargs properly

I don't know what * args and ** kwargs are, but I use them positively! !! There is no such data scientist or AI engineer,

"I don't know what it is, but when I look at the repository for dissertation implementation, I see a lot of * args and ** kwargs."

I think there are many such experiences. Make friends with * args and ** kwargs.

After the explanation of the unpacking operation of dictionary variables in 3.4, * args and ** kwargs are not scary.

ʻArgsis an abbreviation for arguments, and argument means an argument in Japanese. kwargs` is an abbreviation for keyword arguments.

Where * is the list variable unpacking operation. ** is an unpacking operation for dictionary variables, as explained in 3.4.

When these * and ** are used in the argument of the function, it becomes as follows.

For example

def func_args_kwargs(*args, **kwargs):
    print(args)
    if len(args) >= 2:
        print(args[1])
    print(kwargs)
    flg_a = kwargs.pop("flg_a", False)
    print(flg_a)

If you define a function like

func_args_kwargs(10, 20) Will put the input arguments 10 and 20 into args (10, 20) 20 {} False Is output.

Then we also use dictionary variables for input, func_args_kwargs(10, **{"flg_a":True}) When you execute, the arguments other than the first dictionary go into args, the dictionary goes into kwargs,

(10,) {'flg_a': True} True Is output.

* args and ** kwargs are called ** variadic arguments **.

I will explain why we use such variable length arguments * args and ** kwargs.

There are three reasons to use variadic arguments.

The first reason is that it can be executed even if there are extra arguments in the function.

For example

def func_args_kwargs2(a, *args, **kwargs):
    print(a)

When you run func_args_kwargs2(10, 20, 30) The output of 10

And can be executed without error.

The second reason is that if you want to extend the function and increase the arguments later, you can receive it with * args so that you do not have to rewrite the contents of the function or the argument definition part.

The third reason is to use * args and ** kwargs as arguments when executing a function or method, which are optional and receive arguments that may or may not be passed at runtime. To do.

In this case, set the default value when * args or ** kwargs is not received.

Example:

def func_args_kwargs3(a, *args, **kwargs):
    b = kwargs.pop("b", 2.0)
    print(a*b)

When defined as

func_args_kwargs3(3.0) The output of is 6.0. The default 2.0 is used for b in the function.

func_args_kwargs3(3, **{"b":4.0}) In the case of, the output will be 12.0. B in the function is 4.0 given as an argument.

The above is the function of * args and ** kwargs.

** * Remarks **: In my case, I rarely use * args and ** kwargs at the time of implementation. For OSS and published dissertation implementations, * args and ** kwargs are commonly used. In the case of these developments, there is a high degree of uncertainty as many people use the implementation code. So, as shown in the first reason, are you using * args, ** kwargs so that it works even if it contains unnecessary arguments? I think.

Level 4

4.1 Get used to the if statement with the ternary operator

In Python, if statements are often written with ternary operators and combined into one line. Let's get used to it. (I don't use it much when writing books, but I use it because the implementation code has fewer lines and is easier to read.)

#Determine whether it is even or odd
num = 10

if num % 2 == 0:
    print("Even")
else:
    print("Odd")

Is

#Determine whether it is even or odd
num = 10

print("Even") if num % 2 == 0 else print("Odd")

Write.

The sample in the print statement is not good. At the time of substitution, it will be as follows.

num = 10
a = "Even" if num % 2 == 0 else "Odd"
# a = "Even" if num % 2 == 0 else a = "Odd"  #This will result in an error

print(a)

@Nabetani, more advice added: sunny:

print("Even") if num % 2 == 0 else print("Odd")
#Than
print("Even" if num % 2 == 0 else "Odd")
#I feel that is preferable.

My situation

--Use foo if cond else bar if you need the value of the result of the conditional operation. --Avoid foo if cond else bar if you don't need the value of the result of the conditional operation, and use if cond: foo else: bar

It is said.


4.2 Implement preprocessing and model classes in sklearn compliance

sklearn compliant is a class that inherits scikit-learn's BaseEstimator, TransformerMixin, ClassifierMixin, etc. and implements it so that it can be handled by scikit-learn's Pipeline class together with other scikit-learn objects.

If you make your own preprocessing class and model sklearn compliant, it is convenient because you can use it by incorporating it in Scikit-learn Pipeline.

For example, if it is a preprocessing class, write as follows. Inherit Transformer Mixin and Base Estimator.

from sklearn.base import BaseEstimator, ClassifierMixin, TransformerMixin
from sklearn.utils.validation import check_array, check_is_fitted, check_X_y


class TemplateTransformer(TransformerMixin, BaseEstimator):
    """Sample pre-processing class"""

    def __init__(self, demo_param='demo'):
        self.demo_param = demo_param  #Parameters to be used later are prepared by init

    def fit(self, X, y=None):
        """Implementation of learning required for preprocessing. y even if y does not exist=Give with None"""
        
        X = check_array(X, accept_sparse=True)  # check_array is a validation function for sklearn's input

        #The process of learning something. Here is an example n_features_Is learning the parameter
        self.n_features_ = X.shape[1]

        #Returns the preprocessed Transformer itself
        return self

    def transform(self, X):
        """Apply preprocessing to argument X"""

        #Parameters to be learned when applying preprocessing (here n_features_) Check if there is
        check_is_fitted(self, 'n_features_')

        #Validation of sklearn input
        X = check_array(X, accept_sparse=True)

        #Some conversion process
        X_transformed = hogehoge(X)

        return X_transformed

Also, in the case of a model class, ClassifierMixin and BaseEstimator are inherited. The following is an image of supervised learning, but I will write it in the same way for unsupervised learning.

from sklearn.base import BaseEstimator, ClassifierMixin, TransformerMixin
from sklearn.utils.validation import check_array, check_is_fitted, check_X_y


class TemplateClassifier(ClassifierMixin, BaseEstimator):
    """Model class sample"""

    def __init__(self, demo_param="demo"):
        self.demo_param = demo_param  #Parameters to be used later are prepared by init

    def fit(self, X, y):
        """Implementation of learning. Even if y does not exist due to supervised learning etc.=Give with None"""

        #Validation of sklearn input
        X, y = check_X_y(X, y)

        #Some learning
        self.fugafuga = piyopiyo(X, y)

        #Returns the learned learner (model) itself
        return self

    def predict(self, X):
        """Inference of unknown data"""

        #Check if there are any parameters (here fugafuga) that should be learned before inference
        check_is_fitted(self, ["fugafuga"])

        #Validation of sklearn input
        X = check_array(X)

        #Infer
        y_predicted = self.fugafuga(X)

        return y_predicted

When implementing sklearn compliance, it is recommended to change it based on the template published by scikit-learn.

Developing scikit-learn estimators

Click here for sklearn-compliant implementation template

4.3 Make good use of decorators

Decorators are like @hogehoge. It's often attached to method names and function names, what is it? I think it will go through while thinking.

But let's make friends with the decorator.

There are standard decorators in Python and self-made decorators.

A common Python standard decorator is @property@staticmethod@classmethod@abstractmethod It is around.

Check one by one.

@property makes its member variables immutable from outside the class.

(Example)

class User:
    def __init__(self, name: str, user_type: str):
        self.name = name
        self.__user_type = user_type
        
    @property
    def user_type(self):
        return self.__user_type

As, the user_type of the User class is defined by @ property.

Then taro = User("taro", "admin") print(taro.user_type) Can be executed without any problem, and ʻadmin` is output.

but, taro.user_type="normal" And, when I try to change the member variable defined by @property, I get an error.

In this way, you can define variables that cannot be changed externally.

(Example) @staticmethod@classmethod

class User:
    def __init__(self, name: str, user_type: str):
        self.name = name
        self.user_type = user_type

    @staticmethod
    def say_hello(name):
        print("Hello " + name)

@staticmethod, @ classmethod makes a class available without having to materialize it as an object.

If you define it as above and do the following,

User.say_hello("Hanako")

The output is Hello Hanako.

You can find out the difference between @ static method and @ class method when you face it again.

Another standard Python decorator you'll find is @abstractmethod.

If a method of a class has this @abstractmethod, the child class that inherits this class must implement that method. If not implemented, an error will occur.

Use @abstractmethod when defining an abstract class and forcing a method definition in an inherited child class.

Generally, if you understand this area, is it okay for data science and AI?

When you create a web application with Django, you'll see another Django decorator, but that's okay to look at at that time.

Next, I will explain about my own decorator.

Suppose there is a process that can be executed only when user_type is admin in the User class.

class User:
    def __init__(self, name: str, user_type: str):
        self.name = name
        self.user_type = user_type
        
    def func_admin_can_do(self):
        if self.user_type=="admin":
          #Processing that only admin can do
          print("I'm admin.")
        else:
          print("cannot do this func with auth error.")

You can write it like this, but if there are many other processes that can be executed only when user_type is admin, it is troublesome to check with an if statement each time.

So if you use a decorator, it looks like this:

def admin_only(func):
    """Decorator definition"""
    def wrapper(self, *args, **kwargs):
        if self.user_type == "admin":
            #Processing that only admin can do
            return func(self, *args, **kwargs)
        else:
            print("cannot do this func with auth error.")

    return wrapper


class User:
    def __init__(self, name: str, user_type: str):
        self.name = name
        self.user_type = user_type

    @admin_only
    def func_admin_can_do(self):
        #Processing that only admin can do
        print("Im admin.")

With this definition, you can only add the @admin_only decorator and execute the process only if the user is admin.

Use decorators in cases where you write the same thing over and over, such as when there are many methods that determine if you are an admin and process it.

4.4 Unify editor settings for team development

It is convenient to use the auto formatter when coding because it will be formatted automatically. For Python, black is the most popular these days.

However, if team members use different formatters, just changing the format will increase the chances that the file will be overwritten and the contents of git commit will be messed up.

Therefore, when developing as a team, the auto formatter is unified.

For example, create a folder ".vscode" directly under the repository and put the file "settings.json" in it.

in setting.json "python.formatting.provider": "black" I will write about it so that it will be formatted in black.

When a member codes, you can open this folder with VS code and execute it. It reflects the settings in the .vscode setting.json in the repository, so everyone has the same coding style.

This article is very good for setting VS code for Python.

[VSCode setting memo to explode the implementation of deep learning model](http://shunk031.hatenablog.com/entry/how-to-setup-vscode-for-developing-deep-learning-model?utm_campaign=piqcy&utm_medium= email & utm_source = Revue% 20newsletter)

4.5 Prepare a template for pull request on GitHub and describe notes

I will prepare the contents mentioned in this post and other points that I would like you to be careful about as a template for pull request on GitHub.

Create a folder ".github" directly under the repository, and create a file "PULL_REQUEST_TEMPLATE.md" in it.

This PULL_REQUEST_TEMPLATE.md is displayed as a template for the posted content at the time of pull request.

If you write down the contents mentioned in this post,


Level 1
- [ ]Do you follow the naming conventions for variables, functions, classes and methods?
- [ ]Naming method 1: Are redundant parts removed from variable names and method names?
- [ ]Does the description of import follow the rules?
- [ ]Is the seed of random numbers fixed to ensure reproducibility?
- [ ]Is the program executed as a function?

Level 2
- [ ]Naming method 2: Is it easy to read because it is named by reverse notation?
- [ ]Are the functions and methods short with a single responsibility, conscious of S in SOLID?
- [ ]Do functions and methods have type hints?
- [ ]Are docstrings listed in classes, methods and functions?
- [ ]When saving the trained model, do you save information such as preprocessing and hyperparameters together?

Level 3
- [ ]Naming method 3: Is the name given with appropriate English words and part of speech to understand the responsibility?
- [ ]Is exception handling properly implemented?
- [ ]Are you implementing logging properly?
- [ ]Are the number of function and method arguments 3 or less?
- [ ] `*args`、`**kwargs`Are you using properly?

Level 4
- [ ]Are you writing a short if statement with a ternary operator?
- [ ]Are preprocessing and model classes implemented in sklearn compliance?
- [ ]Are you using the decorator properly?
- [ ]Are the editor settings unified within the team?
- [ ]Have you prepared and used a template for pull request?

is.

Summary

The above is a summary of points that data scientists and AI engineers should be aware of when implementing in Python, and implementation know-how and tips.

I wrote a post, but there are some rules that I haven't completely followed. Also, since I am a young person and have little experience, I would appreciate any advice from my predecessors that it would be better to do this.

And it's not good to be too strict and cramped, "Everyone on the team can develop comfortably and quality is guaranteed." It may be good to make rules for each team about such points.

Thank you for reading the above.


** Recent serialization list ** [1] [Implementation explanation] How to use Japanese version BERT with Google Colaboratory (PyTorch) [2] [Implementation explanation] Livedoor news classification in Japanese version BERT: Google Colaboratory (PyTorch) [3] [Implementation explanation] Brain science and unsupervised learning. Classify MNIST by information amount maximization clustering [4] [Implementation explanation] Classify livedoor news by Japanese BERT x unsupervised learning (information amount maximization clustering)


[Remarks] The AI Technology Department development team that I lead is looking for members. Click here if you are interested

[Disclaimer] The content of this article itself is the opinion / transmission of the author, not the official opinion of the company to which the author belongs.


Recommended Posts

Summary of Python implementation know-how and tips that AI engineers want to be careful about
Summary of know-how and tips for AI new business planning that AI engineers want to know
[Introduction to Python] Summary of functions and methods that frequently appear in Python [Problem format]
[Python] Introduction to web scraping | Summary of methods that can be used with webdriver
Summary of things that need to be installed to run tf-pose-estimation
Concept of server load that new engineers want to know
Processing of python3 that seems to be usable in paiza
[Python] Summary of how to use split and join functions
I want to know the features of Python and pip
[Python] A program to find the number of apples and oranges that can be harvested
I want to use both key and value of Python iterator
This and that of python properties
Summary of Python indexes and slices
[Python] A program that calculates the number of socks to be paired
I wanted to be careful about the behavior of Python's default arguments
Book summary that is good for SRE and cloud engineers to read
Implementation of particle filters in Python and application to state space models
[Python2.7] Summary of how to use unittest
[Python3] Code that can be used when you want to change the extension of an image at once
Summary of how to use Python list
[Python2.7] Summary of how to use subprocess
Python practice data analysis Summary of learning that I hit about 10 with 100 knocks
Atcoder Beginner Contest A, B Input summary Python that tends to be a problem
I want to create a priority queue that can be updated in Python (2.7)
I want to exe and distribute a program that resizes images Python3 + pyinstaller
Build a python environment to learn the theory and implementation of deep learning
How to write files that you should be careful about in all languages
Correspondence summary of array operation of ruby and python
Summary of the differences between PHP and Python
Summary of how to import files in Python 3
Summary of examples that cannot be pyTorch backward
Summary of how to use MNIST in Python
Installation of Python3 and Flask [Environment construction summary]
Implementation of TRIE tree with Python and LOUDS
I / O related summary of python and fortran
List of Python code to move and remember
About the * (asterisk) argument of python (and itertools.starmap)
About shallow and deep copies of Python / Ruby
Explanation of edit distance and implementation in Python
Summary of statistical data analysis methods using Python that can be used in business
Collect tweets about "Corona" with python and automatically detect words that became a hot topic due to the influence of "Corona"