This article is the 18th day of Next Co., Ltd. (Lifull) Advent Calendar 2016.
Hello, this is Ninomiya of digital marketing U.
Recently, several departments within the company have created projects that use Python for statistical processing and numerical calculations.
Until now, R language was used mainly by @wakuteka of the same group, and documents and know-how were organized as such.
It seems that there was also a need.
(Of course, instead of degrading the R language, it is interactive using a library group called tidyverse. The R language is easier to use for analysis and visualization. The tools are used.)
I use Python as a hobby and have had the opportunity to work on several projects in the form of reviews and advice (which is exaggerated).
I would like to summarize the knowledge gained in that process and the articles that I referred to.
However, since we proceeded through trial and error, there may be a better way, and the content of the article does not cover the entire development. If you find such a point, I would be grateful if you could let me know in the comments.
The article "Development environment created with the intention of writing Python seriously" was helpful.
A well-known Python distribution called Anaconda, which also has a library for statistics, is not used in production. (I use it in the development / analysis environment.)
As you can see in the article "Flowchart whether pyenv is needed"
I set Anaconda's / bin to the path, but the tools that Anaconda has (openssl / curl / python) obscure the tools that the OS has. Also, it is too premised on BASH, and if you use zsh, it will not work unless you fix it in various ways.
This is because I was worried about the behavior around here in actual operation.
I needed to review it, so I re-learned the coding standards and how to write DocStrings.
Department-specific numerical calculations etc. can be installed using pip using the git repository for groups.
At that time, I referred to these articles.
Refer to PEP8 and Google Python Style Guide I think it will be.
However, it is difficult to visually check the coding standard by hand, so flake8 and autopep8 I also use .org / pypi / autopep8) as appropriate. However, PEP8 is a relatively strict coding standard, so we are proceeding with consultation as appropriate.
Here, I referred to the following article.
I also have Google-style DocStrings written to make it easier to understand the input and output of functions and methods. There seems to be other Numpy styles as well.
When using Type Annotation with Python3.5 or later,
def function_with_pep484_type_annotations(param1: int, param2: str) -> bool:
    """Example function with PEP 484 type annotations.
 
    Args:
        param1: The first parameter.
        param2: The second parameter.
 
    Returns:
        The return value. True for success, False otherwise.
  
    """
This is the case when type annotation is not used.
def function_with_types_in_docstring(param1, param2):
    """Example function with types documented in the docstring.
 
    Args:
        param1 (int): The first parameter.
        param2 (str): The second parameter.
 
    Returns:
        bool: The return value. True for success, False otherwise.
 
    """
However, in the code I reviewed, there was a function that returned multiple values in tuple, but (as far as I investigated) Google style DocStrings did not seem to support the writing method that returns multiple values in Returns. .. Based on this stack overflow answer, write as follows I got it.
import pandas as pd
def _postprocess_data(output_data, market):
    """Format into data for alert and file output
        Args:
            output_data (pd.DataFrame):Data frame after calculation
            market (str):Real estate market name
        Returns:
            tuple:Returns the following values as multiple values
                - output_data (pd.DataFrame):Output data
                - monthly_data (pd.DataFrame):Monthly data
    """
I haven't tried type annotations and Static analysis using mypy yet, but I'll take the opportunity to try it.
It was a small project, so I wrote some simple tests with unittest to some extent.
Besides unittest, it seems that there are some frameworks, so I will create an opportunity for this as well.
Right now, I'm working on something like "Paste a function that I tried and errored with jupyter notebook into an editor." I want to be able to use it properly with TDD (like thing) as needed.
When doing data analysis in Python, I think that you will use pandas to introduce a data frame type like R language.
In R language, libraries such as dplyr and tidyr can express the flow of data processing concisely using pipeline operators, but it seems that it takes some getting used to doing the same with pandas. (Also, unlike R, which expresses everything in a data frame, it is tried and errored to use it properly with the dictionary type.)
However, this article has a good way to write pandas, so please read it if you are starting to use it.
Here's a quick summary of the results of trial and error (or in the process of doing) in a Python project using a data analysis library. I hope it helps someone reading this.
The content of the article does not cover the entire development. If you find such a point, I would be grateful if you could let me know in the comments.
Also, please continue to pay attention to Our Advent Calendar.
Recommended Posts