This is the 13th day article of Open and Reproducible Science Advent Calendar 2019.

Purpose of the article

Speaking of statistical analysis, it is R. But some people like me want to do anything with Python alone. For such people, we will introduce useful libraries and techniques for performing statistical analysis with Python. There are many articles and books on the same theme, but I will focus on those that are not often introduced elsewhere. We will introduce another article as much as possible about the details of each library technique.

Target: People who have used Python but are not apt

Jupyter Notebook / Lab It is a classic that has been introduced in various articles, but I will introduce it for the time being.

Why do we use program-based stats software like Python and R instead of GUI-based like SPSS? Because it is ~~ expensive ~~ ** to ensure the reproducibility of the result **. In other words, this is so that you can later check what kind of operation / analysis a certain result was obtained from.

Jupyter Notebook is a tool for writing programs while checking the execution results one by one. R Markdown in R, live script in MATLAB. It is easy to look back on what kind of program each analysis result was derived from. It's a must-have tool for data analysis in Python. For details, I think it is easy to understand Articles around here.

Jupyter Lab is an evolution of Jupyter Notebook. In June 2019, ver1.0 was finally released. I don't think there is a decisive difference compared to Notebook, but it is easier to use in many ways. For more information, I recommend Articles around here.

Personally, I like the ruler feature. In Python, there is a convention (PEP8) that "it is good if the length of one line is 79 characters or less". You don't have to follow this convention, but it's easier to read it later if you follow the line length properly. With the ruler function, you can easily see how far 79 characters are.

I could display it on Notebook with the extension (Reference), but in Lab it can be implemented by just tweaking the settings (Reference)

Divide into functions

As mentioned above, it is recommended to write the main analysis code on the .ipynb file with Jupyter Notebook / Lab. However, let's separate the complicated code as a function and write it in another .py file. From .ipynb, if you just use it by ʻimport`, readability will be greatly improved.

See this article for how to create your own function and allow it to be ʻimport`.

However, in order to reflect the update of the .py file in .ipynb, it is necessary to restart Kernel once. It's quite annoying because all the variables are reset once. To prevent that, put the following code in the head of .ipynb.

%reload_ext autoreload
%autoreload 2

With this, the changes in the .py file will be reflected immediately without restarting the Kernel.

Leave Docstring

Docstring is a memo about how to use a function, written according to a certain format.

When I read the analysis script I wrote about a year later, the content seems to be sloppy. You can decipher it if you take the time, but if you keep the Docstring, you can easily look back. For how to write [this article](https://qiita.com/simonritchie/items/49e0813508cad4876b5a#%E3%81%9D%E3%82%82%E3%81%9D%E3%82%82docstring%E3%81 If you read% A3% E3% 81% A6), it is perfect.

Use Git (at least moderately)

It's not limited to Python, but ...

Git is a system for managing source code versions. Speaking of Git, the site called GitHub is famous. It's like (roughly speaking) cloud storage for storing Git-managed data. With proper Git, you can see the source code at a point in the past.

How to use Git and GitHub was easy to understand in Articles around here.

Frequently recording change history using Git seems to be an essential technique for programmers. … But for a Monogusa person like me, it's difficult to keep records one after another. At the very least, let's record the history only ** when the analysis results are announced at laboratories, study sessions, academic conferences, papers, etc. **. You can look back on what kind of code the analysis results used in those presentations were obtained.

Debugging with breakpoint ()

Finding the cause of an error is the first step in fixing a program error. In such a case, the debug function allows you to check the contents of variables around the error and check the code flow step by step. Please refer to this article for how to use the debug function of Python. However, the methods introduced in this article etc. cannot be used from Jupyter Notebook. Please rewrite as follows (reference).

# import pdb; pdb.set_trace() <-Instead of this
from IPython.core.debugger import Pdb; Pdb().set_trace()    # <-this

If you are using Python 3.7 or later, ↓ is OK. (Reference)

# import pdb; pdb.set_trace() <-Instead of this
# from IPython.core.debugger import Pdb; Pdb().set_trace() <-Not this
breakpoint()    # <-Only this

breakpoint (), super convenient.

Read variables from another file

I knew that I could ʻimport a function I defined from another .py` file, Until recently, I didn't know that variables (constants) could also be defined. Please see this article for how to do it. If you want to write a large number of parameters in a row, this method will make it easier to see later.

Load Google Spreadsheet

Many people are using Google Forms to conduct surveys. When parsing the data stored in Google Sheets, you usually download it as .csv and then load it. However, you can use a library called ʻoauth2client` to load spreadsheets directly into Python. It is recommended for people who want to analyze the progress one by one, but who have trouble downloading files one by one.

The method is introduced at here.

It is convenient to make it a function like ↓.

import pandas as pd
from oauth2client.service_account import ServiceAccountCredentials
import gspread


def fetch_spread_sheet(spreadsheet_name, worksheet_num, json_path):
    """
Load the specified Google Sheets as a DataFrame type

    Parameters
    ----------
    worksheet_name: str
The name of the spreadsheet to load
    worksheet_num: int
Of the spreadsheets, the worksheet number to load
    json_path: str
Path of json file downloaded from GoogleDrive API manager
    """
    scopes = ['https://www.googleapis.com/auth/drive']
    credentials = ServiceAccountCredentials.from_json_keyfile_name(
        json_path, scopes=scopes)
    gc = gspread.authorize(credentials)
    workbook = gc.open(spreadsheet_name)
    sheet = workbook.get_worksheet(worksheet_num)
    return pd.DataFrame(sheet.get_all_values())

Statistical library Pingouin

Pingouin (Pingouin in French) is a statistical package that can be used in Python (Official Site). It's a new package that was first released in April 2018.

Speaking of Python statistics packages, there are StatsModels and scipy.stats. On the other hand, Pinguin's Uri is "simple and thorough". For example, if you use ttest_ind of scipy.stats, it will perform t-test and return t-value and p-value. On the other hand, pingouin.ttest returns t-value, p-value, ** degrees of freedom, effect size, 95% confidence interval, power, Bayse Factor ** all at once.

Check the Official Site for a list of pingouin functions. There are functions that cannot be done by other libraries or are complicated.

Python Statistical Techniques-Statistical Analysis Against Python-