Organize Python tools to speed up the initial movement of data analysis competitions

Purpose

In analysis competitions such as Kaggle and Signate, the speed of initial action is important, so Organize frequently used Jupyter Notebook templates to speed up the initial action. Updated from time to time.

Change

2020.5.25 Random number setting change (functionalization)

table of contents

  1. import template
  2. Data read
  3. Automatic loading of libraries
  4. Performance profile (%% time /% lprun)
  5. matplotlib Japanese localization (easy)
  6. Maximum Pandas display
  7. Pandas memory reduction
  8. Fixed random number seed

1. import template

import template

import pandas as pd
import numpy as np
import pandas_profiling as pdp
import lightgbm as lgb
import random
from numba import jit

import matplotlib.pyplot as plt
import matplotlib
from matplotlib.dates import DateFormatter
%matplotlib inline
import seaborn as sns

def seed_everything(seed):
    random.seed(seed)
    np.random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    if "tr" in sys.modules:
        tf.random.set_seed(seed)

seed_everything(28)

#Specify the maximum number of lines to display (50 lines are specified here)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 50)

%load_ext autoreload
%autoreload 2
import os

# windows
if os.name == 'nt':
    path = '../input/data/'

    import japanize_matplotlib
    sns.set(font="IPAexGothic")

elif os.name == 'posix':
# Kaggle
    if 'KAGGLE_DATA_PROXY_TOKEN' in os.environ.keys():
        path = '/kaggle/input/'

# Google Colab
    else:
        from google.colab import drive
        drive.mount('/content/drive')
        !ls drive/My\ Drive/'Colab Notebooks'/xxx #xxx rewrite
        path = "./drive/My Drive/Colab Notebooks/xxx/input/data/" #xxx rewrite
        #Check the remaining time of the session
        !cat /proc/uptime | awk '{print $1 /60 /60 /24 "days (" $1 / 60 / 60 "h)"}'

print(os.name)
print(path)

2. By data read platform

Identify the platform running Python

import os

# windows
if os.name == 'nt':
    #xxx

elif os.name == 'posix':
# Kaggle
    if 'KAGGLE_DATA_PROXY_TOKEN' in os.environ.keys():
        #xxx

# Google Colab
    else:
        #xxx
print(os.name)

3. Automatic loading of libraries

Even if you change the library, it will be automatically loaded at runtime.

%load_ext autoreload
%autoreload 2

reference https://qiita.com/Accent/items/f6bb4d4b7adf268662f4

4. Performance profile

If you want to speed up, it's important to find the bottleneck first. Expected to be used in Notebook such as Jupyter. Easy: %% time is useful if you want to know the processing time in a cell. Details:% lprun is useful if you want to know the detailed processing time of each line.

4.1 %%time Put it at the beginning j of the cell. It displays the execution time of the entire cell.

%%time
def func(num):
    sum = 0
    for i in range(num):
        sum += i

    return sum

out = func(10000)

4.2 %lprun It outputs the execution time for each line. %% prun is a module unit, so it may be difficult to understand. % lprun is easier to understand line by line.

Below, 3 steps. Step0. Installation Step1. Road Step2. Execution

Step0 Installation

Skip if installed. Commands for Google Coab, Kaggle Cloud

!pip install line_profiler

Step1 Road

%load_ext line_profiler

Step2 execute

def func(num):
    sum = 0
    for i in range(num):
        sum += i

    return sum

%lprun -f func out = func(10000)

5. matplotlib Japanese localization (easy)

If you are using a cloud platform such as Google Colab, it can be difficult to change your system. It's relatively easy because it automatically installs Japanese fonts and packages.

Note that seaborn also sets the font when importing, so execute sns.set at the end.

import seaborn as sns
import japanize_matplotlib
sns.set(font="IPAexGothic") ###Be sure to run last

6. Maximum Pandas display

When displaying the DataFrame of pandas, it is abbreviated (...) after a certain number of rows / columns. Set the maximum number of displays to control omission.

#Specify the maximum number of lines to display (50 lines are specified here)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 50)

7. Pandas memory reduction

The type is automatically set from the range of numerical values in the data frame.

Reference by @gemartin https://www.kaggle.com/gemartin/load-data-reduce-memory-usage

# Original code from https://www.kaggle.com/gemartin/load-data-reduce-memory-usage by @gemartin
# Modified to support timestamp type, categorical type
# Modified to add option to use float16 or not. feather format does not support float16.
from pandas.api.types import is_datetime64_any_dtype as is_datetime
from pandas.api.types import is_categorical_dtype

def reduce_mem_usage(df, use_float16=False):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        if is_datetime(df[col]) or is_categorical_dtype(df[col]):
            # skip datetime type or categorical type
            continue
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if use_float16 and c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

8. Fixed random number seed

If you do not fix the seed of the random number, the prediction result will change every time and the effect will be difficult to understand, so fix it.

Module settings related to random numbers

import numpy as np
import random
import os

def seed_everything(seed):
    random.seed(seed)
    np.random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    if "tr" in sys.modules:
        tf.random.set_seed(seed)

LightGBM parameters related to random numbers

lgb_params = {
    'random_state':28,
    'bagging_fraction_seed':28,
    'feature_fraction_seed':28,
    'data_random_seed':28,
    'seed':28
}

Recommended Posts

Organize Python tools to speed up the initial movement of data analysis competitions
Summary of tools needed to analyze data in Python
Numba to speed up as Python
How to speed up Python calculations
Ported from R language of "Sazae-san's rock-paper-scissors data analysis" to Python
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
How to speed up instantiation of BeautifulSoup
Try to image the elevation data of the Geographical Survey Institute with Python
[Introduction to Data Scientists] Basics of Python ♬
I made a function to see the movement of a two-dimensional array (Python)
To speed up python, summarize the amount of calculation of collection type (list / tuple / dictionary / set) for each purpose.
[Python] From morphological analysis of CSV data to CSV output and graph display [GiNZA]
Data analysis in Python Summary of sources to look at first for beginners
Introduction to Statistical Modeling for Data Analysis Expanding the range of applications of GLM
An introduction to data analysis using Python-To increase the number of video views-
Python C / C ++ Extensions: Pass some of the data as np.array to Python (set stride)
[Introduction to Python] How to get the index of data with a for statement
Just add the python array to the json data
How to use data analysis tools for beginners
[Python] Flow from web scraping to data analysis
The story of reading HSPICE data in Python
Speed: Add element to end of Python array
[Python] Do your best to speed up SQLAlchemy
A well-prepared record of data analysis in Python
Try to simulate the movement of the solar system
Data analysis python
What to use for Python stacks and queues (speed comparison of each data structure)
Get information equivalent to the Network tab of Chrome developer tools with Python + Selenium
I tried to compare the processing speed with dplyr of R and pandas of Python
python beginners tried to predict the number of criminals
The wall of changing the Django service from Python 2.7 to Python 3
Have passed the Python Engineer Certification Data Analysis Exam
Template of python script to read the contents of the file
Explanation of the concept of regression analysis using python Part 2
How to get the number of digits in Python
[Python] [Word] [python-docx] Simple analysis of diff data using python
[For beginners] How to study Python3 data analysis exam
I tried to predict the J-League match (data analysis)
[python] option to turn off the output of click.progressbar
Calculate the regression coefficient of simple regression analysis with python
Challenge principal component analysis of text data with Python
Not being aware of the contents of the data in python
List of Python code used in big data analysis
Explanation of the concept of regression analysis using Python Part 1
Write data to KINTONE using the Python requests module
Review of atcoder ABC158, up to question E (Python)
Let's use the open data of "Mamebus" in Python
[Python] Summary of how to specify the color of the figure
Explanation of the concept of regression analysis using Python Extra 1
14 quizzes to understand the surprisingly confusing scope of Python
Understand the status of data loss --Python vs. R
[Introduction to Python] Basic usage of the library matplotlib
To do the equivalent of Ruby's ObjectSpace._id2ref in Python
Extract the band information of raster data with python
Completely translated the site of "The Hitchhiker's Guide to Python"
[Technical book] Introduction to data analysis using Python -1 Chapter Introduction-
Python Note: The mystery of assigning a variable to a variable
I tried to summarize the string operations of Python
Get the value of a specific key up to the specified index in the dictionary list in Python
Return the image data with Flask of Python and draw it to the canvas element of HTML
(Maybe) This is all you need to pass the Python 3 Engineer Certification Data Analysis Exam