[PYTHON] I tried AdaNet on table data

Introduction

Recently, I learned about AdaNet (https://github.com/tensorflow/adanet), an automatic construction library for deep learning, but since the sample has a lot of image data and the sample is small for the table data, I have a memo I will also make a note.

The following programs are based on this article. https://towardsdatascience.com/modeling-banks-churn-rate-with-adanet-a-scalable-flexible-auto-ensemble-learning-framework-700fa1e6df74

environment

I used Google Colaboratory.

What i did

Library installation

Install with pip and you're done.

! pip install adanet

In the output, I got some ERROR, but in the end it worked fine.

(abridgement)
Successfully built rednose termstyle
ERROR: datascience 0.10.6 has requirement coverage==3.7.1, but you'll have coverage 4.5.4 which is incompatible.
ERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.
ERROR: coveralls 0.5 has requirement coverage<3.999,>=3.6, but you'll have coverage 4.5.4 which is incompatible.
Installing collected packages: nose, termstyle, colorama, rednose, coverage, mock, adanet
  Found existing installation: coverage 3.7.1
    Uninstalling coverage-3.7.1:
      Successfully uninstalled coverage-3.7.1
Successfully installed adanet-0.8.0 colorama-0.4.3 coverage-4.5.4 mock-3.0.5 nose-1.3.7 rednose-1.3.0 termstyle-0.1.11

Preparation

Create a directory to save model information while building the model.

!mkdir ./models

Library import

First, import the required libraries.

from __future__ import division
from __future__ import print_function

import functools
import os
import shutil

import adanet
from adanet.examples import simple_dnn
import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split


# The random seed to use.
RANDOM_SEED = 42

LOG_DIR = './models'

Data loading

This time, we will use the breast cancer dataset provided by scikit-learn. See the sklearn page for instructions. https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer


from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
df_X = pd.DataFrame(data.data, columns=data.feature_names)
df_y = pd.DataFrame(data.target, columns=['target'])
x_train, x_test, y_train, y_test = train_test_split(df_X, df_y, test_size=0.2, random_state=100)

In this data, the objective variable is 0 or 1, so it is a problem of binary classification.

df_y['target'].unique()
array([0, 1])

Preparation of functions used in AdaNet

Write a function to pass the data to AdaNet. Data is passed in dict type to from_tensor_slices of tensorflow.

FEATURES_KEY = "x"
_NUM_LAYERS_KEY = "num_layers"

def input_fn(partition, training, batch_size):
  """Generate an input function for the Estimator."""

  def _input_fn():

    if partition == "train":
      dataset = tf.data.Dataset.from_tensor_slices(({FEATURES_KEY: x_train}, y_train))
    else:
      dataset = tf.data.Dataset.from_tensor_slices(({FEATURES_KEY: x_test},  y_test))

    # repeat is called after shuffling,to prevent separate epochs from blending together.
    if training:
      dataset = dataset.shuffle(10 * batch_size, seed=RANDOM_SEED).repeat()

    dataset = dataset.batch(batch_size)
    iterator = dataset.make_one_shot_iterator()
    features, labels = iterator.get_next()
    return features, labels

  return _input_fn

Prepare a class for Deep Learning.


class _SimpleDNNBuilder(adanet.subnetwork.Builder):
  """Builds a DNN subnetwork for AdaNet."""

  def __init__(self, optimizer, layer_size, num_layers, learn_mixture_weights,
               seed):

    self._optimizer = optimizer
    self._layer_size = layer_size
    self._num_layers = num_layers
    self._learn_mixture_weights = learn_mixture_weights
    self._seed = seed

  def build_subnetwork(self,
                       features,
                       labels,
                       logits_dimension,
                       training,
                       iteration_step,
                       summary,
                       previous_ensemble=None):

    input_layer = tf.to_float(features[FEATURES_KEY])
    kernel_initializer = tf.glorot_uniform_initializer(seed=self._seed)
    last_layer = input_layer
    for _ in range(self._num_layers):
      last_layer = tf.layers.dense(
          last_layer,
          units=self._layer_size,
          activation=tf.nn.relu,
          kernel_initializer=kernel_initializer)
    logits = tf.layers.dense(
        last_layer,
        units=logits_dimension,
        kernel_initializer=kernel_initializer)
    persisted_tensors = {_NUM_LAYERS_KEY: tf.constant(self._num_layers)}
    return adanet.Subnetwork(
        last_layer=last_layer,
        logits=logits,
        complexity=self._measure_complexity(),
        persisted_tensors=persisted_tensors)

  def _measure_complexity(self):
    """Approximates Rademacher complexity as the square-root of the depth."""
    return tf.sqrt(tf.to_float(self._num_layers))

  def build_subnetwork_train_op(self, subnetwork, loss, var_list, labels,
                                iteration_step, summary, previous_ensemble):
    return self._optimizer.minimize(loss=loss, var_list=var_list)

  def build_mixture_weights_train_op(self, loss, var_list, logits, labels,
                                     iteration_step, summary):
    if not self._learn_mixture_weights:
      return tf.no_op()
    return self._optimizer.minimize(loss=loss, var_list=var_list)

  @property
  def name(self):
    if self._num_layers == 0:
      # A DNN with no hidden layers is a linear model.
      return "linear"
    return "{}_layer_dnn".format(self._num_layers)  

class SimpleDNNGenerator(adanet.subnetwork.Generator):
  """Generates a two DNN subnetworks at each iteration.
  """

  def __init__(self,
               optimizer,
               layer_size=32,
               learn_mixture_weights=False,
               seed=None):

    self._seed = seed
    self._dnn_builder_fn = functools.partial(
        _SimpleDNNBuilder,
        optimizer=optimizer,
        layer_size=layer_size,
        learn_mixture_weights=learn_mixture_weights)
  def generate_candidates(self, previous_ensemble, iteration_number,
                          previous_ensemble_reports, all_reports, config):
    """See `adanet.subnetwork.Generator`."""

    num_layers = 0
    seed = self._seed
    if previous_ensemble:
      num_layers = tf.contrib.util.constant_value(
          previous_ensemble.weighted_subnetworks[
              -1].subnetwork.persisted_tensors[_NUM_LAYERS_KEY])
    if seed is not None:
      seed += iteration_number
    return [
        self._dnn_builder_fn(num_layers=num_layers, seed=seed),
        self._dnn_builder_fn(num_layers=num_layers + 1, seed=seed),
    ]

I will describe the execution part. The parameters at the beginning set the RMSPropOptimizer used to adjust the learning rate and the total number of trainings. Since this time it is a binary classification, binary_classification_head is used for the head of ʻadanet.Estimator`.


# AdaNet parameters
LEARNING_RATE = 0.001
TRAIN_STEPS = 100000 
BATCH_SIZE = 32 

LEARN_MIXTURE_WEIGHTS = False
ADANET_LAMBDA = 0 
BOOSTING_ITERATIONS = 5

def train_and_evaluate(learn_mixture_weights=LEARN_MIXTURE_WEIGHTS,
                       adanet_lambda=ADANET_LAMBDA):
  """Trains an `adanet.Estimator` to predict churn yes/no."""

  estimator = adanet.Estimator(
      # Since we are predicting churn, we'll use a regression
      # head that optimizes for MSE.
      head=tf.contrib.estimator.binary_classification_head(
          loss_reduction=tf.losses.Reduction.SUM_OVER_BATCH_SIZE),

      # Define the generator, which defines our search space of subnetworks
      # to train as candidates to add to the final AdaNet model.
      subnetwork_generator=SimpleDNNGenerator(
          optimizer=tf.train.RMSPropOptimizer(learning_rate=LEARNING_RATE),
          learn_mixture_weights=learn_mixture_weights,
          seed=RANDOM_SEED),

  
      adanet_lambda=adanet_lambda,
      # The number of train steps per iteration.
      max_iteration_steps=TRAIN_STEPS // BOOSTING_ITERATIONS,

      # The evaluator will evaluate the model on the full training set to
      # compute the overall AdaNet loss (train loss + complexity
      # regularization) to select the best candidate to include in the
      # final AdaNet model.
      evaluator=adanet.Evaluator(
          input_fn=input_fn("train", training=False, batch_size=BATCH_SIZE)),

      # The report materializer will evaluate the subnetworks' metrics
      # using the full training set to generate the reports that the generator
      # can use in the next iteration to modify its search space.
      report_materializer=adanet.ReportMaterializer(
          input_fn=input_fn("train", training=False, batch_size=BATCH_SIZE)),

      # Configuration for Estimators.
      config=tf.estimator.RunConfig(
          save_checkpoints_steps=50000,
          save_summary_steps=50000,
          tf_random_seed=RANDOM_SEED))

  # Train and evaluate using using the tf.estimator tooling.
  train_spec = tf.estimator.TrainSpec(
      input_fn=input_fn("train", training=True, batch_size=BATCH_SIZE),
      max_steps=TRAIN_STEPS)
  eval_spec = tf.estimator.EvalSpec(
      input_fn=input_fn("test", training=False, batch_size=BATCH_SIZE),
      steps=None)
  return tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

def ensemble_architecture(result):
  """Extracts the ensemble architecture from evaluation results."""

  architecture = result["architecture/adanet/ensembles"]
  # The architecture is a serialized Summary proto for TensorBoard.
  summary_proto = tf.summary.Summary.FromString(architecture)
  return summary_proto.value[0].tensor.string_val[0]


results, _ = train_and_evaluate()
print("Loss:", results["average_loss"])
print("Results:", results)
print("Architecture:", ensemble_architecture(results))

The execution result is as follows. Since there is a lot of standard output, only the last part is shown. Accuracy is about 93%.

(abridgement)
Loss: 0.1722714
Results: {'accuracy': 0.9298246, 'accuracy_baseline': 0.5701754, 'architecture/adanet/ensembles': b'\n1\n\x13architecture/adanetB\x10\x08\x07\x12\x00B\n| linear |J\x08\n\x06\n\x04text', 'auc': 0.99623233, 'auc_precision_recall': 0.9973264, 'average_loss': 0.1722714, 'best_ensemble_index_0': 0, 'iteration': 0, 'label/mean': 0.5701754, 'loss': 0.16755146, 'precision': 1.0, 'prediction/mean': 0.49755728, 'recall': 0.8769231, 'global_step': 20000}
Architecture: b'| linear |'

Summary

I tried AdaNet on table data. I felt that it was an image of writing in a program except for the deep learning layer and the implementation part of the node.

Recommended Posts

I tried AdaNet on table data
I tried MLflow on Databricks
I tried Cython on Ubuntu on VirtualBox
I tried to visualize BigQuery data using Jupyter Lab on GCP
I tried scraping conversation data from Askfm
I tried using YOUTUBE Data API V3
I tried factor analysis with Titanic data!
I tried scraping
I tried PyQ
I tried AutoKeras
I tried papermill
I tried django-slack
I tried Django
I tried spleeter
I tried cgo
I tried the pivot table function of pandas
I tried using Remote API on GAE / J
I tried to save the data with discord
I tried running YOLO v3 on Google Colab
I tried principal component analysis with Titanic data!
I tried to get CloudWatch data with Python
[Memo] I tried a pivot table in Python
I tried launching jupyter nteract on heroku server
[Pythonocc] I tried using CAD on jupyter notebook
I tried LINE Message API (line-bot-sdk-python) on GAE
I tried DBM with Pylearn 2 using artificial data
I tried playing with the calculator on tkinter
[IBM Cloud] I tried to access the Db2 on Cloud table from Cloud Funtions (python)
I tried to rescue the data of the laptop by booting it on Ubuntu
I tried using parameterized
I tried using argparse
I tried using mimesis
I tried to create a table only with Django
I tried using anytree
I tried competitive programming
I tried running pymc
I tried Python on Mac for the first time.
I tried ARP spoofing
I tried running the app on the IoT platform "Rimotte"
I tried to implement Minesweeper on terminal with python
I tried using aiomysql
I tried using Summpy
I tried to predict the J-League match (data analysis)
I tried Python> autopep8
I tried python on heroku for the first time
I tried using coturn
I tried using Pipenv
I tried using matplotlib
I tried using "Anvil".
I tried using Hubot
I tried using ESPCN
I tried PyCaret2.0 (pycaret-nightly)
I tried using openpyxl
I tried a visual regression test on GitHub Pages
I tried clustering ECG data using the K-Shape method
I tried deep learning
I tried AWS CDK!
I tried using Ipython
I tried to debug.
I tried using PyCaret
I tried using cron