[PYTHON] Estimating cancer patients from CT images by CNTK

Outline

It will be a field where results have been obtained from an early stage in the field of deep learning. Whether or not you have cancer can be modeled simply as machine learning by classifying the input data into two. With a certain amount of data, it is possible to build and use a model from scratch. However, this time I would like to introduce a method of using the general ImageNet image recognition model for transfer learning. Originally from the following, I have been asked how to run it in Azure ML because it exists in the Cortana Ingelligence gallery. From such a situation, I would like to explain according to this content first. https://gallery.cortanaintelligence.com/Notebook/Medical-Image-Recognition-for-the-Kaggle-Data-Science-Bowl-2017-with-CNTK-and-LightGBM-1

There is a data scientist community site called Kaggle, where the Data Science Bowl 2017 was held in a contest. One of the themes provided there was cancer detection from CT images. In the contest, about 1500 actual CT data will be provided, and the analysis logic will be competed from there. Currently (as of March 2017), the top recognition rate is over 99%. https://www.kaggle.com/c/data-science-bowl-2017

Method

A 3D convolution is required for accurate analysis, but this article uses ImageNet's trained model, although it's a bit overwhelming for 2D images. The outline of this process is shown in the figure.

zu.png

LightBGM learning takes only a few seconds, so most of the time is feature extraction using the ResNet model.

Advance preparation

-Introduction of CNTK See below. I am using CNTK2.0RC1. https://analyticsai.wordpress.com/2017/04/04/cntk2-0b15%e3%82%a4%e3%83%b3%e3%82%b9%e3%83%88%e3%83%bc%e3%83%ab/

-data Download from Kaggle. Due to its large size, it takes a long time. https://www.kaggle.com/c/data-science-bowl-2017/data

-Python CNTK normally operates in the Anaconda environment. Introduce pydicom for reading CT images. It can be installed with pip. pip install pydicom

-LightGBM Click here for installation https://analyticsai.wordpress.com/2017/04/04/lightgbm/

-PreTrained model Since it is transfer learning, an ImageNet model is required. CNTK is available in a variety of trained models and is freely available for download. https://www.microsoft.com/en-us/research/product/cognitive-toolkit/model-gallery/

Download the ResNet 152 model from this. (This seems to be a conversion from the Caffe model). https://github.com/Microsoft/CNTK/tree/master/Examples/Image/Classification/ResNet#cntk-examples-imageclassificationresnet

The original article seems to be downloaded from here. https://migonzastorage.blob.core.windows.net/deep-learning/models/cntk/imagenet/ResNet_152.model

Folder structure

\ data \ stage1 \ --CT scan data \ feartures \ features0001 --Feature values calculated from scan data You need to create a folder in advance.

ResNet_152.model is stored in the folder under the model. \ local \ models --Stores ResNet model files

work

Extract the downloaded CT image data to the folder under stage1. The original Python code doesn't work out of the box, so I'll modify it below. To be honest, I moved it while making various corrections. Due to CNTK becoming RC, the value of the layer in the middle has changed from numpy.array to python list. This part affects the transfer learning code and needs to be modified. With this article as it is, it is possible to estimate "who" has cancer, but it is not clear where "where" is cancer. In the case of machine learning, the purpose is to make predictions, so "Why is this happening?" In most cases, we cannot answer the question. It is necessary to consider another method to estimate the location. I would like to write this part next time. The code is attached below. This is a modified version of the code on the above site.

#Load libraries
 import sys,os
 import numpy as np
 import dicom
 import glob
 from cntk.device import set_default_device, gpu
 from matplotlib import pyplot as plt
 import cv2
 import pandas as pd
 import time
 from sklearn import cross_validation
 from cntk import load_model
 from cntk.ops import combine
 from cntk.io import MinibatchSource, ImageDeserializer, StreamDef, StreamDefs
 from lightgbm.sklearn import LGBMRegressor

print("System version:", sys.version, "\n")
 #print("CNTK version:",pkg_resources.get_distribution("cntk").version)
 #Put here the number of your experiment
 EXPERIMENT_NUMBER = '0001'

#Put here the path to the downloaded ResNet model
 MODEL_PATH='/local/models/ResNet_152.model'

#Put here the path where you downloaded all kaggle data
 DATA_PATH='/'

# Path and variables
 STAGE1_LABELS=DATA_PATH + 'stage1_labels.csv'
 STAGE1_SAMPLE_SUBMISSION=DATA_PATH + 'stage1_sample_submission.csv'
 STAGE1_FOLDER=DATA_PATH + 'stage1/'
 FEATURE_FOLDER=DATA_PATH + 'features/features' + EXPERIMENT_NUMBER + '/'
 SUBMIT_OUTPUT='submit' + EXPERIMENT_NUMBER + '.csv'
 # Timer class
 class Timer(object):
 def __enter__(self):
 self.start()
 return self

def __exit__(self, *args):
 self.stop()

def start(self):
 self.start = time.clock()

def stop(self):
 self.end = time.clock()
 self.interval = self.end - self.start

def get_3d_data(path):
 slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
 slices.sort(key=lambda x: int(x.InstanceNumber))
 return np.stack([s.pixel_array for s in slices])
 def get_data_id(path, plot_data=False):
 sample_image = get_3d_data(path)
 sample_image[sample_image == -2000] = 0
 if plot_data:
 f, plots = plt.subplots(4, 5, sharex='col', sharey='row', figsize=(10, 8))

batch = []
 cnt = 0
 dx = 40
 ds = 512
 for i in range(0, sample_image.shape[0] - 3, 3):
 tmp = []
 for j in range(3):
 img = sample_image[i + j]
 img = 255.0 / np.amax(img) * img
 img = cv2.equalizeHist(img.astype(np.uint8))
 img = img[dx: ds - dx, dx: ds - dx]
 img = cv2.resize(img, (224, 224))
 tmp.append(img)

tmp = np.array(tmp)
 batch.append(np.array(tmp))

if plot_data:
 if cnt < 20:
 plots[cnt // 5, cnt % 5].axis('off')
 plots[cnt // 5, cnt % 5].imshow(tmp[0,:,:], cmap='gray')
 cnt += 1

if plot_data: plt.show()

batch = np.array(batch, dtype='int')
 return batch
 def get_extractor():
 node_name = "z.x"
 loaded_model  = load_model(MODEL_PATH)
 node_in_graph = loaded_model.find_by_name(node_name)
 output_nodes  = combine([node_in_graph.owner])
 return output_nodes

def calc_features(verbose=False):
 net = get_extractor()
 for folder in glob.glob(STAGE1_FOLDER+'*'):
 foldername = os.path.basename(folder)
 if os.path.isfile(FEATURE_FOLDER+foldername+'.npy'):
 if verbose: print("Features in %s already computed" % (FEATURE_FOLDER+foldername))
 continue
 batch = get_data_id(folder)
 if verbose:
 print("Batch size:")
 print(batch.shape)
 feats = np.array(net.eval(batch))
 if verbose:
 print(feats.shape)
 print("Saving features in %s" % (FEATURE_FOLDER+foldername))
 np.save(FEATURE_FOLDER+foldername, feats)

def train_lightgbm():
 df = pd.read_csv(STAGE1_LABELS)

x = np.array([np.mean(np.load(FEATURE_FOLDER+'%s.npy' % str(id)), axis=1).flatten() for id in df['id'].tolist()] )
 y = df['cancer'].as_matrix()
 print(x.shape)

trn_x, val_x, trn_y, val_y = cross_validation.train_test_split(x, y, random_state=42, stratify=y,
 test_size=0.20)
 clf = LGBMRegressor(max_depth=50,
 num_leaves=21,
 n_estimators=5000,
 min_child_weight=1,
 learning_rate=0.001,
 nthread=24,
 subsample=0.80,
 colsample_bytree=0.80,
 seed=42)
 clf.fit(trn_x, trn_y, eval_set=[(val_x, val_y)], verbose=True, eval_metric='l2', early_stopping_rounds=300)
 return clf

def compute_training(verbose=True):
 with Timer() as t:
 clf = train_lightgbm()
 if verbose: print("Training took %.03f sec.\n" % t.interval)
 return clf

def compute_prediction(clf, verbose=True):
 df = pd.read_csv(STAGE1_SAMPLE_SUBMISSION)
 x = np.array([np.mean(np.load((FEATURE_FOLDER+'%s.npy') % str(id)), axis=1).flatten() for id in df['id'].tolist()])

with Timer() as t:
 pred = clf.predict(x)
 if verbose: print("Prediction took %.03f sec.\n" % t.interval)
 df['cancer'] = pred
 return df

def save_results(df):
 df.to_csv(SUBMIT_OUTPUT, index=False)
 calc_features(verbose=True)

clf = compute_training(verbose=True)

df = compute_prediction(clf)
 print("Results:")
 df.head()

save_results(df)

The result is output to a file. Part of the output file.

id,cancer
026470d51482c93efc18b9803159c960,0.40692197700808486
031b7ec4fe96a3b035a8196264a8c8c3,0.3037382335250943
03bd22ed5858039af223c04993e9eb22,0.2324142770435242
06a90409e4fcea3e634748b967993531,0.20775132462023793
07b1defcfae5873ee1f03c90255eb170,0.3054943422884507
0b20184e0cd497028bdd155d9fb42dc9,0.21613283326493204
12db1ea8336eafaf7f9e3eda2b4e4fef,0.2073817116259871
159bc8821a2dc39a1e770cb3559e098d,0.28326504404734154
174c5f7c33ca31443208ef873b9477e5,0.23168542729470515
1753250dab5fc81bab8280df13309733,0.21749273848859857
1cf8e778167d20bf769669b4be96592b,0.18329884419207843
1e62be2c3b6430b78ce31a8f023531ac,0.3276063617078295
1f6333bc3599f683403d6f0884aefe00,0.2092160345076576
1fdbc07019192de4a114e090389c8330,0.2276893908994118
2004b3f761c3f5dffb02204f1247b211,0.20776188822078875

For example, a person with an id of 2703df8c469906a06a45c0d7ff501199 has a probability of 0.5561665263809848, so it can be said that it is quite suspicious.

Recommended Posts

Estimating cancer patients from CT images by CNTK
Get images by keyword search from Twitter