[PYTHON] Deep learning for compound formation?

banner.png

Introduction

I am reprinting what I wrote in the past. It may be close to miscellaneous notes. I don't know.

New molecule generation?

Create a new molecule. In particular, what have you done so far regarding the design of "useful new molecules with the desired physical properties"? For example, in the field of drug discovery, I think that it is created using not only basic theory of chemistry, but also empirical rules, Tanimoto coefficient, quantum chemistry calculation etc .... (I think there are others). I will (check myself).

There seems to be a big move lately to get machine learning to do the above.

Among them, the one that we paid attention to this time was the Aspuru-Guzik group of Harvard University, which had a large number of citations of 213. 「Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules」(1) And a program created based on this It is "Chemical VAE" (2). banner.png

What is Chemical VAE?

This is a technology that uses word2vec (Seq2Seq) called SMILES2vec. I would appreciate it if you could see the previous My article about this.

The following is the flow of new molecule generation that I thought about after reading the paper.

First, the character string represented by the compound is vectorized by the encoder to generate a latent space (vector space). Each position in this vector space is a character string of SMILES, and it seems that the closer the position (later expressed as z), the closer the structure exists. Revert it to a similar string as much as possible in the decoder. We will also train the encoder and decorator so that encoding and decoding will work.

smv.jpg

After that (it may be said "at the same time"), f (z) is generated as shown in the figure below by learning the physical property values of the molecules corresponding to the vector space with a neural network.

スクリーンショット 2018-12-31 17.15.12.png

(Note) From here, it was particularly difficult to read, so it will be close to what you expected.

It's meaningless to say that it's a new molecule if it doesn't have the physical properties you want, right? Although the technology introduced this time is new molecule generation, A pharmaceutical rule of thumb that "molecules with structures similar to known molecules with good physical properties may have good physical properties as well (many?)"? There seems to be an idea like this. In other words, it seems that it takes the process of encoding a known molecule with good physical properties in this learned latent space, searching around the position of that molecule in the latent space, and searching for a new molecule. After that, the trained decoder generates SMILES of the molecule. When it is finally generated, RDkit is used to determine whether it holds as a molecule.

oc-2017-00572f_0005.gif

The figure below seems to be the result. The central molecule is surrounded by a square. From there, the positional relationship of the latent space is expressed.

Figure2.jpg

The flow up to this point is within my understanding.

I actually tried an example on GitHub.

When doing an example, enter chemical_vae with conda or pip.

First, the import part

intro_to_chemvae.ipynb


# tensorflow backend
from os import environ
environ['KERAS_BACKEND'] = 'tensorflow'
# vae stuff
from chemvae.vae_utils import VAEUtils
from chemvae import mol_utils as mu
# import scientific py
import numpy as np
import pandas as pd
# rdkit stuff
from rdkit.Chem import AllChem as Chem
from rdkit.Chem import PandasTools
# plotting stuff
import matplotlib.pyplot as plt
import matplotlib as mpl
from IPython.display import SVG, display
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

The dataset uses the zinc dataset. This dataset contains SMILES and physical properties (QED (drug-likeness evaluation), SAS (synthetic accessibility score), logP (octanol coefficient)).

Also, ・ Smiles_1 specifies the central molecule. ・ Noise is the distance (z) from the central molecule in the latent space. -Random sampling is performed within the z range, and it may not be possible to find one for which SMILES holds with only one trial, so I tried using the for statement 500 times. -Reconstruction is vectorized by an encoder that has learned the central molecule and output by a decoder that has learned it. (The results below do not seem to work, but you should change the learning method and parameters.)

vae = VAEUtils(directory='../models/zinc_properties')
smiles_1 = mu.canon_smiles('CSCC(=O)NNC(=O)c1c(C)oc(C)c1C')

for i in range(500):
   X_1 = vae.smiles_to_hot(smiles_1,canonize_smiles=True)
   z_1 = vae.encode(X_1)
   X_r= vae.decode(z_1)

   print('{:20s} : {}'.format('Input',smiles_1))
   print('{:20s} : {}'.format('Reconstruction',vae.hot_to_smiles(X_r,strip=True)[0]))

   print('{:20s} : {} with norm {:.3f}'.format('Z representation',z_1.shape, np.linalg.norm(z_1)))


   print('Properties (qed,SAS,logP):')
   y_1 = vae.predict_prop_Z(z_1)[0]
   print(y_1)
  noise=3.0
   print('Searching molecules randomly sampled from {:.2f} std (z-distance) from the point'.format(noise))

・ Output result

Using TensorFlow backend.
Standarization: estimating mu and std values ...done!
Input                : CSCC(=O)NNC(=O)c1c(C)oc(C)c1C
Reconstruction       : CH1nCNc1Cs)Nccccc(CCc1)c3
Z representation     : (1, 196) with norm 9.901
Properties (qed,SAS,logP):
[0.72396696 2.1183593  2.1463375 ]
Searching molecules randomly sampled from 3.00 std (z-distance) from the point

At the end, we will get what we have found, which is unique and determined by RDkit.

   
   df = vae.z_to_smiles( z_1,decode_attempts=100,noise_norm=noise)
   print('Found {:d} unique mols, out of {:d}'.format(len(set(df['smiles'])),sum(df['count'])))
   print('SMILES\n',df.smiles)
   if sum(df['count']) !=0:
      df1=pd.DataFrame(df.smiles)
   df1.to_csv("result1.csv",mode='a',index=False,header=False)

Output result below

result1.csv


ON cCO=COCC(O)ccN2cs2c
CCCCCCNc-1cO-SCOCCcccc1
CC1CCcC(-nOcc1ccccCCC1)c1 O
OC (C)C(=Occc3cccccccc)CB
CCC1oNCc2cCcccccc2cccc1 1 1
CO CC(1c(O=O1O(1cO)nC))1
C=C1nn(=O)SnNccccccocc1C
CC C@Cs(=CN=11cccc2cc1Cc)2c1
C OcCCc(CO)c1nccc=Occccc1C
O1Cc(c1CCO)CNCC=BBOCCCN
CC ON(FCNN(C)ccc(Ocn1)1)l
C1ccnccnccccccccncccscc1
CC(CScc(c1cOn1nc1CCl)C)1
CCCCc(-ncccc21nc1c1c2)1CCC
C cnc(Cnncncc(C())Cl)Cl1 1

There are some that probably do not become molecules even though they have passed through RDkit. .. .. However, it seems to avoid many Syntax errors. RDkit competent. ..

that's all

reference

1)Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules https://pubs.acs.org/doi/abs/10.1021/acscentsci.7b00572 2)chemical_vae https://github.com/aspuru-guzik-group/chemical_vae 3) Compound formation by deep learning (drugs, organic luminescent molecules) https://ritsuan.com/blog/8480/

Recommended Posts

Deep learning for compound formation?
Deep Learning
[AI] Deep Learning for Image Denoising
Deep Learning Memorandum
Make your own PC for deep learning
Start Deep learning
Python Deep Learning
Deep learning × Python
[Deep learning] Nogizaka face detection ~ For beginners ~
About data expansion processing for deep learning
Recommended study order for machine learning / deep learning beginners
Read & implement Deep Residual Learning for Image Recognition
Implementation of Deep Learning model for image recognition
I installed Chainer, a framework for deep learning
Python: Deep Learning Practices
Deep learning / activation functions
Deep Learning from scratch
Deep learning 1 Practice of deep learning
Reinforcement learning for tic-tac-toe
Deep learning / cross entropy
First Deep Learning ~ Preparation ~
First Deep Learning ~ Solution ~
[AI] Deep Metric Learning
I tried deep learning
Python: Deep Learning Tuning
Deep learning large-scale technology
Summary for learning RAPIDS
Deep learning / softmax function
Techniques for understanding the basis of deep learning decisions
Deep Learning Experienced in Python Chapter 2 (Materials for Journals)
A scene where GPU is useful for deep learning?
Data set for machine learning
Japanese preprocessing for machine learning
Deep Learning from scratch 1-3 chapters
Tips for handling variable length inputs in deep learning frameworks
Try deep learning with TensorFlow
Deep Learning Gaiden ~ GPU Programming ~
<Course> Deep Learning: Day2 CNN
Learning flow for Python beginners
Japanese translation of public teaching materials for Deep learning nanodegree
Python learning plan for AI learning
Deep learning image recognition 1 theory
Deep running 2 Tuning of deep learning
Create an environment for "Deep Learning from scratch" with Docker
Deep learning / LSTM scratch code
Rabbit Challenge Deep Learning 1Day
<Course> Deep Learning: Day1 NN
Deep Kernel Learning with Pyro
Learning memorandum for me w
Try Deep Learning with FPGA
Introducing Udacity Deep Learning Nanodegree
Set up AWS (Ubuntu 14.04) for Deep Learning (install CUDA, cuDNN)
Subjects> Deep Learning: Day3 RNN
Introduction to Deep Learning ~ Learning Rules ~
Rabbit Challenge Deep Learning 2Day
Checkio's recommendation for learning Python
A story about a 40-year-old engineer manager passing "Deep Learning for ENGINEER"
Deep Reinforcement Learning 1 Introduction to Reinforcement Learning
Deep reinforcement learning 2 Implementation of reinforcement learning
Generate Pokemon with Deep Learning
Introduction to Deep Learning ~ Backpropagation ~