[PYTHON] Nowadays molecular competition (1)

Overview

Challenge molecular competitions for practicing RDKit and machine learning. Data assessment for the time being.

reference

Basically my work memo. Since it was last year's competition, there are already various summaries, so it might be better to look there.

  1. [Predicting Scalar Coupling Constants using Machine Learning] (https://medium.com/@liztersahakyan/predicting-scalar-coupling-constants-using-machine-learning-c213af14e862)
  2. [Review of Kaggle Molecular Competition-Overview of Competition / Application of GCN] (https://rishigami.hatenablog.com/entry/2019/08/29/090244)

Overview of the competition

https://www.kaggle.com/c/champs-scalar-coupling/ Find the value of magnetic interaction called scalar_coupling_constant

Number of data items and item name

For the time being, read it with pandas and nunique.

Test/training data

Although the number of molecules is 130,000, there is a value of the interaction required between each atom, so a total of 8 million records of test and training data.

Other data

molecule_name is the primary key

--dipole_moments.csv: Dipole moment (x, y, z data) --magnetic_shielding_tensors.csv: 8 combinations of X, Y, Z (XX, XY, ...) --potential energy: electron energy

molecule_name and atom_index are the primary keys

--mulliken_charges.csv: muliken charge (should be the one who simply halved the charge distribution between two points) - molecule_name --atom_index: 29 (Molecules with a maximum atomic number of 29) - mulliken_charge

--structures.csv (There are other xyz data) - molecule_name - atom_index --atom: 5 species (H, N, O, F, C) - x - y - z

Same as training data (4.7 million lines)

The sum of fc, sd, pso, and dso is the training data scalar_coupling_constant.

Data distribution

Number of molecules

You can see that there are many molecules of about 15 atoms (maybe it doesn't make much sense to see them)

path_structure = '../data/structures.csv'
df_structure = pd.read_csv(path_structure)

df_check = (
    df_structure
        .groupby('molecule_name')
        .max('atom_index')
        .reset_index()
        .loc[:, ['molecule_name', 'atom_index']]
        .groupby('atom_index').count()
)

plt.xlabel('N of atoms', fontsize=12)
plt.ylabel('count', fontsize=12)
plt.bar(df_check.index, df_check['molecule_name'])
plt.show()

001_NumberOfAtomsDistribution.png

Test/training data

Check if the test data and training data have the same distribution by the number of 8 types of interactions → It seems to be divided randomly

[test data] 002_TypeDistributionTest.png

[Training data] 003_TypeDistributionTrain.png

Distribution of values ​​by type of interaction

1.1 Large variance of JHC 2. The value of 1JHC is about 80. Next, 1JHN is about 50. Others are around 0. 3.1 JHN has two peaks

004_TypeDistributionSCC.png

Recommended Posts

Nowadays molecular competition (1)