Challenge molecular competitions for practicing RDKit and machine learning. Data assessment for the time being.
Basically my work memo. Since it was last year's competition, there are already various summaries, so it might be better to look there.
https://www.kaggle.com/c/champs-scalar-coupling/ Find the value of magnetic interaction called scalar_coupling_constant
For the time being, read it with pandas and nunique.
Although the number of molecules is 130,000, there is a value of the interaction required between each atom, so a total of 8 million records of test and training data.
--dipole_moments.csv: Dipole moment (x, y, z data) --magnetic_shielding_tensors.csv: 8 combinations of X, Y, Z (XX, XY, ...) --potential energy: electron energy
--mulliken_charges.csv: muliken charge (should be the one who simply halved the charge distribution between two points) - molecule_name --atom_index: 29 (Molecules with a maximum atomic number of 29) - mulliken_charge
--structures.csv (There are other xyz data) - molecule_name - atom_index --atom: 5 species (H, N, O, F, C) - x - y - z
The sum of fc, sd, pso, and dso is the training data scalar_coupling_constant.
You can see that there are many molecules of about 15 atoms (maybe it doesn't make much sense to see them)
path_structure = '../data/structures.csv'
df_structure = pd.read_csv(path_structure)
df_check = (
df_structure
.groupby('molecule_name')
.max('atom_index')
.reset_index()
.loc[:, ['molecule_name', 'atom_index']]
.groupby('atom_index').count()
)
plt.xlabel('N of atoms', fontsize=12)
plt.ylabel('count', fontsize=12)
plt.bar(df_check.index, df_check['molecule_name'])
plt.show()
Check if the test data and training data have the same distribution by the number of 8 types of interactions → It seems to be divided randomly
[test data]
[Training data]
1.1 Large variance of JHC 2. The value of 1JHC is about 80. Next, 1JHN is about 50. Others are around 0. 3.1 JHN has two peaks
Recommended Posts