[PYTHON] Performs high-speed calculation of only specific descriptors with mordred

Introduction

Previously, I investigated how to calculate the descriptor for each molecule in mordred by Calculate one molecule and one descriptor in mordred. This makes it possible to calculate only the descriptors you want to use.

However, if this method is used to calculate only a specific descriptor when there are a large number of molecules, it is inconvenient in the following points.

--Calculation is slow without parallelization. The pandas method of the Calculator class does a nice parallel calculation when you specify the number of CPU cores.

--Since the calculation result cannot be obtained in the DataFrame format, it takes extra effort.

This time, when there are a large number of molecules, I investigated how to calculate the descriptor by specifying only the descriptor you want to use and using the pandas method of the Calculator class.

environment

Source

You can go like this.


from mordred import Calculator, descriptors

calc_dummy = Calculator(descriptors, ignore_3D=False)
my_desc_names = ["SpAD_A", "SRW10"]
my_descs = []
for i, desc in enumerate(calc_dummy.descriptors):
    if desc.__str__()  in my_desc_names:
       my_descs.append(desc)

calc_real = Calculator(my_descs, ignore_3D=False)
df = calc_real.pandas(mols, nproc=3)
df.to_csv(args.output + "/mordred.csv")

Commentary

If you read the source of the Calculator class carefully, you can specify a list of instances of the Descriptor class instead of descriptors in the argument of the constructor.

Using this, pick up the instance from the name of the descriptor you want to calculate (listed in my_desc_names) and put it in the list.

The reason calc_dummy creates a dummy Calculator object is to get a list of descriptor instances. (There must be a more elegant way, so try hacking the sauce).

Next, create a Calculator object by giving a list of descriptor instances as arguments to the constructor of the Calculator class.

Finally, when the calculation is performed by this Calculator object, a DataFrame containing the calculation result of only the specified descriptor is obtained.

in conclusion

This know-how is quite useful when only making predictions after creating a prediction model. This is because if you only want to make predictions, you only need to calculate the descriptors used in the prediction model. Especially when calculating for a list of compounds such as tens of thousands or hundreds of thousands, we think that there will be a large difference in processing time.

Recommended Posts

Performs high-speed calculation of only specific descriptors with mordred
Error-free calculation with big.Float of golang
Play with numerical calculation of magnetohydrodynamics
Numerical calculation of differential equations with TensorFlow 2.0
1. Statistics learned with Python 1-3. Calculation of various statistics (statistics)
Real-time calculation of mean values with coroutines
VM cannot boot with specific kernel of CentOS7
1. Statistics learned with Python 1-2. Calculation of various statistics (Numpy)
Sequential calculation of mean value with online algorithm
Calculation of mutual information (continuous value) with numpy