Analogue-based approaches in anti-cancer compound modelling: the relevance of QSAR models

Background QSAR is among the most extensively used computational methodology for analogue-based design. The application of various descriptor classes like quantum chemical, molecular mechanics, conceptual density functional theory (DFT)- and docking-based descriptors for predicting anti-cancer activity is well known. Although in vitro assay for anti-cancer activity is available against many different cell lines, most of the computational studies are carried out targeting insufficient number of cell lines. Hence, statistically robust and extensive QSAR studies against 29 different cancer cell lines and its comparative account, has been carried out. Results The predictive models were built for 266 compounds with experimental data against 29 different cancer cell lines, employing independent and least number of descriptors. Robust statistical analysis shows a high correlation, cross-validation coefficient values, and provides a range of QSAR equations. Comparative performance of each class of descriptors was carried out and the effect of number of descriptors (1-10) on statistical parameters was tested. Charge-based descriptors were found in 20 out of 39 models (approx. 50%), valency-based descriptor in 14 (approx. 36%) and bond order-based descriptor in 11 (approx. 28%) in comparison to other descriptors. The use of conceptual DFT descriptors does not improve the statistical quality of the models in most cases. Conclusion Analysis is done with various models where the number of descriptors is increased from 1 to 10; it is interesting to note that in most cases 3 descriptor-based models are adequate. The study reveals that quantum chemical descriptors are the most important class of descriptors in modelling these series of compounds followed by electrostatic, constitutional, geometrical, topological and conceptual DFT descriptors. Cell lines in nasopharyngeal (2) cancer average R2 = 0.90 followed by cell lines in melanoma cancer (4) with average R2 = 0.81 gave the best statistical values.


Background
Cancer has been seriously threatening the health and life of humans for a long period and has become the leading disease-related cause of deaths of human population [1]. Radiation therapy and surgery as a means of treatment are only successful when the cancer is found at earlylocalized stage. However, chemotherapy in contrast is the mainstay in treatment of malignancies because of its ability to cure widespread or metastatic cancers. Natural products are the chemical agents that have been the major source of anti-cancer drugs. According to a review on new chemical entities, approximately 74% of anti-cancer drugs were either natural products or natural product-related synthetic compounds or their mimetics [2]. Computational methodologies have emerged as an indispensible tool for any drug discovery program, playing key role from hit identification to lead optimization. The QSPR/QSAR is among the most practical tool used in analogue/ligand-based drug design and has been extensively reviewed for prediction of various properties like ADME [3], toxicity [4,5], carcinogenicity [6], retention time [7] stability [8] and other physicochemical properties apart from the biological activity [9][10][11][12]. This theoretical method follows the axiom that the variance in the activities or physicochemical properties of chemical compounds is determined by the variance in their molecular structures [13][14][15].
Computational methods aids in not only the design and interpretation of hypothesis-driven experiments in the field of cancer research but also in the rapid generation of new hypotheses. The QSAR has widely been applied for the activity prediction of diverse series of biological and/or chemical compounds including anticancer drugs [16][17][18][19][20][21]. A number of quantum chemical descriptors (such as charge, molecular orbital, dipole moment, etc.) and molecular property descriptors (such as steric, hydrophobic coefficient, etc.) have been successfully applied to establish 2D QSAR models for predicting activities of compounds [22][23][24]. Density functional theory (DFT)-based descriptors have found immense usefulness in the prediction of reactivity of atoms and molecules, and its application in the development of QSAR has been recently reviewed [25][26][27][28][29][30]. QSAR has been instrumental in the development of various popular drugs, and it has been discussed in detail earlier [31].
For a cancer type, there are a number of cell lines available, on which in vitro evaluation of biological activity can be performed, but the results of this evaluation varies based on the cell line employed for assay. Therefore, it becomes difficult for computational chemist to choose experimental data from a pool of available biological activity for a single scaffold type, so as to proceed for analogue-based design. Although in vitro assay for anti-cancer activity is available against many different cell lines, most of the computational studies are carried out targeting any one particular cell line, which may not be a good approach to rely upon. The study considering all the available experimental data to build predictive models, will guide medicinal chemist to more reliably design new and potent compounds. Also, analyzing the obtained descriptors for models against all the cell lines, may suggest the importance of a particular class of descriptor in modelling anti-cancer activity against a cancer type. Such statistically robust and extensive QSAR studies against many different cancer cell lines have not been reported yet. Hence, we performed comprehensive QSAR modelling studies on 266 anti-cancer compounds against 29 different cancer cell lines. Descriptor analysis of all the QSAR models was performed to derive commonality among various cell lines belonging to a cancer type. The experimental data considered in the study was from in vitro cell line-based assays, and it is difficult to get reliable target-based information from such studies, unless meticulously validated. Since the aim of the present study was to evaluate the potentials of simple 2D-based descriptors in anti-cancer compound modelling, the biological targetrelated aspects were not considered. This study provides one of the most comprehensive accounts of the structure-activity relationship of a large number of molecules against 29 different cancer cell lines. Besides being statistically significant, the aim of this study is to assess the role and relevance of computationally demanding conceptual-DFT descriptors compared with the conventional descriptors. The strengths and limitations of QSAR models on treating a complex area such as the development of anti-cancer compounds are important to notice, and the present study shows a systematic way of developing and applying QSAR equations effectively. Table 1 shows the name of scaffolds considered, different cell lines [32][33][34][35][36][37][38][39][40][41], number of molecules corresponding to cell lines and the target of action or the molecular mechanism of scaffolds.

Results and discussion
Two different schemes were opted to develop statistically significant QSAR models. In the first scheme, 10 QSAR models were developed for the 10 scaffolds used in this study (i.e. scaffold-based QSAR models), whereas in the second scheme 29 different QSAR models were developed based on the availability of IC 50 values against 29 cancer cell lines by combining all the scaffolds (i.e. cell lines-based QSAR models). The parent structure of all the scaffolds with a number of compounds and name of cell lines are represented in Scheme 1.
It is vitally necessary to avoid the oversimplification of the QSAR modelling process and employ statistically robust approaches for the model development. The selection of the best model was based on the values of correlation coefficient obtained from the correlation of approximately 300 descriptors (constitutional, geometrical, topological, electrostatic and quantum chemical, etc.) in different combinations. In one hand, the uniqueness of a compound and its total chemical information cannot be described by very few descriptors while on the other hand large number of descriptors will create confusions and reduce the statistical robustness and predictive ability of the model. The effect of a number of descriptors on the correlation coefficient values for all the models were tested on training set by correlating 1-10 descriptors separately and presented in Figure 1a (for cell lines-based models) and b (for scaffold-based models). We observed that in various models, three descriptors are sufficient for getting a good correlation and using more than three descriptors make only small effect on the statistical quality of the models in most cases. Although more than six descriptor-based models may provide high correlation and cross-validation Table 1 Details of scaffolds considered in the study and the cell lines against which their anticancer activity was reported along with the number of molecules in each cell lines and its molecular target/mechanism of action if studied coefficient values, however, this may be false and thus may not be very useful for the further prediction of IC 50 values. Before the division of training and test set of compounds three, four and five, descriptor-based models were selected. While comparing the statistical performance of the selected models, three descriptor-based models were found to be optimum as they provide very acceptable correlation in most cases. All the models were divided into training and test set by randomly selecting around 20% of the compounds in the test set. Two independent test sets were constructed to rule out chance correlation (statistical data for the second test set is reported in Additional file 1 Table S83). Both the test sets showed the similar statistical performance indicating that the developed models are adequate. Final QSAR models were generated within the training set, and they were used to predict the activity of test set of compounds. The lower average residual obtained in both the training and test set of compounds in all the models indicate that the developed models are valuable and have capability to establish the relationship between the structure and activity for various anti-cancer scaffolds used in this study.
In order to assess and compare the predictive power and the stability of the QSAR models, several statistical and other parameters are reported and widely applied like R 2 , R cv 2 , s 2 , F, and AE (for details about these parameters, see footnote to Table 2). Table 2 contains the regression summary for cell lines-based QSAR models along with regression equation, name of the cell lines and types of cancer. Most of the cell lines-based QSAR models where the activity range is broad (M1, M2, M4, M5, M6, M8, M9, M11, M12 and M20) show higher statistical quality (R 2~0 .80, R cv 2~0 .75) and seems valuable for the current class of compounds. The statistical quality of few other cell line-based models (M10, M15, M19 and M21) is also reasonable (R 2~0 .75, R cv 2 0.70), and these models can be used for the prediction. However, the statistical qualities of M17, M23 and M26 models, which are lower (R 2~0 .60, R cv 2~0 .50), show that extra care is required before utilizing these models for the prediction. However, M29 cannot be Scheme 1 266 compounds which have IC 50 values represented into different scaffolds (S1-S10), the number of compounds in each scaffold in parenthesis and different cell lines against which the cytotoxicity values were reported (please see Tables S1-10 in Additional file 1 for structure of all the compounds with their in vitro IC 50 values against various cell lines).
used for the prediction because of the insignificant statistical results obtained for this model (R 2 = 0.46, R cv 2 = 0.43). The reason for poor result in M29 is probably due to involvement of 118 compounds and 5 different scaffolds in this model. The increase in the number of descriptors for M29 is not much improving the quality of the model (with 10 descriptors R 2~0 .7) and indicates that the currently used descriptors are not good enough for developing the structure-activity relationship for this model, and one needs to try or develop a b Figure 1 Effect of number of descriptors on the correlation coefficient of (a) cell line-based QSAR models, (b) scaffold-based QSAR models.  additional descriptors. However, the involvement of single scaffolds in this model provides a good statistical quality (DU145/S10 in Table 3 .60) although the residuals are lower in all the 11 models as per expectations. The statistical details and descriptor types for cell linebased QSAR models are depicted in Figure 2a.
Regression summary for scaffold-based QSAR models along with regression equation, name of the cell lines and types of cancer is given in Table 3. We observed a good statistical quality with higher regression coefficient values in all the scaffold-based QSAR models probably because of the involvement of lesser number of compounds and only one scaffold in the development of these models. The range of activity of compounds in four models (S1, S2, S5 and S6) is narrow, so these models were moved to the end of Table 3 and these models will not be very reliable. The models with narrow activity range compounds show lower regression coefficient values compared with the ones with broad activity range compounds. All the scaffold-based models with broad activity range compounds seem reasonable and can be used for the prediction. The statistical details and descriptor types for scaffold-based QSAR models are depicted in Figure 2b.
The observed and predicted activity with residuals and descriptor values for all the developed models are presented in Additional file 1 (Tables S12 to S46). Outliers  Please refer to the footnote of Table 2 for definition of the statistical parameters as well as other abbreviations. R 2 is the square of the correlation coefficient and represents the statistical significance of the model. Rcv 2 is the cross-validated R 2 , a measure of the quality of the QSAR model. O is the number of outlier for the model. s 2 is the standard deviation. F is the Fischer statistics, the ratio between explained and unexplained variance for a given number of degrees of freedom, thereby indicating a factual correlation or the significance level for QSAR models. AE is the average of absolute difference between experimental and predicted IC 50 values. TR is number of molecules in training set, TE is test set molecules, PD is number of molecules for which activity was not reported, and the QSAR model predicted it.
are those compounds which are unable to fit in the developed QSAR models. Although most of these QSAR models do not have any outlier, however, in some cases maximum of one outlier is present because of its higher deviation between the observed and predicted activities. The occurrence of outliers is not only due to the possibility that the compounds may act by different mechanisms or interact with the receptor in different binding modes but also due to the intrinsic noise associated with both the original data and methodological aspects opted for the construction of models. Figure 3a,b represents the plot between the experimental and predicted IC 50 values for cell line-and scaffold-based QSAR, respectively, (the plot for 11 cell line-and 4 scaffoldbased models, which has narrow activity range, is presented in Figure S1a,b, respectively, of the Additional file 1). The average residual for test and training set compounds presented in this figure clearly shows the compounds of test set are closer to the line compared with the compounds of training set. Rigorous validation for the applicability of generated QSAR models was done by dividing another independent test set. As per our expectations, the statistical performance of the second test set is similar to that of the first test set. The observed and predicted activity with residuals and descriptors values for all the developed models for the second test set of compounds are presented in Additional file 1 (Tables S48-S82).
In the developed QSAR models, 78 descriptors (42 quantum chemical, 18 electrostatic, 8 constitutional, 7 geometrical and 3 topological) were used in different combinations. Figure 4 depicts the details of all the 78 descriptors, its type and occurrence in the models. The inter-correlation of the descriptors appeared in all the developed models were taken into account, and the descriptors were found to be reasonably orthogonal (see Additional file 1 Table S47 for details). Frequent occurrence of quantum chemical descriptors was found in general in the developed QSAR models. Charge-based descriptors (such as Maximum partial charge for a H atom, Minimum net atomic charge for a H atom, Relative positive charged surface area, Maximum net atomic charge for a C atom etc.) were present in 20 of 39 models (approx. 50%) thereby sharing a major proportion of overall descriptor space. This was followed by valencybased descriptors (such as Minimum valency of O atom, Minimum valency of a C atom, Average valency of a N atom, Maximum valency of a H atom, etc.) present in 14 models (approx. 36%). This was later followed by bond order-based descriptors (such as Minimum (>0.1) bond order of a H atom, Maximum bond order of a N atom, Average bond order of a C atom, Maximum PI-PI bond order, etc.) present in 11 models (~28%). This indicates the role of charge-based, valency-based and bond order-based descriptors in modelling of the present set of compounds. We have tested the conceptual DFT descriptors on all the above models and found that these descriptors are not important for this class of compounds.
Cell lines considered in the current study correspond to 14 different cancer types (Additional file 1 Table  S84). Among them, eight cancer types have experimental data with more than one cell line. Thus, comparative a b Figure 2 Regression summary (correlation coefficient R 2 , crossvalidation coefficient R CV 2 and average residual AE values) for (a) cell line-based QSAR models, (b) scaffold-based QSAR models.

Conclusions
Within the present study, we assessed the predictive power of QSAR approaches to model anti-cancer compounds. A total of 39 QSAR models, 10 for different scaffolds and 29 for different cell lines, were built to assess the predictive power of QSAR models for anticancer activity. Although analysis is done with various models where the number of descriptors is increased from 1 to 10, it is interesting to note that in most cases 3 descriptor-based models are adequate. The study reveals that quantum chemical descriptors are the most important class of descriptors followed by electrostatic, constitutional, geometrical, topological and conceptual DFT descriptors. Charge-based descriptors prevailed among the rest, followed by valency-based and bond order-based descriptors. Thus, the current study highlights the importance of analogue-based designing approaches in modelling anti-cancer compounds. Considerably, we did not make any assumptions about the site of interaction or mechanism of action of these compounds yet were able to develop statistically robust models for all experimentally tested compounds where the correlation coefficient (R 2 ) and cross-validation coefficient (R cv 2 ) values are higher and average residuals (AE) are lower in most cases. Cell lines in nasopharyngeal (2) cancer average R 2 = 0.90 followed by cell lines in melanoma cancer (4) with average R 2 = 0.81 gave the best statistical values.

Methods
Details of the scaffold considered in the study along with the cell lines against which experimental IC 50 values is reported with number of compounds in each cell line is given in Table 1. Two different schemes (scaffold-and cell line-based) were followed for performing QSAR studies. Scaffold-based QSAR studies were carried out based on the availability of compounds in various scaffolds (S1-S10) collected from ten different studies. The cell line that provided the best regression summary was used for making scaffold-based QSAR models. See Tables S1-S10 in Additional file 1 for the structure and the corresponding activity values of all the compounds. Scheme 2 provides a schematic illustration of workflow adopted in the manuscript for building and validating various QSAR models. A total of 266 compounds are collected along with their anti-cancer activity against 29 cancer cell lines which belong to 10 different chemical scaffolds (Scheme 1). All the structures were initially optimized using semi-empirical AM1 procedure and later subjected to energy evaluations at B3LYP/6-31G(d) level on AM1 geometries [42]. Important descriptors were obtained using these B3LYP calculations by using the CODESSA [43] program in conjunction with the Gaussian output files. The 300 descriptors obtained using the CODESSA program can be divided into different classes such as constitutional, topological, geometrical, quantum chemical and thermodynamic. For  Table S11 for the details of all the descriptors).
Bohari et al. Organic and Medicinal Chemistry Letters 2011, 1:3 http://www.orgmedchemlett.com/content/1/1/3 each compound these descriptors were calculated, and non-significant descriptors were identified by heuristic method and eliminated. The inter-correlation of the descriptors in all the models was tested. Then, models where the descriptors are highly inter-correlated were replaced and refined so that the descriptors employed in a given model are virtually orthogonal to each other. In order to find out the minimum number of descriptors defining activity, we systematically developed 3, 4 and 5 descriptor-based models for all sets of compounds, using heuristic method. It was found that three descriptor-based models are fairly satisfactory. Then all the compounds were divided into two independent tests (approx. 20%) and training set (approx. 80%) using Project Leader application associated with Scigress explorer [44]. The statistical quality of the model was assessed by various parameters like R 2 , R 2 cv , AE, s, F, for both test and training set. The validation of QSAR models was done by examining the prediction of activity on test set i.e. R 2 , R 2 cv and AE. The effect of the number of descriptors on the correlation coefficient was examined on the training set of molecules by running heuristic method at 1-10 descriptors. Two different training and test sets were developed to rule out chance correlation. Scheme 2 illustrates the steps taken for developing the final QSAR models in a schematic fashion.

Additional material
Additional file 1: The additional data file available with the online version of the article contains following information: (a) Structure of all the compounds used in this study (Tables S1-S10); (b) Full name of all the descriptors involved in the study (Table S11); (c) The predicted activity and descriptors values for all the models, the first test set (Tables S12-S46); (d) Inter-correlation analysis of the descriptors (Table S47); (e) The predicted activity and descriptors values for all the models, the second test set (Tables S48-S82); (f) Regression summary for cell-linebased and scaffold-based QSAR models pertaining to the second test set (Table S83a and S83b); (g) Comparative statistical significance of various cancer types (Table S84); (h) Figure of plot between the experimental and predicted IC 50 values for the QSAR models where activity range was narrow, based on cell lines and scaffold ( Figure S1a,b).