2D and 3D-QSAR study on 4-anilinoquinozaline derivatives as potent apoptosis inducer and efficacious anticancer agent

Background Apoptosis is known as programmed cell death that plays an important role in tumor biology. Methods In this study, apoptosis-inducing activity is predicted by using a QSAR modeling approach for a series of 4-anilinoquinozaline derivatives. 2D-QSAR model for the prediction of apoptosis-inducing activity was obtained by applying multiple linear regression giving r2 = 0.8225 and q2 = 0.7626, principal component regression giving r2 = 0.7539 and q2 = 0.6669 and partial least squares giving r2 = 0.8237 and q2 = 0.6224. Results QSAR study revealed that alignment-independent descriptors and distance-based topology index are the most important descriptors in predicting apoptosis-inducing activity. 3D-QSAR study was performed using k-nearest neighbor molecular field analysis (kNN-MFA) approach for both electrostatic and steric fields. Three different kNN-MFA 3D-QSAR methods (SW-FB, SA, and GA) were used for the development of models and tested successfully for internal (q2 > 0.62) and external (predictive r2 > 0.52) validation criteria. Thus, 3D-QSAR models showed that electrostatic effects dominantly determine the binding affinities. Conclusions The QSAR models developed in this study would be useful for the development of new apoptosis inducer as anticancer agents.


Introduction
Cancer is a disease of cell characterized by progressive, persistent, abnormal, purposeless, and uncontrolled proliferation of tissues. Currently, cancer is most dominating cause of death in world [1,2]. Apoptosis is a cellular process, that organisms use for digestion of excessive cells to control cell numbers. Caspases is a family of cysteine proteases, which are produced as zymogenes, plays a vital role in the beginning and ending of apoptosis [3]. Small drug molecules can inhibit or activate caspases [4]. In apoptosis, cell shrinks, deformed, and looses its contacts to neighboring cells that resulted into fragmented cells called "apoptotic bodies" [5]. Human body removes apoptotic bodies without causing any inflammatory response. Process of apoptosis is controlled by a diverse range of cell signals [6]. In humans during cell development, many cells are produced by mitosis in excess which eventually undergo programmed cell death and their by contribute to sculpturing many organs and tissues [7]. Apoptosis is an essential process for the maintenance of tissue homeostasis. During the last decade, a number of compounds are identified to induce apoptosis. Compounds that promote apoptosis are considered as important medicaments for the treatment of cancer. Anticancer efficacy of many therapeutic agents is correlated to their apoptosis inducer ability, so identification of apoptosis-inducing activity plays an important role in discovery and development of potential anticancer agent [8]. Many anticancer drugs like camptothecins, such as topotecan and irinotecan, [9] and vinca alkaloids, such as vincristine and vinblastine [10], kill tumors to some extent through induction of apoptosis [11]. Recently, extensive advances are achieved in the field of apoptosis-based therapeutics. Many new drug candidates are currently being developed and most of them are in clinical state as potential apoptosis inducer agents. In an effort for search of new potent apoptosis-inducing agents, we have performed 2D-and 3D-QSAR study on 4-anilinoquinozaline derivatives for quantifying the necessary structural and physicochemical requirements of this series of compounds as potent apoptosis inducer and efficacious anticancer agents.

Materials and methods
QSAR is the study of the quantitative relationship between the experimental activity of a set of compounds and their physicochemical properties using statistical methods. The experimental information associated with biological activity, which is used as dependent variables in building a QSAR model. In this study, all computational work (2D-and 3D-QSAR) was performed using Vlife MDS QSAR plus software on a HP computer with Core2 Duo processor and a window XP operating system.

2D-QSAR modeling and dataset
Apoptosis-inducing activity data EC 50 (μM) were taken from the published work [12]. The experimental EC 50 values were evaluated by Cai Sui Xiong et al. in a caspase-based HTS assay in human breast cancer cells (T47D). The negative logarithm of the measured EC 50 (μM) [pEC 50 = -log (EC 50 )] was used as dependent variable for 2D-and 3D-QSAR analysis and it is listed in Table 1. Since some compounds showed insignificant activity, such compounds were excluded from the dataset. Compounds were sketched using 2D draw application and converted to 3D structures. Energy minimization and geometry optimization were conducted using Merck molecular force field as force field and charge, maximum number of cycles were 1,000, convergence criterion (RMS gradient) was 0.01 and medium's dielectric constant of 1 by batch energy minimization method. Energy-minimized geometry was used for calculation of descriptors, a total of 208 2D descriptors were calculated which encoded different aspects of molecular structure and consists of electronic, thermodynamic, spatial, and structural descriptors, e.g., retention index (chi), atomic valence connectivity index (chiV), path count, chain path count, cluster, path cluster, element count, estate number, semi-empirical, molecular weight, molecular refractivity, logP, and topological index. Various Baumann alignment-independent (AI) descriptors were also calculated.

Selection of training and test set
The dataset of 32 molecules was divided into training set (24 compounds) and test set (8 compounds) by Sphere Exclusion (SE) method [13] for multiple linear regression (MLR), principal component regression (PCR), and partial least squares (PLS) model with dissimilarity value of 3.2 using pEC 50 activity field as dependent variable and various 2D descriptors as independent variables (Additional file 1). The unicolumn statistics of test and training sets (Table 2) showed the accurate selection of test and training sets, as the maximum of the training set was more than that of the test set and the minimum of the training set was less than or equal to that of the test set.

Regression analysis
Dataset of 32 molecules was subjected to regression analysis using MLR, PCR, and PLS as model building methods. QSAR models were generated using pEC 50 values as the dependent variable and various descriptors values as independent variables. The cross-correlation limit was set at 0.5, number of variables in the final equation at six in MLR and five in PCR and six in PLS, and term selection criteria as r 2 , F-test 'in,' at 4 and 'out' at 3.99, r 2 , and F-test. Variance cutoff was set at 0, scaling to autoscaling, and number of random iterations to 10. Statistical measures were used for the evaluation of QSAR models were the number of compounds in regression n, regression coefficient r 2 , number of descriptors in a model k, F-test (Fisher test value) for statistical significance F, cross-validated correlation coefficient q 2 , predictive squared correlation coefficients pred_r 2 , coefficient of correlation of predicted data set pred_r 2 se and standard error (SE) of estimation r 2 se and q 2 se.

MLR analysis
MLR is a method used for modeling linear relationship between a dependent variable Y (pEC 50 ) and independent variable X (2D descriptors). MLR is based on least squares: the model is fit such that sum-of-squares of differences of observed and a predicted value is minimized. MLR estimates values of regression coefficients (r 2 ) by applying least squares curve fitting method. The model creates a relationship in the form of a straight line (linear) that best approximates all the individual data points. In regression analysis, conditional mean of dependant variable (pEC 50 ) Y depends on (descriptors) X. MLR analysis extends this idea to include more than one independent variable.
Regression equation takes the form where Y is dependent variable, 'b's are regression coefficients for corresponding 'x's (independent variable), 'c' is a regression constant or intercept [14,15].

PCR method
PCR is a data compression method based on the correlation among dependent and independent variables. PCR provides a method for finding structure in datasets. Its aim is to group correlated variables, replacing the original descriptors by new set called principal components (PCs). These PCs uncorrelated and are built as a simple linear combination of original variables. It rotates the data into a new set of axes such that first few axes reflect most of the variations within the data. First PC (PC 1 ) is defined in the direction of maximum variance of the whole dataset. Second PC (PC 2 ) is the direction that describes the maximum variance in orthogonal subspace to PC 1 . Subsequent components are taken orthogonal to those previously chosen and describe maximum of remaining variance, by plotting the data on new set of axes, it can spot major underlying structures automatically. Value of each point, when rotated to a given axis, is called the PC value. PCA selects a new set of axes for the data. These are selected in decreasing order of variance within the data. Purpose of principal component PCR is the estimation of values of a dependent variable on the basis of selected PCs of independent variables [16].

PLS regression method
PLS analysis is a popular regression technique which can be used to relate one or more dependent variable (Y) to several independent (X) variables. PLS relates a matrix Y of dependent variables to a matrix X of molecular structure descriptors. PLS is useful in situations where the number of independent variables exceeds the number of observation, when X data contain colinearties or when N is less than 5 M, where N is number of compound and M is number of dependant variable. PLS creates orthogonal components using existing correlations between independent variables and corresponding outputs while also keeping most of the variance of independent variables. Main aim of PLS regression is to predict the activity (Y) from X and to describe their common structure [17]. PLS is probably the least restrictive of various multivariate extensions of MLR model. PLS is a method for constructing predictive models when factors are many and highly collinear.

Validation of QSAR model
The best way to evaluate quality of regression model is internal validation of QSAR model. Mostly leave-oneout (LOO) cross validation, one object (one biological activity value) is eliminated from training set and training dataset is divided into subsets (number of subsets = number of data points) of equal size. Model is build using these subsets and dependent variable value of the data point that was not included in the subset is determined, which is a predicted value. Mean of predicted will be same for r 2 and LOO q 2 (cross-validated correlation coefficient value) since all the data point will be sequentially considered as predicted in LOO subset. Same procedure is repeated after elimination of another object until all objects have been eliminated once. LOO cross validation resulted in three statistically significant models for each regression method [18]. To calculate q 2 following equation was used.
where Y pred , Y act , and Y mean are predicted, actual, and mean values of the pIC 50 , respectively. Σ(Y pred -Y act ) 2 is the predictive residual error sum of squares (PRESS). Definitive validity of model is examined by mean of external validation also, which evaluates how well equation generalizes. Training set is used to derive an adjustment model that is used after to predict activities of test set members. The predicted power of equations was validated using cross-validated squared correlation coefficient (q 2 ) and by predictive squared correlation coefficients pred_r 2 which is used as a diagnostic tool. To calculate the predictive r 2 (r 2 pred ) following equation was used.
where Y Pred(Test) and Y Test are predicted and observed activity values, respectively, of test set compounds, and Y Training is the mean activity value of training set. Statistical significance of these models was further supported by 'fitness plot' obtained for each model; this is a plot of experimental versus predicted activity of training and test set compounds and provides an idea about how fit the model was trained and how well it predicts activity of external test set (Figure 1). Nearness of experimental to predicted activity reported in Table 1 also adds to this fact. Contribution charts for all the significant models are presented in Figure 2, which gives percentage contribution of descriptors used in deriving the QSAR models.

3D-QSAR modeling and dataset
Dataset of 32 molecules was divided into training (24 compounds) and test (8 compounds) set by SE method having dissimilarities values of 7.9 with pEC 50 activity field as dependent variable and various 3D descriptors calculated for the compounds as independent variables (Additional file 2).

Molecular modeling and alignment
Conformational search was carried out by systemic conformational search method (grid search), which generates all possible conformations, by systematically varying each of the torsion angles of a molecule by some increment, keeping the bond lengths and bond angles fixed and lowest energy conformers were selected. All the compounds were aligned by template-based method. In template-based alignment method, a template structure was defined and used as a basis for alignment of a set of molecules. In this study, all the compounds were aligned against minimum energy conformation of most active compound number using quinazoline ring as template.

Calculation of field descriptors
Electrostatic and steric field descriptors were calculated with cutoffs of 10.0 kcal/mol for electrostatic and 30.0 kcal/mol for steric, and charge type was selected as by Gasteiger-Marsili [19]. The dielectric constant was set to 1.0, considering distance-dependent dielectric function. Probe setting was carbon atom with charge 1.0. A total of 2,080 field descriptors (1,040 for each electrostatic and steric) were calculated for all the compounds in separate columns. 3D-QSAR analysis was performed after exclusion of all the invariable columns, as they do not contribute to QSAR.

k-Nearest neighbor molecular field analysis (kNN-MFA)
The kNN methodology relies on a simple distance learning approach whereby an unknown member is classified according to the majority of its kNN in training set. The nearness is measured by an appropriate distance metric. In kNN-MFA method, several models were generated for selected members of training and test sets. Once training and test sets are generated, kNN methodology is applied to descriptors generated over the grid [20]. The steric and electrostatic interaction energies are computed at lattice points of the grid using a methyl probe of charge +1. These interaction energy values are considered for relationship generation and utilized as descriptors to decide nearness between molecules.

kNN-MFA with stepwise forward-backward (SW-FB) variable selection method
kNN-MFA models were developed using SW-FB method with cross-correlation limit set to 0.5 and term selection criterion as q 2 . F-test 'in' was set to 4.0, and Ftest 'out' to 3.99. As some additional parameters, variance cutoff was set at 2 kcal/mol Å and scaling to autoscaling; additionally, kNN parameter setting was done within the range of 2-5 and prediction method was selected as the distance-based weighted average.

kNN-MFA with genetic algorithm (GA)
GA first described by Holland [21], mimics natural evolution and selection. In biological systems, genetic information that determines the individuality of an organism is stored in chromosomes. Chromosomes are replicated and passed onto the next generation with selection criteria depending on fitness.

kNN-MFA with simulated annealing (SA)
SA is the simulation of a physical process, 'annealing', which involves heating the system to a high temperature and then gradually cooling it down to a preset temperature (e.g., room temperature). During this process, the system samples possible configurations distributed according to the Boltzmann distribution so that at equilibrium, low energy states are the most populated.

Generation of 2D-QSAR models
Descriptors used in generation of 2D-QSAR models are given in Table 3 with detail description. 2D-QSAR study of 4-anilinoquinozaline derivatives resulted in several QSAR models. Statistically significant QSAR models were selected for discussion.
Model where n = 24 training and 8 test , DF = 30, r 2 = 0.824, q 2 = 0.622, F-test = 31.137, r 2 se = 0.435, q 2 se = 0.515, pred_r 2 = 0.589 In above QSAR models, r 2 is a correlation coefficient that has been multiplied by 100 gives explained variance in biological activity. Predictive ability of generated QSAR models was evaluated by q 2 employing LOO method. F value reflects ratio of variance explained by models and variance due to error in regression. High F value indicates that model is statistically significant. Low SE of estimation indicted by r 2 se and q 2 se suggested that models are statistically significant. Predictive ability of QSAR model was also confirmed by external validation of test set compounds denoted by pred_r 2 , and it was found in agreement with accepted criteria of more than 0.3. Among these three models, PLS has come out with very good results as compare to other two models.
Results of PLS analysis showed very good predictive ability as indicted by r 2 , q 2 , F-test, and pred_r 2 values (Table 4).

Interpretaion of 2D-QSAR models
MLR (Model-1) and PLS (Model-3) indicate positive contribution of Baumann's [22] AI topological descriptors T_C_C_4, T_2_F_5, T_2_Cl_1 and negative contribution of chi3Cluster where as PCR model indicates T_2_F_5 is the count of number of double bounded atoms (i.e. any double bonded atom, T_2) separated from fluorine atom by five bonds in a molecule (C_C_C_C_C_C_F) T_2_Cl_1 T_2_Cl_1 is the count of number of double bounded atoms (i.e. any double bonded atom, T_2) separated from chlorine atom by single bonds in a molecule (C_C_Cl) T_N_N_7 T_N_N_7 is a count of number of nitrogen atoms (single, double or triple bonded) separated from any nitrogen atom (single or double bonded) by seven bonds in a molecule (N_C_C_C_C_C_C_C_N) T_N_O_5 T_N_O_5  positive contribution of RadiusOfGyration along with SKMostHydrophobic and negative contribution of T_N_O_1 and MomInertiaY. AI descriptors can be generated considering topology of the molecule, atom type, and bond. For calculation of AI descriptors, every atom in the molecule was assigned at least one and at most three attributes. First attribute is 'T-attribute' to thoroughly characterize topology of the molecule. Second attribute is atom type, atom symbol is used here. Third attribute is assigned to atoms taking part in a double or triple bond. After all atoms have been assigned their respective attributes, selective distance count statistics for all combinations of different attributes are computed. A selective distance count statistic 'XY2' (e.g., TOPO2N3) counts all the fragments between start atom with attribute 'X' (e.g., '2' doublebonded atom) and end atom with attribute 'Y' (e.g., 'N') separated by graph distance 3. Graph distance can be defined as the smallest number of atoms along the path connecting two atoms in molecular structure. In this study, to calculate AI descriptors, we have used following attributes: 2 (double-bonded atom), 3 (triple-bonded atom), C, N, O, S, H, F, Cl, and Br the distance range of 0 to 7. RadiusOfGyration and MomInertiaY are distance-based topological descriptors. Topological indices are numerical values associated with chemical constitutions for the purpose of correlating chemical structure with biological activity. Distance-based topological descriptors are defined by their atom types and topological distance which signifies basic connectivity of atoms in the molecules. SKMostHydrophobic is a thermodynamic descriptor to characterize the hydrophobicity of a molecule. SlogP estimates logP by summing the contribution of atom-weighted solvent accessible surface areas and correction factors. SKMostHydrophobic contributed positively to the PCR model means the group, which increases hydrophobic nature, may cause increase apoptosis inducing activity. An estate contribution descriptor SssCH2E-index, which represents the electro-topological state indices for number of -CH 2 group connected with two single bonds, is inversely proportional to the activity.

Generation and interpretaion of 3D-QSAR models
3D-QSAR modeling was performed using kNN-MFA method that adopts a kNN principle for generating relationships between molecular fields and apoptosis-inducing activity. The kNN-MFA models (4-6) were generated using training set of 24 compounds and 3D-QSAR models were validated using a test set of 8 compounds. The steric (S) and electrostatic (E) descriptors specify the regions, where variation in the structural features of different compounds in training set leads to increase or decrease in activities. The number accompanied by descriptors represents its position in 3D MFA grid. The stepwise forward backward variable selection method resulted in several statistically significant models, of which model-4 is considered as the best one. The model selection criterion is the value of q 2 , internal predictive ability of model, and that of pred_r 2   the value of LOO cross-validation squared correlation coefficient q 2 = 0.653 suggested goodness of the prediction. Model having predictive squared correlation coefficient (pred_r 2 ) > 0.52, in agreement with the accepted criteria as more than 0.4 which means 52% predictive power for the external test set ( Table 5). The predicted versus the experimental value for training and test sets are depicted in Figure 3. The best 3D-QSAR (kNN-MFA) models for the prediction of apoptosis-inducing activity were obtained by applying SA giving q 2 = 0.693 and pred_r 2 = 0.537, and GA giving q 2 = 0.622 and pred_r 2 = 0.563. 3D-QSAR models (4-6) obtained showed that electrostatic and steric interactions plays major role in determination of apoptosis inducing activity. E_895 and E_805 in model-4, E_509, E_875, and E_147 in model-5 are electrostatic field descriptors, similarly S_526 in model-5 and S_477 and S_664 in model-6 are steric descriptors (Figure 4). Negative value in electrostatic field descriptors indicates that negative electronic potential is required to increase apoptosis-inducing activity and more electronegative groups are preferred in that position, positive range indicates that group that imparting positive electrostatic potential is favorable for apoptosis-inducing activity so less electronegative group is preferred in that region. Similarly, negative range in steric descriptors indicates that negative steric potential is favorable for activity, and less bulky substituents group is preferred in that region, positive value of steric descriptors reveals that positive steric potential is favorable for increase in apoptosisinducing activity of 4-anilinoquinozalines and more bulky group is preferred in that region.

Conclusions
Apoptosis plays a pivotal role in the cytotoxic activity of most chemotherapeutic drugs. QSAR modeling resulted in identification of common structural features responsible for prediction of apoptosis-inducing activity for 4anilinoquinozaline derivatives. 2D-QSAR studies revealed that AI descriptors were major contributing descriptors. Descriptor values obtained in this study helped in quantification of the structural features of 4anilinoquinozaline derivatives. The overall degree of prediction was found to be around 82% in case of MLR and PLS. Among the three 2D-QSAR models (MLR, PCR, and PLS), results of PLS analysis showed significant predictive power and reliability as compare to other two methods. The master grid obtained for the various kNN-MFA models showed positive value in electrostatic field descriptors, which indicates that positive electronic potential is required to increase apoptosis-inducing activity. Negative range in steric descriptors indicates that less bulky group is preferred in that region. 3D-QSAR results suggested the importance of some

Additional material
Additional file 1: