Skip navigation
71678 Views 9 Replies Latest reply: Oct 5, 2009 6:32 AM by dhoneycutt RSS
Currently Being Moderated

Oct 13, 2010 12:39 PM

Comparison of Different Classifiers on Ames Data

I just made a posting on the Accelrys Blog describing a few calculations I did on an Ames mutagenicity data set recently published and made available by Katja Hansen et al. Here, I provide some more details on how the calculations were done. The results may help you decide among different methods when you need to build a classification model.

 

The learner components I applied to the data were: Learn Good Molecules (Bayesian), Learn Cross-validated RP Tree Model (RP Tree), Learn RP Forest Model (RP Forest), Learn Molecular Property (kNN), and Learn R Support Vector Machine Model (R SVM). In all cases, I used the ECFP_4 fingerprint as the sole independent variable, largely because that's what Hansen et al. used for their Bayesian model. (Normally, I'd use a larger-radius fingerprint such as ECFP_6.)

 

For the RP Forest, I used 2000 trees, with the Number of Descriptors set to the fraction 0.25. These means that for each tree node, a random one-fourth of all the fingerprint features seen in the data are considered as potential splitting criteria. In a standard random RP forest model, the default is to use the square-root of the number of descriptors as the number to consider, which for this particular data set with ECFP_4 corresponds to the fraction 0.025. However, because each "descriptor" here is a fingeprint feature bit whose only possible values are 0 or 1, the information per descriptor is lower than with a continuous property. Hence, it seemed appropriate to increase the fraction considered, and empirically, this gave a better model.

 

For Learn Molecular Property, I used the "k-Nearest-Neighbor" model option, which uses Tanimoto similarity to predict a test compound's activity based on its proximity to compounds in the training data. Note that the purpose of Learn Molecular Property is to build regression models, yet we're dealing with a classification problem. But we can turn any regression model into a classifier simply by applying a cutoff. In this particular case, we set the property Activity to 1 for active compounds or 0 for inactive ones when building the model. The kNN prediction then represents the model's assessment of the probability that the test compound is active, and we can use this prediction to get a ROC score as for any other score-based classifier.

 

With any of the learners in the R Statistics collection, it can be tricky to use sparse fingerprints (FP) such as ECFP_4. The reason is that R does not natively handle such fingerprints. To get the FP into a form that R can handle, we need to convert it such that there is one property for each FP feature in the data set. For data records where the feature is present, the property value is 1; for others, the value is 0. The problem is that for large data sets, the number of resulting properties can be so large that either R is overwhelmed or the time required to build the model in R becomes prohibitively long. In order to reduce the number of properties passed to R, we must either "fold" the fingerprint to a fixed size (such as 256 bits, corresponding to 256 binary properties in R) or perform feature selection on the FP. To do the latter, we use the Fingerprint to Properties component. This uses a Bayesian analysis to pre-process the data and keep only the N most important features (where N=200 or 400 in the runs I did).

 

The broader point to keep in mind here is that by comparison to native PP models, R models start at a disadvantage when using fingerprints as descriptors. So the relatively poor performance of the R SVM in the table below does not necessarily reflect a weakness in the SVM algorithm.

 

One other complication with the R SVM learner is that there are a couple of parameters -- Gamma and Cost -- that need to be tuned in order to get the best model. To help with this, the SVM component in PP has built-in cross-validation to automatically choose the best combination of Gamma and Cost from a list of values that you provide. Or if you're in a hurry, as I was, and already have an independent way to test the model quality, the SVM learner can use the all-data model rather than cross-validation to do the parameter selection. (For large data sets, the R svm() function is typically much slower than other learners, and I didn't want to wait the 2+ hours that cross-validation would have taken.)

 

Here are the results. These are ROC scores averaged over the five training/test splits provided by Hansen et al. The standard error in each case is 0.01 or less.

Method ROC Score 
Bayesian 0.82
RP Tree 0.78
RP Forest 0.82
R SVM 0.72
kNN 0.84


These results largely speak for themselves, but I do want to point out one thing. Observe how good the kNN results are. If you read the paper, you'll see that Hansen et al. got significantly worse results with their kNN model. The main difference appears to be in the descriptors. They used DragonX descriptors with (apparently) a Euclidean distance, while I used ECFP_4 with a Tanimoto distance.

 

There's much more that could be done with this data set, but I need to get back to my day job, so I'll leave these as "exercises for the reader": Can we improve on these results with different descriptors -- either a different fingerprint or a combination of fingerprint and numeric properties? Can we improve the SVM results by including more FP features (at the price of more compute time) or by using folding instead of feature pre-selection? How do the other classification learners measure up, such as Learn R Logistic Regression Model, Learn R Linear Discriminant Analysis Model, Learn R Neural Net Model, and Learn R Mixture Discriminant Analysis Model? Given that we can use a regression model as a classifier, and given how well the kNN model did, what sort of results does a consensus model from Learn Molecular GFA Model give?

 

Finally, a question for readers: This posting is rather different from the typical one in which someone posts a problem and others respond to help solve it. This one is more like a mini-application note with a few hints and tips folded in. Do you find this type of posting useful or not?

  • wvanhoo Premier Contributor 23 posts since
    Oct 29, 2007
    Currently Being Moderated
    Oct 1, 2009 11:58 PM (in response to dhoneycutt)
    Re: Comparison of Different Classifiers on Ames Data
    Dana,

    Thanks for this! To answer your last question first: yes I find this post useful and I would like to see more of them! It would even be more useful if you could attach the protocol you used to generate the table. I would be interested to play with different fingerprints/learners to add some row to your table, a common starting point would guarantee a like for like comparison.

    Cheers,

    Willem van Hoorn
    Pfizer, UK
  • nmalcolm Accelrys 176 posts since
    Jan 8, 2008
    Currently Being Moderated
    Oct 2, 2009 12:09 AM (in response to dhoneycutt)
    Re: Comparison of Different Classifiers on Ames Data
    Hi Dana,

    Thanks for the post, great stuff.

    Here is a simple protocol I built to look at this data...really the only useful bit is automatic downloading of the data if not already available.

    Noj
    Attachments:
  • b.sherborne Active Contributor 12 posts since
    Feb 28, 2008
    Currently Being Moderated
    Oct 2, 2009 12:22 AM (in response to dhoneycutt)
    Re: Comparison of Different Classifiers on Ames Data
    Dana

    I second Willem's post.
    This sort of post is a great opportunity to expose the depth of the science within Accelrys and the community on shared data (and protocols Smiler !)

    Any comment on the training:test (80:20) splits?
    Given the high performance of kNN, my instinct would be to compare performances on splits like 40:60 to see how the various methods degrade.

    Best regards


    Brad
  • gxf Accelrys 303 posts since
    Jan 31, 2007
    Dana,
    this is quite a thorough analysis and quite useful. I have a personal interest in exploring the use of electronic descriptors to improve QSAR. In work on homogeous polymerization catalysts, I obtained a predictive r^2 of around .3-.4 with only structural descriptors like your fingerprints, but >.8 when I added some QM results. These were from VAMP semiempirical calcs, so took only around 3-5 min per structure.

    Obviously, your molecules and you analyses are quite different from mine, but I'd like to ask whether you think that electronic descriptors could help in these cases. You obtained quite good ROC scores in this case, but perhaps we could try this for some systems that did less well.
  • leeherman Grand Master 216 posts since
    Dec 9, 2004
    Dana

    I support the previous posts about the usefulness of seeing this kind of analysis.

    A few comments:

    1. I think this work demonstrates that PP needs a native SVM learner, as you did with Forest of Trees. I still think that having an easy interface with R is useful, but once a method becomes part of the standard repertoire a native version is a must.

    2. WRT the parameters in SVM, whether you adapt the current R learner or create a native one I believe that incorporating automatic estimation of gamma and cost via regularization (as in Hastie's svmpath) is a must. Why have to do experiments every time when it can essentially be estimated on the fly?

    3. I'm not so concerned with small method-to-method variations. What I worry about is the generality of the method outside the training set. My experience is that no matter how well the current model does (even under cross validation), it's a crap shoot once the next compounds come along, unless they're extremely similar to what's already there. So cross-validation-by-(structure)-class becomes especially important. Cross validation by random subsetting is a cheat. I've seen this in my own attempts to build general Ames models.

    4. My favorite thing in your post is the analysis of numbers of descriptors chosen for Random Trees. It's very useful to challenge the conventional wisdom about defaults. A plot of AUC vs. number (or percentage) of descriptors would be nice.

    Lee
  • jmetz Grand Master 122 posts since
    Apr 2, 2007
    Dana,

    Many thanks for posting your study. Obviously this topic is and should be generating lots of interest due to its general importance.

    As a follow-up to Lee Herman's comment about cross validation via random subsetting - I completely agree. Random subsetting just creates two sets of "kissing cousins" providing
    little understanding of the true predictive power of the model.

    I suggest a slightly different approach which deliberately attempts to make the model
    "fail" and is an outgrowth of my research presented at the Accelrys 2008 US UGM (PPT attached).

    Take the entire data set and cluster the molecules. Put the odd numbered clusters and members into the training set and the even numbered clusters and members into the prediction set (not used to adjust the model!) In this way, one has deliberately created a prediction set that is
    "diverse" from the training set and is likely more difficult to make accurate predictions. If
    a model has a high ROC AUC for this "diverse" set, then the results are more impressive than for
    the case of "kissing cousins."

    I would be very interested to learn of the performance of models constructed using physical properties which are as "orthogonal" (non-correlated) as possible.

    Again, many research possibilities here worth following up on! I too would very much like to
    see more informal (this forum) discussions like this!

    Regards,
    Jim Metz

    Abbott Laboratories
  • nmalcolm Accelrys 176 posts since
    Jan 8, 2008
    Currently Being Moderated
    Oct 3, 2009 11:42 AM (in response to dhoneycutt)
    Re: Comparison of Different Classifiers on Ames Data
    I'd definitely agree with Jim's point on training/test set splits. I generally advocate his "alternate clusters" method. Using this sort of scheme means that the predicted r2 for the test set should give a much better indication of how the model will perform for true predictions.

    This is one of the split methods I've recently encoded in a suite of protocols for QSAR model building.

More Like This

  • Retrieving data ...

Bookmarked By (1)