Skip navigation
Currently Being Moderated

Good Models Require Good Data

Posted by dhoneycutt on Oct 1, 2009 6:23:55 PM

In my last posting, I touted ROC analysis as one of the best ways to evaluate and compare different methods for building classification models. To do a true apples-to-apples comparison, it also helps to have a good reference data set. In this regard, Katja Hansen et al. have done data modelers a favor by publishing a "Benchmark Data Set for in Silico Prediction of Ames Mutagenicity." Not only did they vet and make available the data, but they also provide data splits for cross-validation to help modelers ensure that their method comparisons have a common basis.


The authors compare several techniques, including the Bayesian classifier in Pipeline Pilot. Data junkie that I am, I couldn't resist throwing the Ames data at this and a few other Pipeline Pilot learners. Here are the results I got using the ECFP_4  molecular fingerprint as the descriptor:


MethodROC Score
RP Tree0.78
RP Forest0.82
R SVM0.72

These results show a few things. The best ROC scores in the table are comparable to those reported by Hansen et al. for various classifiers that they investigated. (The best score they obtained was 0.86 for an SVM model.) The results confirm the widely known fact that forest models give better predictive performance than single tree models. Finally, they confirm that molecular fingerprints are good descriptors for building classification models.


If you want more of the statistical details, I provide them in a posting on the Pipeline Pilot Forum at the Accelrys Community site. (Registration is free.)

Comments (1)