Due to the response to my recent postabout how the Hit Explorer Operating System (HEOS)collaborative program is assisting in the treatment of neglected diseases, I've invited Frederic Bost, director of information services at SCYNEXIS, to talk a little bit more about HEOS and the project. It is with great pleasure that I welcome Fred to our blog!
Thank you Frank, it's great to have this opportunity to talk to your readers. We couldn't think of a better case for the HEOS® cloud-based collaborative platform than what we've seen with the committed scientific community engaged in Drugs for Neglected Diseases initiative (DNDi). The project is grand in scope and comprises scientists spread over five continents representing different cultures, disciplines, processes and companies. In this way, it's a macrocosmic example of what happens in industrial pharma research.
Collaboration requires all team members -- from different physical locations, disciplines and cultures -- to interact equally and as needed regardless of their physical location, disciplinary background or expertise. We've set out to develop a platform that invites all scientists involved in a project to contribute any information that might be beneficial to the team, especially if these scientists don't have the opportunity to interact frequently face-to-face. HEOS ensures that scientists can share whatever they deem relevant; be it a data point, comment on another's work, an annotation, a document, a link from the web or a Pipeline Pilot protocol. The science or the data should never be compromised by external factors. For that reason, we embrace the motto of the DNDi -- and extend it: The Best Science (and the best supporting software) for the Most Neglected.
What does true collaboration look like? Here's an example from the DNDi project: The non-profit organization started a research program against an endemic disease by collecting small compounds sets from volunteer large pharmaceutical and biotech companies. Assays were run by an expert screening company in Europe. While several of the programs proved to be dead ends, one showed promise. The non-profit organization hired an integrated drug discovery contract research organization (CRO) to produce additional analogs using high-throughput screening. Using HEOS, the biotech that provided the initial compounds was able to continue to manage the project while the CRO for high-throughput screening confirmed the most promising hits and leads. The managing biotech was also able to track in vivo studies performed by a US university.
As the program moved along, several ADME, safety and pharmacokinetic teams got involved in the project. Several peer organizations were also consulted on certain decisions. All these efforts successfully delivered a compound ready for the clinic that is today showing great promise in treating a disease for which a new treatment hasn't been produced in decades.
Managing this type of program, whether in a non-profit setting or an industrial one, demands flexible, rich features that can accommodate the needs of each partner at each stage of research while capturing data, keeping it secure and consolidating it so that it is available in real-time to authorized team members when they need it. Data must also be curated, validated and harmonized according to the rules that the project team has established and provided in a common language that enables scientists to compare results, whatever their origin. And because of the power of embedded Accelrys tools, HEOS can also provide the scientific analysis tools necessary to support the team in its decision process. All of these capabilities enable scientists to compare results and make decisions as a team.
It's been fascinating and rewarding to serve this community of passionate scientists fighting against endemic diseases. Together they have participated in an evolution, creating an agile networking environment that combines competencies and science from many places to achieve a common goal. HEOS has quite simply helped the DNDi's virtual teams function as if the world were much smaller than it really is.
A few months ago, I posted on model applicability domains (MAD) and why they are important, with a promise to say more in a future posting. Now that Pipeline Pilot 8.0 is out, the future is here, and I can say what I have been chomping at the bit to say for the past three months: we now have extensive MAD support for all model types in Pipeline Pilot, including R models.
At last week's Accelrys North American User Group Meeting, I presented research results on using some of these MAD measures to quantify the performance of both regression and binary classification models. Specifically, we can use measures of distance from test data to training data to compute error bars for regression models. For classification models, we can compute performance measures such as ROC scores, sensitivity, and specificity. The key point is that the model performance metrics calculated with the aid of the distance measures vary from test sample to test sample according to how close each sample is to the training data. The metrics are not just averages over the entire test set. (See the following picture.)
RMS error versus distance quartile for regression models of 4 different data sets (automobile, LogP, fathead minnow toxicity, hERG). Results averaged over 1000 training/test resamplings. Error bars show standard deviation.
The basic idea and qualitative conclusions of this research are not new and have been reported by other researchers (see my previous posting for links). But some of the details are indeed new.
For binary classification models, one such interesting detail is that not only can the distance measures indicate the expected model performance for particular samples, but it may be possible in some cases to use these measures to improve the predictive performance of the model. The way we do this is by varying—according to the distance from a sample to the training data—the score cutoff that we apply to the model prediction to distinguish between the two classes. For some models, this gives better combined sensitivity and specificity than we get from applying a single cutoff value to all test samples. (The improvement was seen for one balanced data set but not for two imbalanced ones, so more work needs to be done to see whether the results for the balanced data were a fluke.)
For more details on this research, and on MAD support capabilities in Pipeline Pilot 8.0 and how to use them, please see my posting in the Pipeline Pilot forum.
As I wrote last week, “Chemistry for a Sustainable World” was the theme of the ACS spring meeting. In this blog, I report on the research of 3 other speakers of the Monday morning session of CINF. All of these authors are trying to find ways to discover optimal materials from the huge selection of possibilities.
Heard of genomics? Proteomics? These methods generate a lotof data in the analysis of genes or proteins. Information management tools are needed to handle the large amounts of data. The goal, ultimately, is to be able to design new ones (genes or proteins) on the basis of the available information. The number of materials to search (the 'design space') is truly enormous. Searching for optimum materials effiently requires tools that were discussed by three of my fellow speakers in the session.
Prof. Krishna Rajan (Iowa State), discussed his "omics" approach to materials. In his presentation he discussed "a new alternative strategy, based on statistical learning. It systematically integrates diverse attributes of chemical and electronic structure descriptors of atoms with descriptors ... to capture complexity in crystal geometry and bonding. ... we have been able to discover ... the chemical design rules governing the stability of these compounds..." To me, this is one of the key objectives for computational materials science: the development of these design rules. Design rules, empirical evidence, atomic-level insight - call it what you will, this sort of approach is necessary to make custom-designed materials feasible.
Prof. Geoffrey Hutchison (U. Pittsburgh) really did talk about "Finding a needle through the haystack." He discussed the "reverse design problem." Sure, we can predict the properties of any material that we can think up. But what we really want to know is what material will give us these properties. His group uses a combination of database searching, computation, and genetic algorithm optimization to search the haystack. It's a very efficient way to search these huge design spaces.
Dr Berend Rinderspacher (US Army Research Lab) also discussed the reverse design problem. He pointed out that there are around 10200 compounds of a size typical for, e.g., electro-optical chromophores. He unveiled a general optimization algorithm based on an interpolation of property values, which has the additional advantage of handling multiple constraints; and showed applications to optimizing electro-optic chromophores and organo-metallic clusters.
Terrific work by all the speakers in this session, who are using all methods at their disposal - whether based on informatics or atomistic modeling - to come up with better ways of looking for better materials. Next blog: a summary of my own contribution to the session.
The Bayesian learner in Pipeline Pilot is a so-called naïve Bayesian classifier. The "naïve" refers to the assumption that any particular feature contributes a specific amount to the likelihood of a sample being assigned to a given class, irrespective of the presence of any other features. For example, the presence of an NH2 group in a compound has the same effect on predicted activity whether or not there is also an OH or COOH group elsewhere in the compound. In other words, a naïve Bayesian classifier ignores interaction effects.
We know that in reality, interaction effects are quite common. Yet, empirically, naïve Bayesian classification models are surprisingly accurate (not to mention that they are lightning-fast to train).
But perhaps there are cases where a model with interactions would be better. How might we make the Bayesian learner less naïve? If we use molecular fingerprints as descriptors, one simple approach is to create a new fingerprint by pairing off the original fingerprint features and adding them to the list. We can then train the model on the new fingerprint with its expanded feature list.
A sparse molecular fingerprint (such as the Accelrys extended-connectivity fingerprints) consists of a list of feature IDs. These IDs are simply integers corresponding to certain substructural units. E.g., "16" might refer to an aliphatic carbon, while "7137126" might refer to an aryl amino group. So if our original fingerprint has the following features: 16 85 784 12662 ...
our fingerprint-with-interactions would have the above features with the following ones in addition: 16$85 16$784 16$12662 85$784 85$12662 ...
The "$" is just an arbitrary separator between the feature IDs. A Bayesian learner works by simply counting the features present in the two classes of samples (e.g., "active" vs. "inactive"), so the feature labels are unimportant, as long as they are unique.
One approach to building a predictive model is to choose a powerful technique such as a neural network (NN) or support vector machine (SVM) algorithm and then tune the model-building parameters to maximize the predictive performance. Over the past 15 years or so, an increasingly popular alternative is to combine the predictions of multiple different techniques into a consensus or ensemble model, without necessarily optimizing each individual model within the ensemble. This is the approach that won the million dollar Netflix Prize last year, as well as the zero dollar challenge from the November 2009 Pipeline Pilot newsletter. I'll be talking about the latter; for details on the Netflix Prize solution, go here.
In brief, the Pipeline Pilot Challenge was to find the model-building technique that gives the best ROC score for a particular classification problem. When we formulated the problem, we figured people would apply the various different learner components in Pipeline Pilot, and probably come up with a solution involving an SVM, Bayesian, or recursive partitioning (RP) model.
But winner Lee Herman took a clever alternative approach. He built four different models using four dissimilar techniques: Bayesian, RP (a multi-tree forest), mixture discriminant analysis, and SVM. For making predictions on the test set, he summed the predictions from each of the models to get a composite score. This ensemble model gave a better ROC score than any of the individual models contributing to it. For details, see Lee's protocol on the Pipeline Pilot forum (registration is free).
Why does this work? In essence, each type of model captures some aspect of the relationship between the descriptors and what we wish to predict, while having its own distinct errors and biases. To the extent that the errors are uncorrelated between models, they cancel rather than reinforce each other. Thus the accuracy of the whole becomes greater than the greatest accuracy of any of its parts. It's as if many wrongs can make a right.
But there's more to it than this. Even a good model makes poor predictions for samples that are too different from the samples in the training data used to build the model. In other words, the training data define the model's applicability domain.
For example, suppose we wish to model per-capita crawfish consumption as a function of several variables, including the distance from the Mississippi River. Suppose also that our training and test sets consist solely of Louisiana residents. Even if we find that the model has good predictive ability for the test set, we would not expect it to do a good job predicting crawfish consumption in, say, Oregon (though it might do an OK job for parts of Mississippi). In other words, locations in Oregon lie outside the MAD. (See map.)
This idea appears obvious, yet models in statistical software packages often lack the ability to automatically define their own MAD and flag predictions outside the MAD as questionable. (In linear regression models, confidence and prediction bands serve this role to some extent. The bands become wider as we move away from the center of the training data.) The onus is generally on the user of the model to ensure that it is applied correctly. When the person applying the model is the same one who built it, and is thus familiar with the training data and the model's limitations, this is not too big a problem. But when the creator and user of a model are two different people separated in space or time, a model's awareness of its own applicability domain can be critical to the proper use of the model.
In the life sciences, it appears that the need to take the MAD into account when making predictions was first recognized for QSAR models of toxicity such as TOPKAT. TOPKAT introduced the notion of the optimum prediction space (OPS) defined by the ranges of the training set descriptors in principal component space. But the OPS is just one of several MAD measures discussed in the literature (e.g., see here, here, here, and here).
To summarize some of my own recent work in this area: In various numerical experiments, I have reproduced the research results of others who found that the distance from a test sample to samples in the training set correlates well with the model prediction error. ("Distance" can be defined in severaldifferentways, and a lengthy essay could be written on this subject alone. But I'll spare you for now.) This gives us the potential to estimate MAD-dependent error bars even for learning methods that do not intrinsically support them. A few of the model-building (learner) components in Pipeline Pilot now support OPS and other MAD measures, and we're working on adding more of these.
I hope I have convinced you of the importance of paying attention to the applicability domain when making predictions with a model. I'll have more to say on this in a future posting.
Business Intelligence has been around for nearly 30 years. So, that means that businesses have pretty much mastered all of their data mining and management issues. Right? Well, then why do R&D enterprises still struggle with integrating and fully leveraging their scientific data for the knowledge it contains? Accelrys’ VP of Marketing, Bill Stevens, was recently interviewed by Mary Jo Nott, Executive Editor of the BeyeNetwork, on their Executive Spotlight Program. Bill, a BI industry veteran, exposes the unique characteristics of scientific data, and explains why it has eluded the BI umbrella of solutions for so long.
Part of my job is creating and maintaining learner components for building statistical models in Accelrys's Pipeline Pilot product. A statistical model is an empirically derived equation or set of rules for predicting some unknown property (say the toxicity of a chemical compound) from a set of known properties (say descriptors derived from the compound's structure).
A statistical model--as contrasted to a mechanistic model--is built from a specific set of data, called the training set, using a specific learning algorithm (such as linear least-squares, recursive partitioning, etc.). The quality of the model is crucially dependent on the quality of the training data.
Pipeline Pilot makes it really easy to build statistical models from your data. All it takes is dropping in a data reader component, choosing an appropriate learner component, and specifying the variables you wish to use. Because of this ease, you may be tempted to build models from a data set before taking a look at the data.
Don't do it!
Here's why: more often than you might think, data sets are dirty. Some values are missing or invalid. What you thought was a scalar property appears as an array in the data. A few extreme outliers are present which (depending on the learner) may seriously skew the results. Extra commas in your CSV file have shifted some values to the wrong columns. You're trying to build a classification model, but all data records have been assigned the same class. You thought that your data set contained only small organic molecules, but somehow a few organometallics got in there. Unbeknownst to you, the creator of the data set used 99 as a missing value tag. And so on.
Pairs Plot of a Contaminated Data Set
I am sometimes called upon to diagnose problems that customers or colleagues have when trying to build a model. Often the root of the problem is that something is wrong with the input data. In many such cases, just looking at the data in a table makes the problem obvious. Other times, simple analysis (such as univariate analysis) or plots (such as pairs plots) show what's wrong.
http://blog.accelrys.com/wp-content/uploads/2009/06/pairs2.pngThe more worrisome cases are the ones we may never hear about. Not all problems with a training data set will make a learner fail or produce obviously incorrect results. So even if you have gone ahead and successfully built a model before looking at the data, you should still look at the data afterward.
Whether you build models in Pipeline Pilot, R, Weka, or some other program, remember to Look before you Learn.