A colleague stumbled across a conversation over at In the Pipeline earlier this month that illustrates the fluidity of dialogue in blogs (something that we aspire to with this blog). The post muses on how the “publish or perish” paradigm in academia can lead to disastrous competitiveness among grad students and post docs. The comments, however, quickly diverge into a thoughtful discussion about how academics preserve critical research and IP and whether ELNs offer a viable solution.
The conversation starts when one commenter calls notebooks “the bane of organic chemistry,” continues as various commenters chime in to offer their own experiences with commercial ELNs and home-grown solutions, and concludes with the somewhat snarky suggestion that “if only the Scripps group had deployed an enterprise-grade ELN… they probably could have avoided all this unpleasantness.”
I’ll have more to say on this subject in a future post. For now, I encourage you to check out the discussion and would love to hear your thoughts on it.
And in the interest of full disclosure, the post in question concerns work performed in the lab of Peter Schultz at Scripps in 2004. Schultz, of course, was one of the co-founders of Symyx in 1994. We are sorry that this bizarre episode has brought such unwarranted attention to his fine research team.
One approach to building a predictive model is to choose a powerful technique such as a neural network (NN) or support vector machine (SVM) algorithm and then tune the model-building parameters to maximize the predictive performance. Over the past 15 years or so, an increasingly popular alternative is to combine the predictions of multiple different techniques into a consensus or ensemble model, without necessarily optimizing each individual model within the ensemble. This is the approach that won the million dollar Netflix Prize last year, as well as the zero dollar challenge from the November 2009 Pipeline Pilot newsletter. I'll be talking about the latter; for details on the Netflix Prize solution, go here.
In brief, the Pipeline Pilot Challenge was to find the model-building technique that gives the best ROC score for a particular classification problem. When we formulated the problem, we figured people would apply the various different learner components in Pipeline Pilot, and probably come up with a solution involving an SVM, Bayesian, or recursive partitioning (RP) model.
But winner Lee Herman took a clever alternative approach. He built four different models using four dissimilar techniques: Bayesian, RP (a multi-tree forest), mixture discriminant analysis, and SVM. For making predictions on the test set, he summed the predictions from each of the models to get a composite score. This ensemble model gave a better ROC score than any of the individual models contributing to it. For details, see Lee's protocol on the Pipeline Pilot forum (registration is free).
Why does this work? In essence, each type of model captures some aspect of the relationship between the descriptors and what we wish to predict, while having its own distinct errors and biases. To the extent that the errors are uncorrelated between models, they cancel rather than reinforce each other. Thus the accuracy of the whole becomes greater than the greatest accuracy of any of its parts. It's as if many wrongs can make a right.
A roundtable discussion took place near the close of this year’s HCA meeting in San Francisco. The topics of Data Analysis and Management, Image Analysis and Computational Biology were folded into a single discussion. This roundtable was facilitated by Karel Kozak. Participants included:
Karel Kozak (Swiss Fed. Institute Of Technology)
Lisa Smith (Merck)
Peter Horvath (Swiss Fed. Institute Of Technology)
Achim Kirsch (PE/Evotec)
Ghislain Bonamy (Novartis GNF)
Abhay Kini (GE Healthcare)
Jonathan Sexton (North Carolina Central University)
Mark Bray (Broad Institute)
Chris Wood (Stowers Institute for Medical Research)
Pierre Turpin (Molecular Devices)
Mark Collins (ThermoFisher/Cellomics)
The opening shot from Schmerck (Lisa Smith from Schering now Merck) was fired at the vendors. The bullet in question? “Why tools for pattern recognition and machine learning on image data were not more rapidly addressed for vendor systems?” Vendors replied with their own question, “Why is this a better approach than algorithmic quantification of a known endpoint?” The result of the ensuing discussion was that the end-users want the ability to extract any additional information from their data that is not derived by the designed analysis algorithm, i.e., look for natural classes in the data, spot outliers, correlate to chemical structure of test compounds, etc. This does not necessarily have to be correlated to known biological endpoints – it can be purely exploratory. Vendors said “that’s why we need companies like Accelrys and products like Pipeline Pilot”. The marketplace needs a third-party environment which provides turnkey or almost-turnkey access to the data, and an exploratory environment like PLP in which users can develop methods to ask “what-if” questions of their data. When users clearly demonstrate that these techniques have merit, they will find their way into the instrument vendors’ products.
One other aspect of the above discussion which became apparent is that many, if not most, HCS users have no idea what the difference is between PCA, Classification, Support Vector Machines, genetic algorithms, Self-organizing maps, etc., let alone where or when to apply these methods. What they want, and need, is a kind of wizard which walks them through a process of determining what they want to learn from their data, and selecting internally the best method to do that. An analogy was drawn to curve-fitting programs which apply hundreds or thousands of models to a data set, and tell the user which ones produced the best fit. This idea of “opening up to the wider science community methods previously available only to discipline experts”, specifically in computational biology, is by no means in its infancy (see The Future of Computational Science, Scientific Computing World: May / June 2004).
Remember A Company That Cannot Be Named that conducted the study described here? It implemented a suite of different systems that tightened the timeline between discovering a compound and collecting data to inform next-step decisions. But where does the credit go? Is the transformation due to the specific systems companies purchase, or the way companies choose to implement the systems they’ve chosen?
We think it’s the latter. I mean, honestly, pharma now has more tools than ever to choose from to expedite just about any task an R&D scientist needs to do. What’s challenging is integrating all these information sources so that they benefit the entire organization rather than one group of scientists. In fact, the lack of a well-thought out strategy for implementing informatics can actually inhibit R&D productivity by creating silos that make it hard to find information or create comprehensive reports to inform decisions.
Increasingly, we are seeing companies provide a self-service information buffet to scientists. The whole idea is to provide an environment that’s not limited by what’s on the menu or the speed of service—an environment that puts scientists in control rather than expecting them to always consume the same mass-produced, out-of-the-box burger.
How might such an information buffet look? Here’s one enticing smorgasbord. The screen below shows a dashboard for viewing stability study results for different formulations. Scientists select data of interest from multiple experiments, and Isentris assembles various charts on the fly. Bar charts indicate the relative composition of the formulations. The line charts illustrate the stability of micelles as the surfactant in each formulation destabilizes.
We’d love to hear which analogy best fits the way you are providing information to researchers. Do your scientists prefer to grab information from buffets, fast-food joints, or sit-down restaurants—and how do these approaches affect how you purchase and implement informatics? And to get a taste of an Isentris informatics buffet, check out this 10 minute video demonstration.
Everywhere you go these days, you (often literally) run into people with their heads/brains buried in mobile devices. But will these "minicomputers," as a recent article in Pharmaceutical Manufacturing called them, find utility in the lab beyond phoning colleagues or catching up on email?
James Jack, who developed Symyx's ChemMobi app, thinks so. "The iPhone is... more a computer with the ability to make phone calls than a phone," Jack said recently. "It seemed perfectly natural to try and put some meaningful chemistry on there." [Editor's note: In July 2010 Symyx merged with Accelrys, Inc.]
ChemMobi lets scientists use an iPhone or iTouch to access over 30 million chemical structures and related properties, supplier information, and MSDS summaries. So far, at least, scientists seem interested. Since July, the app has been downloaded over 2,000 times by users in 19 countries. And Jack was lauded by the Chemspider blog last summer for the creativity and progressive thinking that went into this app.
But what do you think? Is your mobile device a tool for working or an escape from the daily grind? And, depending on your answer, what types of scientific apps would you use if they were available?
Happily the trend continued through 2009 for a total of4621DFT references inACS Journals. Here are a few of my favorite publications, thought not all are drawn from the ACS citations. Yes, of course, these use Accelrys DFT packages, but they are still pretty cool articles:
Let me and my readers know whatyouthink are the most interesting DFT articles from 2009.
†Strictly speaking, this was not QSAR, Quantitative Structure-Activity Relationship, because they didn't actually base predictions on the structure. I use the term here more generally to refer to relationships that predict complex properites like catalytic activity, on the basis of simpler properties, like workfunction.
During 2009, we touted Symyx Notebook's strength at connecting scientists in different disciplines across the enterprise. Three Webinars presented at the end of the year give you a chance to verify these claims yourself. Check out recordings highlighting Symyx Notebook's utility at transforming enterprise R&D, as well as specific workflows for analytical chemistry and biology.
A list of other recorded webinars can be found here covering such topics as collaborating in Isentris, getting started with DiscoveryGate, and intelligent structure editing in Symyx Draw. Holler, too, if there are other learning sessions you’d like to see offered in the coming months, and I'll pass on your thoughts to our webinar team.
Today's guest author, Pierre Allemand (VP of Life Sciences in Europe), is not the type to keep his opinions to himself. In this column, he argues that enterprise ELNs require a different evaluation process that forces vendors to earn your business. What do you think? If you’ve bought an ELN, would you do things differently—or ask for different things from vendors?
As much as vendors and customers love the idea of an out-of-the-box ELN (says Pierre), there is no such thing--not if you’re talking about an enterprise ELN that meets the specific needs of scientists in different disciplines while enabling data to be accessible by everyone who needs it. A vendor may hand you a box, but in reality, that box is like Tiffany meets IKEA—beautifully engineered parts that still require some assembly.
Research is still the first step in the evaluation process, particularly given that there’s no shortage of ELNs to choose from (obviously, this blog is a great place to learn about Symyx’s ELN). But what next? The standard approach is to send out an RFP, collect vendor responses, view a bunch of canned demos, and select a solution, which only then is customized for you. Everyone has heard stories of buying something based on the demo, only to discover after installation that the product doesn’t work as “advertised.”
Because ELNs (particularly a true enterprise solution) are tied so tightly to your business, I recommend that you spend more time up front developing a relationship with vendors. Ask the vendors to survey your operation. We’ve done everything from two-day workshops to full-blown department or enterprise audits. Yes, we’ve gotten flack for it. Think about it, though. One of the key problems in R&D informatics is that companies tend to let technology drive workflows, rather than using technology to support better or more efficient ways of working. Through an audit, we can get a more complete picture of what you are trying to accomplish. And for the same time and money you’d invest gathering specs for an RFP, you get greater insight into how different vendors operate.
You’re left at the end of this stage with a bunch of reports from various vendors on how to approach your informatics situation. This may not seem much different than what you’d get after an RFP, but the difference is in what’s on the paper. You see how each vendor has defined your problem and their different approaches. You’re in control. The pros and cons are easier to see, you can talk with vendors and ask for revisions, and, ultimately, you define your solution based on your needs rather than adjusting your needs to vendor conceptions.
Following these conversations, you would select maybe your two top vendors and ask them each to run a workshop to illustrate how they’d implement their approach. This workshop is more than a demo—it covers not just what the solution looks like and how it works, but how it is implemented. You know what you’re getting and what is expected of you (and the vendor) to make it work.
From here, you’d select a vendor. But the work isn’t over. We highly recommend a pilot or agile implementation that introduces the system to a subset of early adopters. These scientists can not only put the system through its paces, but serve as champions when you finally roll the system out more widely.”
But there's more to it than this. Even a good model makes poor predictions for samples that are too different from the samples in the training data used to build the model. In other words, the training data define the model's applicability domain.
For example, suppose we wish to model per-capita crawfish consumption as a function of several variables, including the distance from the Mississippi River. Suppose also that our training and test sets consist solely of Louisiana residents. Even if we find that the model has good predictive ability for the test set, we would not expect it to do a good job predicting crawfish consumption in, say, Oregon (though it might do an OK job for parts of Mississippi). In other words, locations in Oregon lie outside the MAD. (See map.)
This idea appears obvious, yet models in statistical software packages often lack the ability to automatically define their own MAD and flag predictions outside the MAD as questionable. (In linear regression models, confidence and prediction bands serve this role to some extent. The bands become wider as we move away from the center of the training data.) The onus is generally on the user of the model to ensure that it is applied correctly. When the person applying the model is the same one who built it, and is thus familiar with the training data and the model's limitations, this is not too big a problem. But when the creator and user of a model are two different people separated in space or time, a model's awareness of its own applicability domain can be critical to the proper use of the model.
In the life sciences, it appears that the need to take the MAD into account when making predictions was first recognized for QSAR models of toxicity such as TOPKAT. TOPKAT introduced the notion of the optimum prediction space (OPS) defined by the ranges of the training set descriptors in principal component space. But the OPS is just one of several MAD measures discussed in the literature (e.g., see here, here, here, and here).
To summarize some of my own recent work in this area: In various numerical experiments, I have reproduced the research results of others who found that the distance from a test sample to samples in the training set correlates well with the model prediction error. ("Distance" can be defined in severaldifferentways, and a lengthy essay could be written on this subject alone. But I'll spare you for now.) This gives us the potential to estimate MAD-dependent error bars even for learning methods that do not intrinsically support them. A few of the model-building (learner) components in Pipeline Pilot now support OPS and other MAD measures, and we're working on adding more of these.
I hope I have convinced you of the importance of paying attention to the applicability domain when making predictions with a model. I'll have more to say on this in a future posting.
Join us at the annual CHI PepTalk 2010 Conference being held at the Hotel Del Coronado in San Diego, January 11-15, 2010. At booth #32, Accelrys Protein Engineering and Antibody Modeling experts will showcase new features and enhancements found in Accelrys Discovery Studio 2.5.5 that both enable and improve the modeling of antibody structure and function. The advanced computational technology supported by the Discovery Studio environment allows scientists to explore the antibody landscape in silico prior to costly experimental implementation, thus greatly reducing the time and expense involved in bringing such products to market, while increasing scientific productivity.
On Monday, January 11th at 3:35 pm, Dr. Shikha Varma-O'Brien of Accelrys will present "Modeling the 3-Dimensional Structures of Antibody and their Interaction Interface to Antigen." She will discuss and demonstrate how Accelrys Discovery Studio not only contains the tools necessary to construct modeling framework from antibodies, but also enables structure based prediction of antibodies by physical properties with the goal of uncovering novel antibody designs.
A webinar on Thursday, January 14, will highlight the new functionality in the all new DiscoveryGate. Titled “Keeping IT Simple,” the webinar focuses on DiscoveryGate’s new search, filter, and reporting options to speed synthesis design and planning.
Power users and research IT administrators can learn about transitioning from legacy ISIS installations to Isentris in a webinar on Tuesday, January 26. The webinar explores not just benefits of transitioning, but showcases two new packages Symyx has developed to ease the transition.
To register for either webinar or view recordings of past webinars, visit the Symyx events page.
One thing research organizations always need is information on how to justify their investment in informatics. These pretty pictures compiled by a VP of chemistry at A Company That Cannot Be Named powerfully illustrate the impact of informatics.
The company used program event-based analysis to map all the data collected for individual compounds over time. The red mountain creeping up the chart marks when a compound was first registered. The colored dots to the right of the red line track the discovery of property information and other details about the initial compound over time. Viewed this way, the problem is obvious—scientists don’t have all the information they need to make next-step decisions. That highlighted result, e.g., was obtained too late to inform work on all those circled compounds.
This organization implemented a suite of laboratory operational systems, including an ELN, a chemical registration system, a compound management system, and assay management, analysis, and reporting applications. They had qualitative aims like improving quality, timeliness, workflow integration, and productivity. And here’s what the mountain looks like now.
Two things to note about this map. First, the time between discovering a compound and capturing additional detail is much tighter (that red line is almost purple now due to all the data crowded against it). More information is available sooner to inform next-step decisions. Second, the slope of this peak is steeper, indicating that compounds are being developed or selected more rapidly and showcasing the impact that this company’s investment in informatics has had on scientific decision making.
There is increasing pressure to deliver lighter, more efficient and less expensive materials more frequently and faster than ever before. Fortunately, the integration of Materials Studio applications such as CASTEP and the Pipeline Pilot platform opens a range of possibilities for the discovery of new materials.
The experts at Accelrys have developed a new framework that screens complex systems and properties across numerous materials and applications. This system is currently being applied to fuel cell catalysts to find alternatives to costly materials such as platinum. Dr. Jacob Gavartin and Dr. Gerhard Goldbeck-Wood will discuss this approach and its application in detail during next week’s webinar: