Skip navigation
Previous Next

Accelrys Blog

January 2010

When Comments Add Commentary

Posted by dcurran Jan 29, 2010
A colleague stumbled across a conversation over at In the Pipeline earlier this month that illustrates the fluidity of dialogue in blogs (something that we aspire to with this blog). The post muses on how the “publish or perish” paradigm in academia can lead to disastrous competitiveness among grad students and post docs. The comments, however, quickly diverge into a thoughtful discussion about how academics preserve critical research and IP and whether ELNs offer a viable solution.

 The conversation starts when one commenter calls notebooks “the bane of organic chemistry,” continues as various commenters chime in to offer their own experiences with commercial ELNs and home-grown solutions, and concludes with the somewhat snarky suggestion that “if only the Scripps group had deployed an enterprise-grade ELN… they probably could have avoided all this unpleasantness.”

I’ll have more to say on this subject in a future post. For now, I encourage you to check out the discussion and would love to hear your thoughts on it.

And in the interest of full disclosure, the post in question concerns work performed in the lab of Peter Schultz at Scripps in 2004. Schultz, of course, was one of the co-founders of Symyx in 1994. We are sorry that this bizarre episode has brought such unwarranted attention to his fine research team.
372 Views 3 References Permalink Categories: Electronic Lab Notebook Tags: eln, academic-labs

Smart Phone Apps Revisited

Posted by Dominic.John Jan 25, 2010

Less than a week after our recent query about whether smart-phone-based apps are truly useful in the lab, Allison Proffitt over at Bio-ITWorld has compiled a list of available life-science apps for the iPhone. Most, like Symyx ChemMobi, are free; the most you have to pay for the ones that require a download fee is about $3.00.


Our question still stands, though. Are phone-based apps for work or play? Have you used any of these apps—or do you know about any other scientific apps that perhaps Allison missed?

360 Views 0 References Permalink Categories: Lab Operations & Workflows, Scientific Databases Tags: mobile-apps, chemmobi

One approach to building a predictive model is to choose a powerful technique such as a neural network (NN) or support vector machine (SVM) algorithm and then tune the model-building parameters to maximize the predictive performance. Over the past 15 years or so, an increasingly popular alternative is to combine the predictions of multiple different techniques into a consensus or ensemble model, without necessarily optimizing each individual model within the ensemble. This is the approach that won the million dollar Netflix Prize last year, as well as the zero dollar challenge from the November 2009 Pipeline Pilot newsletter. I'll be talking about the latter; for details on the Netflix Prize solution, go here.


In brief, the Pipeline Pilot Challenge was to find the model-building technique that gives the best ROC score for a particular classification problem. When we formulated the problem, we figured people would apply the various different learner components in Pipeline Pilot, and probably come up with a solution involving an SVM, Bayesian, or recursive partitioning (RP) model.


But winner Lee Herman took a clever alternative approach. He built four different models using four dissimilar techniques: Bayesian, RP (a multi-tree forest), mixture discriminant analysis, and SVM. For making predictions on the test set, he summed the predictions from each of the models to get a composite score. This ensemble model gave a better ROC score than any of the individual models contributing to it. For details, see Lee's protocol on the Pipeline Pilot forum (registration is free).


Why does this work? In essence, each type of model captures some aspect of the relationship between the descriptors and what we wish to predict, while having its own distinct errors and biases. To the extent that the errors are uncorrelated between models, they cancel rather than reinforce each other. Thus the accuracy of the whole becomes greater than the greatest accuracy of any of its parts. It's as if many wrongs can make a right.

379 Views 1 References Permalink Categories: Data Mining & Knowledge Discovery Tags: qsar, statistics, pipeline-pilot, data-mining, data-modeling, classification-models, roc-analysis

A roundtable discussion took place near the close of this year’s HCA meeting in San Francisco. The topics of  Data Analysis and Management,  Image Analysis and Computational Biology were folded into a single discussion. This roundtable was facilitated by Karel Kozak. Participants included:


Karel Kozak (Swiss Fed. Institute Of Technology)


Lisa Smith (Merck)


Peter Horvath (Swiss Fed. Institute Of Technology)


Achim Kirsch (PE/Evotec)


Ghislain Bonamy (Novartis GNF)


Abhay Kini (GE Healthcare)


Jonathan Sexton (North Carolina Central University)


Mark Bray (Broad Institute)


Chris Wood (Stowers Institute for Medical Research)


Pierre Turpin (Molecular Devices)


Mark Collins (ThermoFisher/Cellomics)


The opening shot from Schmerck (Lisa Smith  from Schering now Merck) was fired at the vendors. The bullet in question? “Why tools for pattern recognition and machine learning on image data were not more rapidly addressed for vendor systems?”  Vendors replied with their own question, “Why is this a better approach than algorithmic quantification of a known endpoint?” The result of the ensuing discussion was that the end-users want the ability to extract any additional information from their data that is not derived by the designed analysis algorithm, i.e., look for natural classes in the data, spot outliers, correlate to chemical structure of test compounds, etc. This does not necessarily have to be correlated to known biological endpoints – it can be purely exploratory. Vendors said “that’s why we need companies like Accelrys and products like Pipeline Pilot”. The marketplace needs a third-party environment which provides turnkey or almost-turnkey access to the data, and an exploratory environment like PLP in which users can develop methods to ask “what-if” questions of their data. When users clearly demonstrate that these techniques have merit, they will find their way into the instrument vendors’ products.


One other aspect of the above discussion which became apparent is that many, if not most, HCS users have no idea what the difference is between PCA, Classification, Support Vector Machines, genetic algorithms, Self-organizing maps, etc., let alone where or when to apply these methods. What they want, and need, is a kind of wizard which walks them through a process of determining what they want to learn from their data, and selecting internally the best method to do that. An analogy was drawn to curve-fitting programs which apply hundreds or thousands of models to a data set, and tell the user which ones produced the best fit. This idea of “opening up to the wider science community methods previously available only to discipline experts”, specifically in computational biology, is by no means in its infancy (see The Future of Computational Science, Scientific Computing World: May / June 2004).


The momentum in machine vision – learning, clustering, modeling, predicative science and ease of use was foreshadowed in the HCA East conference held in 2009 and will likely continue to be the area that enables researchers in High Content Screening and Analysis to make better informed decisions earlier in the discovery process.


Special thanks to contributing author Kurt Scudder.

348 Views 0 References Permalink Categories: Bioinformatics, Data Mining & Knowledge Discovery, Trend Watch Tags: clustering, machine-learning, high-content-screening, predictive-science, computational-biology, image-informatics, exploratory-analysis, self-organizing-maps, support-vector-machines

Remember A Company That Cannot Be Named that conducted the study described here? It implemented a suite of different systems that tightened the timeline between discovering a compound and collecting data to inform next-step decisions. But where does the credit go? Is the transformation due to the specific systems companies purchase, or the way companies choose to implement the systems they’ve chosen?


We think it’s the latter. I mean, honestly, pharma now has more tools than ever to choose from to expedite just about any task an R&D scientist needs to do. What’s challenging is integrating all these information sources so that they benefit the entire organization rather than one group of scientists. In fact, the lack of a well-thought out strategy for implementing informatics can actually inhibit R&D productivity by creating silos that make it hard to find information or create comprehensive reports to inform decisions.


Increasingly, we are seeing companies provide a self-service information buffet to scientists. The whole idea is to provide an environment that’s not limited by what’s on the menu or the speed of service—an environment that puts scientists in control rather than expecting them to always consume the same mass-produced, out-of-the-box burger.


How might such an information buffet look? Here’s one enticing smorgasbord. The screen below shows a dashboard for viewing stability study results for different formulations. Scientists select data of interest from multiple experiments, and Isentris assembles various charts on the fly. Bar charts indicate the relative composition of the formulations. The line charts illustrate the stability of micelles as the surfactant in each formulation destabilizes.


Blog buffet.bmp

We’d love to hear which analogy best fits the way you are providing information to researchers. Do your scientists prefer to grab information from buffets, fast-food joints, or sit-down restaurants—and how do these approaches affect how you purchase and implement informatics? And to get a taste of an Isentris informatics buffet, check out this 10 minute video demonstration.

442 Views 0 References Permalink Categories: Lab Operations & Workflows Tags: webinars, roi, workflow-integration, isentris

chemmobi with DG_small.jpgEverywhere you go these days, you (often literally) run into people with their heads/brains buried in mobile devices. But will these "minicomputers," as a recent article in Pharmaceutical Manufacturing called them, find utility in the lab beyond phoning colleagues or catching up on email?


James Jack, who developed Symyx's ChemMobi app, thinks so. "The iPhone is... more a computer with the ability to make phone calls than a phone," Jack said recently. "It seemed perfectly natural to try and put some meaningful chemistry on there." [Editor's note: In July 2010 Symyx merged with Accelrys, Inc.]


ChemMobi lets scientists use an iPhone or iTouch to access over 30 million chemical structures and related properties, supplier information, and MSDS summaries. So far, at least, scientists seem interested. Since July, the app has been downloaded over 2,000 times by users in 19 countries. And Jack was lauded by the Chemspider blog last summer for the creativity and progressive thinking that went into this app.


But what do you think? Is your mobile device a tool for working or an escape from the daily grind? And, depending on your answer, what types of scientific apps would you use if they were available?

348 Views 1 References Permalink Categories: Scientific Databases Tags: discoverygate, mobile-apps

DFT Redux

Posted by gxf Jan 14, 2010

DFT_refs.jpgI thought I'd start the year with an easy blog, simply following up on my earlier ramblings of 25 October 2009: DFT Goes (Even More) Mainstream. In that article I discussed the success of Density (DFT) and used the annual number of publications as a metric. The numbers show that publications grew by over 25% per annum, but the results for 2009 were naturally incomplete.

Happily the trend continued through 2009 for a total of 4621 DFT references in ACS Journals. Here are a few of my favorite publications, thought not all are drawn from the ACS citations. Yes, of course, these use Accelrys DFT packages, but they are still pretty cool articles:

Let me and my readers know what you think are the most interesting DFT articles from 2009.


†Strictly speaking, this was not QSAR, Quantitative Structure-Activity Relationship, because they didn't actually base predictions on the structure. I use the term here more generally to refer to relationships that predict complex properites like catalytic activity, on the basis of simpler properties, like workfunction.

406 Views 0 References Permalink Categories: Materials Informatics, Modeling & Simulation Tags: materials-studio, computational-chemistry, density-functional-theory, molecular-modeling

Get educated on ELNs

Posted by AccelrysTeam Jan 13, 2010
During 2009, we touted Symyx Notebook's strength at connecting scientists in different disciplines across the enterprise. Three Webinars presented at the end of the year give you a chance to verify these claims yourself. Check out recordings highlighting Symyx Notebook's utility at transforming enterprise R&D, as well as specific workflows for analytical chemistry and biology.

A list of other recorded webinars can be found here covering such topics as collaborating in Isentris, getting started with DiscoveryGate, and intelligent structure editing in Symyx Draw. Holler, too, if there are other learning sessions you’d like to see offered in the coming months, and I'll pass on your thoughts to our webinar team.
457 Views 0 References Permalink Categories: Electronic Lab Notebook Tags: webinars, eln, demos, symyx-notebook-by-accelrys

How to Buy (and Sell) ELNs

Posted by dcurran Jan 11, 2010

Today's guest author, Pierre Allemand (VP of Life Sciences in Europe), is not the type to keep his opinions to himself. In this column, he argues that enterprise ELNs require a different evaluation process that forces vendors to earn your business. What do you think? If you’ve bought an ELN, would you do things differently—or ask for different things from vendors?

small pierre.jpgAs much as vendors and customers love the idea of an out-of-the-box ELN (says Pierre), there is no such thing--not if you’re talking about an enterprise ELN that meets the specific needs of scientists in different disciplines while enabling data to be accessible by everyone who needs it. A vendor may hand you a box, but in reality, that box is like Tiffany meets IKEA—beautifully engineered parts that still require some assembly.


Research is still the first step in the evaluation process, particularly given that there’s no shortage of ELNs to choose from  (obviously, this blog is a great place to learn about Symyx’s ELN). But what next? The standard approach is to send out an RFP, collect vendor responses, view a bunch of canned demos, and select a solution, which only then is customized for you. Everyone has heard stories of buying something based on the demo, only to discover after installation that the product doesn’t work as “advertised.”


Because ELNs (particularly a true enterprise solution) are tied so tightly to your business, I recommend that you spend more time up front developing a relationship with vendors. Ask the vendors to survey your operation. We’ve done everything from two-day workshops to full-blown department or enterprise audits.  Yes, we’ve gotten flack for it. Think about it, though. One of the key problems in R&D informatics is that companies tend to let technology drive workflows, rather than using technology to support better or more efficient ways of working. Through an audit, we can get a more complete picture of what you are trying to accomplish.  And for the same time and money you’d invest gathering specs for an RFP, you get greater insight into how different vendors operate.


You’re left at the end of this stage with a bunch of reports from various vendors on how to approach your informatics situation. This may not seem much different than what you’d get after an RFP, but the difference is in what’s on the paper. You see how each vendor has defined your problem and their different approaches. You’re in control. The pros and cons are easier to see, you can talk with vendors and ask for revisions, and, ultimately, you define your solution based on your needs rather than adjusting your needs to vendor conceptions.


Following these conversations, you would select maybe your two top vendors and ask them each to run a workshop to illustrate how they’d implement their approach. This workshop is more than a demo—it covers not just what the solution looks like and how it works, but how it is implemented. You know what you’re getting and what is expected of you (and the vendor) to make it work.


From here, you’d select a vendor. But the work isn’t over. We highly recommend a pilot or agile implementation that introduces the system to a subset of early adopters. These scientists can not only put the system through its paces, but serve as champions when you finally roll the system out more widely.”

444 Views 2 References Permalink Categories: Electronic Lab Notebook Tags: eln, buying-software, guest-authors

Mad about MAD

Posted by dhoneycutt Jan 8, 2010
Over the past year or so, I have spent a great deal of time working with model applicability domains (MAD). Here I explain some of the what and the why.

When we build a statistical model—whether with linear regression, Bayesian classification, recursive partitioning, or some other method—we want to ensure that the model is a good one. If the goal is to make predictions with the model, then "good" means "able to make accurate predictions." We usually use cross-validation or test set validation to convince ourselves that a model is good in this sense.

But there's more to it than this. Even a good model makes poor predictions for samples that are too different from the samples in the training data used to build the model. In other words, the training data define the model's applicability domain.

For example, suppose we wish to model per-capita crawfish consumption as a function of several variables, including the distance from the Mississippi River. Suppose also that our training and test sets consist solely of Louisiana residents. Even if we find that the model has good predictive ability for the test set, we would not expect it to do a good job predicting crawfish consumption in, say, Oregon (though it might do an OK job for parts of Mississippi). In other words, locations in Oregon lie outside the MAD. (See map.)

This idea appears obvious, yet models in statistical software packages often lack the ability to automatically define their own MAD and flag predictions outside the MAD as questionable. (In linear regression models, confidence and prediction bands serve this role to some extent. The bands become wider as we move away from the center of the training data.) The onus is generally on the user of the model to ensure that it is applied correctly. When the person applying the model is the same one who built it, and is thus familiar with the training data and the model's limitations, this is not too big a problem. But when the creator and user of a model are two different people separated in space or time, a model's awareness of its own applicability domain can be critical to the proper use of the model.

In the life sciences, it appears that the need to take the MAD into account when making predictions was first recognized for QSAR models of toxicity such as TOPKAT. TOPKAT introduced the notion of the optimum prediction space (OPS) defined by the ranges of the training set descriptors in principal component space. But the OPS is just one of several MAD measures discussed in the literature (e.g., see here, here, here, and here).

To summarize some of my own recent work in this area: In various numerical experiments, I have reproduced the research results of others who found that the distance from a test sample to samples in the training set correlates well with the model prediction error. ("Distance" can be defined in several different ways, and a lengthy essay could be written on this subject alone. But I'll spare you for now.) This gives us the potential to estimate MAD-dependent error bars even for learning methods that do not intrinsically support them. A few of the model-building (learner) components in Pipeline Pilot now support OPS and other MAD measures, and we're working on adding more of these.

I hope I have convinced you of the importance of paying attention to the applicability domain when making predictions with a model. I'll have more to say on this in a future posting.
436 Views 1 References Permalink Categories: Data Mining & Knowledge Discovery Tags: statistics, pipeline-pilot, data-mining, data-modeling, toxicology, machine-learning, model-applicability-domain, model-validation
Join us at the annual CHI PepTalk 2010 Conference being held at the Hotel Del Coronado in San Diego, January 11-15, 2010. At booth #32, Accelrys Protein Engineering and Antibody Modeling experts will showcase new features and enhancements found in Accelrys Discovery Studio 2.5.5 that both enable and improve the modeling of antibody structure and function. The advanced computational technology supported by the Discovery Studio environment allows scientists to explore the antibody landscape in silico prior to costly experimental implementation, thus greatly reducing the time and expense involved in bringing such products to market, while increasing scientific productivity.

On Monday, January 11th at 3:35 pm, Dr. Shikha Varma-O'Brien of Accelrys will present "Modeling the 3-Dimensional Structures of Antibody and their Interaction Interface to Antigen." She will discuss and demonstrate how Accelrys Discovery Studio not only contains the tools necessary to construct modeling framework from antibodies, but also enables structure based prediction of antibodies by physical properties with the goal of uncovering novel antibody designs.
352 Views 0 References Permalink Categories: Modeling & Simulation Tags: discovery-studio, proteins, antibodies
A webinar on Thursday, January 14, will highlight the new functionality in the all new DiscoveryGate. Titled “Keeping IT Simple,” the webinar focuses on DiscoveryGate’s new search, filter, and reporting options to speed synthesis design and planning.

 Power users and research IT administrators can learn about transitioning from legacy ISIS installations to Isentris in a webinar on Tuesday, January 26. The webinar explores not just benefits of transitioning, but showcases two new packages Symyx has developed to ease the transition.

 To register for either webinar or view recordings of past webinars, visit the Symyx events page.
423 Views 0 References Permalink Categories: Lab Operations & Workflows, Scientific Databases Tags: webinars, discoverygate, isentris

One thing research organizations always need is information on how to justify their investment in informatics. These pretty pictures compiled by a VP of chemistry at A Company That Cannot Be Named powerfully illustrate the impact of informatics.




The company used program event-based analysis to map all the data collected for individual compounds over time. The red mountain creeping up the chart marks when a compound was first registered. The colored dots to the right of the red line track the discovery of property information and other details about the initial compound over time. Viewed this way, the problem is obvious—scientists don’t have all the information they need to make next-step decisions. That highlighted result, e.g., was obtained too late to inform work on all those circled compounds.


This organization implemented a suite of laboratory operational systems, including an ELN, a chemical registration system, a compound management system, and assay management, analysis, and reporting applications. They had qualitative aims like improving quality, timeliness, workflow integration, and productivity. And here’s what the mountain looks like now.





Two things to note about this map. First, the time between discovering a compound and capturing additional detail is much tighter (that red line is almost purple now due to all the data crowded against it). More information is available sooner to inform next-step decisions. Second, the slope of this peak is steeper, indicating that compounds are being developed or selected more rapidly and showcasing the impact that this company’s investment in informatics has had on scientific decision making.

472 Views 1 References Permalink Categories: Executive Insights, Lab Operations & Workflows Tags: reporting, case-studies, assay-management, chemical-registration, compound-management, eln, roi, workflow-integration

Materials Studio Webinar Series Part V: Exploring New Fuel Cell Materials


There is increasing pressure to deliver lighter, more efficient and less expensive materials more frequently and faster than ever before. Fortunately, the integration of Materials Studio applications such as CASTEP and the Pipeline Pilot platform opens a range of possibilities for the discovery of new materials.


The experts at Accelrys have developed a new framework that screens complex systems and properties across numerous materials and applications. This system is currently being applied to fuel cell catalysts to find alternatives to costly materials such as platinum. Dr. Jacob Gavartin and Dr. Gerhard Goldbeck-Wood will discuss this approach and its application in detail during next week’s webinar:


Exploring New Fuel Cell Materials: High Throughput Calculations and Data Analysis with Materials Studio 5.0 and Pipeline Pilot
January 13, 8am PST / 4pm GMT


Register today!

474 Views 0 References Permalink Categories: Materials Informatics Tags: catalysis, pipeline-pilot, materials-studio, fuel-cells