Skip navigation
Previous Next

Accelrys Blog

February 2010

In the first part of this series, we discussed the basic collection of cloud offerings and what type of value they provide to IT, Developers, and Customers.  In this post, we’ll focus more on the business issues when leveraging the cloud.


One of the biggest hurdles leveraging Cloud Services is around securing and transporting of the data.  There’s no single answer or solution to resolve these issues, and there is no shortage of webinars, papers, conferences, etc. that focus on this so I don’t think I need to dig into that (just yet).  But what’s important to recognize is that all of the Cloud vendors, security experts, and network providers are working to both provide an answer that meets your business and technical requirements but also earns your trust.  The best way to get over that hump is to learn more and try it out.


First try the cloud on non-critical but impactful tasks.  Then start to increase your usage of critical data, connect directly to internal data, and perform tasks that provide real business value.  This isn’t an original approach since it’s pretty much the typically evaluation or Proof of Concept (POC), but that’s exactly the point!  Driving a project like this is more than just technology based, as you’ll most likely involve many people within your organization, such as Legal, Finance, IT, and Security in order to plan and complete the project.  There will be lots of concerns from these various groups, many reasonable and some that just requires lots of education.  So make sure you invest in educating them on the basics of the Cloud first.  This will make the rest of the process much smoother, but not easier.


Second, you’ll need to consider network bandwidth usage and data storage costs.  All of the cloud vendors have some sort of fee when uploading, downloading, and storing your data.  When you first look at this, its penny’s per GB, but when dealing with large data volumes and data transactions (e.g. Read and Writing across the network) those costs can get pretty high.  So your first thought will be that cloud pricing is extremely high, but what you may not be factoring in is all the things the cloud vendor is doing for you that’s beyond just the price of the disk, network, cooling, and power.  The cloud vendors typically offer a high SLA, so that includes data replication, de-duplication, resiliency, continuity, and more.  And not to mention the staff, planning, and operations to make all of that happen.  If you compared that to your own infrastructure and added that to your internal per-GB cost of storage, you’ll most likely see that the Cloud is more affordable but that assumes your meeting the same level of SLA and process as the Cloud vendors which most are not.  That said, there are some applications and data that may not be a good fit for many of the cloud vendors because of the special nature of the application, massive data size with high volume transactions, high throughput requirements, legal requirements, and more.  But this is starting to be the exception versus the rule.


Many organizations are making the leap of putting their most trusted data into the cloud, and some are doing it without realizing the significance. Email and Sales force automation having been leading the charge in hosted applications and Software as a Service (SaaS) deployments.  Now think of it this way, if you can store all of your communications and customer records on the cloud, why can’t you do more?  By businesses taking this leap, they start to build trust in external parties maintaining and operating their business critical services.  In a recent report by Goldman Sachs, they note that customers see a “shift towards cloud unstoppable”. The trend towards cloud services and applications won’t be a complete rip and replace, business will look to the cloud as an extension of their overall enterprise architecture and infrastructure.


When comparing the many Cloud/IaaS vendors in the market today, it’s already moving towards mass commodity price points and common functionality.  And that’s great if you want to take a piece of existing traditional on-premise software and simply deploy it to the cloud.  What you have to look out for are pitfalls in the software license, security, deployment architectures, and the fact that you’re still responsible for managing that software in the “Cloud”.  So the next layer to look for is a Services vendor that can deliver you the application.  This can at times come from the vendor directly or through partner network supported by the vendor.  Each has their own value proposition and differences in how flexible they can be delivering additional custom services.   Again, this type of application + service model isn’t new as the Application Service Provider (ASP) model has been around for years.  What’s new is that these ASP’s can still provide lots of value and cost reduction to the customer but now leveraging computing and storage that provided by a “Cloud” offering (e.g. AWS).


In the next part of this blog series, we’ll focus more on the various services models that will be available to customers based on a cloud version of Pipeline Pilot.


To view all Conrad's Cloud Series posts, please visit:

393 Views 1 References Permalink Categories: The IT Perspective Tags: pipeline-pilot, cloud-computing, software-as-a-service
One of the most successful uses of quantum mechanical modeling methods is to predict spectra. These methods are capable of yielding good predictions of UV/Visible, NMR, Infrared, Raman, THz, and EELS (electron energy loss spectroscopy) to name just a few. Spectroscopy (according to Wikipedia) is the "study of the interaction between radiation and matter as a function of wavelength ... or frequency." How does this help chemists? We can use the spectra to determine the structure of new molecules or materials; to determine the composition of mixtures; or to follow the course of a chemical reaction in situ. How does modeling help with this? In a number of ways, but I'll cover just 2.

One way modeling comes into play is by working with experimental results to remove ambiguities. When a chemist is trying the determine the structure of a new material, he or she takes a spectrum, or two, or three. His or her knowledge of the ingredients together with the spectra gives a pretty good idea what the chemical or crystal structure is. In a lot of cases the data are sufficient only  to narrow this down to 3-4 possible structures. Molecular modeling resolve this ambiguity by predicting the spectrum of each possibility; the spectrum that matches the experimental one presumably corresponds to the "right" one. Modeling is even more valuable when investigating defect structures like this work on Mg2.5VMoO8.

Another use is telling where experimentalists to look for the spectral peaks of a new compound. This can be especially important when trying to detect the spectra of new, novel, or poorly characterized materials. Experimental terahertz (THz) spectroscopy, for example, examines the spectral range of 3-120 cm-1, and can be used for detection and identification for a wide assortment of compounds including explosives like HMX. It's a lot safer to investigate these materials by modeling than in the lab.

A recent blog by Dr. Damian Allishighlights the importance of doing the simulations correctly. (By the way, Damian, congrats on getting to page 1000.) A lot of work for the past 40-odd years has gone into predicting spectra of isolated - or gas phase - molecules. But materials like HMX are crystalline, and calculations on the isolated molecules make for poor comparison with crystals. The recent work underscores how important it is to simulate crystals using crystals. And it's not just for THz spectra. Recent work on NMR leads to the same conclusion. A couple of programs can do this. Damian's blog focuses on DMol3 and Crystal06, but we should also mention CASTEP and Gaussian as other applications capable of predicting a wide variety of properties for solids.

Let's keep modeling - but be careful out there: short cuts will lead to poor results, and molecular modeling will end up taking the rap for user error.
635 Views 1 References Permalink Categories: Materials Informatics, Modeling & Simulation Tags: materials-studio, computational-chemistry, quantum-chemistry, spectroscopy

I last left you with a list of terms and examples surrounding the term "cloud computing;" now it's time for a little context.  Utility Computing, such as Amazon Web Services (AWS) Elastic Cloud Computing (EC2) provides a customer with the ability to spin up new machines on-demand.  From the customer side, you don’t care what machine it’s on but you do get to define the type of resources you want to consume such as CPU cores and Memory.  So far this sounds just like Hosting, right?  Correct!  What’s different is that you don’t have to sign a long term contract for that resource AND you’re not tied to that actual hardware since in the background it’s really just a Virtual Machine.  Now this is where it gets interesting.  Hosting has been around for a while, but since Server Virtualization technologies such as Microsoft Hyper-V and VMware vSphere has become mature, it enables the flexibility and architectures of Cloud Computing.  And since this Server Virtualization is available to Enterprises, this is where you hear the term “Private Cloud” being add to the Enterprise mix.


Now let me quickly tackle a common question.  “What’s the difference between Amazon Web Services, Microsoft Azure, and Salesforce?  Aren’t they all the same?”  First off, this is a great question, but it’s really comparing apples, oranges, and tomatoes.  Yes, those are all fruits but each provide something very different to the consumer.  Where Clouds are different than fruit is that you can layer some of the clouds to deliver a service.  Remember that AWS is a Utility.  Microsoft Azure is a resource targeted towards developers.  Developers are different than IT and therefore have different requirements.  They like to write applications that typically consume some data and provide a User Interface.  They don’t want to be bothered with patch management, monitoring systems, deployment of servers, etc.  Microsoft Azure abstracts this from the developer.  They instead write to the “Fabric” of the Cloud Computing platform that Microsoft manages, which allows the developer focus on what they do best.  Finally, with it’s even further abstracted.  You still have developers that can write applications based on, but the developer is given even more constraints on what they can develop and how it can be implemented.


OK, enough of the Cloud Tutorial, but hopefully you have an understanding that there are many different types of clouds and how they can be used.  Are there challenges to adoption? You bet!  But there are always challenges when adopting technology.  While the above was about the technology, there are a number of business issues, concerns and questions that need to be addressed as well.  In the case of many organizations, one of the biggest hurdles is around securing and transporting of the data.


In the coming weeks, we’ll provide an update on our roadmap for leveraging, supporting, and providing guidance on using Cloud Computing and Virtualization technologies.  Accelrys has already been moving forward to partner with a number of Cloud vendors, Service Providers, and third-party software vendors to ensure our customer have the power of choice, delivery models, and a clear path to leverage Accelrys products in the cloud.


If you’re in any stage of interest, planning, evaluating, or deploying Accelrys products or other scientific applications in the Cloud, we’d love to hear from you!  As the leading provider of Scientific Informatics Solutions, we’re interested in supporting our customers no matter where there environment is – at home or in the cloud.  Visit our forums to continue the discussion:


In the next part of this blog series, I’ll focus on the Business Issues found with leveraging the cloud.


To view all Conrad's Cloud Series posts, please visit:

461 Views 0 References Permalink Categories: The IT Perspective Tags: pipeline-pilot, materials-studio, discovery-studio, cloud-computing, virtualization

Previous entries here and here on the In the Pipeline discussion that deviated to discuss the utility of ELNs in academia have sparked conversation within our company. As usual, Pierre Allemand (VP, life sciences, Europe) wasn’t shy in expressing his POV, so I invited him to author a post. Note that we’ll be running several entries in the next few weeks on hosted solutions, so stay tuned.


small pierre.jpgIt is clear from this whole sad story that not having an ELN is much worse than the potential negatives of having one. But as the comments we’ve summarized indicate (and as my friend Paul Collins noted in his comment on our first entry on this subject), academics have legitimate concerns about the security of electronic records and having the resources to maintain systems over time.


A logical option for academics and other small labs is a hosted ELN. Under this framework, an academic site would access ELN capabilities by subscription, and the application and associated data would be managed, maintained, and secured by an outside vendor.


Security is probably the biggest concern cited about hosted solutions. But this situation demonstrates that paper notebooks aren’t failsafe: they can be lost or stolen or damaged. I would argue that it is easier to get into a research lab and steal a paper notebook than to attempt to hack into an electronic system and copy information. Then again, an internal electronic system isn’t impenetrable, as BMS learned earlier this month.


What makes a hosted solution different is that a research site removes the onus of securing electronic records to a vendor with specific IT expertise. Serious vendors understand the industry standards and services available to protect data. We at this company, for instance, would never consider running hosted services on our own computers. Developing software and hosting it are two different things. You might love your iPhone, but how would you feel if Apple started running its own cellular network?


Today, there are myriad services set up for financial, banking, human resources, customer relations, and other hosted and cloud-based services that specialize in securing sensitive data. Why reinvent the wheel for something as important as research IP?

379 Views 1 References Permalink Categories: Electronic Lab Notebook, The IT Perspective Tags: hosted-informatics, cloud-computing, eln, academic-labs, hosted-eln

The scientific community is seeing an explosion of outsourcing, collaboration, massive data production and consumption, and financial pressures.  Driven by these challenges, Research & Development Information Technology (R&D IT) and even the scientists themselves are looking to the potential of Cloud Computing to enable an increase in science innovation and allow R&D IT to provide higher valued service along with reduced costs.  Cloud Computing isn’t a “silver bullet” to solve these challenges, but it does provides the tools to address many of these key business drivers.

I’m sure many of you have seen the benefits of the cloud such as cost reduction, cost management, on demand, and scalability.  But what does this mean in the context of a Scientist using a product such as Pipeline Pilot?  Before we can get into the specifics of how Cloud Computing will provide value to a Science organization, let’s first get the terminology straight.  This won’t be a deep dive into each area, but just a quick primer.

First off let’s just all agree that “Cloud Computing” is a pretty generic term and it actually comes in many different forms.  Here are some terms loosely used and thrown around with common examples:

  • Platform Virtualization - Virtualization of computers or operating systems. It hides the physical characteristics of a computing platform from users, instead showing another abstract computing platform.
    • VMWare vSphere, Microsoft Hyper-V, Citrix XenServer

  • Grid Computing - Combination of computer resources from multiple administrative domains applied to a common task, usually to a scientific, technical or business problem that requires a great number of computer processing cycles or the need to process large amounts of data.
    • Microsoft HPC, Sun Grid

  • Managed Hosting - A dedicated hosting service, dedicated server, or managed hosting service is a type of Internet hosting in which the client leases an entire server not shared with anyone.

  • Utility Computing (Cloud)- packaging of computing resources, such as computation and storage, as a metered service similar to a traditional public utility (aka Infrastructure as a Service)
    • Amazon EC2, Rackspace Cloud, GoGrid

  • Platform as a Service (PaaS) - a computing platform and/or solution stack as a service, generally consuming cloud infrastructure and supporting cloud applications.
    • Microsoft Azure Services, Google App, Rackspace Cloud Apps

  • Software as a Service (SaaS) - model of software deployment whereby an Application Service Provider (ASP)  licenses an application to customers for use as a service on demand

  • Software plus Services (S+S) - combining hosted services with capabilities that are best achieved with locally running software.
    • Microsoft Exchange Hosted Services, Google Message Labs

That’s a pretty quick and dirty listing of terms, so I'll add a little context next time...


To view all Conrad's Cloud Series posts, please visit:

409 Views 0 References Permalink Categories: The IT Perspective Tags: pipeline-pilot, hosted-informatics, cloud-computing, grid-computing, platform-as-a-service, software-as-a-service, virtualization

SMi ELN Conference Reviewed

Posted by dcurran Feb 16, 2010

Several reviews of January’s SMi ELN conference have appeared in online venues over the past two weeks. There was a LinkedIn conversation about the conference last week in the Electronic Laboratory Notebook Group, and a review of the first day of the conference was also published last week. I talked to Francis Benett, who presented at SMi on our company's behalf, and Paul Collins, who attended, to get their take on the event.


Francis pointed out that the talks generating the most interest had to do with semantic Web searching over unstructured data. “The general issue seems to be that organizations have multiple notebooks from different vendors and not all of them are easily searchable,” said Francis. “A couple of talks discussed semantic Web capabilities: Jeremy Frey from the University of Southampton and Ian Menzies of AstraZeneca, and these were very well received,” said Francis.


While the online reviewers noted the events sparse attendance, numbers have not been forthcoming. Francis and Paul said there were probably 25 people at most in attendance, including vendors. Clearly, travel budgets have dwindled, though Paul pointed out that another factor in the poor attendance may have been repetition in the talks. “Half the talks were just updates on last year,” Paul said.


What makes an ELN conference (or any conference, for that matter) worth attending? I’m sure our Symposium organizers would be interested in your take. Also, if you’d like a copy of Francis’s presentation, leave a comment and I’ll send you one. We’re working on getting a SlideShare site up; when that’s ready, I’ll post a copy of it there.

491 Views 0 References Permalink Categories: Electronic Lab Notebook Tags: conferences, eln, presentations

A new year, a new website, and a new mechanism to support our users!


We have just launched an initiative to help scientists around the world with support for Discovery Studio and its integration with Pipeline Pilot through the New Discovery Studio Open Hour!


Discovery Studio Open Hour is an open session hosted by an Accelrys scientist to answer any questions or queries you might have with regard to Discovery Studio. These sessions are completely open and FREE to attend and do not require any previous registration at all.


Drop in anytime and stay as long as you like. These sessions are open for 1 hour and you can drop in for 10 minutes or you can stay for the entire hour. Science is never black or white, so if you feel like brainstorming an idea or need to get advice on a workflow, dial in and you’ll be connected to an expert! New to Discovery Studio and don’t really know how to take advantage of this powerful architecture? Our support scientists will help demonstrate how customized solutions can be easily developed. And with the growing number of custom DS scripts and protocols on our Accelrys Community forum, you may just find what you are looking for and the DS Open Hour would be a great time for further discussions!


Have a question about the FREE DSVisualizer? This might be the best forum to get started or ask questions and have fun learning about tips and tricks with the product!


For now, sessions are hosted the second Tuesday of each month in 2010. (Jan 12, Feb 9, March 9, April 13, May 11, June 8, July 13, Aug 10, Sept 14, Oct 12, Nov 9, Dec 14) and the website has all the webex and conference call details you’ll need.  Add the dates to your calendar …  we’ll see you then!

361 Views 0 References Permalink Categories: Modeling & Simulation, News from Accelrys Tags: pipeline-pilot, discovery-studio, ds-visualizer
The 2009 Molecular Medicine Tri-Conference blasted off like a rocket with John Crowley’s  keynote, “When Drug Research is Personal.”  His family’s struggle is the inspiration for the motion picture Extraordinary Measures.  It was a profoundly moving experience to witness this father’s story of his family’s search for a cure for Pompe disease.  This journey eventually led to the founding of Novazyme Pharmaceuticals.    This is the kind of story that encourages us at both the human and scientific level.

I had a really tough time choosing which talks to attend but mostly settled on Molecular Diagnostics, Personalized Diagnostics, Cancer Profiling and Pathways, and Informatics Systems.  It was painful to miss the RNA Interfere, Cancer Biologics, and Translational Medicine sessions.  Many talks totally rocked.  Here are some of my favorites, in no particular order.  These talks come to mind because the material was fascinating, the delivery was exceptional, and they were all in areas for which I have a passionate scientific interest.

  • Single Molecule Real Time Biology: New technologies Enabling a More Complete Characterization of Disease Biology, Eric Schadt, Ph.D., Chief Scientific Officer, Pacific Biosciences

  • The Onco-SNP and Cancer Risk: microRNA Binding Site Polymorphisms as Biomarkers, Joanne B. Weidhaas, Ph.D., Assistant Professor, Therapeutic Radiology, Yale University

  • Expression Based Patient Stratification for Cancer Prognostics, Peter J. van der Spek, Ph.D., Department of Bioinformatics, Erasmus MC - Medical Faculty

  • Consumers and Their Genomes, Brian Naughton, Ph.D., Founding Scientist, 23andMe

  • Systematic Discovery of Cancer Gene Fusions using Paired End Transcriptome Sequencing, Chandan Kumar, Ph.D., Michigan Center for Translational Pathology, University of Michigan

  • Enterprise Scientific Workflow Environment Drives Innovation, Daniel J. Chin, Ph.D., Senior Principal Research Scientist, Roche Palo Alto

This year I presented a poster on biomarkers ala Pipeline Pilot™, attended talks, and caught up with professional colleagues.  The Outrageous Character awards affectionately (and respectfully) go to Eric Schadt and Peter van der Spek.  The Thank You award goes to Daniel for his kind words about our work together.  The Exquisite Explanation awards go Joanne and Chandan.  They did an amazing job of bridging any gaps in the audience’s varied background by presenting technical concepts in essential simplicity—truly beautiful.  Brian gets the award for my Favorite DTC Genetics Company.  I have spent many hours studying my own SNPs data (and that of my family members) thanks to 23andme.   I have derived much pleasure from connecting with relatives, all over the world, that I found through the 23andme site.   I am very grateful that I was able to get this type of genetic information AND the raw data, too.
650 Views 0 References Permalink Categories: Bioinformatics Tags: biomarkers, conferences, personalized-medicine, translational-medicine, genomics, cancer-profiling, molecular-diagnostics, pompe-disease

Calling DFT to Order

Posted by gxf Feb 15, 2010

One of the most interesting developments in density functional theory (DFT) in recent years is the emergence of the so-called "Order-N" methods. What's that mean? Quantum chemists and physicists classify the computational cost of a method by how rapidly it scales with the number of electrons (or the number of molecular orbitals.) This can get into a real jargon of computational chemistry, but here are some examples:  



ONETEP gets its speed by using localized molecular orbitals (MOs). Top: a conventional MO is spatially delocalized, hence it interacts with many other MOs. Bottom: localized MOs do not interacte, hence less computational effort is required to evaluate matrix elements.



Consider the N4 case as an example. This means that if you double the size of the system that you're modeling, say from a single amino acid to a DNA base pair, the cost  (i.e., CPU time) goes up by roughly 16x. That makes many of these approaches prohibitive for systems with a large number of atoms. The good news is that it doesn't really need to cost this much. The atomic orbitals that constitute the molecular orbitals have finite ranges, so clever implementations can hold down the scaling. The holy grail is to develop methods that scale as N1 or N, hence the expression "Order-N" or "linear scaling." Using such a method, doubling the size of the system simply doubles the amount of CPU time.  


My favorite Order-N method is ONETEP (not surprising, considering that it's distributed by Accelrys). As explained in their publications, this approach uses orbitals that can be spatially localized more than conventional molecular orbitals to achieve its speed. As a result of localization, there's a lot of sparsity in the DFT calculation, meaning a lot of terms go to zero and don't need to be evaluated. Consequently, it's possible to perform DFT calculations on systems with 1000s of atoms. Because of its ability to treat system of this size, it's ideally suited for nanotechnology applications. Some recent examples include silicon nanorods (Si766H462) or building quasicrystals (Penrose tiles) with 10,5-coronene


Why bring this up now? CECAM (Centre Euopéen de Calcul Atomique et Moléculaire) is hosting a workshop on linear-scaling DFT with ONETEP April 13-16 in Cambridge, UK. This is a chance for experienced modelers and newcomers to learn from the expert. Plus they'll have access to the Cambridge Darwin supercomputer cluster, so attendees will have fun running some really big calculations. What kind of materials would you want to study if you had access to this sort of technology?

450 Views 0 References Permalink Categories: Materials Informatics, Modeling & Simulation Tags: nanotechnology, atomic-scale-modeling, linear-scaling, onetep, order-n

Well, we think you would probably agree…it was time for a re-model.  We’re excited to open the doors to the new and improved



Old Site






New Site



In addition to a new look and feel, we think you’ll find it easier and faster to locate information.  We’ve organized content by a number of different categories – by Area of Science, by Scientific Need, by Industry and by Product – and added helpful ‘Next Steps’ and contextually relevant resources on every page.


One of our favorite additions is the use of video throughout the site.  These videos feature members of our Accelrys team, including our Chief Science Officer, Frank Brown, our head of Research and Development, Matt Hahn and Lalitha Subramanian who leads our Contract Research group.  We think video is a great way for you to get to know our products and our team.


Check out the Flash video demonstrating Pipeline Pilot – our scientific informatics platform.  It’s a quick 3 minute overview that sums up the measurable impact that Pipeline Pilot can have on your research process (Homepage - first video on the left).  This library will continue to grow with not just interviews, but product demos, so check back often.  And of course, there is our Blog which features active commentary from our team on a range of topics and trends impacting the scientific community.


We hope you like the new site.  Surf’s Up!

360 Views 0 References Permalink Categories: News from Accelrys Tags: pipeline-pilot
I am really excited that Prof Robert Langer is going to join the Accelrys UGM in Boston to deliver a plenary address. Though he probably needs no introduction, Langer received the Charles Stark Draper Prize, considered the equivalent of the Nobel Prize for engineers and the 2008 Millennium Prize, the world’s largest technology prize. Check out a recent video, to hear more about his own journey which took him from a Chemical Engineering degree to the forefront of medical research on delivery systems and tissue engineering.

As the UGM sets out to discuss the latest advances in materials and pharmaceuticals research, Prof Langer brings it all together. Drugs on their own are powerful, but putting them simply into pills seems a bit like a powerful engine in a car without a steering wheel. Langer has pioneered ways in which materials, especially polymers can be used to steer their delivery to much greater effect in curing disease. With the development of nanotechnology over the last decade, the sophistication of the drug and the material can work more and more together, and I really look forward to the materials and life science interaction at the UGM.
481 Views 0 References Permalink Categories: News from Accelrys, Trend Watch, Materials Informatics Tags: materials, nanotechnology, pharmaceuticals, chemical-engineering, polymers, tissue-engineering
After simple combustion, and the nuclear option, the relationship between materials and energy is as topical as ever. Taking a new turn in the 21st century the couple have matured into exploring more subtle ways to relate to each other. What am I talking about? Well, there are so many ways in which materials affect energy and energy is affected by materials, i.e. energy generation, storage, conservation and the efficient use of energy. In all of these, insights at the atomistic and quantum level help us to design cleaner energy sources, and find less wasteful ways of using energy. To find out more on how modelling supports the discovery and understanding of new materials for fuel cells and batteries, please check out the Materials Studio 5.0 Webinar Series.  Following the recent webinar on fuel cell catalysts (for which you can still access the recording), we have two more webinars scheduled on the topic:

February 17th, 2pm GMT/6am PST: Atomic-Scale Insights into Materials for Clean Energy. The webinar will be given by Prof Saiful Islam from University of Bath, who is a renowned expert in the field: check out the interviews, podcasts and publications.

March 16th, 3pm GMT/8am PDT:  High-throughput Quantum Chemistry and Virtual Screening for Lithium Ion Battery Electrolyte Materials . George Fitzgerald will include results from a collaboration with Mitsubishi Chemical Inc which was also published in The Journal of Power Sources.
449 Views 0 References Permalink Categories: Materials Informatics, Modeling & Simulation Tags: energy, catalysis, atomic-scale-modeling, materials-studio, quantum-chemistry, fuel-cells, batteries, virtual-screening

The Bayesian learner in Pipeline Pilot is a so-called naïve Bayesian classifier. The "naïve" refers to the assumption that any particular feature contributes a specific amount to the likelihood of a sample being assigned to a given class, irrespective of the presence of any other features. For example, the presence of an NH2 group in a compound has the same effect on predicted activity whether or not there is also an OH or COOH group elsewhere in the compound. In other words, a naïve Bayesian classifier ignores interaction effects.


We know that in reality, interaction effects are quite common. Yet, empirically, naïve Bayesian classification models are surprisingly accurate (not to mention that they are lightning-fast to train).


But perhaps there are cases where a model with interactions would be better. How might we make the Bayesian learner less naïve? If we use molecular fingerprints as descriptors, one simple approach is to create a new fingerprint by pairing off the original fingerprint features and adding them to the  list. We can then train the model on the new fingerprint with its expanded feature list.


A sparse molecular fingerprint (such as the Accelrys extended-connectivity fingerprints) consists of a list of feature IDs. These IDs are simply integers corresponding to certain substructural units. E.g., "16" might refer to an aliphatic carbon, while "7137126" might refer to an aryl amino group. So if our original fingerprint has the following features:


our fingerprint-with-interactions would have the above features with the following ones in addition:


The "$" is just an arbitrary separator between the feature IDs. A Bayesian learner works by simply counting the features present in the two classes of samples (e.g., "active" vs. "inactive"), so the feature labels are unimportant, as long as they are unique.


To test the approach, I applied it to models of the Ames mutagenicity data that I discussed in a previous posting, and to an MAO inhibitor data set. Does it work? The short answer is, "Yes, with caveats." Read my posting on the Pipeline Pilot forum for details (registration is free).

488 Views 0 References Permalink Categories: Data Mining & Knowledge Discovery Tags: qsar, statistics, pipeline-pilot, data-mining, data-modeling, toxicology, classification-models, machine-learning

A recent news article by the University of Texas at Dallas (UTD)  highlighted recent joint work by the Department of Materials Science and Engineering and Accelrys on critical surface reactions of Silicon. The research points the way to "improve semiconductor devices’ performance in health care and solar power applications in particular."


Who cares? Anybody who uses chips, solar cells, or any other device containing semiconductors (in other words, all of us.) 



Insertion of Nitrogen atom is predicted to occur preferentially at the step edge of Si(111)



How does the latest research help? A typical semiconductor device consists of a metal oxide semiconductor layer (e.g., HfO2) deposited on a silicon substrate. As explained by co-author Dr. Mat Halls, formation of an SiO2interlayer between the silicon substrate and metal oxide can decrease semiconductor performance. One approach to solving this is to introduce a nitride barrier to prevent the growth of interfacial SiO2. The ability to introduce such heteroatoms into the topmost layers of Si affords additional opportunities to tune the surface properties by enhancing chemical reactivity at these sites to form functional surfaces. But how do you get the nitrogen to stick to the surface?    


In the latest research, published in Nature Materials, used infra-red spectroscopy to explore the possible formation mechanisms of nitride on silicon surfaces terminated by hydrogen. Calculations using density functional theory (DFT) demonstrated how stepped edges are important to formation of the nitride layers. The reaction mechanism on the stepped surface provides a means of controlling the reaction. As the authors wrote: "The ability to control the reaction ... enables the realization of applications ... including sensing, electrical and thermal transport, and molecular computing." This is a beautiful demonstration of the complementarity of theory and experiment. One can deal with facts, but requires interpretation. The other provides detailed explanations at the atomic level, but sometime requires an anchor to the "real world." Together they can do more. Wouldn't it be great if all viewpoints could be reconciled this well?

378 Views 0 References Permalink Categories: Materials Informatics Tags: kinetics, quantum-chemistry, semiconductors, surface-chemistry

Authors make JCIM Top 50

Posted by keith.taylor Feb 5, 2010
The Journal of Chemical Information and Modeling celebrates its 50th anniversary this year, and the anniversary web site includes a list of the 50 most-cited articles published by the journal since its inception. Symyx authors contributed to four of those papers. In July 2010 Symyx merged with Accelrys, Inc., and a good number of the authors mentioned below are still with the company.

The papers are  

#11: Atom Pairs As Molecular-Features In Structure Activity Studies - Definition And Applications
Raymond E. Carhart, Dennis H. Smith, R. Venkataraghavan
J. Chem. Inf. Comput. Sci., 1985, 25(2), pp 64-73. DOI: 10.1021/ci00046a002

#18: Prediction Of Human Intestinal Absorption Of Drug Compounds From Molecular Structure
Matthew D. Wessel, Peter C. Jurs, John W. Tolan, and Steven M. Muskal
J. Chem. Inf. Comput. Sci., 1998, 38(4), pp 726-735. DOI: 10.1021/ci980029a

#31 Traditional Topological Induces Vs Electronic, Geometrical, And Combined Molecular Descriptors In Qsar Qspr Research
Alan R. Katritzky, Ekaterina V. Gordeeva
J. Chem. Inf. Comput. Sci., 1993, 33(6), pp 835-857. DOI: 10.1021/ci00016a005

#40: Description Of Several Chemical-Structure File Formats Used By Computer-Programs Developed At Molecular Design Limited
Arthur Dalby, James G. Nourse, W. Douglas Hounshell, Ann K. I. Gushurst, David L. Grier, Burton A. Leland, John Laufer
J. Chem. Inf. Comput. Sci., 1992, 32(3), pp 244-255. DOI: 10.1021/ci00007a012

Congratulations to these present and former Symyx (now Accelrys) employees for their contributions to the field of cheminformatics!
349 Views 0 References Permalink Categories: Cheminformatics Tags: publications

While investigating the costs of joins between tables in Oracle, I came across the following, seemingly curious, result.  I had two tables that were identical in content and layout, each with indexes on the same columns but when I ran the same search on both tables, the query on one table was consistently more than 25% faster than the same query on the other.  "You must have done something differently" you cry.  Well, it wasn't exactly obvious...


Let's start at the beginning.  I produced 2 identical tables containing a 10,000 record sample of CAP (Chemicals Available for Purchase) using the same Pipeline Pilot protocol.  The tables differed in name only: one was CapSample, the other CapSample2.  I created indexes on the CLogP and Num_H_Acceptors columns of both tables and then timed the SQL query:



SELECT count(*) FROM CapSample WHERE CLogP>5 and Num_H_Acceptors>10


over 1,000 iterations on each table (replacing CapSample with CapSample2 as appropriate).  My intention was to then measure the time of the search taking CLogP from one table and Num_H_Acceptors from the other table, joining them by the primary key CardRef column.  However the search on CapSample consistently took about 3.85 seconds per 1000 iterations while the same search on CapSample2 consistently took about 2.79 seconds.  I was the only user on the machine and I kept re-running and switching between CapSample and CapSample2 and the results were consistent.  Weird!


The first thing was to examine the execution plans.  Aha! They were different.  Both were using hash joins on the two indexes, but the order of the two index range scan searches was different for the two tables.  Obviously, the CapSample2 order was better.  But why wasn't it choosing it for CapSample?  At this point, I noticed a note at the end of the explain plan output for CapSample2:






- dynamic sampling used for this statement


This wasn't there for CapSample.  Why not?  Because I'd imported CapSample the day before and only created CapSample2 today!  During the night the statistics had been gathered automatically on CapSample.  I'd only added the indexes after creating CapSample2, so the indexes on CapSample had no statistics, even though the table did.


All I had to do was gather default statistics for both tables again.  Then, being careful to slightly change my SQL so that I didn't hit any cached plans, I re-explained the queries on both tables and bingo! I got consistent results and they matched those for the fast search of CapSample2.  Running the searches on both tables now gave me the 2.79 seconds I'd seen earlier.


As a final sanity check, I re-timed using the search over CapSample using the original SQL and I got the original time of 3.85 seconds again.  I was hitting the cached plan: Oracle used it even though the statistics had changed.  It seems weird running two queries that look identical except for an extra space character and finding that one runs over 25% faster than the other, but that's what happens when you have cached plans.


So the moral(s) of this tale are:


1. When you change tables significantly or add indexes, gather table statistics for the changed tables and gather index statistics for changed or new indexes.


2. Oracle’s dynamic sampling can be very good.  However, you might want to gather proper statistics immediately after changes if you are automatically gathering statistics on your tables.  Otherwise, you could find the plan changes later (when cached plans are replaced).


3. Remember to either clear cached plans or change the SQL statement slightly after you have gathered new statistics to avoid hitting old cached plans.



395 Views 0 References Permalink Categories: Data Mining & Knowledge Discovery Tags: statistics, pipeline-pilot, chemicals-available-for-purchase, oracle

Picking up on my last entry, I thought that the comments to that In the Pipeline post nicely outlined some of the very real reasons why academia (or, more accurately, many non-large-pharma research teams) have resisted ELNs. Here's what I found illuminating:

  • Despite the snarky comment suggesting otherwise, ELNs were barely an option back in 2004. And even if we could wormhole back in time with a modern ELN, that ELN would be (as the snarky commenter suggested) designed by and for large enterprises. While vendors often make discounts available for smaller or academic sites, the ELN delivered is still intended for a large research team, making it difficult to implement and leaving scientists in a small lab paying for lots of unused functionality.

  • Several commenters offer their own experiences with ELNs, both good and bad. Clearly, at least a few academics have put ELNs in place. Besides cost, implementation headaches were the biggest complaint reported by these commenters and this shouldn’t surprise. Unlike industry, where most organizations have at least one IT specialist to aid researchers, academic labs experience turnover at least every four years. Can researchers really expect graduate students or post-docs working on research grants and papers to also implement, update, and maintain an ELN?

  • Many academics retort when approached about ELNs that the proof is in the paper. Well, the sad saga discussed here certainly dismisses that argument. Several commenters note that they have set up makeshift electronic archives, usually by scanning notebook pages as images or PDFs. But one wise commenter points out that these methods are only replacing paper (and not awfully well at that) and they certainly aren’t providing the benefits an electronic lab environment can deliver, such as the ability to search on methods, clone experiments, and build on past studies.

The same commenter who notes the benefits of ELNs points out that if cost were off the table and academics could get over a cultural bias that privileges paper notebooks and views ELN merely as another “hole for information,” “ELN is perfect for academic labs.” I’d agree—academics collaborate and have a real need to share methods across time and people. What say you? Do you have any experience using or implementing an ELN in an academic setting? What barriers do you see to adoption?

447 Views 2 References Permalink Categories: Electronic Lab Notebook Tags: eln, guest-authors, academic-labs