Traditionally development has taken a paper based and step-wise approach to delivering pharmaceutically active substances to first clinical trials patients and then to the mass market. I have often heard the process development environment described as a relay race where research hands a baton (or API) to development and development have to run the middle leg as fast and safely as they can, handing the baton (now a product) to manufacturing. Not only does development get the baton to the final runner safely, they also have to prepare the track for the runner to ensure a smooth, safe, fast and glorious finish. These times are changing however.
The FDA in 2006 launched the Quality by Design initiative that takes a more business orientated objective for releasing pharmaceutical products to market. With a goal for supporting continuous product improvements the FDA guidelines recommend an approach that breaks the “relay race” concept where the baton is a simple hand off. QbD relies on an organization demonstrating a full understanding of the “design space” of the product and the processes for production that affect the “critical quality attributes” (CTA). If an organization can demonstrate satisfactorily the variables and the variation within the design space that does not affect the CTA they can operate in the variable space without additional FDA approval. The opportunities are huge, as drug manufacturers can alter products and processes within permissible limits without delay to deliver drugs with higher quality and lower cost to market faster.
The QbD approach is thus a radical change to the baton hand off requiring a new approach to not only the development processes but also the communication between development and manufacturing. To understand the design space a multivariate experimental approach is required during development where experiments are preformed in parallel with different variables to help elucidate the “design space” that maintains the critical quality attributes of the drug. The approach typically requires more experimentation up front to improve insight for the design space and less of a serial approach with traditional pass/fail methodology for product progression.
The Pharma industry has been slow to embrace QbD for various reasons however recent moves by the FDA are no longer making QbD a recommendation. Pharma Development has been built on well defined processes within expensive tried and tested environments. To embrace QbD requires changes to mindsets, the process and supporting infrastructure that in itself generates initial cost and risk. The activation energy to move pharma development into full swing QbD will come from 2 places, first regulation from the FDA and secondly software vendors that can provide an infrastructure to support the new process and laboratory data management needs to support QbD.
One of the common themes in the next gen sequencing (NGS) community is the large amount of data we generate, analyze, and try to store somewhere. While we are blessed with plenty of data, let's pause for a moment and consider another scientific domain - high energy physics. ATLAS, a particle detector experiment at CERN, generates "about 25 megabytes per event (raw; zero suppression reduces this to 1.6 MB) times 23 events per beam crossing, times 40 million beam crossings per second in the center of the detector, for a total of 23 petabyte/second of raw data."
Still think we have a lot of data? Real-time imagery from satellites, financial transaction data, and web content also provide plenty.
Well, that's nice to know, but we deal with NGS data. So let's focus there. Illumina's HiSeq 2000 produces "up to 25 Gb per day for a 2 x 100 bp run." What's the day-to-day impact of producing and/or consuming this volume of data?
Single files can be big - 100s of Gigabytes for compressed, mapped reads stored in a BAM file. That means that copying, computing MD5 checksums, querying, and visualizing all take more time, computing resources, and disk space than we'd like. And transfering data, whether from NCBI's Sequence Read Archive (at least for now - SRA is being discontinued) or to the Cloud, e.g., Amazon Web Services, isn't exactly quick. NCBI's use of Aspera Connect and AWS Import/Export do help. As a friend at a sequencing company once told me, NGS data volume "affects the whole organization".
Oh, and as Lincoln Stein has pointed out in his popular graph (as used by Monya Baker), it's getting worse.
So an NGS software solution needs to do more than just provide algorithms for analysis; it needs to help end users manage their data. It needs to have the ability to organize and query data that resides on multiple disks, with the flexibility to adapt when storage is added, removed, or rearranged.
We often run surveys to better understand what our customers want. One recent result confused me and statistically it was very significant so I would welcome your feedback. When we asked a community of scientists 83% of the 171 modelers (Material Studio and Discovery Studio) that replied said they were interested in using an ELN. I came to one of two conclusions. Either people were hoping their odds of winning a free iPad on offer was improved if they voted positively, or there is genuine interest by modelers to use an ELN.
There’s no dispute in the industry of the ELN productivity over paper notebooks for more efficient documentation, IP and information sharing but what specifically would the modeling community like from an ELN:
Track and recall tweaks to models?
Document what you did, when and why for any run?
Proof of invention around models?
Ability to capture and later re-initiate models?
Ability to share the results of the models with colleagues?
Ability to deploy models to the scientists desktop using the ELN?
As a modeler where would you see high value and use for an ELN?
Some diseases are associated with sequence variants - an 'A' instead of a 'T' in the case of sickle-cell disease - so looking for variants is a pretty common task, especially with sequencing data. Often that's the reason for the experiment in the first place. There are several common types of variants. Single Nuclueotide Polymorphisms (SNPs) are single basepair changes. A SNP may be innocuous (a synonymous SNP, also known as a silent mutation) or it may have a life-threatening impact, e.g., the translated protein now has a different amino acid (a missense SNP) or is prematurely truncated (a nonsense SNP). An indel (meaning an insertion or deletion) can cause either an extra or missing amino acid (if the length of the indel is a multiple of 3) or result in completely different amino acids (frameshift mutation). A third kind of variant is a Copy-Number Variation (CNV). A CNV indicates that there are either too many or too few copies of one or more regions of DNA, both can be problematic.
Okay, so let's assume we have a hypothesis that a disease we're studying is associated with one or more SNPs. How might we test that? Joining the wave of those adopting next gen sequencing, we could sequence healthy tissue and diseased tissue. Reads are short DNA sequences produced by sequencing machines. They can be paired, unpaired, base space, color space, ... Once we've got our reads, we map (align) the reads to a set of reference sequences using any of a number of programs. Next we compare the set of mapped reads' nucleotides at each reference sequence location to see if there are any single-basepair differences. Assuming we find SNPs, what might we conclude? We could check a SNP database, e.g., dbSNP, to see if the SNP is novel. We could use a viewer like GBrowse and investigate other annotations in the same location. For example, is a given SNP inside a gene region or upstream from it, potentially impacting the expression of the gene? We might find something really interesting.
Or we could have found nothing at all. Nada. Garbage. Why?
After starting with our tissue samples we went through several steps, including sequencing and a number of computations. Let's assume for now that the sequencing was perfect - bad assumption, but bear with me - and focus on the computations. Computer programs are implementations of algorithms or processing steps. Hence, programs can differ if either in the implementation (two ways of expressing the same thing) or the underlying algorithm (expressing two different things). Since variant analysis - and mapping and next gen sequencing, for that matter - is an active area for research, we really shouldn't expect that two programs would find exactly the same set of SNPs. Be kind of boring if that happened, actually.
So, the SNPs we found may not have anything at all to do with our biological samples. Instead they may be mere artifacts of the calculation we performed. Arghh! What now?
A common technique is to perform the calculations more than once, using different programs. So, maybe we map the reads with both Bowtie and BWA. Or maybe we use several different SNP-calling programs. We can then compare the lists of SNPs we get and see which ones are found by all of the programs or by most of them or only a few or just one. If a SNP is only found in one list, does that mean it's bogus? Not necessarily. Maybe that particular algorithm is really good. So how do we know? Is this an art or a science? I think it's both.