Some diseases are associated with sequence variants - an 'A' instead of a 'T' in the case of sickle-cell disease - so looking for variants is a pretty common task, especially with sequencing data. Often that's the reason for the experiment in the first place. There are several common types of variants. Single Nuclueotide Polymorphisms (SNPs) are single basepair changes. A SNP may be innocuous (a synonymous SNP, also known as a silent mutation) or it may have a life-threatening impact, e.g., the translated protein now has a different amino acid (a missense SNP) or is prematurely truncated (a nonsense SNP). An indel (meaning an insertion or deletion) can cause either an extra or missing amino acid (if the length of the indel is a multiple of 3) or result in completely different amino acids (frameshift mutation). A third kind of variant is a Copy-Number Variation (CNV). A CNV indicates that there are either too many or too few copies of one or more regions of DNA, both can be problematic.
Okay, so let's assume we have a hypothesis that a disease we're studying is associated with one or more SNPs. How might we test that? Joining the wave of those adopting next gen sequencing, we could sequence healthy tissue and diseased tissue. Reads are short DNA sequences produced by sequencing machines. They can be paired, unpaired, base space, color space, ... Once we've got our reads, we map (align) the reads to a set of reference sequences using any of a number of programs. Next we compare the set of mapped reads' nucleotides at each reference sequence location to see if there are any single-basepair differences. Assuming we find SNPs, what might we conclude? We could check a SNP database, e.g., dbSNP, to see if the SNP is novel. We could use a viewer like GBrowse and investigate other annotations in the same location. For example, is a given SNP inside a gene region or upstream from it, potentially impacting the expression of the gene? We might find something really interesting.
Or we could have found nothing at all. Nada. Garbage. Why?
After starting with our tissue samples we went through several steps, including sequencing and a number of computations. Let's assume for now that the sequencing was perfect - bad assumption, but bear with me - and focus on the computations. Computer programs are implementations of algorithms or processing steps. Hence, programs can differ if either in the implementation (two ways of expressing the same thing) or the underlying algorithm (expressing two different things). Since variant analysis - and mapping and next gen sequencing, for that matter - is an active area for research, we really shouldn't expect that two programs would find exactly the same set of SNPs. Be kind of boring if that happened, actually.
So, the SNPs we found may not have anything at all to do with our biological samples. Instead they may be mere artifacts of the calculation we performed. Arghh! What now?
A common technique is to perform the calculations more than once, using different programs. So, maybe we map the reads with both Bowtie and BWA. Or maybe we use several different SNP-calling programs. We can then compare the lists of SNPs we get and see which ones are found by all of the programs or by most of them or only a few or just one. If a SNP is only found in one list, does that mean it's bogus? Not necessarily. Maybe that particular algorithm is really good. So how do we know? Is this an art or a science? I think it's both.