Wang Zheng Yuan: Exome Sequencing Comes of Age

Tuesday, March 12, 2013

With genomes, as with haute cuisine, less sometimes is more. Sure, researchers using next generation DNA sequencers can whip out several full-human-genome’s worth of sequence data in a single run. But interpreting those data—assembling the reads, identifying where the genome differs from the reference sequence and determining which of those variants, if any, might underlie some interesting biology—that’s another story.
Geneticists can make their lives easier by concentrating their efforts on that small fraction of the genome that encodes protein—the so-called “exome.” Representing just 1% or so of the overall sequence, an exome is like a Reader’s Digest condensed version of the genome—short, to the point, and less expensive than the full-length original.
Exome sequencing represents an “effective compromise between the competing goals of genome-wide comprehensiveness and cost-control,” wrote University of Washington genomicist Jay Shendure in a 2011 editorial to a special issue of Genome Biology on the topic of exome sequencing. [1] Basically, because a sequencer can only push out a finite number of bases, the more samples that can be combined per run through sample “barcoding,” the less each sample costs, and the more deeply each sample can be read.
Another benefit of exome sequencing is interpretability. Whole-genome sequencing generates a lot of data, but for the majority of base pairs, researchers don’t know their relationship to disease, explains Yaping Yang, laboratory director for the Whole Genome Laboratory at Baylor College of Medicine, a clinical laboratory that offers an exome-sequencing service. “Therefore, they are not helpful in making molecular diagnoses.”
In effect, exome sequencing is just a special form of targeted sequencing, an application in which researchers capture specific genomic segments for sequencing analysis. The difference is that instead of capturing, say, all the genes implicated in a particular cellular pathway, exome sequencing selects every exon of every protein-coding gene, and sometimes 5’ and 3’ untranslated sequences, as well—a collection numbering in the tens of megabases.
If PubMed is any guide, getting at those megabases has become very popular indeed: The database lists nearly 1,300 references including the term ‘exome,’ almost all of them published since 2011. Commercial vendors now offer tools to help researchers get in on the act. If you’ve been thinking about spicing up your genetics work with, as Genome Biology called it, “that special ‘exome factor’,” read on. [2] We’ll help you identify a solution that meets your needs.

Solution-based hybridization

Commercial options for exome capture fall into two basic categories, solution-based hybridization and PCR. Kits for the former are available from Agilent Technologies, Illumina and Roche NimbleGen. Clinically focused PCR-based kits can be obtained from Agilent. (Roche NimbleGen used to offer hybridization capture on planar microarrays but has discontinued that line.)
Solution hybridization-based techniques all follow the same basic protocol: Mix fragmented genomic DNA with biotinylated capture oligonucleotides, hybridize, capture hybrids on streptavidin-conjugated microbeads, wash away the unbound material and release what’s left. The differences lie largely in the details.
Agilent’s SureSelect Human All Exon V5 and V5+UTRs kits use pools of several hundred-thousand 120-mer biotinylated RNA capture probes to enrich exonic sequences (as well as, optionally, 5’ and 3’ untranslated regions and up to 6 Mb of custom sequence).
According to Olle Ericsson, Agilent’s marketing director for DNA Sequencing, RNA hybrids are stronger than comparably sized DNA-DNA hybrids, leading to stronger and more efficient capture. In addition, the system’s use of very long oligos means that even sequences containing short insertions and deletions (or “indels”) can be captured efficiently.
Version 5, the newest iteration of the SureSelect exome line, was launched last fall at the American Society for Human Genetics national meeting. According to Ericsson, V5 features updated content as well as a new, streamlined workflow that reduces sample preparation time by about a half-day, largely thanks to a shorter hybridization step: “If you start a prep on day one, you will have [the samples] ready to sequence on day two” (as opposed to the following morning).
The SureSelect system is available in two forms. SureSelect XT enables barcoding and mixing of samples after capture, and SureSelect XT2 barcodes (or “indexes”) samples before capture. The choice of which to use “depends on how you prefer to set up your workflow,” Ericsson says. To do pre-capture indexing, all the samples must be available at the same time. If samples tend to dribble in one or two at a time, post-capture indexing might make more sense. (Pre-capture pooling approaches also typically perform slightly worse than post-capture pooling, he adds.)
Also based on solution capture is Roche NimbleGen’s SeqCap EZ Exome Library v3.0, which uses a capture library comprising some 2.1 million DNA oligonucleotides averaging 80 bases in length. That design, says Thomas Albert, global head of technology innovation at Roche Applied Science, gives the company considerable flexibility in terms of how it positions its capture probes.
“We can put more probes in some places than in others, or make them longer, or shift them around in different ways—these are all things we can do because we have a larger number of smaller probes,” Albert says.
The reason that is necessary, he explains, is non-uniformity—variations in melting-temperature, secondary structure and cross-hybridization efficiency across the genome such that, though an exome might be sequenced to, say, 100-fold coverage, some regions will be over-represented and others possibly skipped altogether. That means crucial variants could be overlooked or misinterpreted, leading to more sequencing and rising costs.
According to Albert, the SeqCap EZ Exome Library v3.0 was released about a year ago and captures some 64 Mb of genome sequence. More recently, the company has added the ability to capture 32 Mb of 5’ and 3’ untranslated region (UTR) sequences (SeqCap EZ Exome +UTR Library) or up to 50 Mb of custom content (SeqCap EZ Exome Plus).
Illumina offers two solution-capture systems, the stand-alone TruSeq™ Exome Enrichment Kit, which captures 62 Mb of genomic sequence using more than 340,000 95-mer probes, and the Nextera® Exome Enrichment Kit, which integrates TruSeq into a “streamlined, automation-friendly workflow [that] combines [enzyme-based fragmentation and] library preparation and exome enrichment steps, and can be easily completed in 2.5 days with minimum hands-on time,” according to product literature.

PCR-based kits

On the PCR front, Agilent launched its PCR-based HaloPlex Exome kit in February 2013, coinciding with the annual Advances in Genome Biology & Technology (AGBT) 2013 conference. Aimed mostly at the clinical market, the kit offers a simpler workflow and less input DNA (200 ng vs. 1 to 3 g) than SureSelect, says Ericsson. In particular, he says, the HaloPlex protocol eliminates the need for mechanical shearing of the genomic template, integrating the library-preparation step into the PCR process itself.
Agilent also launched at AGBT a dedicated software package called SureCall. Unlike the more flexible (and powerful) GeneSpring software used with SureSelect, SureCall converts raw sequence data directly into mutation lists “that are classified according to industry guidelines,” Ericsson says.

In the clinic

Exomes might be simpler than whole genomes, but data interpretation is still a challenge in exome analysis, especially when medical decisions depend on the outcome. Such is the case at the growing number of clinical laboratories now offering exome-sequencing services.
At the Whole Genome Laboratory at Baylor College of Medicine, turnaround time on the lab’s exome-sequencing service is about 15 weeks, says Yang, “because the analyses and interpretation are so complicated.” The lab must consider a patient’s clinical presentation, prior testing and exome-sequencing data before it can issue a report.
Baylor’s service mainly is used to diagnose patients who are suspected of having genetic disorders that the referring physicians cannot pin down. Most are pediatric patients with neurological deficits, Yang says, and its “pick-up” rate—the fraction of cases in which a likely causative genetic mutation can be identified—ranges from 25% to 30%.
“If the clinical phenotype is not caused by a genetic defect, no matter how hard we try, we are not going to find mutations,” she says. And, of course, some mutations fall outside of the exome.
Yang’s lab sequences those exomes using three Illumina HiSeq sequencers, typically combining three samples per lane, 48 samples per run, for about 13 or 14 gigabases, or 150-fold mean coverage, per exome on average. For exome capture, the lab uses a custom Roche NimbleGen solution-hybridization design named VCRome, developed at Baylor’s Human Genome Sequencing Center. According to Yang, that system covers more than 95% of desired based at 20-fold-coverage or higher, with a capture specificity of 70% to 80%.

Which method to chose

On the face of it, any solution-based capture approach should work equally well for exome analysis. But do they?
To find out, Michael Snyder, professor and chair of genetics at Stanford University and director of the Stanford Center for Genomics and Personalized Medicine, and his team in 2011 compared the performance of all three commercial approaches. [3] The results, they report:
… suggest that the Nimblegen platform, which is the only one to use high-density overlapping baits, covers fewer genomic regions than the other platforms but requires the least amount of sequencing to sensitively detect small variants. Agilent and Illumina are able to detect a greater total number of variants with additional sequencing. Illumina captures untranslated regions, which are not targeted by the Nimblegen and Agilent platforms. [3]
(Today, UTR coverage is an option on all three platforms.)
In short, says Snyder, “They all worked pretty well.” But his lab prefers SureSelect, he says, “because it has a nice balance.”
Snyder says exome sequencing offers two advantages over whole-genome sequencing, even in this age of falling whole-genome prices. First, the lower cost means larger populations can be studied than might otherwise be possible. “Most projects tend to be budget-driven,” he says. But exomes also enable deeper sequencing than whole genomes—a consequence of the fact that, again, a sequencer can only push out so many bases. Snyder’s 2011 study routinely identified several thousand variants in exomes that were missed in the corresponding whole-genome sequences. “And because they’re in exomes, they tend to be things that you care about, because they’re coding,” he says.
On the other hand, whole-genome sequencing captures everything and can more effectively identify structural rearrangements that exome sequencing might miss. As a result, when dealing with clinical samples, Snyder’s lab tends to err on the side of caution and capture both an exome and a whole genome. “That’s to get us the extra coverage,” Snyder says. “We feel more confident about our calls.”
References
[1] Shendure, J, “Next-generation human genetics,” Genome Biology, 12:408, 2011.
[2] Stower, H, “The exome factor,” Genome Biology, 12:407, 2011.
[3] Clark, MJ, et al., “Performance comparison of exome DNA sequencing technologies,” Nature Biotechnology, 29:908-4, 2011.

GS Junior Complete

Roche Applied Science

Wednesday, December 26, 2012

The power of next-gen sequencing in your hands The GS Junior System brings the power of 454 Sequencing Systems directly to the laboratory benchtop. Get comprehensive genome coverage with long 400 bp ... read more

Comments:

Boris U. • A very nice overview, thank you very much! Because exome sequencing produces dramatically fewer data than whole-genome it is significantly cheaper and easier to handle. And even with fewer data, it is not unusual to identify over 2 million single nucleotide polymorphism (SNPs) and 1 million short insertions and deletions (INDELs) per sample.

15 days ago

Piero

Piero C. • The issue I have here is that regulatory regions are often outside the exons. It is OK to survey the exons to pick up a low hanging fruit, but the analysis is not necessarily finished with the exome.

11 days ago

Tim

Tim H. • The ultimate issue is that exomes are compromises by definition, they can be useful but are really chosen because of ancillary "costs" not for being qualitatively better than whole genomes. They will remain useful for very limited questions and those that require very deep sequencing for some time. But the lack of regulatory regions and actual lack of total exome coverage along with other things will likely force most people to migrate from them - or at least do both - as fast as they practically can.. The direct sequencing costs are so low these days that those costs drive the choice less and less. And the informatics costs continue to drop as well.

10 days ago

Louis

Louis F. • The comparison to the Reader's Digest condensed version of a book is both cute and apt. The issue is, then, do we want to read the Reader's Digest condensed version of a patient's genome? The current answer is probably "it depends", because of the cost compromise, as discussed in the article, and also on the specific situation. But, as Piero and Tim said, the limitations of exome sequencing are real and we clearly will have to go beyond that ASAP. Victor Hugo put so much more in "Les Miserables" than what we see in the play or than what Reader's Digest would offer.

And then again, the other point in the article remains valid: handling all the data coming from WGS is not only onerous and difficult, but we just don't know what to do with it (at this time anyway, and I think it will be quite a good long while yet before we find some real utility of the whole thing). From that perspective, it might be possible that the Reader's Digest version is not actually a compromise but is very close to what we really need - or can make use of at this time. It wouldn't be too hard to add 5'-flanking and other regulatory sequences to those exome libraries, the same way that the UTR sequences were added. This could be the best compromise, until we learn how to make use of the expansive intergenic sequences that still represent the very vast majority of the gemome.

8 days ago

Austin

Austin S. • I think the key point is that with WGS, "we don't know what to do with it". And from the clinical standpoint, even with exome data alone, there is a lot of "we don't know what to do with it". I like the buffet analogy.

WGS is great from a research standpoint, but the diversity within a population coupled with the diversity within an individual and understanding the relevance of that diversity makes this even a less tractable. Not to say that it isn't a worthwhile pursuit, but just a really hard problem.

8 days ago

Piero

Piero C. • Hi Austin, I agree that the genome is complex and we do not know many things. But often we see GWAS discovered important SNPS that are outside the exome and in those regions we may be able to look for something (ENCODE segments, other signal...)

8 days ago

Boris

Boris U. • There is no question that WGS can provide significantly more information than exome sequencing. The decision comes down to costs. While significant cost differentials exist and exome sequencing can deliver useful information they will remain popular. Once the cost of WGS comes down sufficiently, researchers will begin migrating from exomes.

8 days ago

Austin

Austin S. • I don't disagree with these statements, but it does boil down to what the ultimate goal is IMHO. There will always be a compromise between breadth (WGS) and depth (targeted resequencing), and whether in the sample prep or the analysis, it is an unavoidable compromise. Information about complexity is lost either way.

8 days ago

Mark

Mark S. • Yes a very nice short summary of WES technologies as currently implemented. All diagnostic tests have some rate of return, where costs and successful diagnoses are usually in competition. In that sense there is nothing new about the decision to chose WGS vs WES. I am a researcher but as our case cohorts are typically unexplained patients, even research findings end up leading to medical diagnoses with some potential to improve patient management. For now we still start with WES, which for clear high-penetrance monogenic disorders generates an interpretable result fairly often. Operationally the cost of WGS is still high enough that it makes sense to try WES first, and only move to WGS in cases that remain unsolved by WES.