Saturday, August 25, 2012

NGS alignment software

http://lh3lh3.users.sourceforge.net/NGSalign.shtml

Software

Currently, this page only includes software I am familiar with. Most of them aim for aligning next-generation sequencing (NGS) data and were developed since 2007. I may extend the list when I have time. Several notes:
  • The programs are listed in the alphabet order in each category.
  • Features shown in brackets are optional and may affect efficiency.
  • The version number shown for each program is the one I have checked, but may not be the latest.

Indexing Reads with Hash Tables

  • CloudBurst [PMID:19357099]. RMAP-like algorithm that works in a cloud.
    • Platform: Illumina
    • Features: support cloud computation
    • Availability: open source
  • Cross_match [1.080730]. The latest cross_match has been substantially improved for short read alignment. Its speed is comparable to other aligners and might be the best choice for local alignment.
    • Platform: Illumina; 454
    • Features: gapped alignment (maximum 2 gaps in the fast mode); local alignment
    • Availability: academic free source codes
  • Eland [1.0]. Probably the first short read aligner. Eland substantially influences many aligners in this category and still outperforms many followers. Although it is not the fastest any more, it is close to the fastest and has the smallest memory footprint. Eland itself works for 32bp single-end reads only. Additional Perl scripts in GAPipeline extend its ability.
    • Platform: Illumina
    • Features: PET mapping; mapping quality; SNP caller; counting suboptimal occurrences.
    • Advantages: fast; light-weighted
    • Availability: free source codes for machine buyers.
  • MAQ [0.7.1, PMID: 18714091]. This is my program to align short reads and to call variants. It has been used in several high-profile papers.
    • Platform: Illumina; SOLiD (partial)
    • Features: PET mapping; quality aware; gapped alignment for PET; mapping quality; adapter trimming; partial occurrences counting; SNP caller
    • Advantages: feature rich; publication proved
    • Limitation: up to 128bp reads; no gapped alignment for single-end reads
    • Availability: GPL
  • mrFast/mrsFAST [0.5.1]. An aligner specifically designed for reporting all hits.
    • Platform: Illumina
    • Features: all hits; up to 3X faster than MAQ (not tested by myself); gapped alignment (mrFAST)
    • Availability: Free binary
  • RazerS [20081029, PMID: 19592482]. q-gram filteration; based on the SeqAn library.
    • Availability: free source codes
  • RMAP [0.41, PMID: 18307793;19736251]. One of the earliest short read aligners.
    • Platform: Illumina
    • Features: quality aware; [gapped alignment]; best unique hits
    • Availability: GPL
  • SeqMap [1.0.8, PMID: 18697769]. An Eland-like program.
    • Platform: Illumina
    • Features: [gapped alignment]
    • Limitation: not counting suboptimal hits
    • Availability: GPL
  • SHRiMP [1.10, PMID: 19461883]. Q-gram based algorithm.
    • Platform: SOLiD; Illumina; 454
    • Features: SOLiD mapping; gapped alignment; potential support for mapping quality
    • Limitations: a little slow
    • Availability: GPL
  • ZOOM [1.2.5, PMID: 18684737]. Eland-like algorithm with the improvement of using spaced seed. ZOOM supports longer reads and faster than Eland, although it uses more memory. ZOOM is feature rich, but some features may come at the cost of speed.
    • Platform: Illumina; SOLiD
    • Features: PET mapping; SOLiD mapping; [gapped alignment]; [mapping quality]; [quality aware]
    • Advantage: fast; feature rich
    • Limitation: up to 224bp reads; gapped alignment comes with cost
    • Availability: commercial

Indexing Genome with Hash Tables

  • BFAST [0.3.1, PMID: 19907642].
    • Platform: Illumina; SOLiD
    • Availability: open source
    • Comment on paper: evaluation for bowtie and bwa may be questionable.
  • gnumap [PMID: 19861355].
    • Platform: Illumina
    • Features: "Assigning a proportion of the read to relevant genomic matches based on the relative likelihood that the read maps to each location". (I do not know how this is compared with randomly distribute repetitive reads)
    • Comment on paper: It is possible to achieve paired-end mapping by indexing reads only. Maq does in this way.
    • Availability: open source (?)
  • MOM [0.1, PMID: 19228804].
    • Platform: Illumina; (?)
    • Features: counting suboptimal occurrences; local alignment
    • Availability: free
  • Mosaik [1.0]. Mosaik has been used in several high-profile publications and delivers good performance.
    • Platform: Illumina; 454; SOLiD
    • Advantages: long reads
    • Availability: open source
  • NovoAlign [2.0]. NovoAlign competes with MAQ on speed and feature set, and may be more accurate than MAQ. It also implements several important features missing in MAQ.
    • Platform: Illumina
    • Features: PET mapping; gapped alignment; mapping quality; quality aware; adapter trimming; MAQ format
    • Advantages: highly accurate; gapped alignment; feature rich
    • Requirements: >8GB RAM for paired-end mapping against the human genome.
    • Availability: proprietary; academic free binary (no multi-threading support)
  • PASS [0.5, PMID: 19218350].
    • Platform: Illumina; SOLiD; 454
    • Features: PET mapping
    • Advantages: long reads
    • Requirement: >15GB RAM against human genome
    • Availability: free source codes to academic users
  • PerM [0.1.0, PMID: 19675096]
    • Platform: Illumina; SOLiD
    • Advantages: fast
    • Availability: GPL
    • Limitation: no paired-end mapping apparently.
    • Requirement: 4.5 bytes per reference base
    • Comment on paper: PerM is very fast. The authors attribute its speed to the use of spaced seeds with higher weight. This is a reason, but to me, not the leading reason. I think PerM is fast mainly because in building index, it aligns the genome against itself under given a specified read length; in alignment, PerM aligns a repetitive read once rather than to each copy. The cost is a user needs to build a huge index for each read length.
  • SOAPv1 [1.11, PMID: 18227114]. The first published short read aligner.
    • Platform: Illumina
    • Features: PET mapping; adapter trimming; gapped alignment; SNP caller; counting occurrences
    • Advantages: feature rich
    • Requirements: >14GB RAM against human genome
    • Availability: GPL

Merge Sorting

  • Slider [0.6, PMID: 18974170]. A very clever short read aligner specifically designed for Illumina reads. It is able to use the second best base call, which potentially improves the accuracy on SNP finding.
    • Platform: Illumina
    • Features: Using second base
    • Advantages: fast; potentially more accurate on SNP discovery
    • Requirements: >160GB disk space
    • Availability: free source codes
  • Slider II [1.1].

Indexing Genome with Suffix Array/BWT

  • Bowtie [0.9.9, PMID: 19261174]. This is probably the fastest short read aligner to date. Although under the default option Bowtie does not guarantee to find the best hit or tell if the hit it finds is unique, it is possible to improve this behaviour at the cost of speed.
    • Platform: Illumina
    • Features: partial PET mapping; quality aware; [mapping quality]
    • Advantages: very fast
    • Availability: GPL
  • BWA [0.5.1, PMID: 19451168]. Another aligner written by me. Given high-quality reads, it is an order of magnitude faster than MAQ while achieving similar alignment accuracy.
    • Platform: Illumina; SOLiD; 454; Sanger
    • Features: PET mapping (short reads only); gapped alignment; mapping quality; counting suboptimal occurrences (short reads only); SAM output
    • Advantages: fast
    • Limitations: short read algorithm is slow for long reads and reads with high error rate
    • Availability: GPL
  • SOAPv2 [2.19, PMID: 19497933]. A marvelous program developed by the group who wrote BWT-SW.
    • Platform: Illumina
    • Features: PET mapping; mapping quality; counting occurrences
    • Advantages: fast
    • Availability: academic free binary
  • segemehl [0.0.7, PMID: 19750212].
    • Platform: Illumina
    • Features: accurate
    • Limitation: large memory requirement; no paired-end mapping
    • Comment on paper: the authors show maq and bwa are not as accurate probably because they were counting ambiguous alignments.
  • vmatch [SpringerLink].
    • Availability: academic free binary

Recommendation

First of all, as I am the key developer of two short read aligners (BWA and MAQ), it is really hard for me to give an unbiased evaluation. Please bear this fact in mind when reading through my comments below.
For Illumina reads, I would recommend my program BWA. BWA implements most of the major features of a practical aligner. It is relatively small in memory and highly efficient with little tradeoff on accuracy. BWA outputs alignment in the SAM format. Users may use SAMtools to sort/merge alignments and to make variants calls. One potential concern about BWA is it has not been widely used at the moment. It may be less robust than those publication-proved aligners such as Eland and MAQ.
[Update: With the help of paired-end reads, MAQ is able to find some SNPs at the edge of highly repetitive regions. However, BWA cannot. Nonetheless, I still prefer BWA given its speed and the fact that SNPs that can be called from repeats are rare and more likely to be false positives.]
Mapping inconsistent read pairs with NovoAlign is recommended for PET-based structural varition detection where alignment accuracy is the leading factor on reducing false positive calls. NovoAlign is the most accurate aligner to date.

No comments:

Post a Comment