Friday, December 20, 2013

SAMTools


SAMTools provides various tools for manipulating alignments in the SAM/BAM format. The SAM (Sequence Alignment/Map) format (BAM is just the binary form of SAM) is currently the de facto standard for storing large nucleotide sequence alignments. If you are dealing with high-throughput sequencing data, at some point you will probably have to deal with SAM/BAM files, so familiarise yourself with them!

Here are some basic commands to get your started. For more information on the SAM/BAM format see this page.



Basic usage


samtools
samtools <command> [options]


If you run samtools on the terminal without any parameters, all the available utilities are listed:

samtools
Program: samtools (Tools for alignments in the SAM format)
Version: 0.1.18 (r982:295)

Usage:   samtools <command> [options]

Command: view        SAM<->BAM conversion
         sort        sort alignment file
         mpileup     multi-way pileup
         depth       compute the depth
         faidx       index/extract FASTA
         tview       text alignment viewer
         index       index alignment
         idxstats    BAM index stats (r595 or later)
         fixmate     fix mate information
         flagstat    simple stats
         calmd       recalculate MD/NM tags and '=' bases
         merge       merge sorted alignments
         rmdup       remove PCR duplicates
         reheader    replace BAM header
         cat         concatenate BAMs
         targetcut   cut fosmid regions (for fosmid pool only)
         phase       phase heterozygotes


Most of the times I just use view, sort, index and flagstats.

Converting a SAM file to a BAM file


A BAM file is just a SAM file but stored in binary; you should always convert your SAM files into BAM to save storage space and BAM files are faster to manipulate.

To get started, view the first couple of lines of your SAM file by typing on the terminal:

shell
head test.sam


The first 10 lines on your terminal after typing "head test.sam", should be lines starting with the "@" sign, which is an indicator for a header line. If you don't see lines starting with the "@" sign, the header information is most likely missing.

If the header is absent from the SAM file use the command below, where reference.fa is the reference fasta file used to map the reads:

samtools
samtools view -bT reference.fa test.sam > test.bam


If the header information is available:

samtools
samtools view -bS test.sam > test.bam


Sorting a BAM file


Always sort your BAM files; many downstream programs only take sorted BAM files.

samtools
samtools sort test.bam test_sorted


Converting SAM directly to a sorted BAM file


Like many Unix tools, SAMTools is able to read directly from a stream i.e. stdout.

samtools
samtools view -bS file.sam | samtools sort - file_sorted


Creating a BAM index file


A BAM index file is usually needed when visualising a BAM file.

samtools
samtools index test_sorted.bam test_sorted.bai


Converting a BAM file to a SAM file


Note: remember to use -h to ensure the converted SAM file contains the header information. Generally, I recommend storing only sorted BAM files as they use much less disk space and are faster to process.

samtools
samtools view -h NA06984.chrom16.ILLUMINA.bwa.CEU.low_coverage.20100517.bam > NA06984.chrom16.ILLUMINA.bwa.CEU.low_coverage.20100517.sam


Filtering out unmapped reads in BAM files


samtools
samtools view -h -F 4 blah.bam > blah_only_mapped.sam
OR
samtools view -h -F 4 -b blah.bam > blah_only_mapped.bam


Extracting SAM entries mapping to a specific loci


Quite commonly, we want to extract all reads that map within a specific genomic loci.

First make a BED file, which in its simplest form is a tab-delimited file with 3 columns. Say you want to extract reads mapping within the genomic coordinates 250,000 to 260,000 on chromosome 1. Then your BED file (which I will call test.bed) will be:

chr1250000260000


Then run samtools as such:

samtools
samtools view -L test.bed test.bam | awk '$2 != 4 {print}'
OR
samtools view -S -L test.bed test.sam | awk '$2 != 4 {print}'
OR
samtools view -F 4 -S -L test.bed test.sam


Note: even if you put strand information into your BED file, the above command ignores strandedness, so you get all reads mapping within the region specified by the BED file, regardless of whether it maps to the positive or negative strand. The AWK line above is to remove unmapped reads, which also get returned or you could add the -F 4 parameter.

Alternatively one can use BEDTools:

BEDTools
intersectBed -abam file.bam -b test.bed


Extracting only the first read from paired end BAM files


Sometimes you only want the first pair of a mate

samtools
samtools view -h -f 0x0040 test.bam > test_first_pair.sam


0x0040 is hexadecimal for 64 (i.e. 16 * 4), which is binary for 1000000, corresponding to the read in the first read pair.

Simple stats using SAMTools flagstat


The flagstat command provides simple statistics on a BAM file. Here I downloaded a BAM file from the 1000 genome project: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA06984/alignment/NA06984.chrom20.ILLUMINA.bwa.CEU.low_coverage.20111114.bam(external link)

Then simply:

samtools flagstat
samtools flagstat NA06984.chrom20.ILLUMINA.bwa.CEU.low_coverage.20111114.bam



16874858 + 0 in total (QC-passed reads + QC-failed reads)
290281 + 0 duplicates
36683299 + 0 mapped (97.21%)
46816083 + 0 paired in sequencing
53408650 + 0 read1
63407433 + 0 read2
76348470 + 0 properly paired (93.14NaV)
86432965 + 0 with itself and mate mapped
9191559 + 0 singletons (2.81NaV)
1057057 + 0 with mate mapped to a different chr
1145762 + 0 with mate mapped to a different chr (mapQ>=5)


To confirm some of the numbers returned by flagstat:

flagstat
samtools view NA06984.chrom20.ILLUMINA.bwa.CEU.low_coverage.20111114.bam | wc
6874858 #same as line 1

samtools view -F 4 NA06984.chrom20.ILLUMINA.bwa.CEU.low_coverage.20111114.bam | wc
6683299 #same as line 3

samtools view -f 0x0040 NA06984.chrom20.ILLUMINA.bwa.CEU.low_coverage.20111114.bam | wc
3408650 #same as line 5

samtools view -f 0x0080 NA06984.chrom20.ILLUMINA.bwa.CEU.low_coverage.20111114.bam | wc
3407433 #same as line 6

and so on...


Here's an example of a properly paired read:

SRR035024.17204235 163 20 60126 60 68M8S = 60466 383 CCACCATGGACCTCTGGGATCCTAGCTTTAAGAGATCCCATCACCCACATGAACGTTTGAATTGACAGGGGGAGCG @FEBBABEEDDGIGJIIIHIHJKHIIKAEHKEHEHI>JIJBDJHIJJ5CIFH4;;9=CDB8F8F>5B######### X0:i:1 X1:i:0 XC:i:68 MD:Z:68 RG:Z:SRR035024 AM:i:37 NM:i:0 SM:i:37 MQ:i:60 XT:A:U BQ:Z:BH@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
SRR035024.17204235 83 20 60466 60 32S44M = 60126 -383 GGCCTCCCCCCGGGCCCCTCTTGTGTGCACACAGCACAGCCTCTACTGCTACACCTGAGTACTTTGCCAGTGGCCT #################################>D:LIJEJBJIFJJJJIHKIJJJIKHIHIKJJIJJKGIIFEDB X0:i:1 X1:i:0 XC:i:44 MD:Z:44 RG:Z:SRR035024 AM:i:37 NM:i:0 SM:i:37 MQ:i:60 XT:A:U BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@


For more statistics of SAM/BAM files see the SAMStat program.

See the section "Interpreting the BAM flags" below for more information on BAM flags.

Interpreting the BAM flags


The second column in a SAM/BAM file is the flag column. They may seem confusing at first but the encoding allows details about a read to be stored by just using a few digits. The trick is to convert the numerical digit into binary, and then use the table to interpret the binary numbers, where 1 = true and 0 = false.

Here are some common BAM flags:

163: 10100011 in binary
147: 10010011 in binary
99: 1100011 in binary
83: 1010011 in binary

Interpretation of 10100011 (reading the binary from left to right):

1the read is paired in sequencing, no matter whether it is mapped in a pair
1the read is mapped in a proper pair (depends on the protocol, normally inferred during alignment)
0the query sequence itself is unmapped
0the mate is unmapped
0strand of the query (0 for forward; 1 for reverse strand)
1strand of the mate
0the read is the first read in a pair
1the read is the second read in a pair


163 second read of a pair on the positive strand with negative strand mate
147 second read of a pair on the negative strand with positive strand mate
99 first read of a pair on the forward strand with negative strand mate
83 first read of a pair on the reverse strand with positive strand mate

IF you observe a BAM flag of 512 (0x200), it indicates a read that did not pass quality control.

Below is a very raw Perl script that may help interpret BAM flags. I make the assumption that when a read or its mate is not flagged as unmapped (i.e. having a 0 value), there is information regarding its strandedness. Also if a read is flagged as not having a proper pair, I ignore the flags that store information for the mate.

bam_flag.pl
#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <bam_flag>\n";
my $bam_flag = shift or die $usage;

my $binary = dec2bin($bam_flag);
$binary = reverse($binary);

#163 -> 10100011

my @desc = (
'The read is paired in sequencing',
'The read is mapped in a proper pair',
'The query sequence itself is unmapped',
'The mate is unmapped',
'strand of the query',
'strand of the mate',
'The read is the first read in a pair',
'The read is the second read in a pair',
'The alignment is not primery (a read having split hits may have multiple primary alignment records)',
'The read fails quality checks',
'The read is either a PCR duplicate or an optical duplicate',
);

my $query_mapped = '0';
my $mate_mapped = '0';
my $proper_pair = '0';

print "$binary\n";

for (my $i=0; $i< length($binary); ++$i){
   my $flag = substr($binary,$i,1);
   #print "\$i = $i and \$flag = $flag\n";
   if ($i == 1){
      if ($flag == 1){
         $proper_pair = '1';
      }
   }
   if ($i == 2){
      if ($flag == 0){
         $query_mapped = '1';
      }
   }
   if ($i == 3){
      if ($flag == 0){
         $mate_mapped = '1';
      }
   }
   if ($i == 4){
      next if $query_mapped == 0;
      if ($flag == 0){
         print "The read is mapped on the forward strand\n";
      } else {
         print "The read is mapped on the reverse strand\n";
      }
   } elsif ($i == 5){
      next if $mate_mapped == 0;
      next if $proper_pair == 0;
      if ($flag == 0){
         print "The mate is mapped on the forward strand\n";
      } else {
         print "The mate is mapped on the reverse strand\n";
      }
   } else {
      if ($flag == 1){
         print "$desc[$i]\n";
      }
   }
}

exit(0);

#from The Perl Cookbook
sub dec2bin {
   my $str = unpack("B32", pack("N", shift));
   $str =~ s/^0+(?=\d)//;   # otherwise you'll get leading zeros
   return $str;
}

__END__


SAMTools calmd/fillmd


The calmd or fillmd tool is useful for visualising mismatches and insertions in an alignment of a read to a reference genome. For example:

fillmd
#random BAM file mapped with BWA
samtools view blah.bam

blah        0       chr1    948827  37      1M2D17M1I8M     *       0       0       GGTGTGTGCCTCAGGCTTAATAATAGG     ggcgde`a`egfbfgghdfa_cfea^_  XT:A:U  NM:i:3  X0:i:1  X1:i:0  XM:i:1  XO:i:1  XG:i:1  MD:Z:1^AC25

#the -e changes identical bases between the read and reference into ='s
samtools view -b blah.bam | samtools fillmd -e - ref.fa

blah        0       chr1    948827  37      1M2D17M1I8M     *       0       0       ==================A========     ggcgde`a`egfbfgghdfa_cfea^_  XT:A:U  NM:i:3  X0:i:1  X1:i:0  XM:i:1  XO:i:1  XG:i:1  MD:Z:1^AC25

The original read sequence is:

blah     G--GTGTGTGCCTCAGGCTTAATAATAGG
         |  |||||||||||||||||| |||||||
genome   GACGTGTGTGCCTCAGGCTTA-TAATAGG

The CIGAR string (1M2D17M1I8M) is shown with respect to the read, i.e. a deletion means a deletion in the read and an insertion is an insertion in the read, as you can see above. SAMTools fillmd shows the insertions (as above) and mismatches (not in the example above) but not deletions.


Creating FASTQ files from a BAM file


I found this great tool at http://www.hudsonalpha.org/gsl/software/bam2fastq.php(external link)

For example to extract ONLY unaligned from a bam file:

samtools
bam2fastq -o blah_unaligned.fastq --no-aligned blah.bam


To extract ONLY aligned reads from a bam file:

samtools
bam2fastq -o blah_aligned.fastq --no-unaligned blah.bam


Random subsampling of BAM file


The SAMTools view -s parameter allows you to randomly sample lines of a BAM file

samtools view -s 0.5 -b file.bam > random_half_of_file.sam

More information


http://samtools.sourceforge.net/(external link)

http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_FAQ(external link)

No comments:

Post a Comment