Friday, November 30, 2012

snpEff: Command line options


If you type the command without any arguments, it shows all available options:
Usage: snpEff [eff]    [options] genome_version [variants_file]
   or: snpEff download [options] genome_version
   or: snpEff build    [options] genome_version
   or: snpEff dump     [options] genome_version
   or: snpEff cds      [options] genome_version
    
There are four main 'commands': calcualte effects (eff, which is the default), build database (build), dump database (dump), test cds in database (cds).
Calculate variant effects: snpEff [eff]
If you type the command without any arguments, it shows all available options ("java -jar snpEff.jar eff"):
Usage: snpEff [eff] genome_version [variants_file]

Input file: Default is STDIN

Options:
 -a , -around            : Show N codons and amino acids around change (only in coding regions). Default is 0 codons.
 -i format               : Input format [ vcf, txt, pileup, bed ]. Default: VCF.
 -o format               : Ouput format [ txt, vcf, gatk, bed, bedAnn ]. Default: VCF.
 -interval               : Use a custom interval file (you may use this option many times)
 -chr string             : Prepend 'string' to chromosome name (e.g. 'chr1' instead of '1'). Only on TXT output.
 -s,  -stats             : Name of stats file (summary). Default is 'snpEff_summary.html'
 -t                      : Use multiple threads (implies '-noStats'). Default 'off'

Sequence change filter options:
 -del                    : Analyze deletions only
 -ins                    : Analyze insertions only
 -hom                    : Analyze homozygous variants only
 -het                    : Analyze heterozygous variants only
 -minQ X, -minQuality X  : Filter out variants with quality lower than X
 -maxQ X, -maxQuality X  : Filter out variants with quality higher than X
 -minC X, -minCoverage X : Filter out variants with coverage lower than X
 -maxC X, -maxCoverage X : Filter out variants with coverage higher than X
 -nmp                    : Only MNPs (multiple nucleotide polymorphisms)
 -snp                    : Only SNPs (single nucleotide polymorphisms)

Results filter options:
 -fi bedFile                     : Only analyze changes that intersect with the intervals specified in this file (you may use this option many times)
 -no-downstream                  : Do not show DOWNSTREAM changes
 -no-intergenic                  : Do not show INTERGENIC changes
 -no-intron                      : Do not show INTRON changes
 -no-upstream                    : Do not show UPSTREAM changes
 -no-utr                         : Do not show 5_PRIME_UTR or 3_PRIME_UTR changes

Annotations filter options:
 -canon                          : Only use canonical transcripts.
 -onlyReg                        : Only use regulation tracks.
 -onlyTr file.txt                : Only use the transcripts in this file. Format: One transcript ID per line.
 -reg name                       : Regulation track to use (this option can be used add several times).
 -treatAllAsProteinCoding bool   : If true, all transcript are treated as if they were protein conding. Default: Auto
 -ud, -upDownStreamLen           : Set upstream downstream interval length (in bases)

Generic options:
 -0                      : File positions are zero-based (same as '-inOffset 0 -outOffset 0')
 -1                      : File positions are one-based (same as '-inOffset 1 -outOffset 1')
 -c , -config            : Specify config file
 -h , -help              : Show this help and exit
 -if, -inOffset          : Offset input by a number of bases. E.g. '-inOffset 1' for one-based input files
 -of, -outOffset         : Offset output by a number of bases. E.g. '-outOffset 1' for one-based output files
 -noLog                  : Do not report usage statistics to server
 -noStats                : Do not create stats (summary) file
 -q , -quiet             : Quiet mode (do not show any messages or errors)
 -v , -verbose           : Verbose mode
Options
Option Note
-a, -around Show N codons and amino acids around change (only in coding regions). Default is 0 codons (i.e. by default is turned off).
-i Input format: [ txt, vcf, pileup, bed ]
  • vcf: Input file is in VCF format. Implies '-inOffset 1'
  • txt: Input file is in TXT format. Implies '-inOffset 1'
  • pileup: Input file is in PILEUP format. Implies '-inOffset 1'. WARNING: This format is deprecated.
  • bed: Only intervals are provided (no variants). This is used when you want to know were an interval hits. Implies '-inOffset 0'
-interval Add custom interval file. You may use this option many times to add many interval files.
-o Output format: [ txt, vcf, bed, bedAnn ]
  • vcf: Output in VCF format. Implies '-inOffset 1'
  • txt: Output in TXT format. Implies '-inOffset 1'
  • bed: Only minimal information is added to the 'Name' column. Format: "Effect_1 | Gene_1 | Biotype_1 ; Effect_2 | Gene_2 | Biotype_2 ; ... ". This is used when you want to know were an interval hits. Implies '-outOffset 0'
  • bedAnn: Output annotation's info in BED format (as opposed to variant's info). This option will output annotations intersecting each variant, information will be added in the 'name' column. Implies '-outOffset 0'
-s, -stats Name of stats file (summary). Default is 'snpEff_summary.html'.
-chr Prepend 'chr' before printing a chromosome name (e.g. 'chr7' instead of '7').
-t Use multiple threads (implies '-noStats'). If active, tries to use available cores in the computer. Default 'off'

Sequence change filter options
Option Note
-del Analyze deletions only (filter out insertions, SNPs and MNPs).
-hom Analyze homozygous sequence changes only (filter out heterozygous changes).
-het Analyze heterozygous sequence changes only (filter out homozygous changes). Note that this option may not be valid when using VCF4 files, since there might be more than two changes per line, the notion of heterozygous change is lost.
-ins Analyze insertions only (filter out deletions, SNPs and MNPs).
-minC, -minCoverage Filter out sequence changes with coverage lower than X.
-maxC, -maxCoverage Filter out sequence changes with coverage higher than X.
-minQ, -minQuality Filter out sequence changes with quality lower than X.
-maxQ, -maxQuality Filter out sequence changes with quality higher than X.
-mnp Analyze MNPs only (filter out insertions, deletions and SNPs).
-snp Analyze SNPs only (filter out insertions, deletions and MNPs).

Results filter options
Option Note
-fi {bedFile} Only analyze changes intersecting intervals in file (you may use this option many times)
-no-downstream Do not show DOWNSTREAM changes
-no-intergenic Do not show INTERGENIC changes
-no-intron Do not show INTRON changes
-no-upstream Do not show UPSTREAM changes
-no-utr Do not show 5_PRIME_UTR or 3_PRIME_UTR changes

Annotations filter options
Option Note
-canon Only annotate using "canonical" transcripts. Canonical transcripts are defined as the transcript having the longest CDS.
-treatAllAsProteinCoding {val} If value is 'true', report all transcript as if they were conding. Default: Auto, i.e. if transcripts any marked as 'protein_coding' the set to 'false', if no transcripts are marked as 'protein_coding' then set to 'true'.
-ud, -upDownStreamLen Set upstream downstream interval length (in bases). If set to zero or negative, then no UPSTREAM or DOWNSTREAM effects are reported.
-onlyReg Only use regulation tracks
-reg {name}Regulation track to use (this option can be used add several times).
-onlyTr {file.txt}Only use the transcripts in this file. Format: One transcript ID per line.

Generic options
Option Note
-0 Indicates that input and output positions are zero-based. Tha means the the first base in a chromosome is base number 0. This is equivalent to '-inOffset 0 outOffset 0'
-1 Indicates that input and output positions are one-based. Tha means the the first base in a chromosome is base number 1. This is equivalent to '-inOffset 1 outOffset 1'. This is the default.
-c, -config Specifies the location of a configuration file. Default location is in current directory.
-h, -help Print help and exit.
-if, -inOffset Offset all position in input files by a number of bases. E.g. '-inOffset 1' for one-based input files.
-of, -outOffset Offset all outputs by a number of bases. E.g. '-outOffset 1' for one-based outputs.
-v, -verbose Verbose mode.
-q, -quiet Quiet mode (do not show any messages or errors).
-noLog Do not report usage statistics to server.



Download a database: snpEff download
Download and install a database. A list of databases is available at the download page.
Usage: snpEff download [options] genome_version

Generic options:
 -c , -config            : Specify config file
 -h , -help              : Show this help and exit
 -v , -verbose           : Verbose mode
 -noLog                  : Do not report usage statistics to server
E.g. to downlaod GRCh37.64, just run:
java -jar snpEff.jar download GRCh37.64

Build database: snpEff build
If you type the command without any arguments, it shows all available options ("java -jar snpEff.jar build"):
Usage: snpEff build [options] genome_version

Build DB options:
 -embl                   : Use Embl format.
 -genbank                : Use GenBank format.
 -gff2                   : Use GFF2 format (obsolete).
 -gff3                   : Use GFF3 format.
 -gtf22                  : Use GTF 2.2 format.
 -refseq                 : Use RefSeq table from UCSC.
 -txt                    : Use TXT format (obsolete).
 -onlyReg                : Only build regulation tracks.
 -cellType type          : Only build regulation tracks for cellType "type".

Generic options:
 -0                      : File positions are zero-based (same as '-inOffset 0 -outOffset 0')
 -1                      : File positions are one-based (same as '-inOffset 1 -outOffset 1')
 -c , -config            : Specify config file
 -h , -help              : Show this help and exit
 -if, -inOffset          : Offset input by a number of bases. E.g. '-inOffset 1' for one-based input files
 -of, -outOffset         : Offset output by a number of bases. E.g. '-outOffset 1' for one-based output files
 -noLog                  : Do not report usage statistics to server
 -q , -quiet             : Quiet mode (do not show any messages or errors)
 -v , -verbose           : Verbose mode
Option Note
-emblUse Embl format. It will look gene information in a file called './data/GENOME/genes.embl' which is assumed to be in EMBL format (assuming 'data_dir=./data/' in your snpEff.config file).
-genbankUse GenBank format. It will look gene information in a file called './data/GENOME/genes.gb' which is assumed to be in GenBank format (assuming 'data_dir=./data/' in your snpEff.config file).
-gff3Use GFF3 format. It will look gene information in a file called './data/GENOME/genes.gff' which is assumed to be in GFF3 format (assuming 'data_dir=./data/' in your snpEff.config file).
-gff2Use GFF2 format. It will look gene information in a file called './data/GENOME/genes.gff' which is assumed to be in GFF2 format (assuming 'data_dir=./data/' in your snpEff.config file).
WARNING: GFF2 format is obsolete and should not be used.
-gtf22Use GFT 2.2 format. It will look gene information in a file called './data/GENOME/genes.gtf' which is assumed to be in GTF 2.2 format (assuming 'data_dir=./data/' in your snpEff.config file).
-refseqUse refSeq table. It will look gene information in a file called './data/GENOME/genes.txt' which is assumed to be a RefSeq table from UCSC (assuming 'data_dir=./data/' in your snpEff.config file).

Adding Genomic Annotations Using SnpEff and VariantAnnotator

http://gatkforums.broadinstitute.org/discussion/comment/1574/#Comment_1574

IMPORTANT ANNOUNCEMENT: Our testing has shown that not all combinations of snpEff/database versions produce high-quality results. Please see the Current Recommended Best Practices When Running SnpEff and Analysis of SnpEff Annotations Across Versions sections below to familiarize yourself with our recommended best practices BEFORE running snpEff.

Contents

Introduction

Until recently we were using an in-house annotation tool for genomic annotation, but the burden of keeping the database current and our lack of ability to annotate indels has led us to employ the use of a third-party tool instead. After reviewing many external tools (including annoVar, VAT, and Oncotator), we decided that SnpEff best meets our needs as it accepts VCF files as input, can annotate a full exome callset (including indels) in seconds, and provides continually-updated transcript databases. We have implemented support in the GATK for parsing the output from the SnpEff tool and annotating VCFs with the information provided in it.

SnpEff Setup and Usage

  • Download the SnpEff core program. If you want to be able to run VariantAnnotator on the SnpEff output, you'll need to download a version of SnpEff that VariantAnnotator supports from this page (currently supported versions are listed below). If you just want the most recent version of SnpEff and don't plan to run VariantAnnotator on its output, you can get it from here.
  • Unzip the core program
  • Open the file snpEff.config in a text editor, and change the "database_repository" line to the following:
database_repository = http://sourceforge.net/projects/snpeff/files/databases/
  • Download one or more databases using SnpEff's built-in download command:
java -jar snpEff.jar download GRCh37.64
A list of available databases is here. The human genome databases have GRCh or hg in their names. You can also download the databases directly from the SnpEff website, if you prefer.
  • The download command by default puts the databases into a subdirectory called data within the directory containing the SnpEff jar file. If you want the databases in a different directory, you'll need to edit the data_dir entry in the file snpEff.config to point to the correct directory.
  • Run SnpEff on the file containing your variants, and redirect its output to a file. SnpEff supports many input file formats including VCF 4.1, BED, and SAM pileup. Full details and command-line options can be found on the SnpEff home page.

Supported SnpEff Versions

  • If you want to take advantage of SnpEff integration in the GATK, you'll need to run SnpEff version 2.0.5 (note: the newer version 2.0.5d is currently unsupported by the GATK, as we haven't yet had a chance to test it)

Current Recommended Best Practices When Running SnpEff

These best practices are based on our analysis of various snpEff/database versions as described in detail in the Analysis of SnpEff Annotations Across Versions section below.
  • We recommend using only the GRCh37.64 database with SnpEff 2.0.5. The more recent GRCh37.65 database produces many false-positive Missense annotations due to a regression in the ENSEMBL Release 65 GTF file used to build the database. This regression has been acknowledged by ENSEMBL and is supposedly fixed as of 1-30-2012, however as we have not yet tested the fixed version of the database we continue to recommend using only GRCh37.64 for now.
  • We recommend always running with "-onlyCoding true" with human databases (eg., the GRCh37.* databases). Setting "-onlyCoding false" causes snpEff to report all transcripts as if they were coding (even if they're not), which can lead to nonsensical results. The "-onlyCoding false" option should only be used with databases that lack protein coding information.
  • Do not trust annotations from versions of snpEff prior to 2.0.4. Older versions of snpEff (such as 2.0.2) produced many incorrect annotations due to the presence of a certain number of nonsensical transcripts in the underlying ENSEMBL databases. Newer versions of snpEff filter out such transcripts.

Analysis of SnpEff Annotations Across Versions

  • Analysis of the SNP annotations produced by snpEff across various snpEff/database versions: File:SnpEff snps comparison of available versions.pdf
    • Both snpEff 2.0.2 + GRCh37.63 and snpEff 2.0.5 + GRCh37.65 produce an abnormally high Missense:Silent ratio, with elevated levels of Missense mutations across the entire spectrum of allele counts. They also have a relatively low (~70%) level of concordance with the 1000G Gencode annotations when it comes to Silent mutations. This suggests that these combinations of snpEff/database versions incorrectly annotate many Silent mutations as Missense.
    • snpEff 2.0.4 RC3 + GRCh37.64 and snpEff 2.0.5 + GRCh37.64 produce a Missense:Silent ratio in line with expectations, and have a very high (~97%-99%) level of concordance with the 1000G Gencode annotations across all categories.
  • Comparison of SNP annotations produced using the GRCh37.64 and GRCh37.65 databases with snpEff 2.0.5: File:SnpEff snps ensembl 64 vs 65.pdf
    • The GRCh37.64 database gives good results provided you run snpEff with the "-onlyCoding true" option. The "-onlyCoding false" option causes snpEff to mark all transcripts as coding, and so produces many false-positive Missense annotations.
    • The GRCh37.65 database gives results that are as poor as those you get with the "-onlyCoding false" option on the GRCh37.64 database. This is due to a regression in the ENSEMBL release 65 GTF file used to build snpEff's GRCh37.65 database. The regression has been acknowledged by ENSEMBL and is due to be fixed shortly.
  • Analysis of the INDEL annotations produced by snpEff across snpEff/database versions: File:SnpEff indels.pdf
    • snpEff's indel annotations are highly concordant with those of a high-quality set of genomic annotations from the 1000 Genomes project. This is true across all snpEff/database versions tested.

Example SnpEff Usage with a VCF Input File

Below is an example of how to run SnpEff version 2.0.5 with a VCF input file and have it write its output in VCF format as well. Notice that you need to explicitly specify the database you want to use (in this case, GRCh37.64). This database must be present in a directory of the same name within the data_dir as defined in snpEff.config.
java -Xmx4G -jar snpEff.jar eff -v -onlyCoding true -i vcf -o vcf GRCh37.64 1000G.exomes.vcf > snpEff_output.vcf
In this mode, SnpEff aggregates all effects associated with each variant record together into a single INFO field annotation with the key EFF. The general format is:
EFF=Effect1(Information about Effect1),Effect2(Information about Effect2),etc.
And here is the precise layout with all the subfields:
EFF=Effect1(Effect_Impact|Effect_Functional_Class|Codon_Change|Amino_Acid_Change|Gene_Name|Gene_BioType|Coding|Transcript_ID|Exon_ID),Effect2(etc...
It's also possible to get SnpEff to output in a (non-VCF) text format with one Effect per line. See the SnpEff home page for full details.

Adding SnpEff Annotations using VariantAnnotator

Once you have a SnpEff output VCF file, you can use the VariantAnnotator walker to add SnpEff annotations based on that output to the input file you ran SnpEff on.
There are two different options for doing this:

Option 1: Annotate with only the highest-impact effect for each variant

NOTE: This option works only with supported SnpEff versions. VariantAnnotator run as described below will refuse to parse SnpEff output files produced by other versions of the tool, or which lack a SnpEff version number in their header.
The default behavior when you run VariantAnnotator on a SnpEff output file is to parse the complete set of effects resulting from the current variant, select the most biologically-significant effect, and add annotations for just that effect to the INFO field of the VCF record for the current variant. This is the mode we plan to use in our Production Data-Processing Pipeline.
When selecting the most biologically-significant effect associated with the current variant, VariantAnnotator does the following:
  • Prioritizes the effects according to the categories (in order of decreasing precedence) "High-Impact", "Moderate-Impact", "Low-Impact", and "Modifier", and always selects one of the effects from the highest-priority category. For example, if there are three moderate-impact effects and two high-impact effects resulting from the current variant, the annotator will choose one of the high-impact effects and add annotations based on it. See below for a full list of the effects arranged by category.
  • Within each category, ties are broken using the functional class of each effect (in order of precedence: NONSENSE, MISSENSE, SILENT, or NONE). For example, if there is both a NON_SYNONYMOUS_CODING (MODERATE-impact, MISSENSE) and a CODON_CHANGE (MODERATE-impact, NONE) effect associated with the current variant, the annotator will select the NON_SYNONYMOUS_CODING effect. This is to allow for more accurate counts of the total number of sites with NONSENSE/MISSENSE/SILENT mutations. See below for a description of the functional classes SnpEff associates with the various effects.
  • Effects that are within a non-coding region are always considered lower-impact than effects that are within a coding region.
Example Usage:
java -jar dist/GenomeAnalysisTK.jar \
     -T VariantAnnotator \
     -R /humgen/1kg/reference/human_g1k_v37.fasta \
     -A SnpEff \       
     --variant 1000G.exomes.vcf \        (file to annotate)
     --snpEffFile snpEff_output.vcf \    (SnpEff VCF output file generated by running SnpEff on the file to annotate)
     -L 1000G.exomes.vcf \
     -o out.vcf
VariantAnnotator adds some or all of the following INFO field annotations to each variant record:
  • SNPEFF_EFFECT - The highest-impact effect resulting from the current variant (or one of the highest-impact effects, if there is a tie)
  • SNPEFF_IMPACT - Impact of the highest-impact effect resulting from the current variant (HIGH, MODERATE, LOW, or MODIFIER)
  • SNPEFF_FUNCTIONAL_CLASS - Functional class of the highest-impact effect resulting from the current variant (NONE, SILENT, MISSENSE, or NONSENSE)
  • SNPEFF_CODON_CHANGE - Old/New codon for the highest-impact effect resulting from the current variant
  • SNPEFF_AMINO_ACID_CHANGE - Old/New amino acid for the highest-impact effect resulting from the current variant
  • SNPEFF_GENE_NAME - Gene name for the highest-impact effect resulting from the current variant
  • SNPEFF_GENE_BIOTYPE - Gene biotype for the highest-impact effect resulting from the current variant
  • SNPEFF_TRANSCRIPT_ID - Transcript ID for the highest-impact effect resulting from the current variant
  • SNPEFF_EXON_ID - Exon ID for the highest-impact effect resulting from the current variant
Example VCF records annotated using SnpEff and VariantAnnotator:
1   874779  .   C   T   279.94  . AC=1;AF=0.0032;AN=310;BaseQRankSum=-1.800;DP=3371;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=1.4493;InbreedingCoeff=-0.0045;
MQ=54.49;MQ0=10;MQRankSum=0.982;QD=13.33;ReadPosRankSum=-0.060;SB=-120.09;SNPEFF_AMINO_ACID_CHANGE=G215;SNPEFF_CODON_CHANGE=ggC/ggT;
SNPEFF_EFFECT=SYNONYMOUS_CODING;SNPEFF_EXON_ID=exon_1_874655_874840;SNPEFF_FUNCTIONAL_CLASS=SILENT;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=SAMD11;
SNPEFF_IMPACT=LOW;SNPEFF_TRANSCRIPT_ID=ENST00000342066
1   874816  .   C   CT  2527.52 .   AC=15;AF=0.0484;AN=310;BaseQRankSum=-11.876;DP=4718;FS=48.575;HRun=1;HaplotypeScore=91.9147;InbreedingCoeff=-0.0520;
MQ=53.37;MQ0=6;MQRankSum=-1.388;QD=5.92;ReadPosRankSum=-1.932;SB=-741.06;SNPEFF_EFFECT=FRAME_SHIFT;SNPEFF_EXON_ID=exon_1_874655_874840;
SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=SAMD11;SNPEFF_IMPACT=HIGH;SNPEFF_TRANSCRIPT_ID=ENST00000342066

Option 2: Annotate with all effects for each variant

VariantAnnotator also has the ability to take the EFF field from the SnpEff VCF output file containing all the effects aggregated together and copy it verbatim into the VCF to annotate.
Here's an example of how to do this:
java -jar dist/GenomeAnalysisTK.jar \
     -T VariantAnnotator \
     -R /humgen/1kg/reference/human_g1k_v37.fasta \      
     -E resource.EFF \
     --variant 1000G.exomes.vcf \      (file to annotate)
     --resource snpEff_output.vcf \    (SnpEff VCF output file generated by running SnpEff on the file to annotate)
     -L 1000G.exomes.vcf \
     -o out.vcf
Of course, in this case you can also use the VCF output by SnpEff directly, but if you are using VariantAnnotator for other purposes anyway the above might be useful.

List of Genomic Effects

Below are the possible genomic effects recognized by SnpEff, grouped by biological impact. Full descriptions of each effect are available on this page.

High-Impact Effects

  • SPLICE_SITE_ACCEPTOR
  • SPLICE_SITE_DONOR
  • START_LOST
  • EXON_DELETED
  • FRAME_SHIFT
  • STOP_GAINED
  • STOP_LOST

Moderate-Impact Effects

  • NON_SYNONYMOUS_CODING
  • CODON_CHANGE (note: this effect is used by SnpEff only for MNPs, not SNPs)
  • CODON_INSERTION
  • CODON_CHANGE_PLUS_CODON_INSERTION
  • CODON_DELETION
  • CODON_CHANGE_PLUS_CODON_DELETION
  • UTR_5_DELETED
  • UTR_3_DELETED

Low-Impact Effects

  • SYNONYMOUS_START
  • NON_SYNONYMOUS_START
  • START_GAINED
  • SYNONYMOUS_CODING
  • SYNONYMOUS_STOP
  • NON_SYNONYMOUS_STOP

Modifiers

  • NONE
  • CHROMOSOME
  • CUSTOM
  • CDS
  • GENE
  • TRANSCRIPT
  • EXON
  • INTRON_CONSERVED
  • UTR_5_PRIME
  • UTR_3_PRIME
  • DOWNSTREAM
  • INTRAGENIC
  • INTERGENIC
  • INTERGENIC_CONSERVED
  • UPSTREAM
  • REGULATION
  • INTRON

Functional Classes

SnpEff assigns a functional class to certain effects, in addition to an impact:
  • NONSENSE: assigned to point mutations that result in the creation of a new stop codon
  • MISSENSE: assigned to point mutations that result in an amino acid change, but not a new stop codon
  • SILENT: assigned to point mutations that result in a codon change, but not an amino acid change or new stop codon
  • NONE: assigned to all effects that don't fall into any of the above categories (including all events larger than a point mutation)
The GATK prioritizes effects with functional classes over effects of equal impact that lack a functional class when selecting the most significant effect in VariantAnnotator. This is to enable accurate counts of NONSENSE/MISSENSE/SILENT sites.

---- note by WZY ----
this is an example script to run snpEff:
java -jar snpEff.jar download hg19
java -Xmx2G -jar snpEff.jar -v -chr chr21 -i vcf -o vcf hg19 input.vcf > output.vcf


GATK pipeline











NGS1.png
  1. Global config  : Set up global configuration of the pipeline.
  2. Mapping : Align short sequences to the human reference genome sequence database.
  3. Fixmate : Fixing the mate pairs information to ensure that all mate-pair information is in sync between each read and it's mate pair.
  4. Filter : Filtering for mapping, pairing, and proper paired
  5. Remove duplicate : Examines aligned records in the BAM file to locate duplicate reads and remove them.
  6. Filter low mapping quality : Filter low mapping quality reads
  7. Create intervals : Collect regions around potential indels and clusters of mismatches. Determine small suspicious intervals which are likely in need of realignment.
  8. Realignment : Run the realigner over the intervals to create a cleaned version of the BAM file.
  9. Analysis of covariates : Determine the covariates affecting base quality scores in the BAM file.
  10. Recalibration : Walking through the BAM file and rewrite the quality scores.
  11. Recalculate analysis of covariates : Determine the covariates affecting base quality scores in the realigned recalibrated BAM file for the comparison.
  12. Depth of coverage : Determine coverage summarized by mean, median, quartiles, and/or percentage of bases covered.
  13. HsMetrics : Calculates a set of Hybrid Selection specific metrics from an aligned BAM file..
  14. Cleanup : Remove all intermediate alignment and BAM files. Keep only first aligned and last realigned-recalibrated BAM files.
  15. Calling variants
  16. Generate genotype
  17. Annotation snpEff
  18. Annotation Annovar

Technical considerations for functional sequencing assays


Technical considerations for functional sequencing assays

RNA-Seq

The measurement of RNA expression is a foundation of many experiments done in biomedical research. It is therefore natural that the sequencing of long and short RNA both for quantification and discovery is the most popular functional sequencing assay (Fig. 1a). Quantification of mRNA transcripts in RNA-seq is performed by calculating values reported in units of reads per kilobase per million mapped reads (RPKM) with a paired-end fragment equivalent, fragments per kilobase per million reads (FPKM), also commonly used for each gene (Fig. 1b). RPKM normalizes for differences in gene size and makes the comparison of genes within the same sample meaningful in terms of molar equivalents (Fig. 1b). As RNA-seq is not based on predetermined DNA probes to known genes, it is a powerful tool for the discovery of new exons, splice junctions, transcripts and genes as well as new small RNAs (Fig. 1c). The reads can be used to assemble transcripts that result from gene rearrangements and can also help to identify disease-associated genomic abnormalities. Properly filtered RNA-seq reads can be mined for sequence variants and RNA-editing events with tuned analysis and filtering pipelines (Fig. 1d).

RNA-seq flow chart
Zeng W, Mortazavi A. (2012) Technical considerations for functional sequencing assays. Nat Immunol 13(9), 802-7. [abstract]

RNA-seq pipeline


Wednesday, November 28, 2012

CRAVAT tool

Developed by a group from John Hopkins University, CRAVAT is a online tool to perform the following analysis:

  Cancer driver analysis
  Functional effect analysis
  Annotation only

They provide some help documents here:
http://www.cravat.us/help.jsp?chapter=analysis_tools&article=chasm

interesting tool.

Annotations Using SnpEff and VariantAnnotator

Adding Genomic Annotations Using SnpEff and VariantAnnotator - GATK-Forum

http://gatkforums.broadinstitute.org/discussion/comment/1574/#Comment_1574


Annotation tools for SNP/indel calling results:
snpEff is for the indel annotation and because it's running on java and the process of annotation is pretty fast, it's very amenable to run off your personal desktop.

other tools: wAnnovar, Seattle Seq

Kevin's GATTACA blog

Kevin's GATTACA blog
 http://kevin-gattaca.blogspot.com

A lot of information relevant to NGS data process, for example:
http://kevin-gattaca.blogspot.com/2011/07/genomecoveragebed-to-look-at-coverage.html

talks about looking at coverage of your WGS from BAM files.

Many other posts available.

BAM QC tool

BAM QC tool from QualiMap (http://qualimap.org) provides a coverage plot. See example output here:
http://qualimap.bioinfo.cipf.es/samples/ERR089819_result/qualimapReport.html

 This is a QC tool for BAM files.
(likewise, fastQC is a QC tool for fastQ files)



Summary 

Globals

Reference size 100,286,002
Number of reads 35,576,180
Mapped reads 30,983,200 / 87.09%
Unmapped reads 4,592,980 / 12.91%
Paired reads 30,983,200 / 87.09%
Mapped reads, only first in pair 15,491,600 / 43.54%
Mapped reads, only second in pair 15,491,600 / 43.54%
Mapped reads, both in pair 30,983,200 / 87.09%
Mapped reads, singletons 0 / 0%
Read min/max/mean length 100 / 100 / 100
Clipped reads 0 / 0%
Duplication rate 16.39%

ACGT Content

Number/percentage of A's 1,003,585,776 / 32.43%
Number/percentage of C's 541,692,060 / 17.5%
Number/percentage of T's 1,006,351,599 / 32.52%
Number/percentage of G's 543,136,344 / 17.55%
Number/percentage of N's 0 / 0%
GC Percentage 35.05%

Coverage

Mean 30.86
Standard Deviation 25.01

Mapping Quality

Mean Mapping Quality 255

Insert size

Mean 407.73
Median 400

Per chromosome statistics

Name Length Mapped bases Mean coverage Standard deviation
I 15072421 467401058 31.01 36.49
II 15279324 456673231 29.89 9.13
III 13783682 416692735 30.23 9.81
IV 17493784 545502943 31.18 10.46
V 20924143 635041765 30.35 18.85
X 17718854 553489637 31.24 8.75
MtDNA 13794 19964410 1,447.33 346.72

Input data and parameters 

Alignment

BAM file: /home/kokonech/qualimap_release_data/alignments/ERR089819.bam
Program: Bowtie (0.12.7)
Command line: "bowtie -S -m 1 -k 1 -p 16 -X 3000 --fr --chunkmbs 2056 /data/scratch/ebwts/c_elegans_ws200 -1 ERR089819_1.fastq -2 ERR089819_2.fastq"
Number of windows: 400
Analysis date: Tue Jul 24 12:04:58 CEST 2012
Draw chromosome limits: yes

Coverage across reference 

Coverage Histogram 

Coverage Histogram (0-50X) 

Duplication Rate Histogram 

Genome Fraction Coverage 

Mapped Reads Nucleotide Content 

Mapped Reads GC-content Distribution 

Mapping Quality Across Reference 

Mapping Quality Histogram 

Insert Size Across Reference 

Insert Size Histogram 

GenomeWeb InSequence

http://twitter.com/InSequence

GenomeWeb InSequence

@InSequence

GenomeWeb premium newsletter covering the newest tools and applications in next-generation sequencing, from the established vendors to emerging startups.

Tweets

Early Proton Users Report Good Data, Rapid Exomes; Look Forward to Sample Prep and Automation Improvements
New Products: Qiagen's Sample Prep Products, Beckman Coulter's SPRIselect, Bowtie 2, BaseSpace Apps, and more
Intel, PacBio Researchers Among Team Taking on Exploratory Project Pairing SMRT Sequencing with Electronic Tags
PacBio Books Four Systems in Q3; Plans Q4 Chemistry Upgrade to Enable Longer Average Read Length
Illumina's 21 Percent Increase in Q3 Revenue Driven by MiSeq, HiSeq 2500, Sequencing Services
Columbia Team Pursues Integrated Electronic Approaches for Measuring Multiplexed Nanopore Signals
ZS Genetics Shows Single Base ID by Electron Microscopy, but Path to Sequencing is Still Long
George Church Fielding X Prize Team; Says Sequencing Strategy May Combine Many Technologies
In Pilot Project, FDA Places MiSeqs in State and Federal Labs to Track Food-Borne Pathogens
In Sequence Survey: Illumina Holds Two-Thirds of Sequencing Market, Splits Desktop Share with Ion PGM
In Print: Last Week's Sequencing-Related Papers of Note