Objective
Recalibrate variant quality scores and produce a callset filtered for the desired levels of sensitivity and specificity.Prerequisites
- TBD
Caveats
This document provides a typical usage example including parameter values. However, the values given may not be representative of the latest Best Practices recommendations. When in doubt, please consult the FAQ document on VQSR training sets and parameters, which overrides this document.Steps
- Prepare recalibration parameters for SNPs
a. Specify which call sets the program should use as resources to build the recalibration model
b. Specify which annotations the program should use to evaluate the likelihood of Indels being real
c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches
d. Determine additional model parameters - Build the SNP recalibration model
- Apply the desired level of recalibration to the SNPs in the call set
- Prepare recalibration parameters for Indels a. Specify which call sets the program should use as resources to build the recalibration model b. Specify which annotations the program should use to evaluate the likelihood of Indels being real c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches d. Determine additional model parameters
- Build the Indel recalibration model
- Apply the desired level of recalibration to the Indels in the call set
1. Prepare recalibration parameters for SNPs
a. Specify which call sets the program should use as resources to build the recalibration model
For each training set, we use key-value tags to qualify whether the set contains known sites, training sites, and/or truth sites. We also use a tag to specify the prior likelihood that those sites are true (using the Phred scale).- True sites training resource: HapMap
- True sites training resource: Omni
- Non-true sites training resource: 1000G
- Known sites resource, not used in training: dbSNP
The default prior likelihood assigned to all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the GATK callers is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.
b. Specify which annotations the program should use to evaluate the likelihood of Indels being real
These annotations are included in the information generated for each variant call by the caller. If an annotation is missing (typically because it was omitted from the calling command) it can be added using the VariantAnnotator tool.- Coverage (DP)
- QualByDepth (QD)
- FisherStrand (FS)
- MappingQualityRankSumTest (MQRankSum)
- ReadPosRankSumTest (ReadPosRankSum)
c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches
- First tranche threshold 100.0
- Second tranche threshold 99.9
- Third tranche threshold 99.0
- Fourth tranche threshold 90.0
2. Build the SNP recalibration model
Action
Run the following GATK command:java -jar GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R reference.fa \
-input raw_variants.vcf \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf \
-resource:omni,known=false,training=true,truth=true,prior=12.0 omni.vcf \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf \
-an DP \
-an QD \
-an FS \
-an MQRankSum \
-an ReadPosRankSum \
-mode SNP \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
-recalFile recalibrate_SNP.recal \
-tranchesFile recalibrate_SNP.tranches \
-rscriptFile recalibrate_SNP_plots.R
Expected Result
This creates several files. The most important file is the recalibration report, calledrecalibrate_SNP.recal
,
which contains the recalibration data. This is what the program will
use in the next step to generate a VCF file in which the variants are
annotated with their recalibrated quality scores. There is also a file
called recalibrate_SNP.tranches
, which contains the quality
score thresholds corresponding to the tranches specified in the
original command. Finally, if your installation of R and the other
required libraries was done correctly, you will also find some PDF files
containing plots. These plots illustrated the distribution of variants
according to certain dimensions of the model.For detailed instructions on how to interpret these plots, please refer to the online GATK documentation.
3. Apply the desired level of recalibration to the SNPs in the call set
Action
Run the following GATK command:java -jar GenomeAnalysisTK.jar \
-T ApplyRecalibration \
-R reference.fa \
-input raw_variants.vcf \
-mode SNP \
--ts_filter_level 99.0 \
-recalFile recalibrate_SNP.recal \
-tranchesFile recalibrate_SNP.tranches \
-o recalibrated_snps_raw_indels.vcf
Expected Result
This creates a new VCF file, calledrecalibrated_snps_raw_indels.vcf
, which contains all the original variants from the original raw_variants.vcf
file, but now the SNPs are annotated with their recalibrated quality scores (VQSLOD) and either PASS
or FILTER
depending on whether or not they are included in the selected tranche.Here we are taking the second lowest of the tranches specified in the original recalibration command. This means that we are applying to our data set the level of sensitivity that would allow us to retrieve 99% of true variants from the truth training sets of HapMap and Omni SNPs. If we wanted to be more specific (and therefore have less risk of including false positives, at the risk of missing real sites) we could take the very lowest tranche, which would only retrieve 90% of the truth training sites. If we wanted to be more sensitive (and therefore less specific, at the risk of including more false positives) we could take the higher tranches. In our Best Practices documentation, we recommend taking the second highest tranche (99.9%) which provides the highest sensitivity you can get while still being acceptably specific.
4. Prepare recalibration parameters for Indels
a. Specify which call sets the program should use as resources to build the recalibration model
For each training set, we use key-value tags to qualify whether the set contains known sites, training sites, and/or truth sites. We also use a tag to specify the prior likelihood that those sites are true (using the Phred scale).- Known and true sites training resource: Mills
The default prior likelihood assigned to all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the GATK callers is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.
b. Specify which annotations the program should use to evaluate the likelihood of Indels being real
These annotations are included in the information generated for each variant call by the caller. If an annotation is missing (typically because it was omitted from the calling command) it can be added using the VariantAnnotator tool.- Coverage (DP)
- FisherStrand (FS)
- MappingQualityRankSumTest (MQRankSum)
- ReadPosRankSumTest (ReadPosRankSum)
c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches
- First tranche threshold 100.0
- Second tranche threshold 99.9
- Third tranche threshold 99.0
- Fourth tranche threshold 90.0
d. Determine additional model parameters
- Maximum number of Gaussians (
-maxGaussians
) 4
5. Build the Indel recalibration model
Action
Run the following GATK command:java -jar GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R reference.fa \
-input recalibrated_snps_raw_indels.vcf \
-resource:mills,known=true,training=true,truth=true,prior=12.0 mills.vcf \
-an DP \
-an FS \
-an MQRankSum \
-an ReadPosRankSum \
-mode INDEL \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
--maxGaussians 4 \
-recalFile recalibrate_INDEL.recal \
-tranchesFile recalibrate_INDEL.tranches \
-rscriptFile recalibrate_INDEL_plots.R
Expected Result
This creates several files. The most important file is the recalibration report, calledrecalibrate_INDEL.recal
,
which contains the recalibration data. This is what the program will
use in the next step to generate a VCF file in which the variants are
annotated with their recalibrated quality scores. There is also a file
called recalibrate_INDEL.tranches
, which contains the
quality score thresholds corresponding to the tranches specified in the
original command. Finally, if your installation of R and the other
required libraries was done correctly, you will also find some PDF files
containing plots. These plots illustrated the distribution of variants
according to certain dimensions of the model.For detailed instructions on how to interpret these plots, please refer to the online GATK documentation.
6. Apply the desired level of recalibration to the Indels in the call set
Action
Run the following GATK command:java -jar GenomeAnalysisTK.jar \
-T ApplyRecalibration \
-R reference.fa \
-input recalibrated_snps_raw_indels.vcf \
-mode INDEL \
--ts_filter_level 99.0 \
-recalFile recalibrate_INDEL.recal \
-tranchesFile recalibrate_INDEL.tranches \
-o recalibrated_variants.vcf
Expected Result
This creates a new VCF file, calledrecalibrated_variants.vcf
, which contains all the original variants from the original recalibrated_snps_raw_indels.vcf
file, but now the Indels are also annotated with their recalibrated quality scores (VQSLOD) and either PASS
or FILTER
depending on whether or not they are included in the selected tranche.Here we are taking the second lowest of the tranches specified in the original recalibration command. This means that we are applying to our data set the level of sensitivity that would allow us to retrieve 99% of true variants from the truth training sets of HapMap and Omni SNPs. If we wanted to be more specific (and therefore have less risk of including false positives, at the risk of missing real sites) we could take the very lowest tranche, which would only retrieve 90% of the truth training sites. If we wanted to be more sensitive (and therefore less specific, at the risk of including more false positives) we could take the higher tranches. In our Best Practices documentation, we recommend taking the second highest tranche (99.9%) which provides the highest sensitivity you can get while still being acceptably specific.
Post edited by Geraldine_VdAuwera on
1
0
0
212
Geraldine Van der Auwera, PhD
Comments
-
avidLearner Posts: 7Member ✭Hello, I have a few questions that I would like to clarify:
- Is the file 1000G_phase1.snps.high_confidence.vcf available in the resource bundle? (I looked it up and it wasn't there!) If not, which file is recommended with the updated VariantRecalibrator parameters?
- Just to clarify, are the parameters described above pertinent to whole genomes or whole exomes? (I am assuming whole genomes but this has not been mentioned explicitly anywhere.) I understand that it is recommended to use --maxGaussians 4 --percentBad 0.05 for few exome samples. In this case, is -minNumBad of 1000 recommended or should a lower value be set?
- The VariantRecalibrator documentation parameters (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantrecalibration_VariantRecalibrator.html) are slightly different from the parameters described here. Is this tutorial the latest best practices guideline that can be followed for VQSR?
-
Geraldine_VdAuwera Posts: 4,410Administrator, GSA Member adminHi there,
- The filenames are a little different from what we have in the bundle, I will correct that soon (sorry) but basically you want the 1000 Genomes Phase1 SNPs that should be in our bundle.
- These parameters are for WGS. I will add a note to explain that.
- The parameters given in tutorial documents are not necessarily up-to-date with our Best Practices recommendations. These can be found in the Best Practices document on our website that concentrates (or points to) the actual parameter recommendations. For VQSR there is a specific FAQ document, which is linked to by the BP document. We will try to make that more clear in the documentation.
Geraldine Van der Auwera, PhD
-
avidLearner Posts: 7Member ✭Hello Geraldine, I searched for the 1000 Genomes Phase 1 SNPs VCF in ftp://ftp.broadinstitute.org/bundle/2.3/b37/ but no luck! Please tell me if I'm looking in the wrong place.
-
Geraldine_VdAuwera Posts: 4,410Administrator, GSA Member adminHey @avidLearner, sorry to get back to you so late. You've been looking in the version 2.3 of the bundle, but you should be looking in the latest version, which is 2.5.
Geraldine Van der Auwera, PhD
-
avidLearner Posts: 7Member ✭Hi Geraldine, yes I figured it out. I was initially accessing ftp://ftp.broadinstitute.org/bundle/ through Internet Explorer and for some reason it wasn't loading the updated version 2.5. But when I accessed the resource bundle through wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/ I was able to locate the latest version. Sorry for the trouble. Thanks for the help!
-
yingzhang minneapolisPosts: 4Member ✭This is a really useful article on the variant recalibration. And I think it makes more sense compared to the brief overview at http://www.broadinstitute.org/gatk/guide/topic?name=best-practices
But I do have a question, which might only be my own over-reacting.
In the overview, you state:
=======
We've found that the best results are obtained with the VQSR when it is used to build separate adaptive error models for SNPs versus INDELs:
========snp.model <- BuildErrorModelWithVQSR(raw.vcf, SNP) indel.model <- BuildErrorModelWithVQSR(raw.vcf, INDEL) recalibratedSNPs.rawIndels.vcf <- ApplyRecalibration(raw.vcf, snp.model, SNP) analysisReady.vcf <- ApplyRecalibration(recalibratedSNPs.rawIndels.vcf, indel.model, INDEL)
The difference between the overview and this FAQ is that how one should build the error model for indel.
In overview, it builds the error model for indel from the raw_variant.vcf; while in this article the indel model is built from the recalibrated_snps_raw_indels.vcf.
I assume there should be no cross-talk between the INDEL and SNP types. But if there is such a case, then the indel model from raw_variant.vcf will differ from the indel model from the recalibrated_snps_raw_indels.vcf.
Should I follow the overview or this FAQ?
And when you say "best results", what else have you tested?
Best, Ying
Post edited by yingzhang on -
Geraldine_VdAuwera Posts: 4,410Administrator, GSA Member adminHi Ying,
We're currently rewriting the Best Practices document to make it more readable, so hopefully that will help clarify things. In the meantime, let me try to clear up some of the confusion, which is very common, about how the recalibration works.
When you specify either INDEL or SNP for the model, the model is built using only the specified type of variant, ignoring the other type. Then, when you apply the recalibration (specifying the appropriate variant type), the tool output recalibrated variants for that type, and emits the other variant type without modification. So if you specified SNPs, your output contains recalibrated SNPs and unmodified indels. You can then repeat the process on that output file, specifying INDEL this time, so that the tool will output recalibrated indels, as well as the already recalibrated SNPs from the previous round.
This replaces the old way of proceeding, where we recalibrated indels and SNPs in separate files, because it is faster and more efficient.
Does that make more sense?
Geraldine Van der Auwera, PhD
-
yingzhang minneapolisPosts: 4Member ✭Thank you Geraldine. Your comments are really explicit.
@Geraldine_VdAuwera said: Hi Ying,
We're currently rewriting the Best Practices document to make it more readable, so hopefully that will help clarify things. In the meantime, let me try to clear up some of the confusion, which is very common, about how the recalibration works.
When you specify either INDEL or SNP for the model, the model is built using only the specified type of variant, ignoring the other type. Then, when you apply the recalibration (specifying the appropriate variant type), the tool output recalibrated variants for that type, and emits the other variant type without modification. So if you specified SNPs, your output contains recalibrated SNPs and unmodified indels. You can then repeat the process on that output file, specifying INDEL this time, so that the tool will output recalibrated indels, as well as the already recalibrated SNPs from the previous round.
This replaces the old way of proceeding, where we recalibrated indels and SNPs in separate files, because it is faster and more efficient.
Does that make more sense?
-
avilella Posts: 6Member ✭Hi @Geraldine_VdAuwera,
Is this two step process described below also working for version 1.6? (I am using v1.6-22-g3ec78bd)
"When you specify either INDEL or SNP for the model, the model is built using only the specified type of variant, ignoring the other type. Then, when you apply the recalibration (specifying the appropriate variant type), the tool output recalibrated variants for that type, and emits the other variant type without modification. So if you specified SNPs, your output contains recalibrated SNPs and unmodified indels. You can then repeat the process on that output file, specifying INDEL this time, so that the tool will output recalibrated indels, as well as the already recalibrated SNPs from the previous round."
Or is 1.6 still using the separate process?
"old way of proceeding, where we recalibrated indels and SNPs in separate files"
-
Geraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
-
Geraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
-
Geraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
-
splaisan Posts: 14Member ✭Further in the workflow, I noticed the now fixed --maxGaussians 4 ( not yet fixed in the d. text part ) and that the tranche followed by a range notation was not accepted on my machine, is this normal?
thanks# fails even with single or double quotes around [] -tranche [100.0, 99.9, 99.0, 90.0] \ # works with -tranche 100.0 \ -tranche 99.9 \ -tranche 99.0 \ -tranche 90.0 \
Stephane
-
Geraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
-
Geraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
-
splaisan Posts: 14Member ✭Inspired by the broadE 2013 video#5 by Ryan Poplin (maybe a question for Ryan!) - great video BTW Context: I took IIlumina NA18507 chr21 mappings (originally to hg18) where most of the bam header is missing and poorly complies to Picard+GATK. I first extracted the reads to FASTQ to remap them to hg19 and then followed the full Best practice pipeline.
Q: When such data, should I add one @RG corresponding to each original lane present in the hg18 bam file (thefore exporting each lane to a separate fastQ) in order to take advantage of the lane bias detection of the recalibration? or will the effect be minimal and not justifying the long run?
THX
-
Geraldine_VdAuwera Posts: 4,410Administrator, GSA Member adminYes, you should definitely assign RGs per original lane for best recalibration results. It is totally worth the extra effort, because different lanes are likely to suffer from different error modes.
Keep in mind also that when reverting to FASTQ, it's important to randomize the reads in order to avoid biasing the aligner. I forget if you're the person I recently discussed this with on this forum; if not I recommend you check out our tutorial on reverting bams to Fastq.
Geraldine Van der Auwera, PhD
-
splaisan Posts: 14Member ✭Hi again, About -an parameters.
When the HC has been used without specifying annotation types (as the case in the separate tut), do we always get the 'an' types described above? Is this the full list or some best practice defaults? Is there a rationale on which to be used in genome seq and what about the mirror question for the UG. Thanks
-
Geraldine_VdAuwera Posts: 4,410Administrator, GSA Member adminThere is a set of core annotations that are used by default for both UG and HC. They typically don't change across versions. They are the "Standard" annotations; you can get a list by running VariantAnnotator with the
-list
argument, or if you want to check if a specific annotation is in the standard set, you can look it up on its tech doc page, e.g. here toward the top of the page, where it says "Type:". In the current defaults, the only difference between UG and HC is that HC additionally uses ClippingRankSumTest and DepthPerSampleHC.
The choice of the default set (and any other recommendation we may make in the documentation) is based largely on empirical observations of what annotations are consistently informative and useful for filtering in our work. The caveat is that most of our work is done on specific data types (e.g. well-behaved Illumina exomes) so if you're using different data types you may need to experiment with other annotations. But we try to make our recommendations as broadly applicable as possible.
Geraldine Van der Auwera, PhD
-
vsvinti Posts: 31Member ✭Hello there, Wondering what would be a sensible option for -numBad. You say "The default value is 1000, which is geared for smaller datasets. If you have a dataset that's on the large side, you may need to increase this value considerably."
I have ~ 1,100 samples, and I set -numBad to 10,000. Would that be reasonable? I am working with whole exome data. Your previous recommendations had "-percentBad 0.01 -minNumBad 1000". Does the new numBad replace both of these ? Cheers!
-
Geraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
-
Geraldine_VdAuwera Posts: 4,410Administrator, GSA Member adminIt's hard to say up front because it depends a lot on your data, but I think 10K out of 900K is a good place to start, though I would recommend doing several runs with increasing numbers, and choosing the one that gives the best-looking plots. The good news is we are working on a way to have the recalibrator calculate the optimal number automatically to take the guesswork out of this process...
Geraldine Van der Auwera, PhD
-
vsvinti Posts: 31Member ✭Hi there, another question on VQSR training sets / options. This page suggests the following: -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf \ -resource:omni,known=false,training=true,truth=false,prior=12.0 omni.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf \ -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum \ -mode SNP -numBad 1000 \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0
While the section on "What VQSR training sets / arguments should I use for my specific project?" says --numBadVariants 1000 \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.sites.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \ -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum \
Both of these pages are on the current best practices page. For omni, the first suggests truth=false, while the latter says truth=true. I assume one of them is wrong - please advise which is correct. Thanks.
-
Geraldine_VdAuwera Posts: 4,410Administrator, GSA Member adminHi @vsvinti,
Thanks for pointing that out, I understand is is confusing. For the question of which document is correct, the general rule is that tutorials are only meant to provide usage examples and do not necessarily include the latest Best Practices parameter values. Though in this case actually it is a mistake in the command line; if you look at the parameter description earlier in the text Omni is correctly identified as a truth set. I'll fix this in the text.
Geraldine Van der Auwera, PhD
-
vsvinti Posts: 31Member ✭Thanks for that. I have a general tendency to consider anything I find on the best practices as the updates to go by. Since the tutorial link came first, I thought they are the new updates. I wonder if there's a way to point out in best practices which pages contain the latest option recommendations, and which don't. I generally don't read of all the subsections if I thought I found the answer. Thanks.
Post edited by vsvinti on -
vsvinti Posts: 31Member ✭How about for indels? Tutorial only has mills (all set to true), while the methods article has -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
-
Geraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
-
Geraldine_VdAuwera Posts: 4,410Administrator, GSA Member adminThe methods article is always right over the tutorial. For the record, the difference it makes is very minor -- if you use the dbsnp file as known=true, that's what the recalibrator will use to annotate variants as known or novel. Otherwise it will use the Mills set. But this does not play any role in the recalibration model calculations.
Geraldine Van der Auwera, PhD
-
Geraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
-
Geraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
-
Geraldine_VdAuwera Posts: 4,410Administrator, GSA Member adminOh I see what you mean. Yes, it's essentially the same thing, although for exomes you should specify a list of intervals (which you should already be doing for previous steps as well). And for the ti/tv-based tranche plots to look good, you may need to adjust the expected Ti/Tv ratio parameter.
Geraldine Van der Auwera, PhD
-
rrahman Posts: 5Member ✭yes, I took care for the intervals in the previous steps already and thanks again for the pointing out the expected Ti/Tv ratio parameter.
Cheers Rubayte
No comments:
Post a Comment