Wang Zheng Yuan

avidLearner · July 2013

Hello, I have a few questions that I would like to clarify:

Is the file 1000G_phase1.snps.high_confidence.vcf available in the resource bundle? (I looked it up and it wasn't there!) If not, which file is recommended with the updated VariantRecalibrator parameters?
Just to clarify, are the parameters described above pertinent to whole genomes or whole exomes? (I am assuming whole genomes but this has not been mentioned explicitly anywhere.) I understand that it is recommended to use --maxGaussians 4 --percentBad 0.05 for few exome samples. In this case, is -minNumBad of 1000 recommended or should a lower value be set?
The VariantRecalibrator documentation parameters (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantrecalibration_VariantRecalibrator.html) are slightly different from the parameters described here. Is this tutorial the latest best practices guideline that can be followed for VQSR?

Thanks for your help!

Geraldine_VdAuwera · July 2013

Hi there,

The filenames are a little different from what we have in the bundle, I will correct that soon (sorry) but basically you want the 1000 Genomes Phase1 SNPs that should be in our bundle.
These parameters are for WGS. I will add a note to explain that.
The parameters given in tutorial documents are not necessarily up-to-date with our Best Practices recommendations. These can be found in the Best Practices document on our website that concentrates (or points to) the actual parameter recommendations. For VQSR there is a specific FAQ document, which is linked to by the BP document. We will try to make that more clear in the documentation.

avidLearner · July 2013

Hello Geraldine, I searched for the 1000 Genomes Phase 1 SNPs VCF in ftp://ftp.broadinstitute.org/bundle/2.3/b37/ but no luck! Please tell me if I'm looking in the wrong place.

Geraldine_VdAuwera · July 2013

Hey @avidLearner, sorry to get back to you so late. You've been looking in the version 2.3 of the bundle, but you should be looking in the latest version, which is 2.5.

avidLearner · July 2013

Hi Geraldine, yes I figured it out. I was initially accessing ftp://ftp.broadinstitute.org/bundle/ through Internet Explorer and for some reason it wasn't loading the updated version 2.5. But when I accessed the resource bundle through wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/ I was able to locate the latest version. Sorry for the trouble. Thanks for the help!

yingzhang · August 2013

This is a really useful article on the variant recalibration. And I think it makes more sense compared to the brief overview at http://www.broadinstitute.org/gatk/guide/topic?name=best-practices
But I do have a question, which might only be my own over-reacting.
In the overview, you state:
=======
We've found that the best results are obtained with the VQSR when it is used to build separate adaptive error models for SNPs versus INDELs:

snp.model <- BuildErrorModelWithVQSR(raw.vcf, SNP)
indel.model <- BuildErrorModelWithVQSR(raw.vcf, INDEL)
recalibratedSNPs.rawIndels.vcf <- ApplyRecalibration(raw.vcf, snp.model, SNP)
analysisReady.vcf <- ApplyRecalibration(recalibratedSNPs.rawIndels.vcf, indel.model, INDEL)

========
The difference between the overview and this FAQ is that how one should build the error model for indel.
In overview, it builds the error model for indel from the raw_variant.vcf; while in this article the indel model is built from the recalibrated_snps_raw_indels.vcf.
I assume there should be no cross-talk between the INDEL and SNP types. But if there is such a case, then the indel model from raw_variant.vcf will differ from the indel model from the recalibrated_snps_raw_indels.vcf.
Should I follow the overview or this FAQ?
And when you say "best results", what else have you tested?
Best, Ying

Geraldine_VdAuwera · August 2013

Hi Ying,
We're currently rewriting the Best Practices document to make it more readable, so hopefully that will help clarify things. In the meantime, let me try to clear up some of the confusion, which is very common, about how the recalibration works.
When you specify either INDEL or SNP for the model, the model is built using only the specified type of variant, ignoring the other type. Then, when you apply the recalibration (specifying the appropriate variant type), the tool output recalibrated variants for that type, and emits the other variant type without modification. So if you specified SNPs, your output contains recalibrated SNPs and unmodified indels. You can then repeat the process on that output file, specifying INDEL this time, so that the tool will output recalibrated indels, as well as the already recalibrated SNPs from the previous round.
This replaces the old way of proceeding, where we recalibrated indels and SNPs in separate files, because it is faster and more efficient.
Does that make more sense?

yingzhang · August 2013

Thank you Geraldine. Your comments are really explicit.

@Geraldine_VdAuwera said: Hi Ying,
We're currently rewriting the Best Practices document to make it more readable, so hopefully that will help clarify things. In the meantime, let me try to clear up some of the confusion, which is very common, about how the recalibration works.
When you specify either INDEL or SNP for the model, the model is built using only the specified type of variant, ignoring the other type. Then, when you apply the recalibration (specifying the appropriate variant type), the tool output recalibrated variants for that type, and emits the other variant type without modification. So if you specified SNPs, your output contains recalibrated SNPs and unmodified indels. You can then repeat the process on that output file, specifying INDEL this time, so that the tool will output recalibrated indels, as well as the already recalibrated SNPs from the previous round.
This replaces the old way of proceeding, where we recalibrated indels and SNPs in separate files, because it is faster and more efficient.
Does that make more sense?

avilella · September 2013

Hi @Geraldine_VdAuwera,
Is this two step process described below also working for version 1.6? (I am using v1.6-22-g3ec78bd)
"When you specify either INDEL or SNP for the model, the model is built using only the specified type of variant, ignoring the other type. Then, when you apply the recalibration (specifying the appropriate variant type), the tool output recalibrated variants for that type, and emits the other variant type without modification. So if you specified SNPs, your output contains recalibrated SNPs and unmodified indels. You can then repeat the process on that output file, specifying INDEL this time, so that the tool will output recalibrated indels, as well as the already recalibrated SNPs from the previous round."
Or is 1.6 still using the separate process?
"old way of proceeding, where we recalibrated indels and SNPs in separate files"

Geraldine_VdAuwera · September 2013

With 1.6 you should recalibrate variants separately (the old way).
FYI version 2.7 has some key improvements to VQSR that make it work better and more consistently. We really don't recommend using older versions.

avilella · September 2013

Can I recalibrate each of the chromosomes as separate vcf, each time pointing to the whole resource files? Will it make a huge difference compared to merging the chromosome vcfs and recalibrating only once?

Geraldine_VdAuwera · September 2013

It's much better to recalibrate all the chromosomes together because it gives the VQSR model more data, which will make it more accurate and reliable. In many cases individual chromosomes don't even have enough variants for recalibration to work at all.

prepagam · October 2013

Can I check VQSR should be done on each population separately like Unified Genotyper? I have 40 samples but from many populations - so probably not enough data for VQSR if its by population.

Geraldine_VdAuwera · October 2013

Hi @prepagam,
You should run VQSR jointly only on samples that were called jointly; so if you called your samples in subsets by population, you also have to recalibrate them according to the same subsets.

splaisan · October 2013

Further in the workflow, I noticed the now fixed --maxGaussians 4 ( not yet fixed in the d. text part ) and that the tranche followed by a range notation was not accepted on my machine, is this normal?

# fails even with single or double quotes around []
    -tranche [100.0, 99.9, 99.0, 90.0] \ 
# works with
    -tranche 100.0 \
    -tranche 99.9 \
    -tranche 99.0 \
    -tranche 90.0 \

thanks
Stephane

Geraldine_VdAuwera · October 2013

Hi @splaisan,
You are correct, the way you got it to work is the right way to do it. I'll rephrase it in the doc since it seems to be confusing people. Thanks for pointing this out!

splaisan · October 2013

sad, the bracket notation was cool! ;-)

Geraldine_VdAuwera · October 2013

Yep. Maybe if we find time we'll amend how the arguments can be passed in... (but don't hold your breath)

splaisan · October 2013

Inspired by the broadE 2013 video#5 by Ryan Poplin (maybe a question for Ryan!) - great video BTW Context: I took IIlumina NA18507 chr21 mappings (originally to hg18) where most of the bam header is missing and poorly complies to Picard+GATK. I first extracted the reads to FASTQ to remap them to hg19 and then followed the full Best practice pipeline.
Q: When such data, should I add one @RG corresponding to each original lane present in the hg18 bam file (thefore exporting each lane to a separate fastQ) in order to take advantage of the lane bias detection of the recalibration? or will the effect be minimal and not justifying the long run?
THX

Geraldine_VdAuwera · October 2013

Yes, you should definitely assign RGs per original lane for best recalibration results. It is totally worth the extra effort, because different lanes are likely to suffer from different error modes.
Keep in mind also that when reverting to FASTQ, it's important to randomize the reads in order to avoid biasing the aligner. I forget if you're the person I recently discussed this with on this forum; if not I recommend you check out our tutorial on reverting bams to Fastq.

splaisan · October 2013

Hi again, About -an parameters.
When the HC has been used without specifying annotation types (as the case in the separate tut), do we always get the 'an' types described above? Is this the full list or some best practice defaults? Is there a rationale on which to be used in genome seq and what about the mirror question for the UG. Thanks

Geraldine_VdAuwera · October 2013

There is a set of core annotations that are used by default for both UG and HC. They typically don't change across versions. They are the "Standard" annotations; you can get a list by running VariantAnnotator with the -list argument, or if you want to check if a specific annotation is in the standard set, you can look it up on its tech doc page, e.g. here toward the top of the page, where it says "Type:". In the current defaults, the only difference between UG and HC is that HC additionally uses ClippingRankSumTest and DepthPerSampleHC.
The choice of the default set (and any other recommendation we may make in the documentation) is based largely on empirical observations of what annotations are consistently informative and useful for filtering in our work. The caveat is that most of our work is done on specific data types (e.g. well-behaved Illumina exomes) so if you're using different data types you may need to experiment with other annotations. But we try to make our recommendations as broadly applicable as possible.

vsvinti · November 2013

Hello there, Wondering what would be a sensible option for -numBad. You say "The default value is 1000, which is geared for smaller datasets. If you have a dataset that's on the large side, you may need to increase this value considerably."
I have ~ 1,100 samples, and I set -numBad to 10,000. Would that be reasonable? I am working with whole exome data. Your previous recommendations had "-percentBad 0.01 -minNumBad 1000". Does the new numBad replace both of these ? Cheers!

Geraldine_VdAuwera · November 2013

Yes, the new -numBad argument replaces the other two.
The important metric for setting numBad is not really the number of samples, but more so the number of variants initially called.

vsvinti · November 2013

Right, of course. So for ~ 900,000 variants, would 10,000 suffice for numBad (or how does one determine the 'ideal' number)?

Geraldine_VdAuwera · November 2013

It's hard to say up front because it depends a lot on your data, but I think 10K out of 900K is a good place to start, though I would recommend doing several runs with increasing numbers, and choosing the one that gives the best-looking plots. The good news is we are working on a way to have the recalibrator calculate the optimal number automatically to take the guesswork out of this process...

vsvinti · November 2013

That's great, thanks!

vsvinti · November 2013

Hi there, another question on VQSR training sets / options. This page suggests the following: -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf \ -resource:omni,known=false,training=true,truth=false,prior=12.0 omni.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf \ -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum \ -mode SNP -numBad 1000 \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0
While the section on "What VQSR training sets / arguments should I use for my specific project?" says --numBadVariants 1000 \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.sites.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \ -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum \
Both of these pages are on the current best practices page. For omni, the first suggests truth=false, while the latter says truth=true. I assume one of them is wrong - please advise which is correct. Thanks.

Geraldine_VdAuwera · November 2013

Hi @vsvinti,
Thanks for pointing that out, I understand is is confusing. For the question of which document is correct, the general rule is that tutorials are only meant to provide usage examples and do not necessarily include the latest Best Practices parameter values. Though in this case actually it is a mistake in the command line; if you look at the parameter description earlier in the text Omni is correctly identified as a truth set. I'll fix this in the text.

vsvinti · November 2013

Thanks for that. I have a general tendency to consider anything I find on the best practices as the updates to go by. Since the tutorial link came first, I thought they are the new updates. I wonder if there's a way to point out in best practices which pages contain the latest option recommendations, and which don't. I generally don't read of all the subsections if I thought I found the answer. Thanks.

vsvinti · November 2013

How about for indels? Tutorial only has mills (all set to true), while the methods article has -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \

Geraldine_VdAuwera · November 2013

I wonder if there's a way to point out in best practices which pages contain the latest option recommendations, and which don't

I'll try to think of something to make this clearer.

Geraldine_VdAuwera · November 2013

The methods article is always right over the tutorial. For the record, the difference it makes is very minor -- if you use the dbsnp file as known=true, that's what the recalibrator will use to annotate variants as known or novel. Otherwise it will use the Mills set. But this does not play any role in the recalibration model calculations.

vsvinti · November 2013

Cheers.

rrahman · January 8

Should we add the -numBad parameter in the indel recalibration model as well for the whole exome datasets? or anything else? Thanks in advance.

Geraldine_VdAuwera · January 8

Hi @rrahman,
The -numBad parameter has been removed from VQSR in the latest version (2.8), so unless you're using an older version, you should not try to use this parameter.

rrahman · January 8

so, for the 2.8 version, shall I use the exact recalibration model for both SNPs and INDELs for the whole exome datasets?

Geraldine_VdAuwera · January 8

I'm not sure what you mean by "exact recalibration model". Basically, follow the instructions as written in the article above.

rrahman · January 8

I meant the same parameters as shown in the above section. I was a bit confused cause in the earlier comments you said that they are for whole genome datasets. But anyways thanks for the clarification.

Geraldine_VdAuwera · January 8

Oh I see what you mean. Yes, it's essentially the same thing, although for exomes you should specify a list of intervals (which you should already be doing for previous steps as well). And for the ti/tv-based tranche plots to look good, you may need to adjust the expected Ti/Tv ratio parameter.

rrahman · January 8

yes, I took care for the intervals in the previous steps already and thanks again for the pointing out the expected Ti/Tv ratio parameter.
Cheers Rubayte

Wang Zheng Yuan

Saturday, January 11, 2014

(GATK forum) Recalibrate variant quality scores = run VQSR

Objective

Prerequisites

Caveats

Steps

1. Prepare recalibration parameters for SNPs

a. Specify which call sets the program should use as resources to build the recalibration model

b. Specify which annotations the program should use to evaluate the likelihood of Indels being real

c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

2. Build the SNP recalibration model

Action

Expected Result

3. Apply the desired level of recalibration to the SNPs in the call set

Action

Expected Result

4. Prepare recalibration parameters for Indels

a. Specify which call sets the program should use as resources to build the recalibration model

b. Specify which annotations the program should use to evaluate the likelihood of Indels being real

c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

d. Determine additional model parameters

5. Build the Indel recalibration model

Action

Expected Result

6. Apply the desired level of recalibration to the Indels in the call set

Action

Expected Result

Comments

No comments:

Post a Comment