Saturday, January 11, 2014

(GATK forum) Recalibrate variant quality scores = run VQSR

Objective

Recalibrate variant quality scores and produce a callset filtered for the desired levels of sensitivity and specificity.

Prerequisites

  • TBD

Caveats

This document provides a typical usage example including parameter values. However, the values given may not be representative of the latest Best Practices recommendations. When in doubt, please consult the FAQ document on VQSR training sets and parameters, which overrides this document.

Steps

  1. Prepare recalibration parameters for SNPs
    a. Specify which call sets the program should use as resources to build the recalibration model
    b. Specify which annotations the program should use to evaluate the likelihood of Indels being real
    c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches
    d. Determine additional model parameters
  2. Build the SNP recalibration model
  3. Apply the desired level of recalibration to the SNPs in the call set
  4. Prepare recalibration parameters for Indels a. Specify which call sets the program should use as resources to build the recalibration model b. Specify which annotations the program should use to evaluate the likelihood of Indels being real c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches d. Determine additional model parameters
  5. Build the Indel recalibration model
  6. Apply the desired level of recalibration to the Indels in the call set

1. Prepare recalibration parameters for SNPs

a. Specify which call sets the program should use as resources to build the recalibration model

For each training set, we use key-value tags to qualify whether the set contains known sites, training sites, and/or truth sites. We also use a tag to specify the prior likelihood that those sites are true (using the Phred scale).
  • True sites training resource: HapMap
This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites. The prior likelihood we assign to these variants is Q15 (96.84%).
  • True sites training resource: Omni
This resource is a set of polymorphic SNP sites produced by the Omni genotyping array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).
  • Non-true sites training resource: 1000G
This resource is a set of high-confidence SNP sites produced by the 1000 Genomes Project. The program will consider that the variants in this resource may contain true variants as well as false positives (truth=false), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q10 (%).
  • Known sites resource, not used in training: dbSNP
This resource is a SNP call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).
The default prior likelihood assigned to all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the GATK callers is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.

b. Specify which annotations the program should use to evaluate the likelihood of Indels being real

These annotations are included in the information generated for each variant call by the caller. If an annotation is missing (typically because it was omitted from the calling command) it can be added using the VariantAnnotator tool.
  • Coverage (DP)
Total (unfiltered) depth of coverage.
  • QualByDepth (QD)
Variant confidence (from the QUAL field) / unfiltered depth of non-reference samples.
  • FisherStrand (FS)
Phred-scaled p-value using Fisher's Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.
  • MappingQualityRankSumTest (MQRankSum)
The u-based z-approximation from the Mann-Whitney Rank Sum Test for mapping qualities (reads with ref bases vs. those with the alternate allele). Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.
  • ReadPosRankSumTest (ReadPosRankSum)
The u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

  • First tranche threshold 100.0
  • Second tranche threshold 99.9
  • Third tranche threshold 99.0
  • Fourth tranche threshold 90.0
Tranches are essentially slices of variants, ranked by VQSLOD, bounded by the threshold values specified in this step. The threshold values themselves refer to the sensitivity we can obtain when we apply them to the call sets that the program uses to train the model. The idea is that the lowest tranche is highly specific but less sensitive (there are very few false positives but potentially many false negatives, i.e. missing calls), and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. This allows us to filter variants based on how sensitive we want the call set to be, rather than applying hard filters and then only evaluating how sensitive the call set is using post hoc methods.

2. Build the SNP recalibration model

Action

Run the following GATK command:
java -jar GenomeAnalysisTK.jar \ 
    -T VariantRecalibrator \ 
    -R reference.fa \ 
    -input raw_variants.vcf \ 
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf \ 
    -resource:omni,known=false,training=true,truth=true,prior=12.0 omni.vcf \ 
    -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf \ 
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf \ 
    -an DP \ 
    -an QD \ 
    -an FS \ 
    -an MQRankSum \ 
    -an ReadPosRankSum \ 
    -mode SNP \ 
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ 
    -recalFile recalibrate_SNP.recal \ 
    -tranchesFile recalibrate_SNP.tranches \ 
    -rscriptFile recalibrate_SNP_plots.R 

Expected Result

This creates several files. The most important file is the recalibration report, called recalibrate_SNP.recal, which contains the recalibration data. This is what the program will use in the next step to generate a VCF file in which the variants are annotated with their recalibrated quality scores. There is also a file called recalibrate_SNP.tranches, which contains the quality score thresholds corresponding to the tranches specified in the original command. Finally, if your installation of R and the other required libraries was done correctly, you will also find some PDF files containing plots. These plots illustrated the distribution of variants according to certain dimensions of the model.
For detailed instructions on how to interpret these plots, please refer to the online GATK documentation.

3. Apply the desired level of recalibration to the SNPs in the call set

Action

Run the following GATK command:
java -jar GenomeAnalysisTK.jar \ 
    -T ApplyRecalibration \ 
    -R reference.fa \ 
    -input raw_variants.vcf \ 
    -mode SNP \ 
    --ts_filter_level 99.0 \ 
    -recalFile recalibrate_SNP.recal \ 
    -tranchesFile recalibrate_SNP.tranches \ 
    -o recalibrated_snps_raw_indels.vcf 

Expected Result

This creates a new VCF file, called recalibrated_snps_raw_indels.vcf, which contains all the original variants from the original raw_variants.vcf file, but now the SNPs are annotated with their recalibrated quality scores (VQSLOD) and either PASS or FILTER depending on whether or not they are included in the selected tranche.
Here we are taking the second lowest of the tranches specified in the original recalibration command. This means that we are applying to our data set the level of sensitivity that would allow us to retrieve 99% of true variants from the truth training sets of HapMap and Omni SNPs. If we wanted to be more specific (and therefore have less risk of including false positives, at the risk of missing real sites) we could take the very lowest tranche, which would only retrieve 90% of the truth training sites. If we wanted to be more sensitive (and therefore less specific, at the risk of including more false positives) we could take the higher tranches. In our Best Practices documentation, we recommend taking the second highest tranche (99.9%) which provides the highest sensitivity you can get while still being acceptably specific.

4. Prepare recalibration parameters for Indels

a. Specify which call sets the program should use as resources to build the recalibration model

For each training set, we use key-value tags to qualify whether the set contains known sites, training sites, and/or truth sites. We also use a tag to specify the prior likelihood that those sites are true (using the Phred scale).
  • Known and true sites training resource: Mills
This resource is an Indel call set that has been validated to a high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).
The default prior likelihood assigned to all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the GATK callers is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.

b. Specify which annotations the program should use to evaluate the likelihood of Indels being real

These annotations are included in the information generated for each variant call by the caller. If an annotation is missing (typically because it was omitted from the calling command) it can be added using the VariantAnnotator tool.
  • Coverage (DP)
Total (unfiltered) depth of coverage.
  • FisherStrand (FS)
Phred-scaled p-value using Fisher's Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.
  • MappingQualityRankSumTest (MQRankSum)
The u-based z-approximation from the Mann-Whitney Rank Sum Test for mapping qualities (reads with ref bases vs. those with the alternate allele). Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.
  • ReadPosRankSumTest (ReadPosRankSum)
The u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

  • First tranche threshold 100.0
  • Second tranche threshold 99.9
  • Third tranche threshold 99.0
  • Fourth tranche threshold 90.0
Tranches are essentially slices of variants, ranked by VQSLOD, bounded by the threshold values specified in this step. The threshold values themselves refer to the sensitivity we can obtain when we apply them to the call sets that the program uses to train the model. The idea is that the lowest tranche is highly specific but less sensitive (there are very few false positives but potentially many false negatives, i.e. missing calls), and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. This allows us to filter variants based on how sensitive we want the call set to be, rather than applying hard filters and then only evaluating how sensitive the call set is using post hoc methods.

d. Determine additional model parameters

  • Maximum number of Gaussians (-maxGaussians) 4
This is the maximum number of Gaussians (i.e. clusters of variants that have similar properties) that the program should try to identify when it runs the variational Bayes algorithm that underlies the machine learning method. In essence, this limits the number of different ”profiles” of variants that the program will try to identify. This number should only be increased for datasets that include very many variants.

5. Build the Indel recalibration model

Action

Run the following GATK command:
java -jar GenomeAnalysisTK.jar \ 
    -T VariantRecalibrator \ 
    -R reference.fa \ 
    -input recalibrated_snps_raw_indels.vcf \ 
    -resource:mills,known=true,training=true,truth=true,prior=12.0 mills.vcf \ 
    -an DP \ 
    -an FS \ 
    -an MQRankSum \ 
    -an ReadPosRankSum \ 
    -mode INDEL \ 
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ 
    --maxGaussians 4 \ 
    -recalFile recalibrate_INDEL.recal \ 
    -tranchesFile recalibrate_INDEL.tranches \ 
    -rscriptFile recalibrate_INDEL_plots.R 

Expected Result

This creates several files. The most important file is the recalibration report, called recalibrate_INDEL.recal, which contains the recalibration data. This is what the program will use in the next step to generate a VCF file in which the variants are annotated with their recalibrated quality scores. There is also a file called recalibrate_INDEL.tranches, which contains the quality score thresholds corresponding to the tranches specified in the original command. Finally, if your installation of R and the other required libraries was done correctly, you will also find some PDF files containing plots. These plots illustrated the distribution of variants according to certain dimensions of the model.
For detailed instructions on how to interpret these plots, please refer to the online GATK documentation.

6. Apply the desired level of recalibration to the Indels in the call set

Action

Run the following GATK command:
java -jar GenomeAnalysisTK.jar \ 
    -T ApplyRecalibration \ 
    -R reference.fa \ 
    -input recalibrated_snps_raw_indels.vcf \ 
    -mode INDEL \ 
    --ts_filter_level 99.0 \ 
    -recalFile recalibrate_INDEL.recal \ 
    -tranchesFile recalibrate_INDEL.tranches \ 
    -o recalibrated_variants.vcf 

Expected Result

This creates a new VCF file, called recalibrated_variants.vcf, which contains all the original variants from the original recalibrated_snps_raw_indels.vcf file, but now the Indels are also annotated with their recalibrated quality scores (VQSLOD) and either PASS or FILTER depending on whether or not they are included in the selected tranche.
Here we are taking the second lowest of the tranches specified in the original recalibration command. This means that we are applying to our data set the level of sensitivity that would allow us to retrieve 99% of true variants from the truth training sets of HapMap and Omni SNPs. If we wanted to be more specific (and therefore have less risk of including false positives, at the risk of missing real sites) we could take the very lowest tranche, which would only retrieve 90% of the truth training sites. If we wanted to be more sensitive (and therefore less specific, at the risk of including more false positives) we could take the higher tranches. In our Best Practices documentation, we recommend taking the second highest tranche (99.9%) which provides the highest sensitivity you can get while still being acceptably specific.
Post edited by Geraldine_VdAuwera on
 1  0  0  212
Geraldine Van der Auwera, PhD

Comments

  • avidLearneravidLearner Posts: 7Member
    Hello, I have a few questions that I would like to clarify:
    1. Is the file 1000G_phase1.snps.high_confidence.vcf available in the resource bundle? (I looked it up and it wasn't there!) If not, which file is recommended with the updated VariantRecalibrator parameters?
    2. Just to clarify, are the parameters described above pertinent to whole genomes or whole exomes? (I am assuming whole genomes but this has not been mentioned explicitly anywhere.) I understand that it is recommended to use --maxGaussians 4 --percentBad 0.05 for few exome samples. In this case, is -minNumBad of 1000 recommended or should a lower value be set?
    3. The VariantRecalibrator documentation parameters (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantrecalibration_VariantRecalibrator.html) are slightly different from the parameters described here. Is this tutorial the latest best practices guideline that can be followed for VQSR?
    Thanks for your help!
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    Hi there,
    • The filenames are a little different from what we have in the bundle, I will correct that soon (sorry) but basically you want the 1000 Genomes Phase1 SNPs that should be in our bundle.
    • These parameters are for WGS. I will add a note to explain that.
    • The parameters given in tutorial documents are not necessarily up-to-date with our Best Practices recommendations. These can be found in the Best Practices document on our website that concentrates (or points to) the actual parameter recommendations. For VQSR there is a specific FAQ document, which is linked to by the BP document. We will try to make that more clear in the documentation.
    Geraldine Van der Auwera, PhD
  • avidLearneravidLearner Posts: 7Member
    Hello Geraldine, I searched for the 1000 Genomes Phase 1 SNPs VCF in ftp://ftp.broadinstitute.org/bundle/2.3/b37/ but no luck! Please tell me if I'm looking in the wrong place.
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    Hey @avidLearner, sorry to get back to you so late. You've been looking in the version 2.3 of the bundle, but you should be looking in the latest version, which is 2.5.
    Geraldine Van der Auwera, PhD
  • avidLearneravidLearner Posts: 7Member
    Hi Geraldine, yes I figured it out. I was initially accessing ftp://ftp.broadinstitute.org/bundle/ through Internet Explorer and for some reason it wasn't loading the updated version 2.5. But when I accessed the resource bundle through wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/ I was able to locate the latest version. Sorry for the trouble. Thanks for the help!
  • yingzhangyingzhang minneapolisPosts: 4Member
    edited August 2013
    This is a really useful article on the variant recalibration. And I think it makes more sense compared to the brief overview at http://www.broadinstitute.org/gatk/guide/topic?name=best-practices
    But I do have a question, which might only be my own over-reacting.
    In the overview, you state:
    =======
    We've found that the best results are obtained with the VQSR when it is used to build separate adaptive error models for SNPs versus INDELs:
    snp.model <- BuildErrorModelWithVQSR(raw.vcf, SNP)
    indel.model <- BuildErrorModelWithVQSR(raw.vcf, INDEL)
    recalibratedSNPs.rawIndels.vcf <- ApplyRecalibration(raw.vcf, snp.model, SNP)
    analysisReady.vcf <- ApplyRecalibration(recalibratedSNPs.rawIndels.vcf, indel.model, INDEL)
    ========
    The difference between the overview and this FAQ is that how one should build the error model for indel.
    In overview, it builds the error model for indel from the raw_variant.vcf; while in this article the indel model is built from the recalibrated_snps_raw_indels.vcf.
    I assume there should be no cross-talk between the INDEL and SNP types. But if there is such a case, then the indel model from raw_variant.vcf will differ from the indel model from the recalibrated_snps_raw_indels.vcf.
    Should I follow the overview or this FAQ?
    And when you say "best results", what else have you tested?
    Best, Ying
    Post edited by yingzhang on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    Hi Ying,
    We're currently rewriting the Best Practices document to make it more readable, so hopefully that will help clarify things. In the meantime, let me try to clear up some of the confusion, which is very common, about how the recalibration works.
    When you specify either INDEL or SNP for the model, the model is built using only the specified type of variant, ignoring the other type. Then, when you apply the recalibration (specifying the appropriate variant type), the tool output recalibrated variants for that type, and emits the other variant type without modification. So if you specified SNPs, your output contains recalibrated SNPs and unmodified indels. You can then repeat the process on that output file, specifying INDEL this time, so that the tool will output recalibrated indels, as well as the already recalibrated SNPs from the previous round.
    This replaces the old way of proceeding, where we recalibrated indels and SNPs in separate files, because it is faster and more efficient.
    Does that make more sense?
    Geraldine Van der Auwera, PhD
  • yingzhangyingzhang minneapolisPosts: 4Member
    Thank you Geraldine. Your comments are really explicit.
    @Geraldine_VdAuwera said: Hi Ying,
    We're currently rewriting the Best Practices document to make it more readable, so hopefully that will help clarify things. In the meantime, let me try to clear up some of the confusion, which is very common, about how the recalibration works.
    When you specify either INDEL or SNP for the model, the model is built using only the specified type of variant, ignoring the other type. Then, when you apply the recalibration (specifying the appropriate variant type), the tool output recalibrated variants for that type, and emits the other variant type without modification. So if you specified SNPs, your output contains recalibrated SNPs and unmodified indels. You can then repeat the process on that output file, specifying INDEL this time, so that the tool will output recalibrated indels, as well as the already recalibrated SNPs from the previous round.
    This replaces the old way of proceeding, where we recalibrated indels and SNPs in separate files, because it is faster and more efficient.
    Does that make more sense?
  • avilellaavilella Posts: 6Member
    Hi @Geraldine_VdAuwera,
    Is this two step process described below also working for version 1.6? (I am using v1.6-22-g3ec78bd)
    "When you specify either INDEL or SNP for the model, the model is built using only the specified type of variant, ignoring the other type. Then, when you apply the recalibration (specifying the appropriate variant type), the tool output recalibrated variants for that type, and emits the other variant type without modification. So if you specified SNPs, your output contains recalibrated SNPs and unmodified indels. You can then repeat the process on that output file, specifying INDEL this time, so that the tool will output recalibrated indels, as well as the already recalibrated SNPs from the previous round."
    Or is 1.6 still using the separate process?
    "old way of proceeding, where we recalibrated indels and SNPs in separate files"
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    With 1.6 you should recalibrate variants separately (the old way).
    FYI version 2.7 has some key improvements to VQSR that make it work better and more consistently. We really don't recommend using older versions.
    Geraldine Van der Auwera, PhD
  • avilellaavilella Posts: 6Member
    edited September 2013
    Can I recalibrate each of the chromosomes as separate vcf, each time pointing to the whole resource files? Will it make a huge difference compared to merging the chromosome vcfs and recalibrating only once?
    Post edited by avilella on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    It's much better to recalibrate all the chromosomes together because it gives the VQSR model more data, which will make it more accurate and reliable. In many cases individual chromosomes don't even have enough variants for recalibration to work at all.
    Geraldine Van der Auwera, PhD
  • prepagamprepagam Posts: 8Member
    Can I check VQSR should be done on each population separately like Unified Genotyper? I have 40 samples but from many populations - so probably not enough data for VQSR if its by population.
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    Hi @prepagam,
    You should run VQSR jointly only on samples that were called jointly; so if you called your samples in subsets by population, you also have to recalibrate them according to the same subsets.
    Geraldine Van der Auwera, PhD
  • splaisansplaisan Posts: 14Member
    Further in the workflow, I noticed the now fixed --maxGaussians 4 ( not yet fixed in the d. text part ) and that the tranche followed by a range notation was not accepted on my machine, is this normal?
    # fails even with single or double quotes around []
        -tranche [100.0, 99.9, 99.0, 90.0] \ 
    # works with
        -tranche 100.0 \
        -tranche 99.9 \
        -tranche 99.0 \
        -tranche 90.0 \
    thanks
    Stephane
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    Hi @splaisan,
    You are correct, the way you got it to work is the right way to do it. I'll rephrase it in the doc since it seems to be confusing people. Thanks for pointing this out!
    Geraldine Van der Auwera, PhD
  • splaisansplaisan Posts: 14Member
    sad, the bracket notation was cool! ;-)
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    Yep. Maybe if we find time we'll amend how the arguments can be passed in... (but don't hold your breath)
    Geraldine Van der Auwera, PhD
  • splaisansplaisan Posts: 14Member
    Inspired by the broadE 2013 video#5 by Ryan Poplin (maybe a question for Ryan!) - great video BTW Context: I took IIlumina NA18507 chr21 mappings (originally to hg18) where most of the bam header is missing and poorly complies to Picard+GATK. I first extracted the reads to FASTQ to remap them to hg19 and then followed the full Best practice pipeline.
    Q: When such data, should I add one @RG corresponding to each original lane present in the hg18 bam file (thefore exporting each lane to a separate fastQ) in order to take advantage of the lane bias detection of the recalibration? or will the effect be minimal and not justifying the long run?
    THX
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    Yes, you should definitely assign RGs per original lane for best recalibration results. It is totally worth the extra effort, because different lanes are likely to suffer from different error modes.
    Keep in mind also that when reverting to FASTQ, it's important to randomize the reads in order to avoid biasing the aligner. I forget if you're the person I recently discussed this with on this forum; if not I recommend you check out our tutorial on reverting bams to Fastq.
    Geraldine Van der Auwera, PhD
  • splaisansplaisan Posts: 14Member
    Hi again, About -an parameters.
    When the HC has been used without specifying annotation types (as the case in the separate tut), do we always get the 'an' types described above? Is this the full list or some best practice defaults? Is there a rationale on which to be used in genome seq and what about the mirror question for the UG. Thanks
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    There is a set of core annotations that are used by default for both UG and HC. They typically don't change across versions. They are the "Standard" annotations; you can get a list by running VariantAnnotator with the -list argument, or if you want to check if a specific annotation is in the standard set, you can look it up on its tech doc page, e.g. here toward the top of the page, where it says "Type:". In the current defaults, the only difference between UG and HC is that HC additionally uses ClippingRankSumTest and DepthPerSampleHC.
    The choice of the default set (and any other recommendation we may make in the documentation) is based largely on empirical observations of what annotations are consistently informative and useful for filtering in our work. The caveat is that most of our work is done on specific data types (e.g. well-behaved Illumina exomes) so if you're using different data types you may need to experiment with other annotations. But we try to make our recommendations as broadly applicable as possible.
    Geraldine Van der Auwera, PhD
  • vsvintivsvinti Posts: 31Member
    Hello there, Wondering what would be a sensible option for -numBad. You say "The default value is 1000, which is geared for smaller datasets. If you have a dataset that's on the large side, you may need to increase this value considerably."
    I have ~ 1,100 samples, and I set -numBad to 10,000. Would that be reasonable? I am working with whole exome data. Your previous recommendations had "-percentBad 0.01 -minNumBad 1000". Does the new numBad replace both of these ? Cheers!
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    Yes, the new -numBad argument replaces the other two.
    The important metric for setting numBad is not really the number of samples, but more so the number of variants initially called.
    Geraldine Van der Auwera, PhD
  • vsvintivsvinti Posts: 31Member
    Right, of course. So for ~ 900,000 variants, would 10,000 suffice for numBad (or how does one determine the 'ideal' number)?
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    It's hard to say up front because it depends a lot on your data, but I think 10K out of 900K is a good place to start, though I would recommend doing several runs with increasing numbers, and choosing the one that gives the best-looking plots. The good news is we are working on a way to have the recalibrator calculate the optimal number automatically to take the guesswork out of this process...
    Geraldine Van der Auwera, PhD
  • vsvintivsvinti Posts: 31Member
    That's great, thanks!
  • vsvintivsvinti Posts: 31Member
    Hi there, another question on VQSR training sets / options. This page suggests the following: -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf \ -resource:omni,known=false,training=true,truth=false,prior=12.0 omni.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf \ -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum \ -mode SNP -numBad 1000 \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0
    While the section on "What VQSR training sets / arguments should I use for my specific project?" says --numBadVariants 1000 \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.sites.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \ -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum \
    Both of these pages are on the current best practices page. For omni, the first suggests truth=false, while the latter says truth=true. I assume one of them is wrong - please advise which is correct. Thanks.
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    Hi @vsvinti,
    Thanks for pointing that out, I understand is is confusing. For the question of which document is correct, the general rule is that tutorials are only meant to provide usage examples and do not necessarily include the latest Best Practices parameter values. Though in this case actually it is a mistake in the command line; if you look at the parameter description earlier in the text Omni is correctly identified as a truth set. I'll fix this in the text.
    Geraldine Van der Auwera, PhD
  • vsvintivsvinti Posts: 31Member
    edited November 2013
    Thanks for that. I have a general tendency to consider anything I find on the best practices as the updates to go by. Since the tutorial link came first, I thought they are the new updates. I wonder if there's a way to point out in best practices which pages contain the latest option recommendations, and which don't. I generally don't read of all the subsections if I thought I found the answer. Thanks.
    Post edited by vsvinti on
  • vsvintivsvinti Posts: 31Member
    How about for indels? Tutorial only has mills (all set to true), while the methods article has -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    I wonder if there's a way to point out in best practices which pages contain the latest option recommendations, and which don't
    I'll try to think of something to make this clearer.
    Geraldine Van der Auwera, PhD
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    The methods article is always right over the tutorial. For the record, the difference it makes is very minor -- if you use the dbsnp file as known=true, that's what the recalibrator will use to annotate variants as known or novel. Otherwise it will use the Mills set. But this does not play any role in the recalibration model calculations.
    Geraldine Van der Auwera, PhD
  • vsvintivsvinti Posts: 31Member
    Cheers.
  • rrahmanrrahman Posts: 5Member
    Should we add the -numBad parameter in the indel recalibration model as well for the whole exome datasets? or anything else? Thanks in advance.
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    Hi @rrahman,
    The -numBad parameter has been removed from VQSR in the latest version (2.8), so unless you're using an older version, you should not try to use this parameter.
    Geraldine Van der Auwera, PhD
  • rrahmanrrahman Posts: 5Member
    so, for the 2.8 version, shall I use the exact recalibration model for both SNPs and INDELs for the whole exome datasets?
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    I'm not sure what you mean by "exact recalibration model". Basically, follow the instructions as written in the article above.
    Geraldine Van der Auwera, PhD
  • rrahmanrrahman Posts: 5Member
    I meant the same parameters as shown in the above section. I was a bit confused cause in the earlier comments you said that they are for whole genome datasets. But anyways thanks for the clarification.
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    Oh I see what you mean. Yes, it's essentially the same thing, although for exomes you should specify a list of intervals (which you should already be doing for previous steps as well). And for the ti/tv-based tranche plots to look good, you may need to adjust the expected Ti/Tv ratio parameter.
    Geraldine Van der Auwera, PhD
  • rrahmanrrahman Posts: 5Member
    yes, I took care for the intervals in the previous steps already and thanks again for the pointing out the expected Ti/Tv ratio parameter.
    Cheers Rubayte

No comments:

Post a Comment