Saturday, January 11, 2014

(GATK forum) Call variants on a diploid genome with the HaplotypeCaller

Objective

Call variants on a diploid genome with the HaplotypeCaller, producing a raw (unfiltered) VCF.

Prerequisites

  • TBD

Steps

  1. Determine the basic parameters of the analysis
  2. Call variants in your sequence data

1. Determine the basic parameters of the analysis

If you do not specify these parameters yourself, the program will use default values. However we recommend that you set them explicitly because it will help you understand how the results are bounded and how you can modify the program's behavior.
  • Genotyping mode (–genotyping_mode)
This specifies how we want the program to determine the alternate alleles to use for genotyping. In the default DISCOVERY mode, the program will choose the most likely alleles out of those it sees in the data. In GENOTYPE_GIVEN_ALLELES mode, the program will only use the alleles passed in from a VCF file (using the -alleles argument). This is useful if you just want to determine if a sample has a specific genotype of interest and you are not interested in other alleles.
  • Emission confidence threshold (–stand_emit_conf)
This is the minimum confidence threshold (phred-scaled) at which the program should emit sites that appear to be possibly variant.
  • Calling confidence threshold (–stand_call_conf)
This is the minimum confidence threshold (phred-scaled) at which the program should emit variant sites as called. If a site's associated genotype has a confidence score lower than the calling threshold, the program will emit the site as filtered and will annotate it as LowQual. This threshold separates high confidence calls from low confidence calls.
The terms called and filtered are tricky because they can mean different things depending on context. In ordinary language, people often say a site was called if it was emitted as variant. But in the GATK's technical language, saying a site was called means that that site passed the confidence threshold test. For filtered, it's even more confusing, because in ordinary language, when people say that sites were filtered, they usually mean that those sites successfully passed a filtering test. However, in the GATK's technical language, the same phrase (saying that sites were filtered) means that those sites failed the filtering test. In effect, it means that those would be filtered out if the filter was used to actually remove low-confidence calls from the callset, instead of just tagging them. In both cases, both usages are valid depending on the point of view of the person who is reporting the results. So it's always important to check what is the context when interpreting results that include these terms.

2. Call variants in your sequence data

Action

Run the following GATK command:
java -jar GenomeAnalysisTK.jar \ 
    -T HaplotypeCaller \ 
    -R reference.fa \ 
    -I preprocessed_reads.bam \  # can be reduced or not
    -L 20 \ 
    --genotyping_mode DISCOVERY \ 
    -stand_emit_conf 10 \ 
    -stand_call_conf 30 \ 
    -o raw_variants.vcf 

Expected Result

This creates a VCF file called raw_variants.vcf, containing all the sites that the HaplotypeCaller evaluated to be potentially variant. Note that this file contains both SNPs and Indels.
Although you now have a nice fresh set of variant calls, the variant discovery stage is not over. The distinctions made by the caller itself between low-confidence calls and the rest is very primitive, and should not be taken as a definitive guide for filtering. The GATK callers are designed to be very lenient in calling variants, so it is extremely important to apply one of the recommended filtering methods (variant recalibration or hard-filtering), in order to move on to downstream analyses with the highest-quality call set possible.
Post edited by Geraldine_VdAuwera on
 0  0  0  265
Geraldine Van der Auwera, PhD

Comments

  • mikemike Posts: 103Member
    Hi,
    I am trying to revisit the documents for HyplotypeCaller.
    I found a webpage seems also taking about HyplotypeCaller: http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_haplotypecaller_HaplotypeCaller.html
    which one is the one to follow: this one or the one I copied and pasted here above? Since there are some difference between these two documents: e.g., one has --genotyping_mode DISCOVERY and the other does not have.
    This page seems a new one for HaplotypeCaller, I used to use --enable_experimental_downsampling -dcov 10, -minPruning 10 etc, are those options still working? I check the command with -h, minPruning seems still there, but not --enable_experimental_downsampling. And also I noticed that -I reduced_reads.bam as input in this document, does that mean I need first run ReduceReads as mentioned in the following URL? http://gatkforums.broadinstitute.org/discussion/2802/howto-compress-read-data-with-reducereads#latest
    ReduceReads is equivalent of --enable_experimental_downsampling option?
    Thanks a lot for your help!
    Mike
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    Hi Mike,
    These are two different kinds of pages.
    The http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_haplotypecaller_HaplotypeCaller.html page is the technical document that describes the capabilities of the tool and all its possible options. The command line shown there is just a general usage example, it does not represent THE best way to use it.
    This page here is a tutorial that explains at the beginner level what are the important options and what they mean. For processes that involves several steps, like BQSR, the tutorial explains how to run the tools in succession. The tutorial (howto) pages do not include advanced options, but those options are still available for users who know of them and know how to use them.
    The --enable_experimental_downsampling option is gone since version 2.4 (or 2.5 maybe?) because the then-experimental downsampler replaced the previous downsampler. So there is no longer any reason to specify which downsampler to use. The default one is the new and improved one.
    Finally, ReduceReads is an additional processing step that compresses your data. It is quite different from downsampling. It is optional, so you don't have to use it. The input file here is named like that because this content was extracted from another document that contained all the steps. I will change the name to avoid any further confusion.
    Geraldine Van der Auwera, PhD
  • mikemike Posts: 103Member
    Dear Geraldine:
    Thanks so much for your comments and info, which are very helpful!
    Last time I run HyplotyperCaller, it runs forever and I started to use option -minPruning, which finally works in speed for my old dataset until version 2.2.4 or 2.3-9-ge5ebf34 so that I can finish it in reasonable time frame. For the new version 2.6-4, is it still necesssary to use this option? Of curse it may depend upon dataset, I just wonder about general guidance here.
    still kind of confused about the genotyping_mode DISCOVERY, not sure the difference DISCOVERY vs GENOTYPE_GIVEN_ALLELES etc. Is it good for finding the rare-allelic variant?
    Thanks again for your great help!
    Mike
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    Hi Mike,
    HaplotypeCaller gets faster with every release, but you may still need to use -minPruning with version 2.6. Best thing to do is try it both ways and see how it goes.
    As explained elsewhere, GENOTYPE_GIVEN_ALLELES mode is meant for when you know the alleles you are looking for. Let's say you are a clinician, you have patients with a genetic disease, and you know that usually this disease is caused by a specific mutated allele of a known gene. If you have your patients' exome data, you can very rapidly run in GENOTYPE_GIVEN_ALLELES mode to find out whether your patients have that allele or not. Or if there is a set of candidate genes with known disease-causing alleles, you can use the same thing to find out which of the alleles your patients have. You're not trying to discover something new, you're trying to identify something that is known. Make sense?
    Geraldine Van der Auwera, PhD
  • mikemike Posts: 103Member
    Thx so much for your info, Appreciated very much as usual! Mike
  • michaelchaomichaelchao Posts: 8Member
    Hi I have been running the nightly version of GATK (due to an error in ReduceReads stage). I am now using the following HaplotypeCaller command to call variants:
    java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_g1k_v37.fasta -I reduced_reads_1.bam --dbsnp dbsnp_137.b37.vcf --genotyping_mode DISCOVERY --output_mode EMIT_VARIANTS_ONLY --stand_emit_conf 10 --stand_call_conf 30 -o raw_variants_1.vcf
    With the GATK version nightly-2013-07-16-g515d8e5, I have the following error message:
    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version nightly-2013-07-16-g515d8e5):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Argument with name 'output_mode' isn't defined.
    ERROR ------------------------------------------------------------------------------------------
    With the GATK version 2.6-4-g3e5ff60, I have a similar but different error message:
    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.6-4-g3e5ff60):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Argument with name 'stand_emit_conf' isn't defined.
    ERROR ------------------------------------------------------------------------------------------
    Thank you for your help on this.
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    Hi there,
    I'm not sure what's going on with the output_mode argument, but for the other one the issue is that there was a typo in the tutorial. The argument was given with two dashes instead of one. I fixed it now.
    Geraldine Van der Auwera, PhD
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    edited July 2013
    Addendum: the error you're seeing with the output_mode argument is because that argument has been removed in our development version (which will eventually be released as version 2.7), and that functionality has been replaced by another argument, --emitRefConfidence (or -ERC for short). This gives you the option to get reference confidence scores for non-variant sites. It is a new experimental feature and we're interested in feedback about it, so you are welcome to try it out. For more details, see the announcement here: http://gatkforums.broadinstitute.org/discussion/2940/reference-confidence-calculation-gvcf-available-in-haplotypecaller-in-gatk-2-7
    Post edited by Geraldine_VdAuwera on
    Geraldine Van der Auwera, PhD
  • modi2020modi2020 Posts: 12Member
    Hi Geraldine,
    Just a simple question about the –stand_emit_conf and –stand_call_conf.
    Say I want to chose –stand_call_conf as 20 and –stand_emit_conf as 10. I know that anything below –stand_emit_conf will be removed entirely and anything above 20 will be considered a high confidence call. Now would calls with quality scores between 10 and 20 be considered low confidence calls?
    Thank you
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    Yes, that is correct.
    Geraldine Van der Auwera, PhD
  • sannesanne NorwayPosts: 9Member
    Dear Geraldine, I had hoped I wouldn't have to ask your advice so soon already, but I've run into an issue with the haplotypecaller (you recently gave me some very valuable feedback on RG, indel realignment and base recalibration!). My question about the haplotypecaller concerns the multithreading. On the website (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_haplotypecaller_HaplotypeCaller.html) it says that the program can be run in multi-threaded mode using -nct. I've tried this using just 5 samples (i.e. input 5 bam files) and I see the following line in the output: "MicroScheduler - Running the GATK in parallel mode with 24 total threads, 1 CPU thread(s) for each of 24 data thread(s), of 64 processors available on this machine". The program runs fine, but it won't use more CPUs than when I run it without the -nct option. When I click on the link of the –nct option on the website, it goes to the option to specify number of threads with -nt, but when trying this I get an error message that the haplotypecaller does not support multi-threading. So now I'm a little confused:) I hope there is a solution, because the analysis is very slow this way. Especially for the base recalibration I have to call SNPs on all my 40 genomes at once (to make the SNP reference file), which might become too impractical with this speed. Many thanks! Sanne
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    Hi @sanne,
    Don't worry, there is no formal quota to the number of questions you can ask per period of time :)
    I see what you mean regarding the link; actually it is taking you to the correct option (-nct), but due to a quirk in the website layout, that option is hidden by the black bar at the top of the screen, and you just see the -nt option a few lines below. If you scroll a little upward after clicking the link you will see it. I need to fix that in the website code.
    Regarding what you're seeing, let me consult with one of our software engineers on whether this is the expected behavior or not.
    Geraldine Van der Auwera, PhD
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    @sanne, can you tell me how you are checking how many cores are in use when you run this? Are you using top (table of processes) or something else?
    Geraldine Van der Auwera, PhD
  • sannesanne NorwayPosts: 9Member
    @Geraldine, yes I'm using top. And the expected run time of the program (checked after approx 30 min) is not very different whether I run with -nct 3 or -nct 24 or without -nct option specified.
  • sannesanne NorwayPosts: 9Member
    @Geraldine, sorry I should be more specific - the estimated run time when running with -nct 24 was twice what it took with -nct 3 (9 versus 4.5 days). The only other setting I changed was -stand_call_conf and -stand_emit_conf, which were set to 30 in the run with -nct 24, and to 4.0 in the run with -nct 3.
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    Oh, I see. Well, the confidence threshold shouldn't have any effect on runtime. As for the time differences that you're seeing, it's difficult to comment on. Using the estimated runtime as proxy can be problematic, especially for a big job, and after a relatively short time of actual running. It may be that the extra start-up overhead involved by the high level of parallelization you requested with -nct 24 skewed the estimate significantly. It would be more informative to evaluate actual runtimes, or (considering you may not want to let something run for a week before making a decision) evaluate the estimated runtimes after at least several hours or actual run.
    Geraldine Van der Auwera, PhD
  • sannesanne NorwayPosts: 9Member
    Hi Geraldine, the 4.5 days was actual run time, the 9 days was the estimate it gave me after 3 days of running. However, I killed that run yesterday and started it again, and this time it only took 20 hours! I gave it the exact same command, so it seems something just went wrong in the one before. So sorry to have bothered you with this false alarm! But I am very happy now that I know I can just go ahead and do more runs without it taking weeks :) Cheers!
  • blueskypyblueskypy Posts: 108Member
    hi, Geraldine, If the input files include multiple sample bam files, are the QUAL, FILTER, and INFO of the output vcf file computed using the pooled aligned reads from all samples? how is the genotyping different from running HC with just one sample bam file?
    Between running HC with all samples of a cohort together and running HC with each sample separately, is it possible the genotyping results are different for some samples? If so, when could that happen?
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    Hi @blueskypy,
    When you do multisample calling, the fields that refer to the variant site itself, such as QUAL, FILTER, and INFO, are all computed based on pooled reads from all samples, yes. The HC's variant calling process (which is not the same thing as genotyping, strictly speaking) starts with a local de novo assembly step that uses all the reads, regardless of what sample they come from, so it is quite different from running HC on bam files individually. This can indeed lead to genotype calls that are different for samples that are run individually vs. with others. The reason is that when you use reads from all the samples, you can end up resolving that big pile of reads into different haplotypes than if you were just using the reads from a single sample. This is normally a good thing because if you are processing a cohort of samples, there is a population effect that compensates for potentially poor sequencing that might affect some of the individuals. But if you're processing wildly different samples together that can potentially confuse the assembly process, so it's important to only process samples together if they form a cohort that makes sense.
    Geraldine Van der Auwera, PhD
  • blueskypyblueskypy Posts: 108Member
    hi, Geraldine, Thanks for your explanation! I agree with you totally! However, my boss just does not like the idea that the genotying result of a sample may depend on the composition of the cohort, and thinks that each sample should run with HC separately. So I wonder if you can recommend any docs or papers that illustrate the benefit of calling all the samples of a cohort together vs separately?
    Thanks,
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    I'm not aware of any papers that have done this comparison in detail, but if you look at the 1000 Genomes publications, all that was done with multisample calling. Note that now if you have decent quality sequencing data, any differences should be minimal. The problem with single-sample calling is that you won't be able to perform variant recalibration, and will be stuck with hard filtering, which is an inferior method.
    Geraldine Van der Auwera, PhD
  • splaisansplaisan Posts: 14Member
    Dear Geraldine, I am experiencing strange errors with the command adapted from above. The whole workflow until this point worked like a charm (thanks for your great pages with code, this is really the way to go for me) The env variables are ok as they were used until this point already. I use a freshly installed GATK. Any idea what could be wrong?
    Thanks and cheers from BE.
    java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
    >     -T HaplotypeCaller \
    >     -R ${reference} \
    >     -I ${result}/reduced_reads.bam \
    >     -L chr21 \
    >     --genotyping_mode DISCOVERY \
    >     --output_mode EMIT_VARIANTS_ONLY \
    >     -stand_emit_conf 10 \
    >     -stand_call_conf 30 \
    >     -o ${result}/raw_variants.vcf
    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR A USER ERROR has occurred (version 2.7-2-g6bda569): 
    ##### ERROR
    ##### ERROR This means that one or more arguments or inputs in your command are incorrect.
    ##### ERROR The error message below tells you what is the problem.
    ##### ERROR
    ##### ERROR If the problem is an invalid argument, please check the online documentation guide
    ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ##### ERROR
    ##### ERROR Visit our website and forum for extensive documentation and answers to 
    ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ##### ERROR
    ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ##### ERROR
    ##### ERROR MESSAGE: Argument with name 'output_mode' isn't defined.
    ##### ERROR ------------------------------------------------------------------------------------------
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    Hi @splaisan,
    Actually this is my bad; the output_mode argument was deprecated in 2.7 and I forgot to update the tutorial. I will fix that asap, thanks for pointing it out.
    Geraldine Van der Auwera, PhD
  • splaisansplaisan Posts: 14Member
    Thanks,
    I removed the guilty parameter and it worked. I will see later if a replacing argument comes to shine.
    Does HaplotypeCaller fully support ReduceReads bam files now? I read an old post about this where caution was suggested for this at the time under dev application (just to make sure I am not doing some mistake!).
    Thanks again for the great DOC!
    Stephane
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    Hi Stephane,
    For HC, by default it will only emit variants, and there is no replacing argument as such, but there is a different argument that now allows you to emit reference calls with confidence scores. Have a look at this document:
    http://www.broadinstitute.org/gatk/guide/article?id=2940
    HC now handles reduced bams just fine, yes. If that caution note is in one of our docs (as opposed to a user's question/discussion thread) let me know and I will update it.
    Geraldine Van der Auwera, PhD
  • splaisansplaisan Posts: 14Member
    This new argument recalls me of Complete Genomics calling ref positions which is extremely useful when comparing several genomes with one having a variant and often not knowing if others were ref or just not-sequenced at this given position. Thanks for pointing to that page.
  • splaisansplaisan Posts: 14Member
    When calling variants on a full genome scale, is it correct to split the calling based on the same full bam file over 22+ parallel jobs using the L parameter or should I call against the full target genome in a single run?
    I fear that the first approach will ignore the important information for those reads mapping alternatively to several possible chromosomes and bias the calls towards the -L chosen target.
    What is the best practice for parallelization of the calling step using HaplotypeCaller?
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    Calling based on the same full bam file over 22+ parallel jobs using the -L parameter is absolutely fine. GATK ignores alternate mappings anyway; it will only consider primary alignments.
    Geraldine Van der Auwera, PhD
  • mprasadmprasad FrancePosts: 2Member
    Hello Geraldine,
    I was doing pretty well so far following instructions in Best Practices and troubleshooting the occassional hitch, but I now have a problem I am not sure how to tackle. I tried to use HaplotypeCaller to call variants in a preprocessed BAM file that I generated from a fastq file of a single patient's exome using the guidelines in best practices. I used the parameters recommended in Best Practices, and obtained an output file with around 1.1 million variants. I then ran the VariantFiltration tool on snps, again with the recommended parameters, hoping that I would end up with a more reasonable number of variants. But I still had ~900,000 snps that passed the filter, which can't be right. I repeated the variant calling process with UnifiedGenotyper just to see if there was a difference but I ended up with similar numbers. I am wondering if there is something wrong with my bam file (This is my first ever experience with handling raw exome data)? Any suggestions/comments will be much appreciated. Below is the command I used for HaplotypeCaller
    java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R '/home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/human_g1k_v37.fasta' -I '/home/megana/Desktop/HDS_05/reduced_HDS_05_5.bam' --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o raw_variants_HDS_05_5.vcf
    Running the VariantFiltration with the following command
    java -jar GenomeAnalysisTK.jar -T VariantFiltration -R '/home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/human_g1k_v37.fasta' -V '/home/megana/Desktop/HDS_05/raw_snps_HDS-05_5.vcf' --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0" --filtername "GATKbestpractices" -o '/home/megana/Desktop/HDS_05/filtered_snps_HDS-05_5.vcf'
    led to several warnings such as
    WARN  14:25:11,496 Interpreter - ![63,84]: 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0;' undefined variable MappingQualityRankSum
    but based on another thread I don't think this is a problem.
    Thank you for your help and for the wonderful tool/resources the team has made available to the community.
    Megana
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 4,410Administrator, GSA Member admin
    Hi Megana,
    You should try calling your variants with an intervals file corresponding to the exome capture targets of your experiment (use the -L argument). This will restrict the analysis to only the targeted exons and should eliminate variants called in off-target regions. See if that helps...
    Geraldine Van der Auwera, PhD
  • mprasadmprasad FrancePosts: 2Member
    Hi Geraldine,
    That worked like a charm- I now have ~44,000 variants. Thanks for the prompt response!
    Megana
  • chunxuanchunxuan Posts: 11Member
    edited December 2013
    @Geraldine_VdAuwera said: I'm not aware of any papers that have done this comparison in detail, but if you look at the 1000 Genomes publications, all that was done with multisample calling. Note that now if you have decent quality sequencing data, any differences should be minimal. The problem with single-sample calling is that you won't be able to perform variant recalibration, and will be stuck with hard filtering, which is an inferior method.
    Hi @Geraldine_VdAuwera, could you kindly explain a little bit more why HaplotypeCaller on single sample could not be followed by VariantRecalibrator? I didn't found related info in other places.
    EDIT: I found in FAQ section...
    Post edited by chunxuan on

No comments:

Post a Comment