ANNOVAR
Home
Download
Quick Start-up Guide
Prepare Database
Prepare Input File
Annotation
Gene-based
Region-based
Filter-based
Accessary Programs
FAQ

Becoming ANNOVAR Guru...

ANNOVAR offers multiple essential functionalities; when combined with the right settings and right database files, it can accomplish things for creative users. Below is a list of problems that ANNOVAR may be helpful in solving.

1. I sequenced ten patients with a rare Mendelian disease and I want to find the causal gene.

The auto_annovar.pl program can help in achieving this goal. Usually the mode of inheritence is probably unknown, so it could be recessive with compound heterozygotes, or a single dominant de novo mutation.

Since the sample size is large, we could start with something less stringent and see what can be found to be shared in common between 10 subjects. So run auto_annovar.pl with "-step 1,4-6 -model dominant" to generate a list of candidate genes. Then compare between 10 subjects for over-represented genes for follow up.

If the number of samples are smaller (say 3 samples), it will become considerably more difficult to find the gene, so some fishing experiments need to be done. Users can first try run auto_annovar by all default parameters, and take the step9 output file and run them through the AVSIFT pipeline in ANNOVAR, and then manually examine the reduced list of variants and their genes by dominant and recessive modes. If nothing obviously show up, relax the threshold a bit and try again.

2. I found a susceptibility locus in GWAS that function as a strong eQTL and I sequenced the region, but which is the causal variant?

The real only information here is that the variant probably regulate gene expression in cis, so it probalby disrupt some transcription factor binding site or some microRNA target site, or other types of regulatory regions. If there are additional association data (for example, if you actually sequenced say 50 persons), you can use genotype imputation to further trim down the potential causal variants using the original GWAS sample set. (it is very important to note that P-values no longer make sense here as truly causal varint may have higher P-values and lower allele frequencies; look at odds ratio as well to identify top candidates). Then run the list of variants in sequencing run through ANNOVAR using the TFBS and the microRNA target and the most conserved site and the RNA Evofold program to help further trim down causal variant. Then clone the segment into luciferase vector and see how it works.

3. I called genotypes and indels by two algorithms from next-gen sequencing data, how to compare them?

It is actually not that hard to just write your own script to do that. Nevertheless, ANNOVAR can help with this task: just scan one file (as query) against the other (as generic database) and you will see all overlapping variants. One additional step is to classify the overlapping variants into het-het, het-hom and hom-hom between two algorithms.

4. I identified SNPs from one population (possibly a mixture of cases) and SNPs from another (possibily a mixture of controls). How to compare them?

This is essentially the same to the comparison of 1000 Genoms Project 2009 data and 2010 data that I described in this page. ANNOVAR can be very helpful and takes a few minutes to solve the puzzle. It is unlikely that you can use system command such as "grep" to do the pairwise comparison when dealing with 10 million variants.

For example, I generated the variant calls for the ~50 individuals provided by Complete Genomics. I can just concatenate all calls (many many millions) together into a single file called "cg46.avdb" as a "database", then directly annotate a given list of variants against this database, using "-dbtype generic -genericdb cg46.avdb". It is as simple as you think.

5. I genotyped a sample by Illumina SNP array and I sequenced its whole genome. What is the sensitivity of genotype calling from sequencing data? How does it change when sequencing coverage drop?

The ANNOVAR program to implement this function will be available in a future release. It converts the variants file into a PED and MAP file for comparison to SNP array data, and it a subset of SNPs from SNP array that has non-reference-allele mutations.

6. I have a list of SNPs from the Affymetrix Genome-wide 6.0 array. The annotation on alleles for this array is notoriously wrong. How can I be sure that which allele is in forward strand and which allele is in reverse strand?

Just take the regions for all SNPs as tab-delimited format (chr, start, end as the 3 columns), and use retrieve_seq_fasta.pl to identify the nucleotide in the forward strand in the genome build of interst (hg18 or hg19). This is the most correct way to solve the problem. Do not trust things such as dbSNP, 1000 Genome or HapMap, etc, as their "forward allele" may not be the real allelel in the forward strand in a given genome build.

7. I need to remove known dbSNPs from all variant calls in a VCF file but I want to keep VCF file format

Many users asked about this since they need to keep the VCF format for further downstream analysis. In fact, this is easily do-able by the -includeinfo argument.

First use convert2annovar.pl with -includeinfo argument to convert VCF file to ANNOVAR input file, then do all the necessary analysis by ANNOVAR, filtering variants, etc. Then just simply use a Linux system command "cut -f 5-" (it could be 4, 5, 6, etc) to convert the ANNOVAR output back into original VCF files with all the original fields intact. This is possible because ANNOVAR output is tab-delimited, yet VCF is also a tab-delimited format.

8. I need to design a custom array with Illumina but they ask for the entire sequence rather than two alleles...

Most people annotate novel SNPs as chr, position, reference allele and alternative allele, but array manufacturers typically need to know a flanking sequence (say 101bp surrounding the actual allele). In your excel file, make two columns as position-50, position-1. Then copy/paste the positions into a tab-delimited text file. Then run "retrieve_seq_from_fasta.pl region.50bp -format tab -seqdir humandb/hg18_seq/ -tabout", then copy and paste the results to Excel. All lines should be already in the correct and identical order. Do the same thing for the right flanking sequence.