ANNOVAR
Home
Download
Quick Start-up Guide
Prepare Database
Prepare Input File
Annotation
Gene-based
Region-based
Filter-based
Accessary Programs
FAQ

Download ANNOVAR

ANNOVAR is is an efficient software tool to utilize update-to-date information to functionally annotate genetic variants detected from diverse genomes.

1 ANNOVAR main package
2 Additional databases
3 Version history
4 Credit

 

ANNOVAR main package

ANNOVAR (2012Feb23 version) can be downloaded here.

ANNOVAR is written in Perl and can be run as a standalone application on diverse hardware systems where standard Perl modules are installed. For Windows users: since some functionality of ANNOVAR requires external programs such as "gzip", "grep", it is best not to use Windows unless you also install these programs (try MSYS which I like the most but there are other options such as Cygwin). For Mac users: ALWAYS make sure that your ANNOVAR input files ends with "\n" instead of "\r"; use the command perl -pe 's/\r/\n/g' < oldfile > newfile to convert file format first!!!

Additional databases

Most of the databases that ANNOVAR uses can be directly retrieved from UCSC Genome Browser Annotation Database. In general, users can use "-downdb" in ANNOVAR to download these files. As of Feb2012, there are 6418 databases for hg19, 6443 databases for hg18, 1841 databases for mm9, etc.

Several very commonly used annotation databases for human genomes are additionally provided by me as described below. In general, users can use "-downdb -webfrom annovar " in ANNOVAR directly to download these files.

Genome Build Table Name Explanation Additional Comments
hg18 avsift whole-exome SIFT scores for non-synonymous variants file updated 2011Mar01, index updated 2012Feb22
hg19 avsift same as above file updated 2011Mar01, index updated 2012Feb22
hg18 ljb_sift whole-exome LJBSIFT scores (which corresponds to 1-SIFT !!!!!) file and index updated 2012Feb22
hg19 ljb_sift same as above file and index updated 2012Feb22
hg18 ljb_pp2 whole-exome PolyPhen version 2 scores file updated 2011May11, index updated 2012Feb22
hg19 ljb_pp2 same as above file updated 2011May11, index updated 2012Feb22
hg18 ljb_phylop whole-exome PhyloP scores file updated 2011May11, index updated 2012Feb22
hg19 ljb_phylop same as above file updated 2011May11, index updated 2012Feb22
hg18 ljb_lrt whole-exome LRT scores file updated 2011May11, index updated 2012Feb22
hg19 ljb_lrt same as above file updated 2011May11, index updated 2012Feb22
hg18 ljb_mt whole-exome MutationTaster scores file updated 2011May11, index updated 2012Feb22
hg19 ljb_mt same as above file updated 2011May11, index updated 2012Feb22
hg18 ljb_gerp++ whole-exome GERP++ scores file and index updated 2012Feb22
hg19 ljb_gerp++ same as above file and index updated 2012Feb22
hg18 ljb_all whole-exome LJBSIFT, PolyPhen, PhyloP, LRT, MutationTaster, GERP++ scores file and index updated 2012Feb22
hg19 ljb_all same as above file and index updated 2012Feb22
hg18 cg46 alternative allele frequency in 46 unrelated human subjects sequenced by Complete Genomics index updated 2012Feb22
hg19 cg46 same as above index updated 2012Feb22
hg18 cg69 allele frequency in 69 human subjects sequenced by Complete Genomics index updated 2012Feb22
hg19 cg69 same as above index updated 2012Feb22
hg18 esp5400_aa alternative allele frequency in African Americans in the NHLBI-ESP project with 5400 exomes index updated 2012Feb22
hg19 esp5400_aa same as above index updated 2012Feb22
hg18 esp5400_ea alternative allele frequency in European Americans in the NHLBI-ESP project with 5400 exomes index updated 2012Feb22
hg19 esp5400_ea same as above index updated 2012Feb22
hg18 esp5400_all alternative allele frequency in all subjects in the NHLBI-ESP project with 5400 exomes index updated 2012Feb22
hg19 esp5400_all same as above index updated 2012Feb22
hg18 1000g (3 data sets) alternative allele frequency data in 1000 Genomes Project Read here for details, index updated 2012Feb22
hg18 1000g2010 (3 data sets) same as above Read here for details, index updated 2012Feb22
hg18 1000g2010jul (3 data sets) same as above Read here for details, index updated 2012Feb22
hg19 1000g2010nov same as above Read here for details, index updated 2012Feb22
hg19 1000g2011may same as above Read here for details, index updated 2012Feb22
hg19 1000g2012feb same as above Read here for details, file and index updated 2012Mar08
hg18 snp128 dbSNP with ANNOVAR index files index updated 2012Feb22
hg18 snp129 same as above index updated 2012Feb22
hg18 snp130 same as above index updated 2012Feb22
hg19 snp130 same as above index updated 2012Feb22
hg18 snp131 same as above index updated 2012Feb22
hg19 snp131 same as above index updated 2012Feb22
hg18 snp132 same as above index updated 2012Feb22
hg19 snp132 same as above index updated 2012Feb22
hg19 snp135 same as above file and index updated 2012Feb22
hg18 refGene FASTA sequences for all annotated transcripts in RefSeq Gene file updated 2012Feb22
hg19 refGene same as above file updated 2012Feb22
hg18 knownGene FASTA sequences for all annotated transcripts in UCSC Known Gene file updated 2012Feb22
hg19 knownGene same as above file updated 2012Feb22
hg18 ensGene FASTA sequences for all annotated transcripts in ENSEMBL Gene file updated 2012Feb22
hg19 ensGene same as above file updated 2012Feb22
hg18 gerp++elem conserved genomic regions by GERP++ this is region-based score, not base-level score
hg19 gerp++elem same as above same as above
mm9 gerp++elem same as above same as above
hg18 gerp++ whole-genome GERP++ scores HUGE SIZE. Currently not available
hg19 gerp++ same as above same as above
mm9 gerp++ same as above same as above
other <other> other experimental databases Currently not available

 

Version history

Idea was conceived in 2009, motivated by several whole-genome sequencing paper and whole-exome sequencing paper.

On 2010Feb15, first public release of ANNOVAR.

On 2010Mar07, new release (subversion 322) fixed -regionanno issues.

On 2010Mar27, major updated release is uploaded.

On 2010Mar30, updated the auto_annovar script and improved ANNOVAR memory management so that it runs in environment with limited memory.

On 2010Jun02, the functionality of ANNOVAR is greatly improved, and now includes an optional step to implement SIFT-based annotation of non-synonymous SNPs (that is, predict whether non-synonymous SNPs are detrimental or tolerated), as well as the ability to examine GFF3 databases.

On 2010Jun06, several bugs have been fixed, and the convert2annovar.pl program has been added. ANNOVAR can now handle March 2010 release of the 1000 Genomes Project data.

On 2010Jun30, several functions were enhanced and bugs were fixed. This version fixed a problem downloading Ensembl annotations, added the functionality to handle VCF file as annotation database directly, improved the functionality of -downdb operation, fixed gene-based annotation issues due to errors in the FASTA files provided by UCSC. An update is also provided for convert2annovar.pl. This fixed an issue when handling pileup format files with indels.

On 2010Aug06, added the summarize_annovar.pl program to convert whole-genome variants data into an Excel file that users can examine using Excel "filter" functions to identify causal mutations. Major changes to the retrieve_seq_from_fasta.pl file such that it can handle several different types of input files, and that it knows how to handle whole-genome sequence files for several irregularly formatted model organisms (such as chimp), and that is produce FASTA records with time stamps. Several known minor bug fixes for the annotate_variation.pl program are also implemented. convert2annovar.pl can now handle MAQ genotype calling output files.

On 2010Sep29, minor bug fixes and function enhancement. convert2annovar.pl can now handle CAVASA and VCF version 4 genotype call files, but these functionalities are not mature yet and are being rigorously tested.

On 2010Dec02, added support for defining custom precedence in gene-based annotation, changed defult precedence as exonic=splicing > ncrna > utr5=utr3 > intronic > upstream=downstream > intergenic; fixed bugs in annotating intronic variants between two UTR-exons as UTR-variants; fixed bugs in reporting amino acid change for reverse strand insertions; added support for 1000G hg19 coordinate (Nov 2010 release); added support for SIFT hg19 coordinate; changed exonic variant annotation (adding cDNA level annotations to amino acid annotations) per user requests; fixed bugs in handling lower-case letters in --genericdbfile.

On 2011Jan17, added -colsWanted argument for users to choose the desired output column in DB file, added chrX data to 1000G Nov 2011 data set (use -downdb to re-download the data set), updated gene definition and FASTA file for human and mouse, changed filter operation to handle SNPs with 3 or 4 alleles annotated in dbSNP, changed "stop lost" to "stop loss" in exonic annotation, fixed a bug in summarize_annovar.pl in handling older 1000G files, fixed a bug in convert2annovar.pl in handling insertions for VCF4 files, changed default 1000G file as 2010jul for hg18 in summarize_annovar.pl.

On 2011Jan31, fixed the "counts cannot be inferred" issue in convert2annovar.pl, more informative conversion for SamTools pileup file in convert2annovar.pl, added ability to handle the newer version of SOLiD GFF file in convert2annovar.pl, added protein level annotation for exonic deletion, fixed the bug in handling negative strand in dbSNP records. On 2011Jan31 3PM PST, a small bug was discovered and the package was re-uploaded.

On 2011Feb11, fixed a bug that was introducted in the 2011-01-31 version to handle dbSNP filtering.

On 2011Feb20, changed convert2annovar.pl for more informative handling of pileup files and VCF4 files, changed exonic annotation for frameshift stopgain/stoploss mutations by printing amino acids before stop codon, changed "database annotation error" warning (due to for example co-existence of chr6 and chr6_cox_hap1), ANNOVAR now only examine the first occurence of a transcript, if the transcript is mapped to multiple locations with discordant sequence length, added functionality to perform gene-based annotations using GENCODE or other gene annotation systems, region-based annotation no longer prints Score=0 in the second column, changed output file name for region-based annotation using mceXway, tfbs, band, segdup keywords, fixed a bug in filter-based annotation for block substitution on single nucleotide, retrieve_seq_from_fasta.pl: added warning message to sequence that occur multiple times with discordant lengths, retrieve_seq_from_fasta.pl: no longer process 'alternative haplotype' chromosomes such as chr6_cox_hap1 by default, fixed a bug in having negative values in cDNA positions when annotating long indels, fixed the bug in not printing out normalized scores when annotating phastCons regions. (Note that a small issue was found after uploading, so an updated file was uploaded on 2011Feb22).

On 2011May06, fixed the problem downloading bosTau4 sequence for cow genome, fixed the -separate argument that print line column twice in exonic annotation, the ./. genotype in VCF file is annotated as "unknown" in updated convert2annovar.pl, fixed a bug in retrieve_seq_from_db.pl in handling ENSEMBL gene for yeast, added -exonsort argument to sort exon number in output line for gene-based annotation, replaced Em: to Em. for very rare scenarios where UCSC Gene name is prefixed with Em:, fixed auto_annovar bug in handling wrong mce file name due to changes in annotate_variation.pl, fixed problem on handling snp132 files due to different file format, updated convert2annovar.pl to enhance functionality to handle VCF files, updated summarize_annovar.pl to incorporate additional scoring methods in Excel output, added ljb scoring system in filter-based annotation

On 2011Jun18, improved the annotation of splicing variants, added -reverse argument to better control -score_threshold argument, added coding_change.pl program to print out protein sequence before and after mutation, added -exonsort argument to annotate_variation.pl to make results stable, added -bedfile argument for region based annotation using BED files as database, fixed a bug in processing VCF files in annotate_variation.pl directly, fixed issues in convert2annovar.pl to handle zygosity status in mpileup file generated by Samtools, added functions to process BED file directly in region annotation

On 2011Sep11, significant speedup of filter operation for certain databases (dbSNP, SIFT, PolyPhen, etc), added warning message if user inputs wrong reference allele for exonic mutations, added exon number to splicing annotation in gene-based annotation, changed ncRNA to ncRNA_exon and ncRNA_intron in gene-based annotation, added support for cg69 (complete genomics) database and GERP++ database

On 2011Oct02, fixed the cDNA off-by-one error for splicing annotation for acceptor site splicing variants, fixed bug in summarize_annovar.pl when -step argument is used, ANNOVAR now prints out examples when exonic SNPs have WRONG reference alleles specified in your input file, fixed the bug on indexing-based filter search on dbSNP (indexing-based search now requires '-webfrom annovar' when -downdb is used), fixed certain ncRNA annotation errors (such as ncRNA_UTR5, ncRNA_exonic) when the variant hits both coding and noncoding gene, fixed the bug to annotate ncRNA_exonic with exonic_variant_function, only coding transcripts will be used in gene-based annotation if a gene has coding and noncoding transcripts

On 2011Nov20, mRNA FASTA sequences without complete ORF annotation will no longer be used in exonic annotation, fixed the bug in specifying ensgene in command line in auto_annovar and summarize_annovar, fixed the problem in handling dbSNP132 in hg19 coordinate, slightly changed the "exonic SNPs have WRONG reference alleles" warning message to be more clear, retrieve_seq_from_fasta.pl now reports transcripts whose ORF have premature stop codon, fixed the hg18_cg69 and hg19_cg69 allele frequency error, convert2annovar.pl supports GFF3 files generated by 5500SOLiD and the LifeScope software

On 2012Feb23, added esp5400_ea, esp5400_aa, esp5400_all keywords for allele frequencies in 5400 exomes, added ljb_sift, ljb_gerp++, ljb_all databases for faster/easier retrieval of whole-exome functional scores, updated mRNA sequence files for hg18 and hg19 gene definitions, all custom databases have newer/faster index and default -indexfilter argument is now 0.9, add -otherinfo argument for -filter operation to print additional information in annotation, slight changes to convert2annovar.pl to better handle CASAVA files, fixed the problem in handling UCSC genes whose names contain space fixed the bug that -reverse does not work for "-dbtype avsift" other minor bug fixes

On 2012Mar08, added ability to handle 1000G 2012feb version, fixed bug in -allallele argument in convert2annovar.pl when handling more than two alternative alleles in VCF files, slight change to handle latest knowngene annotation due to format change of kgXref file, -verbose now print out noncoding transcripts that are ignored in analysis in gene-based annotation

Credit

The ANNOVAR software is originally designed by Dr. Kai Wang. Other developers and significant contributors currently include Dr. German Gaston Leparc and Paul Leo. The index-based filter operation were designed by Allen Day, Marine Huang and Stephen Weinberg at Ion Flux. Many ANNOVAR users have provided valuable feedbacks, bug reports and suggestions to improve the functionality of ANNOVAR.