ANNOVAR is is an efficient software tool to utilize update-to-date information to functionally annotate genetic variants detected from diverse genomes.
The latest version of ANNOVAR can be downloaded here.
ANNOVAR is written in pure Perl and can be run as a standalone application on diverse hardware systems where standard Perl modules are installed.
Most of the databases that ANNOVAR uses can be directly retrieved from UCSC Genome Browser Annotation Database. In general, users can use "-downdb" in ANNOVAR to download these files. As of Feb2012, there are 6418 databases for hg19, 6443 databases for hg18, 1841 databases for mm9, etc.
Several very commonly used annotation databases for human genomes are additionally provided by me as described below. In general, users can use "-downdb -webfrom annovar " in ANNOVAR directly to download these files.
Idea was conceived in 2009, motivated by several whole-genome sequencing paper and whole-exome sequencing paper.
On 2010Feb15, first public release of ANNOVAR.
On 2010Mar07, new release (subversion 322) fixed -regionanno issues.
On 2010Mar27, major updated release is uploaded.
On 2010Mar30, updated the auto_annovar script and improved ANNOVAR memory management so that it runs in environment with limited memory.
On 2010Jun02, the functionality of ANNOVAR is greatly improved, and now includes an optional step to implement SIFT-based annotation of non-synonymous SNPs (that is, predict whether non-synonymous SNPs are detrimental or tolerated), as well as the ability to examine GFF3 databases.
On 2010Jun06, several bugs have been fixed, and the convert2annovar.pl program has been added. ANNOVAR can now handle March 2010 release of the 1000 Genomes Project data.
On 2010Jun30, several functions were enhanced and bugs were fixed. This version fixed a problem downloading Ensembl annotations, added the functionality to handle VCF file as annotation database directly, improved the functionality of -downdb operation, fixed gene-based annotation issues due to errors in the FASTA files provided by UCSC. An update is also provided for convert2annovar.pl. This fixed an issue when handling pileup format files with indels.
On 2010Aug06, added the summarize_annovar.pl program to convert whole-genome variants data into an Excel file that users can examine using Excel "filter" functions to identify causal mutations. Major changes to the retrieve_seq_from_fasta.pl file such that it can handle several different types of input files, and that it knows how to handle whole-genome sequence files for several irregularly formatted model organisms (such as chimp), and that is produce FASTA records with time stamps. Several known minor bug fixes for the annotate_variation.pl program are also implemented. convert2annovar.pl can now handle MAQ genotype calling output files.
On 2010Sep29, minor bug fixes and function enhancement. convert2annovar.pl can now handle CAVASA and VCF version 4 genotype call files, but these functionalities are not mature yet and are being rigorously tested.
On 2010Dec02, added support for defining custom precedence in gene-based annotation, changed defult precedence as exonic=splicing > ncrna > utr5=utr3 > intronic > upstream=downstream > intergenic; fixed bugs in annotating intronic variants between two UTR-exons as UTR-variants; fixed bugs in reporting amino acid change for reverse strand insertions; added support for 1000G hg19 coordinate (Nov 2010 release); added support for SIFT hg19 coordinate; changed exonic variant annotation (adding cDNA level annotations to amino acid annotations) per user requests; fixed bugs in handling lower-case letters in --genericdbfile.
On 2011Jan17, added -colsWanted argument for users to choose the desired output column in DB file, added chrX data to 1000G Nov 2011 data set (use -downdb to re-download the data set), updated gene definition and FASTA file for human and mouse, changed filter operation to handle SNPs with 3 or 4 alleles annotated in dbSNP, changed "stop lost" to "stop loss" in exonic annotation, fixed a bug in summarize_annovar.pl in handling older 1000G files, fixed a bug in convert2annovar.pl in handling insertions for VCF4 files, changed default 1000G file as 2010jul for hg18 in summarize_annovar.pl.
On 2011Jan31, fixed the "counts cannot be inferred" issue in convert2annovar.pl, more informative conversion for SamTools pileup file in convert2annovar.pl, added ability to handle the newer version of SOLiD GFF file in convert2annovar.pl, added protein level annotation for exonic deletion, fixed the bug in handling negative strand in dbSNP records. On 2011Jan31 3PM PST, a small bug was discovered and the package was re-uploaded.
On 2011Feb11, fixed a bug that was introducted in the 2011-01-31 version to handle dbSNP filtering.
On 2011Feb20, changed convert2annovar.pl for more informative handling of pileup files and VCF4 files, changed exonic annotation for frameshift stopgain/stoploss mutations by printing amino acids before stop codon, changed "database annotation error" warning (due to for example co-existence of chr6 and chr6_cox_hap1), ANNOVAR now only examine the first occurence of a transcript, if the transcript is mapped to multiple locations with discordant sequence length, added functionality to perform gene-based annotations using GENCODE or other gene annotation systems, region-based annotation no longer prints Score=0 in the second column, changed output file name for region-based annotation using mceXway, tfbs, band, segdup keywords, fixed a bug in filter-based annotation for block substitution on single nucleotide, retrieve_seq_from_fasta.pl: added warning message to sequence that occur multiple times with discordant lengths, retrieve_seq_from_fasta.pl: no longer process 'alternative haplotype' chromosomes such as chr6_cox_hap1 by default, fixed a bug in having negative values in cDNA positions when annotating long indels, fixed the bug in not printing out normalized scores when annotating phastCons regions. (Note that a small issue was found after uploading, so an updated file was uploaded on 2011Feb22).
On 2011May06, fixed the problem downloading bosTau4 sequence for cow genome, fixed the -separate argument that print line column twice in exonic annotation, the ./. genotype in VCF file is annotated as "unknown" in updated convert2annovar.pl, fixed a bug in retrieve_seq_from_db.pl in handling ENSEMBL gene for yeast, added -exonsort argument to sort exon number in output line for gene-based annotation, replaced Em: to Em. for very rare scenarios where UCSC Gene name is prefixed with Em:, fixed auto_annovar bug in handling wrong mce file name due to changes in annotate_variation.pl, fixed problem on handling snp132 files due to different file format, updated convert2annovar.pl to enhance functionality to handle VCF files, updated summarize_annovar.pl to incorporate additional scoring methods in Excel output, added ljb scoring system in filter-based annotation
On 2011Jun18, improved the annotation of splicing variants, added -reverse argument to better control -score_threshold argument, added coding_change.pl program to print out protein sequence before and after mutation, added -exonsort argument to annotate_variation.pl to make results stable, added -bedfile argument for region based annotation using BED files as database, fixed a bug in processing VCF files in annotate_variation.pl directly, fixed issues in convert2annovar.pl to handle zygosity status in mpileup file generated by Samtools, added functions to process BED file directly in region annotation
On 2011Sep11, significant speedup of filter operation for certain databases (dbSNP, SIFT, PolyPhen, etc), added warning message if user inputs wrong reference allele for exonic mutations, added exon number to splicing annotation in gene-based annotation, changed ncRNA to ncRNA_exon and ncRNA_intron in gene-based annotation, added support for cg69 (complete genomics) database and GERP++ database
On 2011Oct02, fixed the cDNA off-by-one error for splicing annotation for acceptor site splicing variants, fixed bug in summarize_annovar.pl when -step argument is used, ANNOVAR now prints out examples when exonic SNPs have WRONG reference alleles specified in your input file, fixed the bug on indexing-based filter search on dbSNP (indexing-based search now requires '-webfrom annovar' when -downdb is used), fixed certain ncRNA annotation errors (such as ncRNA_UTR5, ncRNA_exonic) when the variant hits both coding and noncoding gene, fixed the bug to annotate ncRNA_exonic with exonic_variant_function, only coding transcripts will be used in gene-based annotation if a gene has coding and noncoding transcripts
On 2011Nov20, mRNA FASTA sequences without complete ORF annotation will no longer be used in exonic annotation, fixed the bug in specifying ensgene in command line in auto_annovar and summarize_annovar, fixed the problem in handling dbSNP132 in hg19 coordinate, slightly changed the "exonic SNPs have WRONG reference alleles" warning message to be more clear, retrieve_seq_from_fasta.pl now reports transcripts whose ORF have premature stop codon, fixed the hg18_cg69 and hg19_cg69 allele frequency error, convert2annovar.pl supports GFF3 files generated by 5500SOLiD and the LifeScope software
On 2012Feb23, added esp5400_ea, esp5400_aa, esp5400_all keywords for allele frequencies in 5400 exomes, added ljb_sift, ljb_gerp++, ljb_all databases for faster/easier retrieval of whole-exome functional scores, updated mRNA sequence files for hg18 and hg19 gene definitions, all custom databases have newer/faster index and default -indexfilter argument is now 0.9, add -otherinfo argument for -filter operation to print additional information in annotation, slight changes to convert2annovar.pl to better handle CASAVA files, fixed the problem in handling UCSC genes whose names contain space fixed the bug that -reverse does not work for "-dbtype avsift" other minor bug fixes
On 2012Mar08, added ability to handle 1000G 2012feb version, fixed bug in -allallele argument in convert2annovar.pl when handling more than two alternative alleles in VCF files, slight change to handle latest knowngene annotation due to format change of kgXref file, -verbose now print out noncoding transcripts that are ignored in analysis in gene-based annotation
On 2012May25, -downdb works for 1000g2012apr now, mutations in beginning or end of transcript are no longer reported as splicing variants, added -seq_padding argument to pad flanking amino-acid sequence around indels, added -indel_splicing_threshold argument to better annotate splicing variants around indels, the -colWanted argument now works on BED database files, fixed problem with -colWanted argument if the desired column contains comma, minor fix to region annotation when the region in database itself is a zero-length insertion, fixed the bug in complaining "wholegene," in annotation output, enhanced handling of errors in VCF4 files (such as presence of N in alleles) in convert2annvar.pl, added -infoasscore argument for printing entire INFO field in VCF database files
On 2012Oct23, added -veresp argument to summarize_annovar.pl to suppert esp6500 data set, added -aamatrixfile argument to print out amino acid substitution scores such as Gratham scores, changed UCSC download from FTP to HTTP to help users with firewall settings, fixed a problem handling genericdb file when chr prefix is present for chromosomes, fixed a problem downloading index for gerp++gt2 files, added variants_reduction.pl program
On 2013Feb11, mitochondria genome is now supported, the -zerostart argument is no longer supported, better handling of GFF3 files with undefined scores, added -gff3attr argument so that attribute field from GFF3 file can be printed in output, changed summarize_annovar.pl to take -alltranscript argument to print out all isoforms for exonic variants, summarize_annovar.pl now takes esp6500si and snp137NonFlagged as databases, exonic variant near intron/exon boundary are no longer reported as splicing, unless -exonicsplicing is set, fixed a minor issue in finding tar program in BSD-derived operating system, convert2annovar.pl now handles *.gz file or handles stdin as input file name, convert2annovar.pl accepts -comment argument to keep comment lines in VCF4 file in output.
On 2013Feb21, fixed a bug that exonic variants at exon end were annotated as splicing when -exonicsplicing is not set.
On 2013May09, fixed a bug in line count of exonic_variant_function when handling more than 5 million variants,
table_annovar.pl is implemented to replace summarize_annovar.pl
On 2013Jun21, fixed a bug in table_annovar to have empty output when input is in five-column format, fixed a bug in table_annovar fo avsift output, fixed a bug when handling start position for multi-allelic SNPs in dbSNP, fixed a bug when scanning indels with multi-allelic variants in VCF DB file, fixed a bug when chr prefix is present in filter database, fixed a bug in annotate_variation.pl to report position of coding insertions in negative strand, small change to retrieve_seq_from_fasta so it handles zebrafish correctly
On 2013Jul28, much improved VCF conversion function in convert2annovar.pl, improved functionality of table_annovar including support for ljb2* databases, new databases such as nci60 and popfreq_all are now supported, disabled -sortout in table_annovar due to many bugs, updated Gratham matrix due to inconsistencies with original publication
On 2013Aug23, convert2annovar.pl no longer complains when VCF file does not have a valid header, fixed a small bug in convert2annovar.pl to handle certain classes of indels, table_annovar now works on non-human species, minor fix in annovar to handle certain mouse mutations, ccdsGene annotation uses transcript ID as gene name due to lack of gene name in previous versions, implement dup keyword in exonic variant annotation to better conform to HGVS standards
On 2014Jul14, table_annovar now supports -tempdir argument, table_annovar now supports VCF input format and write to output VCF file, table_annovar now use separate column for splicing/UTR notations, convert2annovar can generate all possible SNVs/indels in a genomic region or in a transcript, convert2annovar can generate ANNOVAR input files for list of dbSNP identifiers, improve convert2annovar to better handle block indel/substitution in VCF4, changed 'stopgain SNV' and 'stoploss SNV' to 'stopgain' and 'stoploss' as they apply to both SNVs and indels, add -withfilter argument to convert2annovar to print out FILTER field in output for VCF files, fixed bug to handle VCFdbfile when AF record is in scientific notation, add UTR cdot annotation, improved table_annovar to print out correct column headers when the database file has the information, fixed a convert2annovar bug when VCF file does not have valid VCF header but -includeinfo is specified, add details for splicing variant when -separate flag is used, minor bug fix for dup annotation for insertions in negative strand, minor change to default parameters in table_annovar.pl for ljb2_pp2, fixed an error in convert2annovar to handle multi-sample VCF files
The ANNOVAR software is originally designed by Dr. Kai Wang. Other developers and significant contributors include Dr. German Gaston Leparc, Paul Leo and Jamie Teer. The index-based filter operation were designed by Allen Day, Marine Huang and Stephen Weinberg at Ion Flux. Mehdi Pirooznia from Johnas Hopkins provided 1000g2012apr population-specific allele frequency files, Magali Olivier from International Agency for Research on Cancer and the COSMIC teamprovided COSMIC61 database and instructions on generating COSMIC63 database, Jun Ding from NIH/National Institute of Aging compiled human mitochondria gene annotation file on 1000 Genomes Project reference (GRCh37), Konrad Herbst from German Cancer Research Center compiled human mitochondria gene annotation file on UCSC hg19 coordiante (AF347015.1 or NC001807). Numerous other ANNOVAR users have provided feedbacks, bug reports, code snipets and suggestions to improve the functionality of ANNOVAR and I am indebted to them for their invaluable help.