ANNOVAR: Functional annotation of genetic variants from high-throughput sequencing data
ANNOVAR is an efficient software tool to utilize update-to-date information to functionally annotate genetic variants detected from diverse genomes (including human genome hg18, hg19, as well as mouse, worm, fly, yeast and many others). Given a list of variants with chromosome, start position, end position, reference nucleotide and observed nucleotides, ANNOVAR can perform:
TABLE_ANNOVAR is a script within the ANNOVAR package that is very popular among users. Given a list of variants from whole-exome or whole-genome sequencing, it will generate an Excel-compatible file with gene annotation, amino acid change annotation, SIFT scores, PolyPhen scores, LRT scores, MutationTaster scores, PhyloP conservation scores, GERP++ conservation scores, dbSNP identifiers, 1000 Genomes Project allele frequencies, NHLBI-ESP 6500 exome project allele frequencies and other information.
In a modern desktop computer (3GHz Intel Xeon CPU, 8Gb memory), for 4.7 million variants, ANNOVAR requires ~4 minutes to perform gene-based functional annotation, or ~15 minutes to perform stepwise "variants reduction" procedure, making it practical to handle hundreds of human genomes in a day.
BIOBASE is responsible for the exclusive worldwide marketing and distribution of the ANNOVAR tool to commercial users. ANNOVAR will be distributed stand alone, and as a complement to Genome Trax™, which includes data from HGMD® and TRANSFAC®. With ANNOVAR and Genome Trax™ combined, users can identify and annotate known disease causing inherited mutations in whole-genome or whole-exome data sets.
: 2015Jan06: I wrote a new article with some thoughts and guidelines on processing VCF files and assigning dbSNP identifiers to variants. You can take a look by clicking the "VCF Processing" menu to the left.
: 2014Dec22: I now created new dbSNP that has indel left-realigned and provide to ANNOVAR users, including avsnp138 in hg19 coordiante and avsnp142 in hg19 and hg38 coordinate. Please read "VCF Processing" menu to the left.
: 2014Dec16: Updated 1000g2014oct are available now that addressed some issues with indel mismatch. Please read "VCF Processing" menu to the left. In addition, esp6500siv2_all, esp6500siv2_aa, esp6500siv2_ea are available, and both hg19 and hg38 are available to download now.
: 2014Nov12: ANNOVAR new version is available, with significantly reduce memory usage for filter annotation, improved compatibility for unconventional chromosome names for species such as tomato, fixed a problem in exon numbering for splice variants in reverse strand. Registered users will receive an email with link to download.
: 2014Nov05: updated refGene, knownGene and ensGene files for hg18/hg19/hg38 are available to download. Users can always build these yourself with some efforts though.
: 2014Nov01: 1000 Genomes Project 2014Oct version is available to download now (use -downdb 1000g2014oct), which now finally includes chrX and chrY markers for all populations and five subpopulations (AFR,AMR,EAS,EUR,SAS).
: 2014Nov01: ExAC 65000 exomes allele frequency data is available to download now (use "-downdb exac02" for version 0.2 ), which includes all populations and seven subpopulations (AFR,AMR,EAS,FIN,NFE,OTH,SAS).
: 2014Oct02: Clinvar 20140929 (hg19 only) are available to download now.
: 2014Sep25: Updated ljb26 databases from dbnsfp indexed by annovar is available to download now (the previous one dated Sep15 has column heading errors, so if you downloaded the file before 2014Sep25, you need to download again). 1000Genomes 2014 September version (based on 20130502 alignment phase 3 version 5, with high coverage exome sequencing data) is available to download now. Use '-downdb -build hg19 1000g2014sep' to download it. Currently it has ALL populations and ethnicity-specific files for five sub-populations (AFR,AMR,EAS,EUR,SAS)!
: 2014Sep15: ljb26 databases from dbnsfp indexed by annovar is available to download now. 1000Genomes 2014 August version (with high coverage exome sequencing data) is available to download now. About 36M variants were in 2012 version, but 45M are new. Use '-downdb -build hg19 1000g2014aug' to download it. Currently it has ALL populations and ethnicity-specific files for five sub-populations (AFR,AMR,EAS,EUR,SAS)!
: 2014Sep10: Cosmic70 is available to download now. Clinvar 20140902 (both hg19 and hg38) are available to download now.
: 2014Jul22: An updated table_annovar is available that fixes an issue with invalid characters (space, semicolon, equal sign) in INFO field in VCF output files. If you downloaded ANNOVAR between 7/14-7/22, you can click here to download this file only.
: 2014Jul14: ANNOVAR new version is available with several new functionalities, including the ability to input VCF files and generate annotated VCF files, generate input files for all all possible SNVs/indels in a region or a transcript to faciliate back-convert cDNA/protein change to genomic coordiantes, generate UTR cDNA annotation, etc. Registered users should receive an email within a week with updated link, otherwise you can re-register to get the link immediately.
: 2014Jul12: CLINVAR databases (clinvar_20140702) is available to download now in hg19 and hg38 coordinate, with 80491 SNPs and 7686 indels.
: 2014Jul12: Pre-built FASTA files are available for refGene and knownGene in hg38 coordinate. Use '-downdb -webfrom annovar -buildver hg38' to download each.
: 2014Apr30: CLINVAR databases (clinvar_20140303/clinvar_20140211/clinvar_20131105) have minor bug fixes (previous version displays only one annotation when a mutation has multiple significance annotations). Please redownload them.
: 2014Mar10: Per user requests, whole-genome CADD scores that are within 1% highest percentile (3.3GB) or 10% highest scores (33GB) is available to download by keyword caddgt20 and caddgt10, respectively.
: 2014Feb24: ljb23 (version 2.3) database is available to download in ANNOVAR: Compared to version 2, it includes both raw/original score and converted (0-1 scale, higher scores are more damaging) scores to reduce confusion, and updates scores for some methods. Additionally, we introduce two new scores MetaSVM and MetaLR which has the best performance in finding Mendelian disease variants over all other methods we tested (some details here). Use this updated table_annovar.pl to annotate ljb23.
: 2014Feb24: Per user requests, whole-genome CADD database (350GB) is available to download, see instructions here. My test shows that is is 8X faster than tabix on a variant file from exome sequencing. Updated CLINVAR (-dbtype clinvar_20140211) is available to download with 48K variants. dbSNP138 and its NonFlagged versions are available to download. COSMIC68 and COSMIC68WGS databases are available to download now. Rewrite large portion of the website tutorial to be more updated.
: 2013Nov17: COSMIC67 database is available to download. Use "-downdb cosmic67 . -webfrom annovar -build hg19" or "-downdb cosmic67wgs . -webfrom annovar -build hg19"
: 2013Nov11: CLINVAR database is available to download. Use "-downdb clinvar_20131105 . -webfrom annovar -build hg19". Annotations include Variant Clinical Significance (unknown, untested, non-pathogenic, probable-non-pathogenic, probable-pathogenic, pathogenic, drug-response, histocompatibility, other) and Variant disease name.
: 2013Aug23: New ANNOVAR version is available. Registered users will get an email with download links soon. convert2annovar.pl no longer complains when VCF file does not have a valid header, fixed a small bug in convert2annovar.pl to handle certain classes of indels, table_annovar now works on non-human species, minor fix in annovar to handle certain mouse mutations, ccdsGene annotation uses transcript ID as gene name due to lack of gene name in previous versions, implement dup keyword in exonic variant annotation to better conform to HGVS standards. (A bug was identified in convert2annovar when handling multi-sample VCF files with -allsample argument as output files are empty, so this file was replaced on 9/11).
: 2013Jul28: New ANNOVAR version is available. Registered users will get an email with download links soon. The convert2annovar.pl can handle VCF file with many samples now and can address the multiple alternative allele issue appropriately.
: 2013Jul27: NCI-60 exome allele frequency data is available from ANNOVAR users analyzing cancer somatic mutations. Read details here which used ANNOVAR for variant annotation. Use argument "-downdb -buildver hg19 nci60" to download. COSMIC65 is also available for ANNOVAR users to download now.
: 2013Jun21: New ANNOVAR version is available. Registered users will get an email with download links soon. The LJB version 2 databases are now available from ANNOVAR. These include whole-exome SIFT scores, PolyPhen2 HDIV scores, PolyPhen2 HVAR scores, LRT scores, MutationTaster scores, MutationAssessor score, FATHMM scores, GERP++ scores, PhyloP scores and SiPhy scores.
: 2013May20: COSMIC64 is updated to fix a bug in position for certain indels, use -downdb cosmic64 to download.
: 2013May08: New ANNOVAR version is available. The most important change is the replacement of summarize_annovar by table_annovar (instruction here), which allows better flexibility for users to specify annotation tasks.
: 2013Apr08: COSMIC64 is uploaded, use -downdb cosmic64 to download.
: 2013Mar07: COSMIC63 is uploaded, use -downdb cosmic63 to download. It includes both coding and non-coding variants, and doubles the size for version 61.
: 2013Feb21: New ANNOVAR version is available, which fixed a bug that exonic variants at exon end were annotated as splicing when -exonicsplicing is not set. Registered users will get an email notification on Feb 27, as an email server issue has caused this delay. But as usual, whenever you do a "annotate_variation.pl -downdb null ." you will know if new version is available.
: 2013Feb11: New ANNOVAR version is available. Registered users will get an email with download links soon. Changes include: mitochondria genome is now supported, the -zerostart argument is no longer supported, better handling of GFF3 files with undefined scores, added -gff3attr argument so that attribute field from GFF3 file can be printed in output, changed summarize_annovar.pl to take -alltranscript argument to print out all isoforms for exonic variants, summarize_annovar.pl now takes esp6500si and snp137NonFlagged as databases, exonic variant near intron/exon boundary are no longer reported as splicing, unless -exonicsplicing is set, fixed a minor issue in finding tar program in BSD-derived operating system, convert2annovar.pl now handles *.gz file or handles stdin as input file name, convert2annovar.pl accepts -comment argument to keep comment lines in VCF4 file in output.
: 2013Jan24: The updated summarize_annovar.pl can take arguments such as "-verdbsnp 137NonFlagged -veresp 6500si".
: 2013Jan22: The ESP6500si database is updated, to fix a bug in annotating insertions (previously there was a one-bp error in position for insertions when reference allele is one single base) .
: 2013Jan07: The dbSNP version 137 is available from ANNOVAR now! Use keyword snp137 to download and annotate. The COSMIC version 61 is available from ANNOVAR now! It helps cancer researchers identify if their somatic mutations have been previously observed, how many times are observed, and in which cancer tissues are observed. Use keyword cosmic61 to download and annotate by filter-based annotation.
: 2012Nov04: The NHLBI 6500 Exome data sets with indels and chrY calls is available from ANNOVAR now! Use keyword esp6500si_ea, esp6500si_aa and esp6500si_all to download and annotate.
: 2012Oct23: New ANNOVAR version is available. Registered users will get an email with download links. I also updated large portions of the website to provide updated information to ANNOVAR beginners. The major changes include: added -veresp argument to summarize_annovar.pl to suppert esp6500 data set, added -aamatrixfile argument to print out amino acid substitution scores such as Gratham scores, changed UCSC download from FTP to HTTP to help users with firewall settings, fixed a problem handling genericdb file when chr prefix is present for chromosomes, fixed a problem downloading index for gerp++gt2 files, added variants_reduction.pl program. Updated Oct25: the previous program cannot handle -veresp argument correctly, please download again from the same URL link. Update Nov01: I updated summarize_annovar.pl to take -alltranscript argument to print out all isoforms for exonic variants and to fix slight problems in variants_reduction.pl. Please download again with the same URL link.
: 2012Jun24: The NHLBI 6500 Exome data sets is re-uploaded as the previous version (2012Jun21) has only chr22 data. Please download again.
: 2012Jun21: The NHLBI 6500 Exome data sets are available to download now. Use commands like "annotate_variation.pl -downdb esp6500_ea humandb -webfrom annovar -buildver hg19". You can change hg19 to hg18 or change "ea" to "aa" or "all". The whole-genome GERP++ scores are available to download now but I only include those with RS>=2! User commands like "annotate_variation.pl -downdb gerp++gt2 humandb/ -webfrom annovar -buildver hg19" to download and use "annotate_variation.pl -filter inputfile humandb/ -dbtype gerp++gt2 -buildver hg19" to anntoate your inputfile. See download page.
: 2012Jun21: A slight bug fix to convert2annovar.pl is available to download.
: 2012May25: The 1000 Genomes Project 2012 Aprial data sets are available download (this is based on phase 1 release v3 called from 20101123 alignment). The populations include ALL, AMR, AFR, ASN and EUR. Use latest version of ANNOVAR and "-downdb 1000g2012apr" to download and "-filter -dbtype 1000g2012apr_eur" and so on to annotate. Additionally, 9 NonFlagged dbSNP data sets are available to download. See download page for details.
: 2012May25: A new version of ANNOVAR is available. Existing users will receive an email with link to download. The -seq_padding argument and -indel_splicing_threshold arguments were added, and a bug to report beginning/end of transcript as splicing variants was fixed, thanks to Jamie Teer @ NIH. The dbtype of 1000g2012apr is now supported with five populations (based on files from here), thanks to Mehdi Pirooznia @ Hopkins.
: 2012Apr17: New mRNA FASTA files were uploaded for hg18 and hg19 (refseq, knowngene, ensgene), given recent update in gene annotations. Users can always generate the latest files using retrieve_seq_from_fasta.pl by yourself. Updated hg18/hg19 SNP130/131/132/135 index files are uploaded, as the previous version has a minor issue that may miss a tiny fraction of SNPs during filter-based operation.
: 2012Mar08: New ANNOVAR is available with minor feature enhancements. The variation database 1000g2012feb is now available for ANNOVAR users (for 1000 Genomes Project Feb 2012 variant call release, with 38 million SNPs and 3.8 million indels).
: 2012Feb23: New ANNOVAR is available with cumulative bug fixes and many function enhancements. All indexes for ANNOVAR annotation databases have been updated to further improve speed for whole-exome sequencing data, see here for details. New summarize_annovar generates more informative results.
: 2011Dec20: Whole-exome GERP++ scores can be downloaded and annotated by ANNOVAR now for both hg18/hg19. Additionally, allele frequency data for the 5400 exomes from NHLBI (for European Americans, African Americans and all ethnicity) can be downloaded and annotated by ANNOVAR now for both hg18/hg19.
: 2011Dec20: A new generation of variants annotator called ANNOVAR++ is being developed and will be tested by certain avid users. Most known limitations in ANNOVAR will be solved by using this fundamentally new framework for annotation. Users will be able to specify your own customized workflow (summarize_annovar, auto_annovar, index_annovar, etc) in the future.
: 2011Nov20: New version of ANNOVAR is realeased. Major changes include: mRNA FASTA sequences without complete ORF annotation will no longer be used in exonic annotation, retrieve_seq_from_fasta.pl now reports transcripts whose ORF have premature stop codon, fixed the hg18_cg69 and hg19_cg69 allele frequency error and others. See the download page.
: 2011Oct02: The last Version of ANNOVAR has introduced some bugs related to ncRNA annotation, which subsequently affects exonic/splicing annotation. An updated version is released. Please report bugs to me if you still see problems.
: 2011Sep11: New Version of ANNOVAR is released with significant speedup of filter operation for certain databases (dbSNP, SIFT, PolyPhen, 1000G, etc), thanks to Ion Flux for the speed improvements. In previous version of ANNOVAR, filter-based annotation for ex1.human (12 variants) requires ~10 minutes for snp132, sift or polyphen. In the new version, it takes 1 second only! Performance improvements for larger query file will be less apparent. To use the new version, it is necessary to re-download the databases by -downdb. See details here. (Updated 2011Sep14: User reports that the previously uploaded program cannot download index file correctly and was fixed. Please download annovar program again).
: 2011Jun18: New Version of ANNOVAR is released with some function enhancements. New mRNA FASTA files were uploaded for hg18 and hg19 (refseq, knowngene, ensgene), given recent update in gene annotations.
: 2011Jun18: The 1000g2010nov file was updated to include indel calls. Now it has 26.1 million SNPs (released by 1000G in Nov 2011 based on Aug 2011 alignments) and 3.7 million indels (released by 1000G in Feb 2011 based on Aug 2010 alignments). A new 1000g2011may file was provided with 39 million SNPs. Read details here.
: 2011May06: New version of ANNOVAR is released with minor bug fixes and feature enhancements. Whole-exome pre-computed PolyPhen v2, MutationTaster, LRT, PhyloP scores are available as ANNOVAR annotation database to give more detailed annotation of non-synonymous mutations in humans, in addition to SIFT. Use "-downdb ljb_pp2 -webfrom annovar", "-downdb ljb_lrt -webfrom annovar", "-downdb ljb_mt -webfrom annovar", "-downdb ljb_phylop -webfrom annovar" to download them. Add "-buildver hg19" to download them in hg19 coordinate. The annotation database ljb refers to Liu, Jian, Boerwinkle paper in Human Mutation with pubmed ID 21520341. Cite this paper if you use the scores; higher scores (0-1) represent functionally more deleterious predictions. (2011May11: There is a bug in the hg18_lrt_pp2 file which has been fixed now; if you download before this date, please download file again. Please report other bugs).
: 2011May03: Fourty six whole-genome (variant calls and allele frequency information) from Complete Genomics are now available as a ANNOVAR annotation database. Users need to use "-downdb cg46 -webfrom annovar" (with either '-buildver hg18' or '-buildver hg19') to download the file. For filter-based annotation, use "-dbtype generic -genericdbfile hg18_cg46.txt" for annotation. The -score_threshold argument can be used to apply a MAF threshold.
: 2011Apr18: New mRNA FASTA files were uploaded for hg18 and hg19 (refseq, knowngene, ensgene), given recent update in gene annotations. Users can always generate the latest files using retrieve_seq_from_fasta.pl by yourself.
: 2011Mar25: dbSNP version 132 in hg19 coordinate with >30 million SNPs (more than double of dbSNP131). Download the files from the download page, or use "-downdb -webfrom annovar" in ANNOVAR to download directly (as the file is from ANNOVAR not UCSC).
: 2011Mar18: dbSNP version 131 and 132 in hg18 coordinate! There is a huge community demand to have latest dbSNP in hg18 (NCBI 36), but unfortunately dbSNP elected to work on hg19 only. Dr. Leparc lifted over the latest dbSNP files and provided the dbSNP131 and dbSNP132 file in hg18 coordinate for use in ANNOVAR. Download the files from the download page, or use "-downdb -webfrom annovar" in ANNOVAR to download directly (-webfrom is required as the file is from ANNOVAR website).
: 2011Mar01: Small update to AVSIFT database based on updated annotations at http://sift-dna.org/.
If you have questions, comments or concerns, contact