ANNOVAR
Home
Download
Quick Start-up Guide
Prepare Input File
Annotation
Gene-based
Region-based
Filter-based
Accessary Programs
FAQ

Preparation of local annotation databases

  1. Download gene annotation databases
  2. Download region annotation databases from UCSC
  3. Download additional region annotation databases
  4. Download 1000 Genomes Project, dbSNP, SIFT, PolyPhen, MutationTaster and other variant annotation databases
  5. Download Complete Genomics variation database
  6. Users can specify additional filter-based annotation databases

ANNOVAR requires "annotation databases" saved in local disk for annotating genetic variants. The --downdb argument can be issued to download required annotation database from the UCSC Genome Browser or the ANNOVAR database repository automatically, assuming that the computer is connected to Internet. Several different types of annotation databases can be downloaded, with the command below:

annotate_variation.pl -downdb [optional arguments] <table-name> <output-directory-name>

1. Download gene annotation databases

The command below downloads gene-based annotation databases for human genome and save these files to a local directory called humandb/. The keyword "refGene" tells the program that RefSeq gene-related annotations need to be downloaded. (Note that ANNOVAR package already includes a RefSeq annotation database for human hg18 genome build to help users get started in using ANNOVAR.)

[kai@beta ~/]$ annotate_variation.pl -downdb -buildver hg18 -webfrom annovar refGene humandb
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg18_refGene.txt.gz ... OK
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg18_refLink.txt.gz ... OK
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg18_refGeneMrna.fa.gz ... OK
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg18 build version, with files saved at the 'humandb' directory

Now check the humandb/ directory. A few files will be saved in this directory, namely, hg18_refGene.txt, hg18_refLink.txt and hg18_refMrna.fa. Note that this procedure may not work in Windows (due to the need to unpack files by GUNZIP), so Windows users may need to unpack and rename files by themselves (if this is the case, the program will print out an instruction). If the program runs forever at the "downloading" stage, stop it, then add the --verbose argument and run it again, and this helps examine whether the network speed is too slow for downloading.

Note that by default, the "--buildver hg18" argument is turned ON, indicating that human genome is assumed (hg18 is the UCSC annotation for NCBI human genome build 36). If the user is interested in annotating another genome build, you need to change the -buildver argument.

Besides RefSeq gene, several other gene annotations can be downloaded and uesd in gene-based annotation (see Table below). For example, the following commands will download the knownGene gene annotations and ensGene gene annotations.

[kai@beta ~]$ annotate_variation.pl -downdb -buildver hg18 -webfrom annovar knownGene humandb
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg18_knownGene.txt.gz ... OK
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg18_kgXref.txt.gz ... OK
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg18_knownGeneMrna.fa.gz ... OK
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg18 build version, with files saved at the 'humandb' directory

[kai@beta ~]$ annotate_variation.pl -downdb -buildver hg18 -webfrom annovar ensGene humandb
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg18_ensGene.txt.gz ... OK
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg18_ensGeneMrna.fa.gz ... OK
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg18 build version, with files saved at the 'humandb' directory

Notes: In previous versions of ANNOVAR before May 2013, by default the gene definitions are downloaded from UCSC Genome Browser, yet the mRNA files are downloaded from ANNOVAR website. This created a problem since these two sites may not be synchronized well. After May 2013, users should try to add "-webfrom annovar" in the command line, so that the gene definitions are downloaded from ANNOVAR to ensure consistency with mRNA files. Obviously, users have the flexibility to remove "-webfrom annovar" to download the latest files from UCSC Genome Browser, and then use retrieve_seq_from_fasta.pl in ANNOVAR package the generate the most updated FASTA files.

Sevearl commonly used table names are described below. ANNOVAR developers only provide pre-built download versions for refGene, knownGene and ensGene.

table-name in command line UCSC Table Name Explanation
refGene refGene, refLink, refMrna (FASTA sequences may be provided by ANNOVAR devleoper, or can be generated by the user) RefSeq transcript annotations and mRNA sequences in FASTA formats
knownGene knownGene, kgXref, knownGeneMrna (FASTA sequences may be provided by ANNOVAR devleoper, or can be generated by the user) UCSC Gene annotation (more comprehensive than RefSeq annotation) and mRNA sequences in FASTA format
ensGene ensGene, ensGeneMrna (FASTA sequences may be provided by ANNOVAR devleoper, or can be generated by the user) Ensembl Gene annotation (more comprehensive than RefSeq annotation) and mRNA sequences in FASTA format
wgEncodeGencodeManualV3 wgEncodeGencodeManualV3 GENCODE manual annotation for hg18
wgEncodeGencodeManualV4 wgEncodeGencodeManualV4 GENCODE manual annotation for hg19
wgEncodeGencodeAutoV3, wgEncodeGencodePolyaV3 wgEncodeGencodeAutoV3, wgEncodeGencodePolyaV3 other GENCODE annotations
(other) check UCSC Table Browser for other gene definition/prediction systems Other gene definitions/predictions

Note that as of August 2010, all the FASTA sequences can be be provided by ANNOVAR devleoper, or can be generated by the user. The mRNA/cDNA sequences (as refSeqMrna.fa files) at the UCSC database may NOT match the true genomic sequence for a given reference genome, because they are "observed" sequences from some random individual in some random population by some random sequencing runs, rather than the "theoretical" sequences based on genomic DNA that match exactly to the specific genome build of interest. For human (hg18, hg19), I provide the pre-built FASTA sequences that users can download by the command above. The pre-built FASTA sequences may not be up-to-date, so some users may want to build the FASTA sequences themselves. For other organisms, users need to build the FASTA sequences themselves. The basic concept is simple: take the whole genome DNA sequence, and take a gene definition file, the retrieve_seq_from_fasta.pl program can automatically generate the correct mRNA sequence for all these genes.

To understand this more, try to handle the chimp genome:

[kai@beta ~/]$ annotate_variation.pl -downdb -buildver panTro2 gene chimpdb
NOTICE: Downloading annotation database ftp://hgdownload.cse.ucsc.edu/goldenPath/panTro2/database/refGene.txt.gz ... OK
NOTICE: Downloading annotation database ftp://hgdownload.cse.ucsc.edu/goldenPath/panTro2/database/refLink.txt.gz ... OK
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/panTro2_refGeneMrna.fa.gz ... Failed
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for panTro2 build version, with files saved at the 'chimpdb' directory
WARNING: Some files cannot be downloaded, including http://www.openbioinformatics.org/annovar/download/panTro2_refGeneMrna.fa.gz
--------------------------------IMPORTANT---------------------------------
--------------------------------------------------------------------------
NOTICE: the FASTA file http://www.openbioinformatics.org/annovar/download/panTro2_refGeneMrna.fa.gz is not available to download but can be generated by the ANNOVAR software. PLEASE RUN THE FOLLOWING TWO COMMANDS CONSECUTIVELY TO GENERATE THE FASTA FILES:
annotate_variation.pl --buildver panTro2 --downdb seq chimpdb/panTro2_seq
retrieve_seq_from_fasta.pl chimpdb/panTro2_refGene.txt -seqdir chimpdb/panTro2_seq -format refGene -outfile chimpdb/panTro2_refGeneMrna.fa
--------------------------------------------------------------------------
--------------------------------------------------------------------------

The above command will run, but will print out some warning message: the FASTA sequences are not provided in ANNOVAR website so users need to build them. Just follow the exact intructions and run the two commands:

[kai@beta ~/]$ annotate_variation.pl --buildver panTro2 --downdb seq chimpdb/panTro2_seq
NOTICE: Downloading annotation database ftp://hgdownload.cse.ucsc.edu/goldenPath/panTro2/bigZips/chromFa.zip ... Failed
NOTICE: Downloading annotation database ftp://hgdownload.cse.ucsc.edu/goldenPath/panTro2/bigZips/chromFa.tar.gz ... OK
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for panTro2 build version, with files saved at the 'chimpdb/panTro2_seq' directory

[kai@beta ~/]$ retrieve_seq_from_fasta.pl chimpdb/panTro2_refGene.txt -seqdir chimpdb/panTro2_seq -format refGene -outfile chimpdb/panTro2_refGeneMrna.fa
NOTICE: Finished reading 1 sequences from chimpdb/panTro2_seq/12/chr12_random.fa
NOTICE: Finished reading 1 sequences from chimpdb/panTro2_seq/22/chr22.fa
NOTICE: Finished reading 1 sequences from chimpdb/panTro2_seq/14/chr14.fa
......
......
NOTICE: Finished writting FASTA for 1337 genomic regions to chimpdb/panTro2_refGeneMrna.fa.

So after running the above commands, the gene annotation database for the chimp genome would be complete, accurate and most up-to-date.

Exercise: Try to run the same procedure described above for rheMac2 (Macaque), and see how this differ from panTro2. UCSC did not utilize the same file naming convention or directory structuring rules for different genomes, and this makes the life of programmers more complicated. ANNOVAR can handle many genomes, but there will be another genome for which ANNOVAR cannot retrieve sequence automatically; if that is the case, please report to me and I will invesigate and add the functionality.

Exercise: Try to run the same procedure above for sacCer2 (yeast) and see how this differs.

Exercise: Try to run the same procedure above for bosTau6 (cow). Note that as of April 2012, UCSC has not split the FASTA file for bosTau6 genome sequence into individual chromosomes. Therefore, users need to use "-seqfile bosTau6.fa", rather than "-seqdir cowdb/bosTau6_seq", in the retrieve_seq_from_fasta.pl command.

Exercise: Try to run the same procedure above for rn5 (rate). Again users need to supply FASTA files rather than FASTA directory.

The above procedure will only work if the gene-based annotations exist in UCSC for the particular species or the particular build. For example, if you want to use ANNOVAR on pigs, since RefSeq gene and UCSC Gene are not available for pigs, you have to use "annotate_variation.pl --downdb -buildver susScr2 ensgene pigdb" instead and use "-dbtype ensgene" for the gene-based annotation.

When running gene-based annotation, ANNOVAR may complain that "WARNING: A total of 99 sequences cannot be found in hg19_refGeneMrna.fa (example: NM_001195278 NM_001195252 NM_001194947)". This usually means that the sequence database (FASTA file) is already outdated because the gene annotation was updated constantly in a weekly basis. Try to generate a new FASTA file for all genes yourself for hg19, using these two commands: "annotate_variation.pl --downdb seq humandb/hg19_seq/ -build hg19; retrieve_seq_from_fasta.pl humandb/hg19_refGene.txt -seqdir humandb/hg19_seq/ -format refGene -outfile humandb/hg19_refGeneMrna.fa ".

 

2. Download region annotation databases from UCSC

Besides gene-based annotations, several common table-names for region-based annotations are summarized below. Thousands of additioal databases are availalbe from UCSC, so below are just some examples so that users can have something to test upon.

UCSC Table Name (a few examples are shown below) Explanation
cytoBand the approximate location of bands seen on Giemsa-stained chromosomes
tfbsConsSites transcription factor binding sites conserved in the human/mouse/rat alignment, based on transfac Matrix Database (v7.0)
wgRna snoRNA and miRNA annotations
targetScanS TargetScan generated miRNA target site predictions
genomicSuperDups Segmental duplications in genome
phastConsElements*way

conserved elements produced by the phastCons program based on a whole-genome alignment of vertebrates. Depending on species used, it could be 17way, 28way, 30way, 44way, etc, so users have to specify the *way in the command line argument. For human genome hg18 build, the recommended value is mce28way.

If the user wants to limit the conservation measures to mammals, then the complete Table name must be specified. For example, phastConsElements28wayPlacMammal, phastConsElements44wayPrimates, phastConsElements44wayPlacental, etc. ANNOVAR will attempt to download these tables directly (see the "other" annotation type below)

evofold conserved functional RNA, through RNA secondary structure predictions made with the EvoFold program
dgv Database of Genomic Variants, which contains annotations for reported structural variations
omimGene

canonical UCSC genes that have been associated with identifiers in the Online Mendelian Inheritance in Man (OMIM) database. As advised by UCSC, the results "should be treated with skepticism and any conclusions based on them should be carefully scrutinized using independent resources", including manual inspection of primary literature.

This is no longer available. Read FAQ for more information.

gwasCatalog Published GWAS results on diverse human diseases.
(other) all other databases, using the URL ftp://hgdownload.cse.ucsc.edu/goldenPath/<build-version>/database/<table-name>.txt.gz, where <build-version> and <table-name> is specified by the user. If the Table does not exist in UCSC databases, an error will be thrown by the program.

The users can use the following example commands to download these databases:

annotate_variation.pl -downdb -buildver hg18 cytoBand humandb/
annotate_variation.pl -downdb -buildver hg18 tfbsConsSites humandb/
annotate_variation.pl -downdb -buildver hg18 phastConsElements28way humandb/
annotate_variation.pl -downdb -buildver hg18 genomicSuperDups humandb/
annotate_variation.pl -downdb -buildver hg18 wgRna humandb/
annotate_variation.pl -downdb -buildver hg18 evofold humandb/
annotate_variation.pl -downdb -buildver hg18 dgv humandb/
annotate_variation.pl -downdb -buildver hg18 gwasCatalog humandb/
annotate_variation.pl -downdb -buildver hg18 phastConsElements28wayPlacMammal humandb/

Several new files will be generated in the humandb/ directory. The file names are mostly self-evident.

3. Download additional region databases outside of UCSC genome browser

In addition to UCSC Genome Browser, many other users or research groups compliled their own genome annotations based on their own algorithm or data (for example, Chip-Seq peak regions). Some of them made the data publicly available and these data resources can be interrogated by ANNOVAR as well.

Per users' request, I have now added a few keywords in the -downdb and provided several database that perhaps appeal to more users. These databases will be downloaded from ANNOVAR's website. For example, if users are interested in scanning variants against conserved regions/elements that are identified by GERP++ elements (rather than phastCons elements in UCSC), you can use:

[kaiwang@biocluster ~]$ annotate_variation.pl -downdb gerp++elem . -webfrom annovar

[kaiwang@biocluster ~]$ annotate_variation.pl ex1.human humandb/ -regionanno -dbtype gerp++elem
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Reading annotation database humandb/hg18_gerp++elem.txt ... Done with 1354034 regions
NOTICE: Finished region-based annotation on 12 genetic variants in ex1.human
NOTICE: Output files were written to ex1.human.hg18_gerp++elem

Note that by default the RS score is printed out. If P-value is desired, use -colsWanted argument and set it as 4 (column number start from 0 to 4).

[kaiwang@biocluster ~/]$ annotate_variation.pl ex1.human humandb/ -regionanno -dbtype gerp++elem -colsWanted 4
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Reading annotation database humandb/hg18_gerp++elem.txt ... Done with 1354034 regions
NOTICE: Finished region-based annotation on 12 genetic variants in ex1.human
NOTICE: Output files were written to ex1.human.hg18_gerp++elem

You can do the same thing using "-buildver" of "hg19" or "mm9". Note that GERP++elem is a region-based database containing conserved genomic elements.

In the future, many more region annotation databases will be provided to ANNOVAR users. If you have a database that you want to share with others, please drop me an email and I will be glad to host it.

 

 

4. Download 1000 Genome Project, dbSNP, SIFT, Polyphen, MutationTaster and other filter-based annotation databases

Currently ANNOVAR can utilize variants information in 1000 Genome Project or dbSNP or other databases. The table below listed some of these databases provided by ANNOVAR developers, but they are not a comprehensive list. For the most comprehensive list, go to the download page of the website.

table-name in command line Data set Name buildver Explanation
1000g 1000 Genomes Project Pilot Data SNPs (2009 April call set) hg18 1000 Genomes Project Pilot 1 allele frequency data on the CEU, YRI and JPTCHB populations, updated on Aprial 2009. Files are in simple tab-delimited text file format. Three files will be downloaded for 3 ethnicity groups. Therefore, when analyzing the data, users need to use "1000g_ceu", "1000g_yri" or "1000g_jptchb" as -dbtype to specify the actual population to be scanned.
1000g2010 1000 Genomes Project Pilot Data SNPs + indels (2010 March call set) hg18 1000 Genomes Project Pilot 1 allele frequency data on the CEU, YRI and JPTCHB populations, updated on March 2010, including the indel frequencies. This file is reformatted by myself to be in a consistent format as the Aprial 2009 release, and this file will be downloaded from the ANNOVAR website, not the 1000G website!
1000g2010jul 1000 Genomes Project Pilot data 2010 July release hg18 Similar to above, but the Pilot release data is July 2010.
...... ......   I will try to keep up with the update from the 1000G website, but it seems that they will no longer release PILOT project data. ANY FUTURE DATA WILL BE LIKELY IN hg19 CORRDINATE FROM PHASE 1, 2, etc. Therefore, the 1000g2010jul might be the the last hg18 data that ANNOVAR users can use, and they are unfortunately limited to CEU+YRI+CHB+JPT only.
1000g2010nov 1000 Genomes Project PHASE 1 2010 November release hg19, indel added June 2011

THIS IS NOT PILOT PROJECT. THIS IS FULL PHASE 1 PROJECT WITH 629 SUBJECTS FROM DIVERSE POPULATIONS USING August 2010 ALIGNMENTS. use "-buildver hg19 1000g2010nov" to download this database. when analyzing the data, users need to use "1000g2010nov_all" as -dbtype to specify the population to be scanned (it is not possible to scan CEU or YRI only).

The variant calls were released in November 2011, hence the name of the db in ANNOVAR. The variants were based on alignment indexes generated in August 2011, and subsequently these files were all moved to the August directory in 1000G website (read their README file here).

In Feburary 2011, indel calls from the same alignment (Auguest 2010) was released by 1000G, and subsequently, these calls were deposited into the 2010_08/ directory in 1000G website. For whatever it is worth I decided to get the indel calls and append them to the hg19_ALL.2010_11.txt file provided in ANNOVAR package. This file is updated in June 2011, so if you download this file before this date, it is perhaps best if you re-download the file again to take advantage of the indel calls.

1000g2011may 1000 Genomes Project 2011 May release hg19, no indel yet

THIS IS PHASE 1 LOW-COVERAGE DATA ON 1094 SUBJECTS USING NOVEMBER 2010 ALIGNMENTS. I generated this data set using this file. Based on their README file, there are 38.88M sites, including 30.36M novel sites.

The raw data are available in the 20101123/ directory in 1000G, but it really has aboslutly nothing to do with the November 2011 variant call realease above. That call set was in the 2010_11/ directory in 1000G!!! But if you read the README file in the 2010_11/ directory, you'll know that they decided to move the whole content of 2010_11/ to the 2010_08/ directory instead (Read the row above for 1000g2010nov for more explanations.) I know it is confusing, but this is life and you'll just have to live with it.

1000g2012feb 1000 Genomes Project 2012 Feb release hg19, 38M SNPs, 3.8M indels variant calls from 1092 samples for SNPs, short indels. I generated this data set using this file. It is located in the 20110521/ directory in 1000G FTP site, but do not confuse it with the 2011may database in ANNOVAR.
1000g20**** 1000 Genomes Project hg19 see download page for the most updated list of 1000G data sets
snp132 snp132 hg18/hg19 dbSNP version 132 (hg19, or use "-buildver hg18 -webfrom annovar" to download hg18 version)
snp131 snp131 hg18/hg19 dbSNP version 131 (hg19, or use "-buildver hg18 -webfrom annovar" to download hg18 version)
snp130 snp130 hg18/hg19 dbSNP version 130 (available for both hg18 and hg19. use "-bulidver hg18" or "-buildver hg19 -webfrom annovar " to choose the genome build)
snp129 snp129 hg18 dbSNP version 129 (the most popular dbSNP build used in many sequencing papers, hg18 ONLY)
snp125 snp125 hg17 dbSNP version 125 (the last hg17/NCBI35-based dbSNP build)
snp**** snp**** hg19 see download page for the most updated list of dbSNP data sets
avsift ANNOVAR-SIFT database hg18/hg19

Based on the SQLite database provided by SIFT developers (http://sift-dna.org/), I generated ANNOVAR-specific files called hg18_avsift.txt and hg19_avsift.txt with eight fields per line (chr, start, end, reference allele, observed allele, SIFT score, reference amino acid, observed amino acids). Sometimes one mutation could cause changes in multiple peptides (transcriptional isoforms), whereas the SIFT score for each peptide may sometimes be different; in this case, the lowest score is used. For example, hg18 coordinate chr22:14646995 G->C has SIFT score of 0.94 for ENSP00000347095, 0.08 for ENSP00000348134, 0.8 for ENSP00000352595. ANNOVAR use the 0.08 for annotation

November 2010 change: (1) hg19 support is added (3) deleted a few rows in hg18 when reference allele is identical to observed allele

March 2011 change: updated SIFT scores and added chrY scores

ljb_pp2 PolyPhen2 hg18/hg19 use "-downdb ljb_pp2 -webfrom annovar" to download the database. Read this paper for details and cite it if you use the database.
ljb_sift LJBSIFT hg18/hg19 I call this score as LJBSIFT, because it is calculated as 1-SIFT. The higher the score, the more important the mutation. use "-downdb ljb_sift -webfrom annovar" to download the database. Read this paper for details and cite it if you use the database.
ljb_mt MutationTaster hg18/hg19 use "-downdb ljb_mt -webfrom annovar" to download the database. Read this paper for details and cite it if you use the database.
ljb_phylop PhyloP conservation score hg18/hg19 use "-downdb ljb_phylop -webfrom annovar" to download the database. Read this paper for details and cite it if you use the database.
ljb_lrt LRT hg18/hg19 use "-downdb ljb_lrt -webfrom annovar" to download the database. Read this paper for details and cite it if you use the database.
ljb_gerp++ GERP++ score for exonic variants hg18/hg19 use "-downdb ljb_gerp++ -webfrom annovar" to download the database. Read this paper for details and cite it if you use the database.
ljb_all all scores above from LJB database hg18/hg19 use "-downdb ljb_all -webfrom annovar" to download the database. Read this paper for details and cite it if you use the database. The "-otherinfo" argument needs to be specified during annotation for the information to be printed out in the output file.
esp5400_ea 5400 NHLBI exomes (European Americans) hg18/hg19 use "-downdb esp5400_ea -webfrom annovar" to download the database. Use -dbtype generic for annotation. In future versions of ANNOVAR, the keyword esp5400_ea will be added. The allele frequencies are "alternative allele frequency", not "minor allele frequency".
esp5400_aa 5400 NHLBI exomes (African Americans) hg18/hg19 use "-downdb esp5400_aa -webfrom annovar" to download the database. Use -dbtype generic for annotation
esp5400_all 5400 NHLBI exomes (all ethnicity) hg18/hg19 use "-downdb esp5400_all -webfrom annovar" to download the database. Use -dbtype generic for annotation
esp**** NHLBI exomes hg18/hg19 see download page for the most updated list of dbSNP data sets
<OTHER> <OTHER>   see download page for the most updated list of OTHER data sets. This table shows only some examples and explains that they are.

The users can use the following example commands to download these databases:

annotate_variation.pl -downdb -buildver hg18 -webfrom annovar 1000g humandb/
annotate_variation.pl -downdb -buildver hg18 -webfrom annovar snp130 humandb/
annotate_variation.pl -downdb -buildver hg18 -webfrom annovar avsift humandb/

After executing these commands, Several output files will appear in the humandb/ directory. For the 1000G data, the output files will look like hg18_CEU.sites.2009_04.txt, indicating that these data were compiled in April 2009. For dbSNP data, the file hg18_dbsnp130.txt will be generated in the humandb/ directory. For AVSIFT, the file hg18_avsift.txt will be generated in the humandb/ directory.

Notes: ANNOVAR was developed when the April 2009 build of 1000 Genomes Project was available, so by default the table-name of "1000g" will download this data set. In March 2010, an update from 1000G is issued, so to handle these new files, the "1000g2010" table-name must be used in the command line. In July 2010, another update of the pilot project was released, so users need to use "1000g2010jul" table-name. In November 2010, another update of the project (non-pilot) was released, so users need to use "1000g2010nov" table name, as well as "-buildver hg19" because this release was based on hg19 coordinate.

To download the 2010 and 2011 release of the 1000 Genomes Project data, use the command below:

[kai@beta ~/]$ annotate_variation.pl -downdb 1000g2010 humandb/
[kai@beta ~/]$ annotate_variation.pl -downdb 1000g2010jul humandb/
[kai@beta ~/]$ annotate_variation.pl -downdb 1000g2010nov humandb/ -buildver hg19
[kai@beta ~/]$ annotate_variation.pl -downdb 1000g2011may humandb/ -buildver hg19
[kai@beta ~/]$ annotate_variation.pl -downdb 1000g2012apr humandb/ -buildver hg19

The humandb/ directory will have a few additional files that look like hg18_CEU.sites.2010_03.txt. These files will be used by ANNOVAR for annotation.

 

Advanced notes on 1000 Genomes Project Data processing

1. tri-allelic SNPs or multi-allelic variants : for whatever reason, currently 1000G data do not actually specify the allele count (AC field) for multi-allelic variants which actually violates the VCF4 format specification. I will have to treat both alternative alleles as having identical allele frequency (for example, AF=0.23 for both A and C below):

1 3216108 rs59508799 G A,C 7818.35 PASS AC=110;AF=0.2331;AN=472;BaseQRankSum=0.000;BaseQRankSumZ=6.2032227.76;VQSLOD=1.8460;set=ALL2

in 2011 May release of 1000G SNP data, there are 89168 tri-allelic variants that I converted using the above rule. In other word, the above line was treated as two different entries in a database, both with allele frequency of 0.23. See below:

[kaiwang@biocluster ~/]$ fgrep -w 3216108 hg19_ALL.sites.2011_05.txt
1 3216108 G A 0.2331
1 3216108 G C 0.2331

2. indel calls: indels from 1000G were typically released at a different time than SNP calls, but I tried to combine them together into one single file to faciliate ANNOVAR users. Depending on the version of release, sometimes more than 10 million indels will be called in 1000G (yet only 20-30 million SNPs are called), indicating that there are extremly high false positive rate in the calls. Therefore, I only pick the most confident ones that pass all sorts of filters, in the database file that I compile. So do not be surprised that a particular indel was "reported" in 1000G but was not found in the file that I provided to users.

If you absolutely need to use all potential indels regardless of confidence, then just use the VCF file from 1000G directly in ANNOVAR for filter annotation. Read this for details.

3. VCF filter: The VCF files from 1000G may contain different quality filters. In the newer versions of ANNOVAR database file, only the FILTER of "PASS" will be used. For example, among the 41.6 million variant calls in the VCF file from 1000G, only 38.9 million variants have "PASS" filter, yet others have filters such as "TruthSensitivityTranche99.90to100.00", etc. Only the 38.9 million variants are in the hg19_ALL.sites.2011_05.txt file used by ANNOVAR.

Indexed databases are provided starting from September 2011

Starting from September 2011, indexed databases will be provided to ANNOVAR users to speed up filter-based operations. To use this feature, users should re-download all the databases provided by me using -downdb operation (and "-webfrom annovar" argument). In Feburary 2012, an update to all the index files have been posted to provide additional ~2-5X speed up for whole-exome sequencing data, based on extensive simulation tests.

For example, use "annotate_variation.pl -downdb avsift humandb/" will re-download the SIFT database as well as a *.idx file associated with the SIFT database. the *.idx file will be used by ANNOVAR for speeding up filter-based operations.

Technical Note: One thing to note for snp131, snp132, etc: Given that UCSC constantly update these files, it is very difficult for me to maintain a constantly updated index file. Therefore, if users do specify "-webfrom annovar", then both snp files and index files will be downloaded from Annovar website, not UCSC site, and the snp files may sometimes be slightly outdated but defintitely fine to use. If users do not specify "-webfrom annovar", then index files will no longer be downloaded and the speed may not be as good.

 

5. Download Complete Genomics variation database

Complete genomics (CG) has released variants from 69 genomes so this will be a great resource for CG users to filter variants against to remove platform-specific artifacts. I will try to keep up to date with CG in the future if they release more date.

dbtype buildver Explanation
cg46 hg18 a diversity panel representing 9 different populations (variant called on hg18 coordinate)
cg46 hg19 Before Sep 2011: I liftover from the hg18 data above. After sep 2011: compiled from variant calls directly provided by CG
cg69 hg18 a diversity panel representing 9 different populations, a Yoruba trio; a Puerto Rican trio; a 17-member, 3-generation pedigree
cg69 hg19 same as above but mapped and called on hg19 coordinate and directly provided by CG

 

6. Users can supply additional filter-based annotation databases

In addition to downloading annotation databases from Internet, users can supply their own annotation databases. Severral types of database can be supplied. The "generic" format can be used for filter-based annotations, while the "gff3" format can be used for region-based annotations or gene-based annotations.

table-name Data set Name Explanation
generic any filter-based data set conforming to generic format (for use with --filter operation) Users can generate their own variants databases with the simple format (chr, start, end, reference allele, observed allele, and any other columns), and ANNOVAR can process this database using -dbtype generic argument. For example, some users may want to compute whole-exome PolyPhen scores and use ANNOVAR to annotate variants using these scores.
gff3 any annotation data set conforming to Generic Feature Format 3 (GFF3), a current golden standard for model-organism sequence feature annotations (for use with -regionanno opeartion) Users can supply a GFF3 formatted database file, and annovar will perform region-based annoations on query against this file. A detailed description on GFF3 format can be found at sequence ontology website: http://www.sequenceontology.org/gff3.shtml. It has become the standard for many model organism databases for sequence feature exchange, so essentially users have unlimited ability to annotate their variants, as long as a particular annotation database exist in GFF3 format.
vcf any custom VCF file with population frequency data on alleles VCF format is adopted by the 1000 Genomes Project to present variation data. The file may contain called alleles and their frequencies in a population, but may also contain individual genotypes for each subject in a population. ANNOVAR will examine the annotated mutations in a population.
bed a BED file with chr, start and end position Users can supply a custom BED file for region-based annotation. For example, after an exome sequencing experiments you generated variant calls, but are only interested in the calls located in the "target region" of the exome enrichment array; in this case, you can use the BED file provided by array manufacturer to filter the subset of variants located within target regions.