ANNOVAR
Home
Download
Quick Start-up Guide
Prepare Database
Prepare Input File
Annotation
Gene-based
Region-based
Filter-based
Accessary Programs
FAQ

ANNOVAR Accessary Programs

  1. Automation of variant reduction procedure
  2. Conversion of whole-exome data into an Excel file for manual filtering
  3. Conversion of input file formats
  4. Retrieval of nucleotide and protein sequences from a particular genomic region

The ANNOVAR package contains several accessary programs to help users convert file formats or perform additional functions. These accessary programs are described below.

1. Automation of variant reduction procedure

A program called auto_annovar.pl is provided to automate the procedure to identify a small subset of most likely causal variants, from a large list of variants (possibly generated from whole-genome sequencing). An example flow chart is given below, demonstrating how ANNOVAR can be useful in pinpointing a handful of candidate genes (including the true causal gene) from 4.7 million variants (4.2 million SNPs, 0.5 million indels) for a rare Mendelian disease. The input data includes all SNVs in subject NA18107 generated by Illumina, as well as two variants known to cause Miller syndrome.

pipeline

The command to run the "varians reduction" method is shown below. It only works on human variation data, and the program will first check to make sure that all human (hg18 genome build) related annotation databases are already downloaded and are available at the humandb/ directory.

[kai@beta ~/]$ auto_annovar.pl -model recessive ex2.human humandb/

The output files will look like ex2.human.step1.filtered, etc. A total of 10 steps will be performed, and the user need to take a data and apply a genetic model (recessive versus dominant verus X-linked) to further trim down the number of candidate genes. In the example for the synthetic dataset for Miller syndrome shown above, applying a recessive model reduced the candidate to 20, so that these genes can be resequenced by Sanger sequencing in additional cases to confirm which gene is really the causal gene.

step1:identify splicing and exonic variants, but remove synonymous mutations (these
subsets of variants are more likely to be functional; obviously, skip this step
if the purpose is to look for non-coding functional variants)

step2:identify the variants located in genomic regions that are annotated as Most
Conserved Elements (more likely to be functional)

step3:identify variants that are not located in segmental duplications regions (less
likely to be affected by genotype calling issues)

step4/step5/step6: remove variants observed in the 1000 Genomes Project (CEU, YRI, JPTCHB,
respectively) because these variants are less likely to be causing Mendelian
diseases. Howeve, I do see MANY known mendelian disease variants in 1000G, so
some MAF threshold should really be applied here!

step7: remove variants observed in the dbSNP 130 because these variants are
less likely to be causing Mendelian diseases. Howeve, I do see MANY known
mendelian disease variants in dbSNP130!

step8: collect the remaining variants, map them to genes, and prepare the next step

step9: identify a list of genes that are likely to be affected by deleterious mutations

step10 (optional, by default OFF): remove some genes that are regarded as "dispensible"

Most people probably want to tweak the pipeline described above. For example, some may want to eliminate the step that look for Most conserved regions (with the reason that an indel within a non-conserved region could still be highly functional). In this case, using the "-step 1,3-9" will skip the step 2. Similarly, using the "-step 1-3,8-9" argument will not use 1000G or dbSNP data for filtering out potentially benign variants. By default, step 10 will not be executed, but users can add "-step 1-10 -dispensable file" argument to run step 10 as well by supplying a dispensable file, or probably easier to simply run step 1-9 then manually remove some genes from the output file.

If users stops the program when it is running (by Ctrl+C), but want to resume the process by taking advantage of files that have already been generated in previous steps, the "-skip" arugment can be used. For example, suppose that the program was stopped when running step 5, but all the output files for step 4 are already generated. In this case, just run the program using "-step 5-9 -skip" argument, and it will resume from step 5, skipping step 1-4 without re-generating output files for step 1-4. Similarly, if you run the program by default "-model recessive" but want to check out what will happen for dominant model, just run the program with "-step 9 -skip" and you will get a new output file instantly.

 

2. Conversion of whole-exome data into an Excel file for manual filtering

Several researchers expressed interests to perform the filtering steps themselves by testing different combinations of filtering procedure. Excel provides a good platform to do that, even though it has a limit of 1 million lines. In any case, since currently most people focus on exome, Excel can handle exome data just fine. I therefore implemented a functionality to convert whole genome data into an Excel file, such that users can click a few mouse buttons to identify a small subset of candidate genes.

Currently the program is still rudimentary, but it has all the essential functinality now. Below I show how to use it on the ex1.human file as the input variant file:

[kai@node-r2-u16-c19-p13-o11 ~/project/annotate_variation]$ summarize_annovar.pl ex1.human humandb/ -outfile sum

NOTICE: Running step 1 with system command <annotate_variation.pl -geneanno -buildver hg18 -outfile sum ex1.human humandb/>
NOTICE: Reading gene annotation from humandb/hg18_refGene.txt ... Done with 35709 transcripts (including 4745 without coding sequence annotation) for 22032 unique genes
NOTICE: Reading FASTA sequences from humandb/hg18_refGeneMrna.fa ... Done with 13 sequences
NOTICE: Finished gene-based annotation on 12 genetic variants in ex1.human
NOTICE: Output files were written to sum.variant_function, sum.exonic_variant_function

NOTICE: Running step 2 with system command <annotate_variation.pl -regionanno -dbtype mce44way -buildver hg18 -outfile sum ex1.human humandb/>
NOTICE: Reading annotation database humandb/hg18_phastConsElements44way.txt ... Done with 4878296 regions
NOTICE: Finished region-based annotation on 12 genetic variants in ex1.human
NOTICE: Output files were written to sum.hg18_mce44way

NOTICE: Running step 3 with system command <annotate_variation.pl -regionanno -dbtype segdup -buildver hg18 -outfile sum ex1.human humandb/>
NOTICE: Reading annotation database humandb/hg18_genomicSuperDups.txt ... Done with 51809 regions
NOTICE: Finished region-based annotation on 12 genetic variants in ex1.human
NOTICE: Output files were written to sum.hg18_genomicSuperDups

NOTICE: Running step 4 with system command <annotate_variation.pl -filter -dbtype 1000g_ceu -buildver hg18 -outfile sum ex1.human humandb/>
NOTICE: Variants matching filtering criteria are written to sum.hg18_CEU.sites.2009_04_dropped, other variants are written to sum.hg18_CEU.sites.2009_04_filtered
NOTICE: Processing next batch with 12 variants
NOTICE: Scanning filter database humandb/hg18_CEU.sites.2009_04.txt...Done

NOTICE: Running step 5 with system command <annotate_variation.pl -filter -dbtype 1000g_yri -buildver hg18 -outfile sum ex1.human humandb/>
NOTICE: Variants matching filtering criteria are written to sum.hg18_YRI.sites.2009_04_dropped, other variants are written to sum.hg18_YRI.sites.2009_04_filtered
NOTICE: Processing next batch with 12 variants
NOTICE: Scanning filter database humandb/hg18_YRI.sites.2009_04.txt...Done

NOTICE: Running step 6 with system command <annotate_variation.pl -filter -dbtype 1000g_jptchb -buildver hg18 -outfile sum ex1.human humandb/>
NOTICE: Variants matching filtering criteria are written to sum.hg18_JPTCHB.sites.2009_04_dropped, other variants are written to sum.hg18_JPTCHB.sites.2009_04_filtered
NOTICE: Processing next batch with 12 variants
NOTICE: Scanning filter database humandb/hg18_JPTCHB.sites.2009_04.txt...Done

NOTICE: Running step 7 with system command <annotate_variation.pl -filter -dbtype snp130 -buildver hg18 -outfile sum ex1.human humandb/>
NOTICE: Variants matching filtering criteria are written to sum.hg18_snp130_dropped, other variants are written to sum.hg18_snp130_filtered
NOTICE: Processing next batch with 12 variants
NOTICE: Scanning filter database humandb/hg18_snp130.txt...Done

NOTICE: Running step 8 with system command <annotate_variation.pl -filter -dbtype avsift -buildver hg18 -sift 0 -outfile sum ex1.human humandb/>
NOTICE: Variants matching filtering criteria are written to sum.hg18_avsift_dropped, other variants are written to sum.hg18_avsift_filtered
NOTICE: Processing next batch with 12 variants
NOTICE: Scanning filter database humandb/hg18_avsift.txt...Done
NOTICE: Finished loading filterstep database file

NOTICE: Final whole-genome summary was written to sum.genome_summary.csv file
NOTICE: Final whole-genome summary was written to sum.genome_summary.csv file

Now open the sum.exome_summary.csv file in Excel 2007. Click the "DATA" tab at the menu bar, then click the big "Filter" button. Then click any one of the headings such as 1000G_CEU or SIFT to filter out variants, essentially by clicking the check boxes. For SIFT score, make sure to use "less than 0.05 OR equal to (blank)" so that variants without SIFT score do not get filtered out. It should be straightfoward to do, but it may need a little practice for users not familiar with Excel.

excel

 

3. Conversion of input file format

The convert2annovar.pl program can be uesd to convert various file formats into ANNOVAR input file format. This topic has been discussed in detail in the "ANNOVAR Input Files" section.

 

3. Retrieval of nucleotide and protein sequences from a particular genomic region

The retrieve_seq_from_fasta.pl program can be used to retrieve genomic nucleotide sequences or cDNA sequences, or translated amino acid sequences (this functionality is currently being developed and will be released in future ANNOVAR version) from many user-specified genomic regions. It can take several different types of region files, hereafter referred to as "simple", "tab", "refGene", "ensGene", "knownGene".

A few examples are given below to illustrate the use of this program. Before running the example, first download the genomic sequences for whole human genome. They will be saved in the humandb/hg18seq/ directory.

[kai@beta ~/]$ annotate_variation.pl -downdb seq humandb/hg18seq/

1. simple input files

The file list simple regions in the first column of each line (other columns can be present but will not be used). For example,

[kai@beta ~/example]$ cat example.simple_region
chr10:4000000-4000100
chr10:7000000-8000000

This file contains two genomic regions. To retrieve the sequence for these two regions (100bp and 1Mb, respectively), use

[kai@beta ~/example]$ retrieve_seq_from_fasta.pl -format simple -seqdir ../humandb/hg18_seq/ example.simple_region
NOTICE: The output file is written to example.simple_region.fa (use --outfile to change this)
NOTICE: Finished reading 1 sequences from ../humandb/hg18_seq/chr10.fa
NOTICE: Finished writting FASTA for 2 genomic regions to example.simple_region.fa.
[kai@node-r2-u16-c19-p13-o11 ~/project/annotate_variation/example]$ head -n 2 example.simple_region.fa
>chr10:4000000-4000100 Comment: this DNA sequence is generated by ANNOVAR on Thu Aug 5 21:58:08 2010, based on regions speficied in example.simple_region and sequence files stored at ../humandb/hg18_seq/.
CACCATAATCCGTCTCGCCATTCTTTCCCAAGGGGCTTTATTCGTTCTATCTCCATGCTCTTCTCAACATCACCTGCCACTGTTGGCTCGTGGACTTTTT

2. tab-delimited input files

The file list chr, start and end position in tab delimited format as the first 3 columns of each line (other columns can be present but will not be used). An example is given below. Note that the -outfile can be used to specify an output file name.

[kai@beta ~/example]$ cat example.tab_region
chr10 4000000 4000100
chr10 7000000 8000000

[kai@beta ~//example]$ retrieve_seq_from_fasta.pl -format tab -seqdir ../humandb/hg18_seq/ -outfile example.fa example.tab_region
NOTICE: Finished reading 1 sequences from ../humandb/hg18_seq/chr10.fa
NOTICE: Finished writting FASTA for 2 genomic regions to example.fa.

3. refGene input files

The file is in UCSC refGene format that contains exon start and end positions. The output will be mRNA/cDNA sequences, rather than genomic seqences.

[kai@beta ~/humandb]$ head hg19_refGene.txt
971 NR_024227 chr19 - 50595745 50595866 50595866 50595866 1 50595745, 50595866, 0 SNAR-A6unk unk -1,
971 NR_024227 chr19 - 50601082 50601203 50601203 50601203 1 50601082, 50601203, 0 SNAR-A6unk unk -1,
629 NM_001014809 chr4 - 5822491 5894785 5823486 5894696 14 5822491,5827220,5830215,5837641,5838491,5841248,5843034,5844819,5851118,5853134,5857869,5862752,5868394,5894315, 5823578,5827386,5830395,5837812,5838633,5841405,5843155,5844888,5851199,5853196,5858034,5862937,5868483,5894785, CRMP1 cmpl cmpl 1,0,0,0,2,1,0,0,0,1,1,2,0,0,
808 NM_001029883 chr2 - 29284557 29297127 29287734 29297127 2 29284557,29293459, 29287933,29297127, C2orf71 cmpl cmpl 2,0,
705 NM_024329 chr1 + 15736390 15756839 15736467 15755220 4 15736390,15752366,15753645,15755088, 15736775,15752514,15753780,15756839, 0 EFHD2 cmpl cmpl 0,2,0,0,

[kai@beta ~/humandb]$ retrieve_seq_from_fasta.pl -format refGene -seqdir hg19_seq/ -outfile example.fa hg19_refGene.txt
NOTICE: Finished reading 1 sequences from hg19_seq/chr1_gl000191_random.fa
NOTICE: Finished reading 1 sequences from hg19_seq/chr22.fa
NOTICE: Finished reading 1 sequences from hg19_seq/chr14.fa
...
...
NOTICE: Finished writting FASTA for 36824 genomic regions to example.fa.

4. knownGene input files

The handling of this type of input files is very similar to the refGene input files. Future versions of ANNOVAR may merge these input files together.

5. ensGene inputfiles

The handling of this type of input files is very similar to the refGene input files. Future versions of ANNOVAR may merge these input files together.

6. Others (such as Gencode) input files

Use the genericGene as the -format argument.