|
ANNOVAR Accessary Programs
The ANNOVAR package contains several accessary programs to help users convert file formats or perform additional functions. These accessary programs are described below. 1. Variants_Reduction: prioritizing causal variants In October 2012, a new program, variants_reduction.pl, was added into the ANNOVAR package to replace the old auto_annovar.pl. The new program is more flexible to allow users choose customized filtering procedure, and hopefully will be more useful than the old program to help identify causal variants from next-generation sequencing data. If you are familiar with the annotate_variation.pl program, it should not be too hard to use variants_reduction.pl program. One example is shown below: [kaiwang@biocluster ~/]$ variants_reduction.pl sample.avinput humandb/ -protocol nonsyn_splicing,genomicSuperDups,phastConsElements46way,1000g2012apr_all,esp5400_ea,esp5400_aa,snp135NonFlagged,dominant -operation g,rr,r,f,f,f,f,m -out reduce -buildver hg19 This command means to apply a series of filtering procedures to identify a small subset of variants/genes that are likely to be related to diseases. These filtering procedures include: identifying nonsynonymous and splicing variants, removing variants in segmental duplication regions, keeping variants in conserved genomic regions based on 46-way alignment, removing variants not observed in 1000 Genomes Project 2012 April release or ESP5400 European Americans or ESP5400 African Americans, removing variants observed in dbSNP135 Non Flagged set, and then apply a dominant disease model. The -operation argument instruct what operation are used: gene-based (g), reverse region-based (rr), region-based (r), filter-based (f), filter-based (f), filter-based (f), filter-based (f), model-based (m), respectively. The output are written to a set of files with reduce* file names. Another example command is given below: [kaiwang@biocluster ~/]$ variants_reduction.pl sample.avinput humandb -buildver hg19 -protocol nonsyn_splicing,1000g2012apr_all,esp6500_ea,esp6500_aa,snp135NonFlagged,cg46,ljb_sift,ljb_pp2,dominant -operation g,f,f,f,f,f,f,f,m -outfile reduce -genetype knowngene -maf_threshold 0.01 Basically, this command will perform a similar set of operations as above, but additionally remove any variants observed in the CG46 database. Additionally, the MAF threhsold will be applied to all the 1000G, ESP6500 and CG46 databases. Furthermore, variants believed to be likely benign by SIFT or PolyPhen are removed. Finally, the UCSC Known Gene, rather than RefSeq Gene (default), will be used for gene-based annotation. As you will see, basically as users, you specify what operations are used by ANNOVAR, and what specfic databases are used by the corresponding operation. Users have somewhat limited ability to select custom thresholds such as different MAF for different databases. The program is not mature enough and will undergo additional changes in future versions to improve its functionality and to make it compatible in Windows operating system.
2. Table_Annovar: Conversion of whole-genome data into an Excel file Previous version of ANNOVAR before May 2013 included the summarize_annovar program. It takes an input file and generates tab-delimited annotation file, where each column represents one type of annotation. This program has been popular among ANNOVAR users, because it allows easy viewing of the results in Excel or other tools. However, summarize_annovar fixed the number and type of annotation, which severely limits user's ability to perform custom annotations. In May 2013, I released the table_annovar.pl program to address this challenge. Since the program is new, it may have some bugs; if you encounter any, please report to me. Below I show how to use it on the ex1.human file as the input variant file: [kaiwang@biocluster ~/]$ table_annovar.pl ex1.human humandb/ -protocol refGene,phastConsElements44way,genomicSuperDups,esp6500si_all,1000g2012apr_all,snp135,avsift,ljb_all -operation g,r,r,f,f,f,f,f -nastring NA The output file is written to ex1.human.multianno.txt. This is a tab-delimited file, where each row represents one variant, and each column represents one annotation task. If you are familiar with summarize_annovar, you will see that the output file is similiar. However, table_annovar allows user to specify exactly which columns or annotation tasks are required, and allows user to select multiple versions of the same analysis (such as multiple gene-definition systems or multiple dbSNP databases). Users can open the file in Excel 2007 (select "tab-delimited" when opening the file). Click the "DATA" tab at the menu bar, then click the big "Filter" button. Then click any one of the headings such as 1000G_CEU or SIFT to filter out variants, essentially by clicking the check boxes. For SIFT score, make sure to use "less than 0.05 OR equal to (blank)" so that variants without SIFT score do not get filtered out. It should be straightfoward to do, but it may need a little practice for users not familiar with Excel.
Next, try add the "-csvout" argument to the above command and run the program again. This time, a CSV file will be generated that can be directly loaded into Excel. Next, try add the "-sortout" argument to the above command. The output file will be sorted by Chromosome and by Start Position already for downstream processing. Next, let's try something more complicated, to generate gene-based annotations by different gene definition systems and filter-based annotations by different version of dbSNP: [kaiwang@biocluster ~/]$ table_annovar.pl ex1_hg19.human humandb/ -buildver hg19 -protocol refGene,knownGene,ensGene,wgEncodeGencodeManualV4,gerp++elem,phastConsElements46way,genomicSuperDups,esp6500si_all,1000g2012apr_all,1000g2012apr_eur,1000g2012apr_amr,1000g2012apr_asn,1000g2012apr_afr,cg46,cosmic64,snp129,snp132,snp137,avsift,ljb_all -operation g,g,g,g,r,r,r,f,f,f,f,f,f,f,f,f,f,f,f,f -csvout Examine the results to see the consistence between different annotation approaches/versions.
3. Conversion of input file format The convert2annovar.pl program can be uesd to convert various file formats into ANNOVAR input file format. This topic has been discussed in detail in the "ANNOVAR Input Files" section.
4. Retrieval of nucleotide and protein sequences from a particular genomic region The retrieve_seq_from_fasta.pl program can be used to retrieve genomic nucleotide sequences or cDNA sequences, or translated amino acid sequences (this functionality is currently being developed and will be released in future ANNOVAR version) from many user-specified genomic regions. It can take several different types of region files, hereafter referred to as "simple", "tab", "refGene", "ensGene", "knownGene". A few examples are given below to illustrate the use of this program. Before running the example, first download the genomic sequences for whole human genome. They will be saved in the humandb/hg18seq/ directory. [kai@beta ~/]$ annotate_variation.pl -downdb seq humandb/hg18seq/ 1. simple input filesThe file list simple regions in the first column of each line (other columns can be present but will not be used). For example, [kai@beta ~/example]$ cat example.simple_region This file contains two genomic regions. To retrieve the sequence for these two regions (100bp and 1Mb, respectively), use [kai@beta ~/example]$ retrieve_seq_from_fasta.pl -format simple -seqdir ../humandb/hg18_seq/ example.simple_region 2. tab-delimited input files The file list chr, start and end position in tab delimited format as the first 3 columns of each line (other columns can be present but will not be used). An example is given below. Note that the -outfile can be used to specify an output file name. [kai@beta ~/example]$ cat example.tab_region 3. refGene input files The file is in UCSC refGene format that contains exon start and end positions. The output will be mRNA/cDNA sequences, rather than genomic seqences. [kai@beta ~/humandb]$ head hg19_refGene.txt 4. knownGene input files The handling of this type of input files is very similar to the refGene input files. Future versions of ANNOVAR may merge these input files together. 5. ensGene inputfiles The handling of this type of input files is very similar to the refGene input files. Future versions of ANNOVAR may merge these input files together. 6. Others (such as Gencode) input files Use the genericGene as the -format argument. |
||||||||||||