|
ANNOVAR Function III: Filter-based annotation
An important and probably highly desirable feature is that ANNOVAR can help identify subsets of variants based on comparison to other variant databases, for example, variants annotated in dbSNP or variants annotated in 1000 Genome Project. In addition, I provide a "avsift" data set, which contains pre-computed SIFT scores for all possible whole-genome non-synonymous mutations, so that users can scan non-synonymous mutations against avsift to filer out "benign" mutations. The exact variant, with same start and end positions, and with same observed alleles, will be identified. These functionalities mentioned above can be performed using the --filter operation in ANNOVAR. The major difference between --filter and --regionanno above is that that --filter operation works on mutations (nucleotide changes), but --regionanno operation works on chromosome locations. For example, --region compare variants with things like chr1:1000-1000, but --filter compare variants with things like A->G change at the position chr1:1000-1000. 1. 1000 Genomes Project (2009 release) annotations ANNOVAR can annotate variants based on annotated allele frequency in CEU populations used in the 1000 Genome project. This analysis requires downloading the annotation files from the 1000Genome project website (the command is "annotate_variation.pl -downdb 1000g humandb/") The current version of program download the 2009 April annotations. [kai@beta ~/]$ annotate_variation.pl -filter -dbtype 1000g_ceu ex1.human humandb/ The ex1.human file contains a few common SNPs that were annotated in the 1000G project on CEU populations. The output file ex1.human.hg18_1000g_ceu_filtered contains a list of SNPs not reported in 1000G_CEU. The output file ex1.human.hg18_1000g_ceu_dropped contains the list of SNPs that are reported and their non-reference allele frequencies (as the second column below): [kai@beta ~/]$ cat ex1.human.hg18_1000g_ceu_dropped Similarly, change 1000g_ceu to 1000g_yri and 1000g_jptchb can be used to check the allele frequencies in YRI and JPT/CHB populations. In general, these commands are highly efficient, requiring several minutes to scan 4 million input genetic variants. Additionally, users may use settings such as "-chr 1-9" and "-chr 10-22,X" to run ANNOVAR in selected chromosomes only to further speed up searchers, if knowing that input variants are in specific chromosomes only. 2. 1000 Genomes Project (2010 March/July/November and 2011 release) annotations The procedure above was originally developed for the April 2009 release of 1000G. In March 2010, a new release of 1000G data is available, so the new keyword "1000g2010" must be used, if the users want to use the new 1000G data for the annotation. Similarly, the new keyword "1000g2010jul" and "1000g2010nov" must be used to handle additional releases. The 2010 November release is no longer a pilot release, but a full project release in hg19 coordinate. [kai@beta ~/]$ annotate_variation.pl -downdb 1000g2010 humandb/ The results are below: [kai@beta ~/]$ cat ex1.human.hg18_1000g2010_ceu_dropped We will note that this result slightly differ from those generated on the 2009 release of 1000G data. This is because the 2009 and 2010 release used different genotype calling algorithms, potentially resulting in false negative calls in both sets. Users should be aware of these differences. In general, I recommend using the 2010 release, not just because it is newer, but also because it contains indel calls. Also note that as far as I can tell, 1000G 2010 March release does not provide chromosome X or chromosome Y calls for SNPs. They do provide chromosome X calls for indels. This should be kept in mind when using ANNOVAR for annotation. The 2010 July and Nov release does contain sex chromosome calls. Technical notes: Most casual ANNOVAR users can safely ignore these notes. If you want to know in detail how ANNOVAR works, read below. 1000 Genomes Pilot Proejct continuously to put out updates to their variant calls. The keyword 1000g2010jul was added in July 2010 to handle the new version of data: [kaiwang@biocluster ~/]$ annotate_variation.pl -filter -dbtype 1000g2010jul_ceu ex1.human ~/project/annotate_variation/humandb/ comparing the 2010 July and 2010 March release, we can see that one additional SNP has been detected in the July release. Similarly, the non-pilot project also released variant calls in their website (in hg19 coordinate), and users need to use 1000g2010nov to handle the new data, with the '-buildver hg19' argument as well. The 2010Nov data does not separate CEU/YRI/ASN populations though, as they are based on all 1000G populations. The dbtype should include "all" as suffix as the database are for all subjects without specific popualtion identifiers. [kaiwang@biocluster ~/]$ annotate_variation.pl -filter inputfile humandb/ -dbtype 1000g2010nov_all -buildver hg19
Technical notes: ANNOVAR has the ability to handle VCF file directly. Therefore, you do not need to rely on the datasets that I compile, you can just directly interrogate 1000G data. For example, using 2010 March release of 1000G data
ANNOVAR can identify the variant that are already reported in dbSNP and also identify the corresponding rs identifiers. This can be a filtering step, similar to what used in the exome sequening paper to exclude non-pathogenic SNPs in Miller syndrome. [kai@beta ~/]$ annotate_variation.pl -filter -dbtype snp130 ex1.human humandb/ Two output files are generated. The ex1.human.hg18_snp130_filtered file contains SNPs not in dbSNP. The ex1.human.hg18_snp130_dropped file contains variants that are annotated in dbSNP, and print out their rs identifiers (as the second column) NOTE: dbSNP 129 is generally regarded as the last "clean" dbSNP without "contamination" from 1000 Genomes Project and other large-scale next-generation sequencing projects. Many published papers utilize dbSNP129 only. NOTE: in UCSC Genome Browswer, the hg18 tracks contain information from dbSNP 126 through 130, but not 131 and above (presumably dbSNP has the same change). ANNOVAR cannot annotate dbSNP 131 in hg18 coordinate. If you absolutely want to use dbSNP 131 in hg18, you can try lifeOver, but I am not sure how reliable that will be. NOTE: The dbSNP filtering utilized the snp128, snp129, snp130, snp131, snp132, etc table from UCSC Genome Browser Database. There is typically a delay for compiling the "UCSC version" of the dbSNP, compared to dbSNP database itself. As of January 2011, snp132 is not yet available so users cannot use dbSNP 132; if a user absolutely wants to use version 132, then you can simply compile a generic db using data downloaded from dbSNP, and then use "-format generic" in ANNOVAR to perform the filtering. In the future, if 132 is made available from UCSC, ANNOVAR will be able to handle it automatically, but not now. Advanced Notes: Since January 2011, per users' request, ANNOVAR now handles tri-allelic or quad-allelic SNPs. For example, rs12931472 can have four alleles (A, C, G, T) with wildtype as A, so any non-A mutation will be filtered by ANNOVAR, and rs12931472 will be printed out during filtering. In previous versions of ANNOVAR, only di-allelic SNPs are handled. Advanced Notes: These annotations may be assigned to "SNPs" in dbSNP: 'unknown','single','in-del','het','microsatellite','named','mixed','mnp','insertion','deletion'. ANNOVAR will only care about 'single', 'deletion', 'in-del', 'insertion' and ignore others. 'single' SNP accounts for the vast majority of dbSNP entries. Now try an exercise to perform dbSNP-based annotation on the ex2.human file included in the ANNOVAR package. [kaiwang@biocluster ~/]$ annotate_variation.pl -filter -dbtype snp130 ex2.human ~/project/annotate_variation/humandb/ As we can see, the vast majority of variants identified in this human subjects are already found in dbSNP version 130. 4. SIFT/PolyPhen functional importance score annotations SIFT is a very popular non-synonymous SNP annotation software that assigns a "functional importance" score to SNPs, with a default cutoff threshold of 0.05 (SNPs with SIFT score higher than this threshold are regarded as "benign"). The SIFT website can be accessed here. The SIFT system utilize a SQLite database as background to speed up the searchers. Based on the SQL tables, I provide a text file called hg18_avsift.txt (~2GB in size) that contains whole-genome SIFT scores for all non-synonymous mutations, and implemented ANNOVAR to handle this file. Obviously, this can be only applied to human genomes, and can only be applied to hg18 (NCBI build 36) coordinates. In November 2010, I additionally provide the hg19_avsift.txt (~2GB in size) file for annotating variants in hg19 coordinate. To utilize this function, users need to first download the AVSIFT dataset. Only hg18 (NCBI 36) or hg19 (NCBI build 37) is currently supported. It does not work on any other genome build, or any model organisms! [kai@beta ~/]$ annotate_variation.pl -downdb avsift humandb/ Next, users can simply scan an input variant list file against the AVSIFT dataset, similar to the filtering procedure for 1000G and dbSNP data: [kai@beta ~/]$ annotate_variation.pl -filter -dbtype avsift ex1.human humandb/ Two output files will be generated: The first file contains variants and the corresponding SIFT scores for "benign variants" (by default, SIFT>0.05, but this can be changed using --sift_threshold argument). For the example above, only one non-synonymous variant is predicted as benign. The second file contains the remaining variants, that either have no SIFT annotations (introgenic or intronic variants), or have SIFT scores less than 0.05 (or whatever threshold that uesr specified). In most cases, users probably will want to supply only non-synonymous SNPs to this procedure (for example, running the --geneanno operation to generate a list of exonic variants first, before running AVSIFT annotation). When using "-sift_threshold 0" argument, all non-synonymous variants (but excluding indels, excluding non-sense mutations) and their SIFT scores will be printed out in the "dropped" output file. [kai@beta ~/]$ annotate_variation.pl -filter -dbtype avsift ex1.human humandb/ -sift 0 The ex1.human.hg18_avsift_dropped file contains all four non-synonymous SNPs, yet the ex1.human.hg18_avsift_filtered file contains all other variants (insertions, deletions, intronic SNPs, intergenic ones and block substitutions). PolyPhen2, MutationTaster, LRT, PhyloP In May 2011, several scoring files for exome data analysis have been made available in ANNOVAR, including pre-computed PolyPhen v2, MutationTaster, LRT, PhyloP scores. Use "-downdb ljb_pp2 -webfrom annovar", "-downdb ljb_lrt -webfrom annovar", "-downdb ljb_mt -webfrom annovar", "-downdb ljb_phylop -webfrom annovar" to download them. Add "-buildver hg19" to download them in hg19 coordinate. The annotation database ljb refers to Liu, Jian, Boerwinkle paper in Human Mutation with pubmed ID 21520341. Cite this paper if you use the scores. However, unlike SIFT, here higher scores (0-1) represent functionally more important predictions!!! Unlike SIFT (which has -sift_threshold 0.05 by default), there is no such "default" scores for these annotations here. Read the paper to understand appropriate threshold to use for each score. When running the program below, any variants with PolyPhen2 annotation will be printed out in the DROP file, and all other variants are in the FILTERED file: [kaiwang@biocluster ~/]$ annotate_variation.pl ex1.human humandb/ -filter -dbtype ljb_pp2 If your goal is to put all "benign" variants into the DROP file, and all "important" variants or variants without PolyPhen score into the FILTERED file, then the --reverse argument need to be specified (this tells ANNOVAR that instead of writting variants with higher score to DROP file, writting variants with lower score to the file): [kaiwang@biocluster ~/]$ annotate_variation.pl ex1.human humandb/ -filter -dbtype ljb_pp2 -reverse -score_threshold 0.85 So the DROP file contains "benign" variants with (score<=0.85).
5. Generic mutation annotations Users have the flexibility to supply a custom-made annotation file, and let ANNOVAR perform filter-based annotation on this annotation file. The "-dbtype" should be specified as "generic". The file format is very simple: first five columns are chr, start, end, reference allele, observed allele, the sixth column (functional score) is optional, other columns are optional as well asn will be ignored. The first ten lines of an example database file is given below: [kai@beta ~/]$ head humandb/hg18_example_db_generic.txt Running the above command, we can see that one variant is present in the generic database hg18_example_db_generic.txt, and its function score (as 0) is printed as the second column in the output file ex1.human.hg18_generic_dropped. When the database file contains the functional importance score column (sixth column in the generic format files), users can use --score_threshold argument to control output: only database records with a higher functional score will be used in filtering query variants. For example, when adding "-score_threshold 0.5" argument to the above command, the output file will be empty. It is also possible to treat "avsift" files as if it is a generic database file, and use the "--dbtype generic --genericdbfile hg18_avsift.txt --score_threshold 0.05" argument to perform SIFT-based annotations of variants. It is also possible to treat ljb_pp2, ljb_lrt, etc as if they are generic database file, and use the "-dbtype generic -genericdbfile hg18_ljb_pp2.txt" argument to perform PolyPhen annotation of variants. It is also possible to make two files that each contains a list of variants (five columns in each file), then scan one file against the other, to find the shared variants. This is useful, for example, when comparing two genotype-calling algorithms on the same set of alignment data.
|
||||||||||||