ANNOVAR
Home
Download
Quick Start-up Guide
Prepare Database
Prepare Input File
Annotation
Gene-based
Region-based
Filter-based
Accessary Programs
FAQ

ANNOVAR Function III: Filter-based annotation

  1. 1000 Genomes Project (2009 release) annotations
  2. 1000 Genomes Project (2010 March/July/November and 2011 release) annotations
  3. dbSNP annotations
  4. SIFT/PolyPhen functional importance score annotations
  5. Generic mutation annotations

An important and probably highly desirable feature is that ANNOVAR can help identify subsets of variants based on comparison to other variant databases, for example, variants annotated in dbSNP or variants annotated in 1000 Genome Project. In addition, I provide a "avsift" data set, which contains pre-computed SIFT scores for all possible whole-genome non-synonymous mutations, so that users can scan non-synonymous mutations against avsift to filer out "benign" mutations. The exact variant, with same start and end positions, and with same observed alleles, will be identified.

These functionalities mentioned above can be performed using the --filter operation in ANNOVAR. The major difference between --filter and --regionanno above is that that --filter operation works on mutations (nucleotide changes), but --regionanno operation works on chromosome locations. For example, --region compare variants with things like chr1:1000-1000, but --filter compare variants with things like A->G change at the position chr1:1000-1000.

1. 1000 Genomes Project (2009 release) annotations

ANNOVAR can annotate variants based on annotated allele frequency in CEU populations used in the 1000 Genome project. This analysis requires downloading the annotation files from the 1000Genome project website (the command is "annotate_variation.pl -downdb 1000g humandb/") The current version of program download the 2009 April annotations.

[kai@beta ~/]$ annotate_variation.pl -filter -dbtype 1000g_ceu ex1.human humandb/
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Variants matching filtering criteria are written to ex1.human.hg18_1000g_ceu_dropped, other variants are written to ex1.human.hg18_1000g_ceu_filtered
NOTICE: Processing next batch with 12 variants
NOTICE: Scanning filter database humandb/hg18_CEU.sites.2009_04.txt...Done

The ex1.human file contains a few common SNPs that were annotated in the 1000G project on CEU populations. The output file ex1.human.hg18_1000g_ceu_filtered contains a list of SNPs not reported in 1000G_CEU. The output file ex1.human.hg18_1000g_ceu_dropped contains the list of SNPs that are reported and their non-reference allele frequencies (as the second column below):

[kai@beta ~/]$ cat ex1.human.hg18_1000g_ceu_dropped
1000g_ceu 0.017544 1 67478546 67478546 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
1000g_ceu 0.482456 1 84647761 84647761 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
1000g_ceu 0.877193 1 161003087 161003087 C T comments: rs1000050, a SNP in Illumina SNP arrays

Similarly, change 1000g_ceu to 1000g_yri and 1000g_jptchb can be used to check the allele frequencies in YRI and JPT/CHB populations.

In general, these commands are highly efficient, requiring several minutes to scan 4 million input genetic variants. Additionally, users may use settings such as "-chr 1-9" and "-chr 10-22,X" to run ANNOVAR in selected chromosomes only to further speed up searchers, if knowing that input variants are in specific chromosomes only.

2. 1000 Genomes Project (2010 March/July/November and 2011 release) annotations

The procedure above was originally developed for the April 2009 release of 1000G. In March 2010, a new release of 1000G data is available, so the new keyword "1000g2010" must be used, if the users want to use the new 1000G data for the annotation. Similarly, the new keyword "1000g2010jul" and "1000g2010nov" must be used to handle additional releases. The 2010 November release is no longer a pilot release, but a full project release in hg19 coordinate.

[kai@beta ~/]$ annotate_variation.pl -downdb 1000g2010 humandb/

[kai@beta ~/]$ annotate_variation.pl -filter -dbtype 1000g2010_ceu ex1.human humandb/
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Variants matching filtering criteria are written to ex1.human.hg18_1000g2010_ceu_dropped, other variants are written to ex1.human.hg18_1000g2010_ceu_filtered
NOTICE: Processing next batch with 12 variants
NOTICE: Scanning filter database humandb/hg18_CEU.sites.2010_03.txt...Done

The results are below:

[kai@beta ~/]$ cat ex1.human.hg18_1000g2010_ceu_dropped
1000g2010_ceu 0.508 1 84647761 84647761 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
1000g2010_ceu 0.883 1 161003087 161003087 C T comments: rs1000050, a SNP in Illumina SNP arrays
1000g2010_ceu 0.508 2 233848107 233848107 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
1000g2010_ceu 0.083 16 49303427 49303427 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2

We will note that this result slightly differ from those generated on the 2009 release of 1000G data. This is because the 2009 and 2010 release used different genotype calling algorithms, potentially resulting in false negative calls in both sets. Users should be aware of these differences. In general, I recommend using the 2010 release, not just because it is newer, but also because it contains indel calls.

Also note that as far as I can tell, 1000G 2010 March release does not provide chromosome X or chromosome Y calls for SNPs. They do provide chromosome X calls for indels. This should be kept in mind when using ANNOVAR for annotation. The 2010 July and Nov release does contain sex chromosome calls.

Technical notes: Most casual ANNOVAR users can safely ignore these notes. If you want to know in detail how ANNOVAR works, read below.

Some users may wonder how exactly ANNOVAR handles the different variant call sets in the 1000 Genomes Project. In March 2010 release of 1000 Genomes Project Pilot data, a SNP call file and an indel call file are provided for each of the three HapMap populations. The SNP call file contains consensus calls from 3 groups: a SNP is called only if at least 2 out of 3 groups call it. The indel call file contains many indel calls and it is generated by "Dindel on MAQ Illumina only BAMs ... using candidate indel set". The vast majority of indels (>90%) are annotated with QC-flags, so I decided to take only the most confident set of indels (that is, indels without any QC flags), and then generate a new "sites" file that contains both SNPs and indels. This is what users will get when "-downdb 1000g2010" is used in ANNOVAR (for example, hg18_CEU.sites.2010_03.txt, etc.). Unlike the 2009 release of 1000G data, these files are not downloaded from 1000G directly but from ANNOVAR website.

The counts of SNPs and indels in March 2010 release in each of the HapMap populations are given below:

CEU: SNPs=7,725,713 indels=751,528 total=8,477,241
YRI: SNPs=10,556,876 indels=978,444 total=11,535,320
JPTCHB: SNPs= 6,109,233 indels=687,884 total=6,797,117

In comparison, the counts of SNPs in Aprial 2009 release in each of the HapMap populations are given below:

CEU: SNPs=9,633,115
YRI: SNPs=13,759,844
JPTCHB: SNPs=10,970,708

So it appears that 2010 release contains fewer number of SNPs compared to the 2009 release, though it provides indel calls.

The next natural question is how much overlap there is between the 2009 release and the 2010 release. This can be done easily by ANNOVAR. Just make an ANNOVAR input file using 2009 release data, and then scan the 2010 release using the -filter operation in ANNOVAR (command line is "annotate_variation.pl -filter -dbtype 1000g2010_ceu ex10.human humandb/"). It takes just 4 minutes in my computer for CEU population, while adding "-batchsize 20m" argument further improves the speed.

CEU: overlapping SNPs= 6,145,332
YRI: overlapping SNPs=9,155,523
JPTCHB: overlapping SNPs=5,393,065

So the overlap with 2009 release is around 80%-90% for SNPs in the 1000G 2010 release, for each of the three populations.

1000 Genomes Pilot Proejct continuously to put out updates to their variant calls. The keyword 1000g2010jul was added in July 2010 to handle the new version of data:

[kaiwang@biocluster ~/]$ annotate_variation.pl -filter -dbtype 1000g2010jul_ceu ex1.human ~/project/annotate_variation/humandb/

[kaiwang@biocluster ~/]$ cat ex1.human.hg18_CEU.sites.2010_07_dropped
1000g2010jul_ceu 0.508 1 84647761 84647761 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
1000g2010jul_ceu 0.883 1 161003087 161003087 C T comments: rs1000050, a SNP in Illumina SNP arrays
1000g2010jul_ceu 0.508 2 233848107 233848107 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
1000g2010jul_ceu 0.083 16 49303427 49303427 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2

comparing the 2010 July and 2010 March release, we can see that one additional SNP has been detected in the July release.

Similarly, the non-pilot project also released variant calls in their website (in hg19 coordinate), and users need to use 1000g2010nov to handle the new data, with the '-buildver hg19' argument as well. The 2010Nov data does not separate CEU/YRI/ASN populations though, as they are based on all 1000G populations.

The dbtype should include "all" as suffix as the database are for all subjects without specific popualtion identifiers.

[kaiwang@biocluster ~/]$ annotate_variation.pl -filter inputfile humandb/ -dbtype 1000g2010nov_all -buildver hg19

[kaiwang@biocluster ~/]$ annotate_variation.pl -filter inputfile humandb/ -dbtype 1000g2011may_all -buildver hg19

 

 

Technical notes: ANNOVAR has the ability to handle VCF file directly. Therefore, you do not need to rely on the datasets that I compile, you can just directly interrogate 1000G data. For example, using 2010 March release of 1000G data

[kai@beta ~/]$ annotate_variation.pl -filter -dbtype vcf -vcfdbfile hg18_CEU.SRP000031.2010_03.sites.vcf.txt ex1.human humandb/
[kai@beta ~/]$ cat ex1.human.hg18_vcf_dropped
vcf 0.508333333333333 1 84647761 84647761 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
vcf 0.883333333333333 1 161003087 161003087 C T comments: rs1000050, a SNP in Illumina SNP arrays
vcf 0.0833333333333333 16 49303427 49303427 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2


You can do the same thing for 2011 May release of 1000G data:

[kaiwang@biocluster ~/]$ annotate_variation.pl inputfile ./ -vcfdbfile ALL.wgs.phase1.projectConsensus.snps.sites.vcf -filter -dbtype vcf

So there is no real need to wait for me to update the latest 1000G data in ANNOVAR. Any user can just run ANNOVAR on VCF file downloaded from 1000G yourself.

This is especially important for indels: when I compile indels, I ONLY USE THE MOST CONFIDENT CALLS FROM 1000G THAT PASS ALL FILTERS. Some users may want to scan all potential indels called by 1000G regardless of how confident it is, and in this case, it is best that you ues the VCF file directly in ANNOVAR for annotation.

 

3. dbSNP annotations

ANNOVAR can identify the variant that are already reported in dbSNP and also identify the corresponding rs identifiers. This can be a filtering step, similar to what used in the exome sequening paper to exclude non-pathogenic SNPs in Miller syndrome.

[kai@beta ~/]$ annotate_variation.pl -filter -dbtype snp130 ex1.human humandb/
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Variants matching filtering criteria are written to ex1.human.hg18_snp130_dropped, other variants are written to ex1.human.hg18_snp130_filtered
NOTICE: Processing next batch with 12 variants
NOTICE: Scanning filter database humandb/hg18_snp130.txt...Done

[kai@beta ~/]$ cat ex1.human.hg18_snp130_dropped
snp130 rs35561142 1 11326183 11326183 - AT comments: rs35561142, a 2-bp insertion
snp130 rs59770105 1 13133880 13133881 TC - comments: rs59770105, a 2-bp deletion
snp130 rs11209026 1 67478546 67478546 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
snp130 rs6576700 1 84647761 84647761 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
snp130 rs1000050 1 161003087 161003087 C T comments: rs1000050, a SNP in Illumina SNP arrays
snp130 rs1801002 13 19661686 19661686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
snp130 rs2066844 16 49303427 49303427 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
snp130 rs2066845 16 49314041 49314041 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
snp130 rs2066847 16 49321279 49321279 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2

Two output files are generated. The ex1.human.hg18_snp130_filtered file contains SNPs not in dbSNP. The ex1.human.hg18_snp130_dropped file contains variants that are annotated in dbSNP, and print out their rs identifiers (as the second column)

NOTE: dbSNP 129 is generally regarded as the last "clean" dbSNP without "contamination" from 1000 Genomes Project and other large-scale next-generation sequencing projects. Many published papers utilize dbSNP129 only.

NOTE: in UCSC Genome Browswer, the hg18 tracks contain information from dbSNP 126 through 130, but not 131 and above (presumably dbSNP has the same change). ANNOVAR cannot annotate dbSNP 131 in hg18 coordinate. If you absolutely want to use dbSNP 131 in hg18, you can try lifeOver, but I am not sure how reliable that will be.

NOTE: The dbSNP filtering utilized the snp128, snp129, snp130, snp131, snp132, etc table from UCSC Genome Browser Database. There is typically a delay for compiling the "UCSC version" of the dbSNP, compared to dbSNP database itself. As of January 2011, snp132 is not yet available so users cannot use dbSNP 132; if a user absolutely wants to use version 132, then you can simply compile a generic db using data downloaded from dbSNP, and then use "-format generic" in ANNOVAR to perform the filtering. In the future, if 132 is made available from UCSC, ANNOVAR will be able to handle it automatically, but not now.

Advanced Notes: Since January 2011, per users' request, ANNOVAR now handles tri-allelic or quad-allelic SNPs. For example, rs12931472 can have four alleles (A, C, G, T) with wildtype as A, so any non-A mutation will be filtered by ANNOVAR, and rs12931472 will be printed out during filtering. In previous versions of ANNOVAR, only di-allelic SNPs are handled.

Advanced Notes: These annotations may be assigned to "SNPs" in dbSNP: 'unknown','single','in-del','het','microsatellite','named','mixed','mnp','insertion','deletion'. ANNOVAR will only care about 'single', 'deletion', 'in-del', 'insertion' and ignore others. 'single' SNP accounts for the vast majority of dbSNP entries.

Now try an exercise to perform dbSNP-based annotation on the ex2.human file included in the ANNOVAR package.

[kaiwang@biocluster ~/]$ annotate_variation.pl -filter -dbtype snp130 ex2.human ~/project/annotate_variation/humandb/
[kaiwang@biocluster ~/]$ wc -l ex2.human.hg18_snp130_dropped
4065419 ex2.human.hg18_snp130_dropped
[kaiwang@biocluster ~/]$ wc -l ex2.human.hg18_snp130_filtered
635937 ex2.human.hg18_snp130_filtered

As we can see, the vast majority of variants identified in this human subjects are already found in dbSNP version 130.

4. SIFT/PolyPhen functional importance score annotations

SIFT is a very popular non-synonymous SNP annotation software that assigns a "functional importance" score to SNPs, with a default cutoff threshold of 0.05 (SNPs with SIFT score higher than this threshold are regarded as "benign"). The SIFT website can be accessed here. The SIFT system utilize a SQLite database as background to speed up the searchers. Based on the SQL tables, I provide a text file called hg18_avsift.txt (~2GB in size) that contains whole-genome SIFT scores for all non-synonymous mutations, and implemented ANNOVAR to handle this file. Obviously, this can be only applied to human genomes, and can only be applied to hg18 (NCBI build 36) coordinates. In November 2010, I additionally provide the hg19_avsift.txt (~2GB in size) file for annotating variants in hg19 coordinate.

To utilize this function, users need to first download the AVSIFT dataset. Only hg18 (NCBI 36) or hg19 (NCBI build 37) is currently supported. It does not work on any other genome build, or any model organisms!

[kai@beta ~/]$ annotate_variation.pl -downdb avsift humandb/

Next, users can simply scan an input variant list file against the AVSIFT dataset, similar to the filtering procedure for 1000G and dbSNP data:

[kai@beta ~/]$ annotate_variation.pl -filter -dbtype avsift ex1.human humandb/
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Variants matching filtering criteria are written to ex1.human.hg18_avsift_dropped, other variants are written to ex1.human.hg18_avsift_filtered
NOTICE: Processing next batch with 12 variants
NOTICE: Scanning filter database humandb/hg18_avsift.txt...Done

[kai@beta ~/]$ cat ex1.human.hg18_avsift_dropped
avsift 0.06 2 233848107 233848107 T C comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease

Two output files will be generated: The first file contains variants and the corresponding SIFT scores for "benign variants" (by default, SIFT>0.05, but this can be changed using --sift_threshold argument). For the example above, only one non-synonymous variant is predicted as benign. The second file contains the remaining variants, that either have no SIFT annotations (introgenic or intronic variants), or have SIFT scores less than 0.05 (or whatever threshold that uesr specified). In most cases, users probably will want to supply only non-synonymous SNPs to this procedure (for example, running the --geneanno operation to generate a list of exonic variants first, before running AVSIFT annotation).

When using "-sift_threshold 0" argument, all non-synonymous variants (but excluding indels, excluding non-sense mutations) and their SIFT scores will be printed out in the "dropped" output file.

[kai@beta ~/]$ annotate_variation.pl -filter -dbtype avsift ex1.human humandb/ -sift 0
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Variants matching filtering criteria are written to ex1.human.hg18_avsift_dropped, other variants are written to ex1.human.hg18_avsift_filtered
NOTICE: Processing next batch with 12 variants
NOTICE: Scanning filter database humandb/hg18_avsift.txt...Done

[kai@beta ~/]$ cat ex1.human.hg18_avsift_dropped
avsift 0 1 67478546 67478546 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
avsift 0.06 2 233848107 233848107 T C comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
avsift 0 16 49303427 49303427 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
avsift 0 16 49314041 49314041 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2

[kai@beta ~/]$ cat ex1.human.hg18_avsift_filtered
13 19695176 20003944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
1 11326183 11326183 - AT comments: rs35561142, a 2-bp insertion
1 13133880 13133881 TC - comments: rs59770105, a 2-bp deletion
1 161003087 161003087 C T comments: rs1000050, a SNP in Illumina SNP arrays
13 19661686 19661686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
16 49321279 49321279 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
1 84647761 84647761 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
1 105293754 105293754 A ATAAA comments: rs10552169, a block substitution

The ex1.human.hg18_avsift_dropped file contains all four non-synonymous SNPs, yet the ex1.human.hg18_avsift_filtered file contains all other variants (insertions, deletions, intronic SNPs, intergenic ones and block substitutions).

PolyPhen2, MutationTaster, LRT, PhyloP

In May 2011, several scoring files for exome data analysis have been made available in ANNOVAR, including pre-computed PolyPhen v2, MutationTaster, LRT, PhyloP scores. Use "-downdb ljb_pp2 -webfrom annovar", "-downdb ljb_lrt -webfrom annovar", "-downdb ljb_mt -webfrom annovar", "-downdb ljb_phylop -webfrom annovar" to download them. Add "-buildver hg19" to download them in hg19 coordinate. The annotation database ljb refers to Liu, Jian, Boerwinkle paper in Human Mutation with pubmed ID 21520341. Cite this paper if you use the scores. However, unlike SIFT, here higher scores (0-1) represent functionally more important predictions!!! Unlike SIFT (which has -sift_threshold 0.05 by default), there is no such "default" scores for these annotations here. Read the paper to understand appropriate threshold to use for each score.

When running the program below, any variants with PolyPhen2 annotation will be printed out in the DROP file, and all other variants are in the FILTERED file:

[kaiwang@biocluster ~/]$ annotate_variation.pl ex1.human humandb/ -filter -dbtype ljb_pp2
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Variants matching filtering criteria are written to ex1.human.hg18_ljb_pp2_dropped, other variants are written to ex1.human.hg18_ljb_pp2_filtered
NOTICE: Processing next batch with 12 unique variants in 12 input lines
NOTICE: Scanning filter database humandb/hg18_ljb_pp2.txt...Done

[kaiwang@biocluster ~/]$ cat ex1.human.hg18_ljb_pp2_dropped
ljb_pp2 0.999 1 67478546 67478546 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
ljb_pp2 0.0 2 233848107 233848107 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
ljb_pp2 0.97 16 49303427 49303427 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
ljb_pp2 0.508 16 49314041 49314041 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2

If your goal is to put all "benign" variants into the DROP file, and all "important" variants or variants without PolyPhen score into the FILTERED file, then the --reverse argument need to be specified (this tells ANNOVAR that instead of writting variants with higher score to DROP file, writting variants with lower score to the file):

[kaiwang@biocluster ~/]$ annotate_variation.pl ex1.human humandb/ -filter -dbtype ljb_pp2 -reverse -score_threshold 0.85
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Variants matching filtering criteria are written to ex1.human.hg18_ljb_pp2_dropped, other variants are written to ex1.human.hg18_ljb_pp2_filtered
NOTICE: Processing next batch with 12 unique variants in 12 input lines
NOTICE: Scanning filter database humandb/hg18_ljb_pp2.txt... Done

[kaiwang@biocluster ~/]$ cat ex1.human.hg18_ljb_pp2_dropped
ljb_pp2 0.0 2 233848107 233848107 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
ljb_pp2 0.508 16 49314041 49314041 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2

So the DROP file contains "benign" variants with (score<=0.85).

 

5. Generic mutation annotations

Users have the flexibility to supply a custom-made annotation file, and let ANNOVAR perform filter-based annotation on this annotation file. The "-dbtype" should be specified as "generic". The file format is very simple: first five columns are chr, start, end, reference allele, observed allele, the sixth column (functional score) is optional, other columns are optional as well asn will be ignored. The first ten lines of an example database file is given below:

[kai@beta ~/]$ head humandb/hg18_example_db_generic.txt
16 49303382 49303382 T A 1
16 49303382 49303382 T C 1
16 49303382 49303382 T G 0.04
16 49303383 49303383 G A 0.04
16 49303383 49303383 G C 0.05
16 49303383 49303383 G T 0.05
16 49303384 49303384 G A 0.05
16 49303384 49303384 G C 0.01
16 49303384 49303384 G T 0.01
16 49303385 49303385 G A 0.71

[kai@beta ~/]$ annotate_variation.pl -filter -dbtype generic -genericdbfile hg18_example_db_generic.txt ex1.human humandb/
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Variants matching filtering criteria are written to ex1.human.hg18_generic_dropped, other variants are written to ex1.human.hg18_generic_filtered
NOTICE: Processing next batch with 12 variants
NOTICE: Scanning filter database humandb/hg18_example_db_generic.txt...Done

[kai@beta ~/]$ cat ex1.human.hg18_generic_dropped
generic 0 16 49303427 49303427 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2

Running the above command, we can see that one variant is present in the generic database hg18_example_db_generic.txt, and its function score (as 0) is printed as the second column in the output file ex1.human.hg18_generic_dropped.

When the database file contains the functional importance score column (sixth column in the generic format files), users can use --score_threshold argument to control output: only database records with a higher functional score will be used in filtering query variants. For example, when adding "-score_threshold 0.5" argument to the above command, the output file will be empty.

It is also possible to treat "avsift" files as if it is a generic database file, and use the "--dbtype generic --genericdbfile hg18_avsift.txt --score_threshold 0.05" argument to perform SIFT-based annotations of variants.

It is also possible to treat ljb_pp2, ljb_lrt, etc as if they are generic database file, and use the "-dbtype generic -genericdbfile hg18_ljb_pp2.txt" argument to perform PolyPhen annotation of variants.

It is also possible to make two files that each contains a list of variants (five columns in each file), then scan one file against the other, to find the shared variants. This is useful, for example, when comparing two genotype-calling algorithms on the same set of alignment data.