|
ANNOVAR Function II: Region-based annotation
- Most conserved element annotation
- Transcription factor binding site annotation
- Identify cytogenetic band for genetic variants
- Identify variants located in segmental duplications
- Identify previously reported structural variants in DGV (Database of Genomic Variants)
- Identify variants reported in previously published GWAS (Genome-wide association studies
- Identify variants in ENCODE annotated regions (transcribed regions, H3K4Me1 regions, H3K4Me3 regions, H3K27Ac regions, DNaseI hypersensitivity regions, transcription factor ChIP-Seq regions, etc)
- Identify dbSNP variants in user-specified regions
- Identify variants in other genomic regions annotated with other functions
- Annotating custom-made databases conforming to GFF3 (Generic Feature Format version 3)
- Identifying variants specified in regions in BED file
Besides gene-based annotations, ANNOVAR has several other utilities, such as region-based annotation. This function is issued by the --regionanno argument (by default, --geneanno is ON)
One of the particularly useful function is to identify variants at conserved genomic regions. For this analysis, users can choose several different annotation tracks. For example, when handling human genomes, users may want to choose such as the 17-way or 28-way or 44-way most conserved track (for NCBI 36 genome assembly). ANNOVAR will attempt to identify the subset of variants that either fall within the conserved regions (for SNPs and short in-dels), or overlap with these conserved regions (for large-scale CNVs, and the extent of overlap is user-configurable).
Similarly, users can select other annotation tracks that are already provided by the UCSC Genome Browser annotation databases. Most of these annotation tracks have similar file formats, but sometimes they differ (for example, different number of columns in the file). ANNOVAR will try to be smart in guessing the correct column headers, and usually it works well. Some of the particularly interesting annotation tracks include microRNA, predicted transcription factor binding sites, predicted microRNA targets, predicted long RNAs, predicted evolutionarily conserved RNA secondary structures, etc. Several examples are shown below:
Most conserved element annotation
Here ANNOVAR uses phastCons 44-way alignments to annotate variants that fall within conserved genomic regions. Here the --regionanno argument need to be supplied so that the program knows what to do. In addition, the --dbtype need to be specified so that the program knows which annotation database file to interrogate. Make sure that the annotation database is already downloaded (the command is " annotate_variation.pl -downdb mce44way humandb/ ").
[kai@beta ~/]$ annotate_variation.pl -downdb mce44way humandb/
[kai@beta ~/]$ annotate_variation.pl -regionanno -dbtype mce44way ex1.human humandb/
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Reading annotation database humandb/hg18_phastConsElements44way.txt ... Done with 4878296 regions
NOTICE: Finished region-based annotation on 12 genetic variants in ex1.human
NOTICE: Output files were written to ex1.human.hg18_mce44way
[kai@beta ~/]$ cat ex1.human.hg18_mce44way
[kaiwang@biocluster ~/]$ cat ex1.human.hg18_phastConsElements44way
mce44way Score=388;Name=lod=50 1 67478546 67478546 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
mce44way Score=421;Name=lod=68 16 49314041 49314041 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
mce44way Score=392;Name=lod=52 16 49321279 49321279 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
mce44way Score=381;Name=lod=47 13 19661686 19661686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
mce44way Score=566;Name=lod=262 13 19695176 20003944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
The output is saved in the varlist.hg18_phastConsElements44way file. The first column in the output is "mce44way" indicating the type of annotation. The second column is the normalized score assigned by UCSC Genome Browser, and this score range from 0 to 1000. (Note that the --score_threshold or --normscore_threshold can also be used to filter out specific variants with low conservation scores.) The second column also shows "Name=lod=x" which is used to tell the user the LOD score for the region. All other columns are identical as those in the input file. Only variants that actually are located within a conserved region will be printed in the output file. As a result, only 5 variants are in the output file.
Technical Notes: The exact columns that should be in output file (Name=?) can be controlled by the -colWanted argument. The exact column for the scores (Score=?) can be controlled by the -scorecolumn argument. For some annotation (such as most conserved elements), ANNOVAR already sets up default values, such as in the example above. For some other less used annotation, the default from ANNOVAR may not work for the user, and you can use the -colWanted and -scorecolumn argument to fine-tune the annotation output.
For other genome build, the user needs to specify the correct mceXway argument. For example, when dealing with hg19 coordinate, the -dbtype should be probably set as mce46way (this depends on the user's preference of course).
Transcription factor binding site annotation
Similarly, annotation of TFBS can be done by the commands below, assuming that database is already downloaded (the command is " annotate_variation.pl -downdb tfbs humandb/ "). Users should be aware that there are many different types of TFBS annotations that ANNOVAR can use. This is merely the one that ANNOVAR uses the "tfbs" keyword. See FAQ entry for more explanation.
[kai@beta ~/]$ annotate_variation.pl -downdb tfbs humandb/
[kai@beta ~/]$ annotate_variation.pl -dbtype tfbs -regionanno ex1.human humandb/
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Reading annotation database humandb/hg18_tfbsConsSites.txt ... Done with 3837187 regions
NOTICE: Finished region-based annotation on 12 genetic variants in ex1.human
NOTICE: Output files were written to ex1.human.hg18_tfbs
[kai@beta ~/]$ cat ex1.human.hg18_tfbsConsSites
tfbs Score=878;Name=V$FREAC3_01 13 19661686 19661686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
tfbs Score=1000;Name=V$AML1_01,V$MZF1_01 13 19695176 20003944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
The second column represents the normalized score and transcription factor name. Note that a deletion within the GJB2 gene is annotated as TFBS site, but this is what is annotated in the annotation database (it use a position-weight matrices to predict binding sites). So some caution should be taken when interpreting the results and users may want to filter the prediction data (for example, only consider variants in 1kb upstream regions of genes as truly disrupting potential TFBS).
Identify cytogenetic band for genetic variants
To identify Giemsa-stained chromosomes bands, the "-dbtype band" can be used. The second column in the output file below represent cytogenetic bands. When a variant spans multiple bands, they will be connected by a dash (for example, 1q21.1-q23.3).
[kai@beta ~/]$ annotate_variation.pl -downdb band humandb/
[kai@beta ~/]$ annotate_variation.pl -dbtype band -regionanno ex1.human humandb/
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Reading annotation database humandb/hg18_cytoBand.txt ... Done with 862 regions
NOTICE: Finished region-based annotation on 12 genetic variants in ex1.human
NOTICE: Output files were written to ex1.human.hg18_band
[kai@beta ~/]$ cat ex1.human.hg18_cytoBand
band 1q23.3 1 161003087 161003087 C T comments: rs1000050, a SNP in Illumina SNP arrays
band 1p31.1 1 84647761 84647761 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
band 1p36.21 1 13133880 13133881 TC - comments: rs59770105, a 2-bp deletion
band 1p36.22 1 11326183 11326183 - AT comments: rs35561142, a 2-bp insertion
band 1p21.1 1 105293754 105293754 A ATAAA comments: rs10552169, a block substitution
band 1p31.3 1 67478546 67478546 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
band 2q37.1 2 233848107 233848107 T C comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
band 16q12.1 16 49303427 49303427 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
band 16q12.1 16 49314041 49314041 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
band 16q12.1 16 49321279 49321279 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
band 13q12.11 13 19661686 19661686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
band 13q12.11 13 19695176 20003944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
Identify variants located in segmental duplications
Genetic variants that are mapped to segmental duplications are most likely sequence alignment errors and should be treated with extreme caution. Sometimes they manifest as SNPs with high fold coverage and probably high confidence score, but they may actually represent two non-polymorphic sites in the genomes that happen to have the same flanking sequence. To identify variants in these regions, use the command below. Again the first command download annotation databases, yet the second command identify variants in segmental duplications.
[kai@beta ~/]$ annotate_variation.pl -downdb segdup humandb/
[kai@beta ~/]$ annotate_variation.pl -dbtype segdup -regionanno ex1.human humandb/
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Reading annotation database humandb/hg18_genomicSuperDups.txt ... Done with 51809 regions
NOTICE: Finished region-based annotation on 12 genetic variants in ex1.human
NOTICE: Output files were written to ex1.human.hg18_segdup
[kai@beta ~/]$ cat ex1.human.hg18_genomicSuperDups
segdup Score=0.996154;Name=1:13065149 1 13133880 13133881 TC - comments: rs59770105, a 2-bp deletion
The "Name" field in output represents the other "matching" segment in genome (which is located in the same chromosome at chr1:13065149). The "Score" field is the sequence identity with indels between two genomic segments (fracMatchIndel in the UCSC database table).
Identify previously reported structural variants in DGV (Database of Genomic Variants)
By using the --regionanno operation and the --dbtype of dgv, ANNOVAR can also conveniently annotate deletions and duplications and compare them to previously published variants. For example,
[kai@beta ~/]$ annotate_variation.pl -downdb dgv humandb/
[kai@beta ~/]$ annotate_variation.pl -regionanno -dbtype dgv ex1.human humandb/
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Reading annotation database humandb/hg18_dgv.txt ... Done with 49988 regions
NOTICE: Finished region-based annotation on 12 genetic variants in ex1.human
NOTICE: Output files were written to ex1.human.hg18_dgv
[kai@beta ~/]$ cat ex1.human.hg18_dgv
dgv Score=0;Name=48150 1 161003087 161003087 C T comments: rs1000050, a SNP in Illumina SNP arrays
dgv Score=0;Name=2298,31604,38855,48105,0256,3283,6779,0677,30369 1 13133880 13133881 TC - comments: rs59770105, a 2-bp deletion
dgv Score=0;Name=53401,55057,9214,3004,9215,10411,25208,49187,3905 13 19695176 20003944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
The above command identifies 3 variants located in previously reported structural variation regions in Database of Genomic Variants. Their scores are all zero, because score is not defined for DGV track. The Name represent DGV database identifiers for CNVs. The GJB6 deletion overlaps with a few different entries (identifiers are 53401,55057,9214,3004,9215,10411,25208,49187,3905) in the DGV.
For structural variants such as deletions and duplications, the -minqueryoverlap may be sometimes useful. It requires that at least this percentage of query be overlapped with a database entry.
[kai@beta ~/]$ annotate_variation.pl -regionanno -dbtype dgv ex1.human humandb/ -minqueryfrac 0.5
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Reading annotation database humandb/hg18_dgv.txt ... Done with 49988 regions
NOTICE: Finished region-based annotation on 12 genetic variants in ex1.human
NOTICE: Output files were written to ex1.human.hg18_dgv
[kai@beta ~/]$ cat ex1.human.hg18_dgv
dgv Score=0;Name=48150 1 161003087 161003087 C T comments: rs1000050, a SNP in Illumina SNP arrays
dgv Score=0;Name=2298,31604,38855,48105,0256,3283,6779,0677,30369 1 13133880 13133881 TC - comments: rs59770105, a 2-bp deletion
dgv Score=0;Name=9214 13 19695176 20003944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
As we can see from the results above, adding a "-minqueryfrac 0.5" argument reduces the number of database hits. To understand this more, check the genome browser shots for this region:
The above figure shows that this query region overlaps with several CNVs reported in DGV, but only one of them (identifier is 9214) contains more than 50% of the query.
Identify variants reported in previously published GWAS (Genome-wide association studies)
[kai@beta ~/]$ annotate_variation.pl -downdb gwascatalog humandb/
[kai@beta ~/]$ annotate_variation.pl -regionanno -dbtype gwascatalog ex1.human humandb/
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Reading annotation database humandb/hg18_gwasCatalog.txt ... Done with 2930 regions
NOTICE: Finished region-based annotation on 12 genetic variants in ex1.human
NOTICE: Output files were written to ex1.human.hg18_gwascatalog
[kai@beta ~/]$ cat ex1.human.hg18_gwascatalog
gwascatalog Score=0;Name=Ankylosing spondylitis,Crohn's disease,Ulcerative colitis,Inflammatory bowel disease 1 67478546 67478546 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
gwascatalog Score=0;Name=Crohn's disease 2 233848107 233848107 T C comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
The above command identifies 2 variants previously reported in GWAS (note that the IL23R variant is known to be associated with multiple diseases already, but the gwascatalog track is not that comprehensive yet).
Identify variants in ENCODE annotated regions (transcribed regions, H3K4Me1 regions, H3K4Me3 regions, H3K27Ac regions, DNaseI hypersensitivity regions, transcription factor ChIP-Seq regions, etc)
ENCODE now provides huge amounts of data in Genome Browser tracks that ANNOVAR can annotate against. Some specific examples are shown below, but obviously, there are hundreds of ENCODE annotation tracks that can be used in ANNOVAR.
To check whether the variants are located in transcribed regions in the RNA-Seq data for GM12878 cell lines:
[kaiwang@biocluster ~/project/annotate_variation]$ annotate_variation.pl -downdb wgEncodeCaltechRnaSeqRawSignalRep1Gm12878CellLongpolyaBb12x75 humandb/
[kaiwang@biocluster ~/project/annotate_variation]$ annotate_variation.pl -regionanno -dbtype wgEncodeCaltechRnaSeqRawSignalRep1Gm12878CellLongpolyaBb12x75 ex1.human humandb/
[kaiwang@biocluster ~/project/annotate_variation]$ wc -l ex1.human.hg18_wgEncodeCaltechRnaSeqRawSignalRep1Gm12878CellLongpolyaBb12x75
7 ex1.human.hg18_wgEncodeCaltechRnaSeqRawSignalRep1Gm12878CellLongpolyaBb12x75
So 7 of of the 12 regions in ex1.human are transcribed in the GM12878 cell lines. Note that here we used RawSignal; it may make sense to use summarized signals that impose a specific expression activity threshold to eliminate lowly-expressed genes.
To check whether the variants are located in enhancer regions, based on H3K4Me1 (or H3K27Ac if you want) chromatin marks:
[kaiwang@biocluster ~/project/annotate_variation]$ annotate_variation.pl -downdb wgEncodeBroadChipSeqPeaksGm12878H3k4me1 humandb/
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/wgEncodeBroadChipSeqPeaksGm12878H3k4me1.txt.gz ... OK
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg18 build version, with files saved at the 'humandb' directory
[kaiwang@biocluster ~/project/annotate_variation]$ annotate_variation.pl -regionanno -dbtype wgEncodeBroadChipSeqPeaksGm12878H3k4me1 ex1.human humandb/ -scorecolumn 5
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Reading annotation database humandb/hg18_wgEncodeBroadChipSeqPeaksGm12878H3k4me1.txt ... Done with 66675 regions
NOTICE: Finished region-based annotation on 12 genetic variants in ex1.human
NOTICE: Output files were written to ex1.human.hg18_wgEncodeBroadChipSeqPeaksGm12878H3k4me1
[kaiwang@biocluster ~/project/annotate_variation]$ cat ex1.human.hg18_wgEncodeBroadChipSeqPeaksGm12878H3k4me1
wgEncodeBroadChipSeqPeaksGm12878H3k4me1 Score=1000;Name=. 16 49303427 49303427 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
wgEncodeBroadChipSeqPeaksGm12878H3k4me1 Score=1000;Name=. 13 19695176 20003944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
2 out of the 12 variants are considered to be in enhancers. The Name is "." because ENCODE did not assign a name to these regions. Note that in the command above, we used -scorecolumn 5, to tell the program that the fifth column is the score column.
To check whehter the variants are located in DNase I hypersensitivity sites from ENCODE:
[kaiwang@biocluster ~/project/annotate_variation]$ annotate_variation.pl -downdb wgEncodeRegDnaseClustered humandb/
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/wgEncodeRegDnaseClustered.txt.gz ... OK
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg18 build version, with files saved at the 'humandb' directory
[kaiwang@biocluster ~/project/annotate_variation]$ annotate_variation.pl -regionanno -dbtype wgEncodeRegDnaseClustered ex1.human humandb/ -scorecolumn 5
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Reading annotation database humandb/hg18_wgEncodeRegDnaseClustered.txt ... Done with 970658 regions
NOTICE: Finished region-based annotation on 12 genetic variants in ex1.human
NOTICE: Output files were written to ex1.human.hg18_wgEncodeRegDnaseClustered
[kaiwang@biocluster ~/project/annotate_variation]$ cat ex1.human.hg18_wgEncodeRegDnaseClustered
wgEncodeRegDnaseClustered Score=446;Name=5 16 49303427 49303427 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
wgEncodeRegDnaseClustered Score=1000;Name=25,64,9,15,56,45,19,23,13 13 19695176 20003944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
As an exercise, the users should try to annotate CTCF binding site (hint: use -dbtype wgEncodeBroadChipSeqPeaksGm12878Ctcf).
To annotate ENCODE transcription factor ChIP-Seq data:
[kaiwang@biocluster ~/project/annotate_variation]$ annotate_variation.pl -downdb wgEncodeRegTfbsClustered humandb/
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/wgEncodeRegTfbsClustered.txt.gz ... OK
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg18 build version, with files saved at the 'humandb' directory
[kaiwang@biocluster ~/project/annotate_variation]$ annotate_variation.pl -regionanno -dbtype wgEncodeRegTfbsClustered ex1.human humandb/ -scorecolumn 5
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Reading annotation database humandb/hg18_wgEncodeRegTfbsClustered.txt ... Done with 1582526 regions
NOTICE: Finished region-based annotation on 12 genetic variants in ex1.human
NOTICE: Output files were written to ex1.human.hg18_wgEncodeRegTfbsClustered
[kaiwang@biocluster ~/project/annotate_variation]$ cat ex1.human.hg18_wgEncodeRegTfbsClustered
wgEncodeRegTfbsClustered Score=305;Name=NRSF 16 49303427 49303427 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
wgEncodeRegTfbsClustered Score=1000;Name=BAF155,EBF,IRF4,BATF,TCF12,Max,BAF170,PU.1,JunD,c-Jun,GR,Egr-1,BCL3,PAX5-N19,BCL11A,NFKB,Ini1,PAX5-C20 13 19695176 20003944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
Note that the 342kb deletion encompass multiple binding sites, just because of its large size.
Identify dbSNP variants in user-specified regions
This is straightfoward, by using -dbtype snp130 together with -regionanno opeartion. Note that -regionanno only cares about region overlap, whereas -filter cares about exact region and exact base pair identities.
[kaiwang@biocluster ~/project/annotate_variation]$ annotate_variation.pl ex1.human humandb/ -region -dbtype snp130
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Reading annotation database humandb/hg18_snp130.txt ... Done with 18833531 regions
NOTICE: Finished region-based annotation on 12 genetic variants in ex1.human
NOTICE: Output files were written to ex1.human.hg18_snp130
Now check the ex1.human.hg18_snp130 output file. For each input line, it listed whether this line contains one or more known dbSNP entries. The 300kb deletion has many SNPs inside.
Identify variants in other genomic regions annotated with other functions
If you use UCSC database for the annotation, in principle, the vast majority of tracks (hundreds) conforming to the standard file format can be handled by ANNOVAR.
For example, say I want to identify whether a variant is located in a "hot spot" of DNAse I hypersensitivity sites in the GM12878 cell line (which is a model cell line). These sites mark activated cis-regulatory regions including promoters, enhancers, insulators. The following command will do it:
[kai@beta ~/]$ annotate_variation.pl -downdb wgEncodeUwDnaseSeqHotspotsRep2Gm12878 humandb/
[kai@beta ~/]$ annotate_variation.pl -region ex1.human humandb/ -dbtype wgEncodeUwDnaseSeqHotspotsRep2Gm12878
[kai@beta ~/]$ cat ex1.human.hg18_wgEncodeUwDnaseSeqHotspotsRep2Gm12878
wgEncodeUwDnaseSeqHotspotsRep2Gm12878 Score=549 13 19695176 20003944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
Users can change "hotspot" to "peaks" to find the most significant hypersensitivity sites among the zones, and change GM12878 to dozens of other cell lines to identify these regions in other lines. The possibility is limited only by the current database annotation, as well as the genomic regions that have been assayed.
Annotating custom-made databases conforming to GFF3 (Generic Feature Format version 3)
ANNOVAR also offer some rudimentary ability to annotate variants against GFF3-formatted annotation databases, using the region-based annotation procedure. In this case, the -dbtype is 'gff3', but users need to specify a -gff3dbfile argument as well to supply the actual database file to be scanned. The GFF3 format specification is described here: http://www.sequenceontology.org/gff3.shtml. One example database is provided in the ANNOVAR package:
[kai@beta ~/]$ head humandb/hg18_example_db_gff3.txt
##gff-version 3
chr1 tfloc TF_binding_site 83024 83039 849 - . ID=pos83031;Name=V$FREAC3_01
chr1 tfloc TF_binding_site 229352 229367 849 - . ID=pos229359;Name=V$FREAC3_01
chr1 tfloc TF_binding_site 405674 405689 849 + . ID=pos405681;Name=V$FREAC3_01
chr1 tfloc TF_binding_site 564503 564518 849 - . ID=pos564510;Name=V$FREAC3_01
chr1 tfloc TF_binding_site 1103849 1103864 847 - . ID=pos1103856;Name=V$FREAC3_01
chr1 tfloc TF_binding_site 1311968 1311983 917 - . ID=pos1311975;Name=V$FREAC3_01
chr1 tfloc TF_binding_site 2160490 2160505 818 - . ID=pos2160497;Name=V$FREAC3_01
chr1 tfloc TF_binding_site 2229446 2229461 924 - . ID=pos2229453;Name=V$FREAC3_01
chr1 tfloc TF_binding_site 2229639 2229654 842 - . ID=pos2229646;Name=V$FREAC3_01
To examine which query fall into the regions specified in GFF3 database:
[kai@beta ~/]$ annotate_variation.pl -regionanno -dbtype gff3 -gff3dbfile hg18_example_db_gff3.txt ex1.human humandb/
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Reading annotation database humandb/hg18_example_db_gff3.txt ... Done with 25691 regions
NOTICE: Finished region-based annotation on 12 genetic variants in ex1.human
NOTICE: Output files were written to ex1.human.hg18_gff3
[kai@beta ~/]$ cat ex1.human.hg18_gff3
gff3 Score=878;Name=19661681 13 19661686 19661686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
gff3 Score=859;Name=19695580 13 19695176 20003944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
Based on the above results, two variants disrupt annotated regions in the GFF3 database file. Note that the "Name" in output corresponds to "ID" in the original GFF3 file.
Identifying variants in regions specified in BED files
Sometimes you may get a BED file from somewhere and want to know if some of the variants fall within the regions specified in BED. For example, after an exome sequencing experiments, you identified many variants, but want to focus on variants that only fall within the designed exome capture regions. Typically, capture array manufacturers will provide the regions in BED file. ANNOVAR provides the means to use the BED file as database directly.
[kaiwang@biocluster ~/]$ annotate_variation.pl ex1.human . -bedfile hg18_SureSelect_All_Exon_G3362_with_names.bed -dbtype bed -regionanno
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Reading annotation database hg18_SureSelect_All_Exon_G3362_with_names.bed ... Done with 165637 regions
NOTICE: Finished region-based annotation on 12 genetic variants in ex1.human
NOTICE: Output files were written to ex1.human.hg18_bed
[kaiwang@biocluster ~/]$ cat ex1.human.hg18_bed
bed Name=NA 1 67478546 67478546 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
bed Name=NA 2 233848107 233848107 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
bed Name=NA 16 49303427 49303427 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
bed Name=NA 16 49314041 49314041 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
bed Name=NA 16 49321279 49321279 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
bed Name=NA 13 19661686 19661686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
bed Name=NA 13 19695176 20003944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
In the output, "Name=NA" because there is no such annotation in a BED file.
|