ANNOVAR
Home
Download
Quick Start-up Guide
Prepare Input File
Annotation
Gene-based
Region-based
Filter-based
Accessary Programs
VCF Processing
FAQ

Preparation of input files with variants

  1. Standard format of ANNOVAR input file
  2. Format conversion script: facilitating the generation of ANNOVAR input files
    1. VCF4 genotype calling format
    2. dbSNP identifiers
    3. All possible variants in a genomic region
    4. All possible variants in exonic region of a transcript
    5. Old VCF4 format (obselete)
    6. Samtools genotype-calling pileup format
    7. Complete Genomics genotype calling format
    8. GFF3-SOLiD format
    9. SOAPsnp format
    10. MAQ genotype calling format
    11. CASAVA genotype calling format

Standard format of ANNOVAR input file

ANNOVAR takes text-based input files, where each line corresponds to one variant. On each line, the first five space- or tab- delimited columns represent chromosome, start position, end position, the reference nucleotides and the observed nucleotides. Additional columns can be supplied and will be printed out in identical form. For convenience, users can use “0” to fill in the reference nucleotides, if this information is not readily available. Insertions, deletions or block substitutions can be readily represented by this simple file format, by using “–” to represent a null nucleotide. One example is given below (this example is included as ex1.human file in the ANNOVAR package), with extra columns that serve as comments on the variants. By default, 1-based coordinate system will be assumed; if --zerostart argument is issued, a half-open zero-based coordinate system will be used in ANNOVAR instead.

The ANNOVAR package contains a few example input files. For example, the content of the ex1.human file is below:

[kai@biocluster ~/]$ cat example/ex1.avinput
1 948921 948921 T C comments: rs15842, a SNP in 5' UTR of ISG15
1 1404001 1404001 G T comments: rs149123833, a SNP in 3' UTR of ATAD3C
1 5935162 5935162 A T comments: rs1287637, a splice site variant in NPHP4
1 162736463 162736463 C T comments: rs1000050, a SNP in Illumina SNP arrays
1 84875173 84875173 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
1 13211293 13211294 TC - comments: rs59770105, a 2-bp deletion
1 11403596 11403596 - AT comments: rs35561142, a 2-bp insertion
1 105492231 105492231 A ATAAA comments: rs10552169, a block substitution
1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
13 20797176 21105944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss

The first five space or tab delimited fields are Chromosome ("chr" prefix is optional), Start, End, Reference Allelel, Alternative Allele. The rest of the columns are completely optional

The example above contains several genetic variants. The first variant is a single nucleotide variant, with a substitution of C in reference genome to T. The third variant is a 2-bp deletion, with the observed nucleotides being represented by "-". The fourth variant is a 2-bp insertion, since the reference nucleotide in the reference genome is represented by “–”. The last variant is a large-scale deletion, but the reference allele is represented by “0”, eliminating the need to include reference nucleotides on this line.

Another example is shown below. Note that the first five columns conform to the specification above, yet all other columns are totally optional and the user can put anything there.

[kai@beta ~/]$ cat ex3.avinput
7 92570705 92570705 T C 7 3 43 D G SAMD9 1.56
7 98870495 98870495 G A 26 16 62 R C PTCD1 3.06
7 99835402 99835402 C T 13 6 46 P L PILRA 1.75
7 100122289 100122289 - CCT 5 3 60 EQ ERQ GIGYF1 3.98
7 100209410 100209410 G A 15 8 53 R H ZAN 1.81
7 100473466 100473466 A G 38 13 34 T A MUC17 0.60
7 105066159 105066160 TC - 19 3 16 E X ATXN7L1 4.92
7 113306419 113306419 C T 15 6 40 S N PPP1R3A 1.05
7 115411632 115411632 C T 14 5 36 D N TFEC -0.45
7 119702880 119702880 T C 29 3 10 C R KCND2 5.00
7 120216091 120216091 C G 20 10 50 A P TSPAN12 5.32
7 120555712 120555712 T C 10 3 30 L S C7orf58 3.00
7 128099699 128099699 C G 7 5 71 I M FAM71F2 0.42
7 128221650 128221650 T C 13 3 23 L P CCDC136 1.26

In some cases, users may want to specify only positions but not the actual nucleotides. In that case, "0" can be used to fill in the 4th and 5th column. ANNOVAR can still run on this input file, but obviously there is no output on amino acid changes. Additionally, the observed amino acid will be assumed to be of equal length of the wildtype allele (as specified by the start and end position at each line).

If ANNOVAR encounters an invalid input line, it will write the invalid line into a file called $outfile.invalid_input where $outfile is specified by the --outfile argument. If all input lines are of valid format, this output file will not exist. Therefore, even if the input file contains empty lines or invalid format, ANNOVAR can still proceed with the next input line.

The download package contains several example input files. The users can check them out.

Format conversion script: facilitating the generation of ANNOVAR input files

The convert2annovar.pl script provide some very rudimentary utility to convert other "genotype calling" format into ANNOVAR format. Currently, the program can handle Samtools genotype-calling pileup format, Illumina export format from GenomeStudio, SOLiD GFF genotype-calling format, Complete Genomics variant format, and VCF format.

1. VCF4 genotype calling format (updated July 2013)

The "-format vcf4 " argument should be specified to convert VCF files to ANNOVAR input format. This functionality was re-implemented to address many users' comments to handle multiple alternative alleles and to handle multiple input samples in a single VCF file.

The ANNOVAR package should contain an example VCF file in the example/ directory. We can use this file as an example:

[kaiwang@biocluster ~/]$ cat example/ex2.vcf
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
16 50745926 rs2066844 C T 80 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T G 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1230288 . T . 50 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

More detailed explanation of this VCF file is given at http://www.1000genomes.org/node/101, as I basically copied the file from there with minor modification. You can see that seven loci are contained within the file, together with many comment lines.

Now let's do the conversion:

[kaiwang@biocluster ~/]$ convert2annovar.pl -format vcf4 example/ex2.vcf > ex2.avinput
WARNING to old ANNOVAR users: this program no longer does line-to-line conversion for multi-sample VCF files. If you want to include all variants in output, use '-format vcf4old' or use '-format vcf4 -allsample -withfreq' instead.
NOTICE: Finished reading 25 lines from VCF file
NOTICE: A total of 7 locus in VCF file passed QC threshold, representing 6 SNPs (3 transitions and 3 transversions) and 3 indels/substitutions
NOTICE: Finished writing 2 SNPs (1 transitions and 1 transversions) and 1 indels/substitutions for 1 sample (but input contains 3 samples)
WARNING: Skipped 1 invalid alternative alleles found in input file

[kaiwang@biocluster ~/]$ cat ex2.avinput
20 1110696 1110696 A G het 67 6
20 1110696 1110696 A T het 67 6
20 1234568 1234570 TCT - het 50 4

The above command takes ex2.vcf as input file, and generate the ex2.avinput as output file. The 3 extra columns are zygosity status, genotype quality and read depth.

If you read the screen message carefully, it tells that only 1 out of 3 samples have been processed in this VCF file. By default, only the first sample in VCF file will be written to output file. The input contains seven loci, but many of them do not have non-reference genotypes for the first sample, and that is why the output contains only 3 variants.

Instead of using redirection, we can also do this to write the output into the ex2.avinput file:

[kaiwang@biocluster ~/]$ convert2annovar.pl -format vcf4 example/ex2.vcf -outfile ex2.avinput

Now, if we want to write all 3 samples into the output file (as three separate output files), we can add the -allsample argument

[kaiwang@biocluster ~/]$ convert2annovar.pl -format vcf4 example/ex2.vcf -outfile ex2 -allsample
NOTICE: output files will be written to ex2.<samplename>.avinput
NOTICE: Finished reading 25 lines from VCF file
NOTICE: A total of 7 locus in VCF file passed QC threshold, representing 6 SNPs (3 transitions and 3 transversions) and 3 indels/substitutions
NOTICE: Finished writing 10 SNPs (6 transitions and 4 transversions) and 3 indels/substitutions for 3 samples
WARNING: Skipped 1 invalid alternative alleles found in input file

[kaiwang@biocluster ~/]$ cat ex2.NA00001.avinput
20 1110696 1110696 A G het 67 6
20 1110696 1110696 A T het 67 6
20 1234568 1234570 TCT - het 50 4

[kaiwang@biocluster ~/]$ cat ex2.NA00002.avinput
16 50745926 50745926 C T het 80 8
20 14370 14370 G A het 29 8
20 17330 17330 T A het 3 5
20 1110696 1110696 A G het 67 0
20 1110696 1110696 A T het 67 0
20 1234567 1234570 GTCT GTACT het 50 2

[kaiwang@biocluster ~/]$ cat ex2.NA00003.avinput
16 50745926 50745926 C T hom 80 5
20 14370 14370 G A hom 29 5
20 1110696 1110696 A T hom 67 4
20 1234568 1234570 TCT - hom 50 3

Clearly, the input VCF file has seven loci. One of them has unknown alternative allele so it is excluded from output. The other two each has two alternative alleles, so in total there are 9 variants, including 6 SNPs and 3 indels. Three output files are generated.

Pay attention to chr20:1110696 in NA0002. There are two alternative alleles for this locus, so two output lines are present, each for one mutation. Both of them are heterozygotes:

20 1110696 1110696 A G het 67 0
20 1110696 1110696 A T het 67 0

If you need the output file to contain the all information as the VCF file, you can add -includeinfo argument:

[kaiwang@biocluster ~/]$ convert2annovar.pl -format vcf4 example/ex2.vcf -outfile ex2 -allsample -include
NOTICE: output files will be written to ex2.<samplename>.avinput
NOTICE: Finished reading 25 lines from VCF file
NOTICE: A total of 7 locus in VCF file passed QC threshold, representing 6 SNPs (3 transitions and 3 transversions) and 3 indels/substitutions
NOTICE: Finished writing 10 SNPs (6 transitions and 4 transversions) and 3 indels/substitutions for 3 samples
WARNING: Skipped 1 invalid alternative alleles found in input file

[kaiwang@biocluster ~/]$ cat ex2.NA00001.avinput
20 1110696 1110696 A G 20 1110696 rs6040355 A G 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27
20 1110696 1110696 A T 20 1110696 rs6040355 A T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27
20 1234568 1234570 TCT - 20 1234567 microsat1 GTCT G 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4

If you need the zygosity, quality and read coverage information in the output line as well, add the -withzyg argument:

[kaiwang@biocluster ~/]$ convert2annovar.pl -format vcf4 example/ex2.vcf -outfile ex2 -allsample -include -withzyg
NOTICE: output files will be written to ex2.<samplename>.avinput
NOTICE: Finished reading 25 lines from VCF file
NOTICE: A total of 7 locus in VCF file passed QC threshold, representing 6 SNPs (3 transitions and 3 transversions) and 3 indels/substitutions
NOTICE: Finished writing 10 SNPs (6 transitions and 4 transversions) and 3 indels/substitutions for 3 samples
WARNING: Skipped 1 invalid alternative alleles found in input file

[kaiwang@biocluster ~/]$ cat ex2.NA00001.avinput
20 1110696 1110696 A G het 67 6 20 1110696 rs6040355 A G 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27
20 1110696 1110696 A T het 67 6 20 1110696 rs6040355 A T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27
20 1234568 1234570 TCT - het 50 4 20 1234567 microsat1 GTCT G 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4

Now, here is one very very important argument: -withfreq. When -withfreq is set, it will print out the allele frequency of each SNP in the VCF file, based on all samples within the file. Because we are not looking at all samples as a whole, the individual genotypes will not be considered here, so the output file should contain all loci from the input file:

[kaiwang@biocluster ~/]$ convert2annovar.pl -format vcf4 example/ex2.vcf -outfile ex2.avinput -allsample -withfreq
NOTICE: Finished reading 25 lines from VCF file
NOTICE: A total of 7 locus in VCF file passed QC threshold, representing 6 SNPs (3 transitions and 3 transversions) and 3 indels/substitutions
NOTICE: Finished writing 17 SNPs (8 transitions and 9 transversions) and 6 indels/substitutions for 3 samples
WARNING: Skipped 1 invalid alternative alleles found in input file

[kaiwang@biocluster ~/]$ cat ex2.avinput
16 50745926 50745926 C T 0.5 80 5
20 14370 14370 G A 0.5 29 5
20 17330 17330 T A 0.1667 3 3
20 1110696 1110696 A G 0.5 67 0
20 1110696 1110696 A T 0.6667 67 4
20 1230237 1230237 T G 0 47 2
20 1230288 1230288 T 0 0 50 2
20 1234568 1234570 TCT - 0.75 50 3
20 1234567 1234570 GTCT GTACT 0.5 50 2

In practice, you probably want to add -includeinfo so that all genotype records for all samples are included in the final output file:

[kaiwang@biocluster ~/]$ convert2annovar.pl -format vcf4 example/ex2.vcf -outfile ex2.avinput -allsample -withfreq -include
NOTICE: Finished reading 25 lines from VCF file
NOTICE: A total of 7 locus in VCF file passed QC threshold, representing 6 SNPs (3 transitions and 3 transversions) and 3 indels/substitutions
NOTICE: Finished writing 17 SNPs (8 transitions and 9 transversions) and 6 indels/substitutions for 3 samples
WARNING: Skipped 1 invalid alternative alleles found in input file

[kaiwang@biocluster ~/]$ cat ex2.avinput
16 50745926 50745926 C T 0.5 80 5 16 50745926 rs2066844 C T 80 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 14370 14370 G A 0.5 29 5 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 17330 T A 0.1667 3 3 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 1110696 A G 0.5 67 0 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1110696 1110696 A T 0.6667 67 4 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 1230237 T G 0 47 2 20 1230237 . T G 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1230288 1230288 T 0 0 50 2 20 1230288 . T . 50 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234568 1234570 TCT - 0.75 50 3 20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
20 1234567 1234570 GTCT GTACT 0.5 50 2 20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

Again, note that there are two lines for chr20:1110696, because there are two alleles at the same locus. Similarly, there are two lines (chr20:1234568 and chr20:1234567) for the same VCF record, because there are two alleles (indels) at the same locus.

 

If you need the VCF header information, just add the -comment argument:

[kaiwang@biocluster ~/]$ convert2annovar.pl -format vcf4 example/ex2.vcf -outfile ex2 -allsample -include -comment
NOTICE: output files will be written to ex2.<samplename>.avinput
NOTICE: Finished reading 25 lines from VCF file
NOTICE: A total of 7 locus in VCF file passed QC threshold, representing 6 SNPs (3 transitions and 3 transversions) and 3 indels/substitutions
NOTICE: Finished writing 10 SNPs (6 transitions and 4 transversions) and 3 indels/substitutions for 3 samples
WARNING: Skipped 1 invalid alternative alleles found in input file

[kaiwang@biocluster ~/]$ cat ex2.NA00001.avinput
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001
20 1110696 1110696 A G 20 1110696 rs6040355 A G 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27
20 1110696 1110696 A T 20 1110696 rs6040355 A T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27
20 1234568 1234570 TCT - 20 1234567 microsat1 GTCT G 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4

Note that the output file contains only VCF information for that particular sample (in other word, only the GT:GQ:DP:HQ information for sample NA00001 is in the ex2.NA00001.avinput output file).

Therefore, in practice, you could use convert2annovar.pl program as a way to split a huge VCF files into many parts, each for one particular sample, by back-converting the ANNOVAR input file into a VCF file for a particular sample. In a sense, it is similar to the vcf-subset program in VCFtools, but in this case ANNOVAR will be far more efficient, especially when handling large VCF files like those over 1TB. See below:

[kaiwang@biocluster ~/]$ grep -P '^#' example1.NA00001.avinput > example1.NA00002.vcf
[kaiwang@biocluster ~/]$ grep -v -P '^#' example1.NA00001.avinput | cut -f 6- >> example1.NA00002.vcf
[kaiwang@biocluster ~/]$ grep -P '^#' example1.NA00002.avinput > example1.NA00002.vcf
[kaiwang@biocluster ~/]$ grep -v -P '^#' example1.NA00002.avinput | cut -f 6- >> example1.NA00002.vcf
[kaiwang@biocluster ~/]$ grep -P '^#' example1.NA00003.avinput > example1.NA00003.vcf
[kaiwang@biocluster ~/]$ grep -v -P '^#' example1.NA00003.avinput | cut -f 6- >> example1.NA00003.vcf

[kaiwang@biocluster ~/]$ cat example1.NA00003.vcf
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 1/1:43:5:.,.
20 1110696 rs6040355 A T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 2/2:35:4
20 1234567 microsat1 GTCT G 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 1/1:40:3

dbSNP identifiers

Many users have a list of dbSNP rs identifiers and want to annotate functinoality of these SNPs. This can be achieved by convert2annovar.pl by the -format rsid argument:

[kaiwang@biocluster ~/]$ cat example/snplist.txt
rs74487784
rs41534544
rs4308095
rs12345678

[kaiwang@biocluster ~/]$ convert2annovar.pl -format rsid example/snplist.txt -dbsnpfile humandb/hg19_snp138.txt > snnplist.avinput
NOTICE: Scanning dbSNP file humandb/hg19_snp138.txt...
NOTICE: input file contains 4 rs identifiers, output file contains information for 4 rs identifiers
WARNING: 1 rs identifiers have multiple records (due to multiple mapping) and they are all written to output

[kaiwang@biocluster ~/]$ cat snplist.avinput
chr2 186229004 186229004 C T rs4308095
chr7 6026775 6026775 T C rs41534544
chr7 6777183 6777183 G A rs41534544
chr9 3901666 3901666 T C rs12345678
chr22 24325095 24325095 A G rs74487784

As you can see above, the new file has the first five columns as chr, start, end, ref, alt, and the sixth column as dbSNP identifier. The LOG message tells us that 1 SNP (rs41534544) has multiple mappings to genome, and as a result, it has two entries in the output file.

All possible variants in a genomic region

Suppose that I am interested in annotating all SNP, 1-bp insertions and 1-bp deletions in a 3-bp genomic region chr1:2000001-2000003. This can be easily done in convert2annovar.pl now:

[kaiwang@biocluster ~/]$ convert2annovar.pl -format region -seqdir humandb/hg19_seq/ chr1:2000001-2000003
NOTICE: Reading region from STDIN ... Done with 1 regions from 1 chromosomes
NOTICE: Finished reading 1 sequences from humandb/hg19_seq/chr1.fa
NOTICE: Finished writting FASTA for 1 genomic regions to stdout
1 2000001 2000001 A C
1 2000001 2000001 A G
1 2000001 2000001 A T
1 2000002 2000002 T A
1 2000002 2000002 T C
1 2000002 2000002 T G
1 2000003 2000003 C A
1 2000003 2000003 C G
1 2000003 2000003 C T

For adding x-bp insertions and deletions to this, use "-inssize x" and "-delsize x", where x is an integer.

[kaiwang@biocluster ~/]$ convert2annovar.pl -format region -seqdir humandb/hg19_seq/ chr1:2000001-2000003 -inssize 1 -delsize 2
NOTICE: Reading region from STDIN ... Done with 1 regions from 1 chromosomes
NOTICE: Finished reading 1 sequences from humandb/hg19_seq/chr1.fa
NOTICE: Finished writting FASTA for 1 genomic regions to stdout
1 2000001 2000001 A C
1 2000001 2000001 A G
1 2000001 2000001 A T
1 2000002 2000002 T A
1 2000002 2000002 T C
1 2000002 2000002 T G
1 2000003 2000003 C A
1 2000003 2000003 C G
1 2000003 2000003 C T
1 2000001 2000001 AT -
1 2000002 2000002 TC -
1 2000001 2000001 - A
1 2000001 2000001 - C
1 2000001 2000001 - G
1 2000001 2000001 - T
1 2000002 2000002 - A
1 2000002 2000002 - C
1 2000002 2000002 - G
1 2000002 2000002 - T
1 2000003 2000003 - A
1 2000003 2000003 - C
1 2000003 2000003 - G
1 2000003 2000003 - T

If you only want 2-bp deletions but not SNVs and insertions, use "-subsize 0 -delsize 2". By default, -subsize is set as 1 to indicate that SNVs are always desired.

All possible variants in exonic region of a transcript

Similar to the '-format region' above, users can generate ANNOVAR input files for all possible variants in exons (plus splicing variants) of a transcript. Let's take NM_022162 as an example:

[kaiwang@biocluster ~/]$ convert2annovar.pl -format transcript NM_022162 -gene humandb/hg19_refGene.txt -seqdir humandb/hg19_seq/ > NM_022162.avinput
NOTICE: 12 regions will be analyzed for possible mutations
NOTICE: Reading region from STDIN ... Done with 1 regions from 1 chromosomes
NOTICE: Finished reading 1 sequences from humandb/hg19_seq/chr16.fa
NOTICE: Finished writting FASTA for 1 genomic regions to stdout
NOTICE: Reading region from STDIN ... Done with 1 regions from 1 chromosomes
NOTICE: Finished reading 1 sequences from humandb/hg19_seq/chr16.fa
NOTICE: Finished writting FASTA for 1 genomic regions to stdout
NOTICE: Reading region from STDIN ... Done with 1 regions from 1 chromosomes
NOTICE: Finished reading 1 sequences from humandb/hg19_seq/chr16.fa
NOTICE: Finished writting FASTA for 1 genomic regions to stdout
NOTICE: Reading region from STDIN ... Done with 1 regions from 1 chromosomes
NOTICE: Finished reading 1 sequences from humandb/hg19_seq/chr16.fa
NOTICE: Finished writting FASTA for 1 genomic regions to stdout
NOTICE: Reading region from STDIN ... Done with 1 regions from 1 chromosomes
NOTICE: Finished reading 1 sequences from humandb/hg19_seq/chr16.fa
NOTICE: Finished writting FASTA for 1 genomic regions to stdout
NOTICE: Reading region from STDIN ... Done with 1 regions from 1 chromosomes
NOTICE: Finished reading 1 sequences from humandb/hg19_seq/chr16.fa
NOTICE: Finished writting FASTA for 1 genomic regions to stdout
NOTICE: Reading region from STDIN ... Done with 1 regions from 1 chromosomes
NOTICE: Finished reading 1 sequences from humandb/hg19_seq/chr16.fa
NOTICE: Finished writting FASTA for 1 genomic regions to stdout
NOTICE: Reading region from STDIN ... Done with 1 regions from 1 chromosomes
NOTICE: Finished reading 1 sequences from humandb/hg19_seq/chr16.fa
NOTICE: Finished writting FASTA for 1 genomic regions to stdout
NOTICE: Reading region from STDIN ... Done with 1 regions from 1 chromosomes
NOTICE: Finished reading 1 sequences from humandb/hg19_seq/chr16.fa
NOTICE: Finished writting FASTA for 1 genomic regions to stdout
NOTICE: Reading region from STDIN ... Done with 1 regions from 1 chromosomes
NOTICE: Finished reading 1 sequences from humandb/hg19_seq/chr16.fa
NOTICE: Finished writting FASTA for 1 genomic regions to stdout
NOTICE: Reading region from STDIN ... Done with 1 regions from 1 chromosomes
NOTICE: Finished reading 1 sequences from humandb/hg19_seq/chr16.fa
NOTICE: Finished writting FASTA for 1 genomic regions to stdout
NOTICE: Reading region from STDIN ... Done with 1 regions from 1 chromosomes
NOTICE: Finished reading 1 sequences from humandb/hg19_seq/chr16.fa
NOTICE: Finished writting FASTA for 1 genomic regions to stdout

By default, splicing site (2-bp outside of each exon) is also processed. If you do not want them, or want to chagne the definition of splicing, use the -splicing_threshold argument.

Similar to the '-format region' above, '-delsize' and '-dupsize' and '-subsize' are all supported, so that you can customize the output file.

One great utility of using '-format region' is that users can back-convert protein or cDNA mutation nomenclature back to genomic coordinate. To do this, one just need to generate all possible mutations for a transcript, and then annotate the resulting *.avinput file, and then search through this file for matches.

 

 

VCF4 genotype calling format (OBSELETE)

As of July 2014, this is completely obselete now. Users are advised to use "-format vcf4 -allsample -withfreq" if they want the output file to have the same set of loci as the input file, regardless of genotype identities of the samples.

The convert2annovar.pl program has been rewritten in July 2013 for improved handling of VCF files. However, if users want "old-style" output, then you can still use "-format vcf4old" to achieve that. More detailed examples are explained below:

For example, for VCF4 file containing genotype calls,

[kaiwang@biocluster ~/]$ convert2annovar.pl 84060.snp.vcf -format vcf4old | head
NOTICE: Detected that the VCF4 file is generated by GATK UnifiedGenotyper
NOTICE: column 6-10 represent heterozygosity status, quality score, read depth, RMS mapping quality, quality by depth
1 55 55 T G het 34.82 2 14.16 17.41
1 2646 2646 G A het 40.91 4 7.50 10.23
1 5074 5074 T G het 37.82 1 12.00 37.82
1 7401 7401 C A het 38.19 2 26.16 19.10
1 9907 9907 C T het 38.80 2 12.02 19.40
1 9993 9993 G A het 36.80 1 18.00 36.80
1 59133 59133 A G hom 756.78 175 5.36 4.32
1 59374 59374 A G het 1189.97 39 18.30 30.51
1 59760 59760 T C hom 169.09 132 2.92 1.28
1 81399 81399 G T het 35.63 1 37.00 35.63

The NOTICE line gives a brief format description for column 6-10 in the output file. Use -includeinfo if the user wants to have all the information.

The indel calling format may be substantially different from the SNP calling format. For example,

[kaiwang@biocluster ~/]$ convert2annovar.pl 84060.indel.vcf -format vcf4old | head
NOTICE: Detected that the VCF4 file is generated by GATK IndelGenotyper
NOTICE: column 6-9 represent heterozygosity status, quality score, read depth, read count supporting indel call, RMS mapping quality
1 2525324 2525324 - C het . 12 5 29.0
1 3553373 3553373 C - het . 6 6 60.0
1 6093012 6093012 G - het . 32 31 59.64516
1 8928691 8928691 - C het . 13 7 64.28571
1 9227566 9227573 CCCAGGCA - het . 36 12 35.833332
1 9704618 9704618 - C het . 6 2 49.5
1 10279794 10279794 T - het . 9 4 62.5
1 10343439 10343439 - T het . 8 7 61.42857
1 10347705 10347705 T - het . 9 6 54.833332
1 11003310 11003310 - G het . 8 8 47.0

As another example, suppose we want to process 1000 Genomes Project genotype calls

[kaiwang@biocluster ~/]$ convert2annovar.pl CEU.low_coverage.2010_07.sites.vcf -format vcf4old -includeinfo | head
WARNING: the support for '--format vcf4' is not well developed yet and may contain bugs.
1 533 533 G C . 423 AA=.;AC=6;AN=120;DP=423
1 41342 41342 T A . 188 AA=.;AC=29;AN=120;DP=188
1 41791 41791 G A . 192 AA=.;AC=5;AN=120;DP=192
1 44449 44449 T C . 166 AA=C;AC=2;AN=120;DP=166
1 44539 44539 C T . 131 AA=T;AC=2;AN=120;DP=131
1 44571 44571 G C . 135 AA=g;AC=7;AN=120;DP=135
1 45162 45162 C T . 166 AA=c;AC=20;AN=120;DP=166;HM2
1 52066 52066 T C . 159 AA=C;AC=18;AN=120;DP=159
1 53534 53534 G A . 243 AA=G;AC=18;AN=120;DP=243
1 75891 75891 T C . 182 AA=.;AC=11;AN=120;DP=182

In the above command, we used -includeinfo argument, so that the additional information are printed out in the same line.

Similarly, we can process indel calls

[kaiwang@biocluster ~/]$ convert2annovar.pl CEU.low_coverage.2010_07.indel.sites.vcf -format vcf4old -i | head
WARNING: the support for '--format vcf4' is not well developed yet and may contain bugs.
1 1000154 1000157 CACA - 100 615 AF=0.115095;HP=1;NF=16;NR=13;NS=52;CA=0;DP=615
1 1000907 1000907 A - 48 281 AF=0.0772696;HP=1;NF=2;NR=9;NS=51;CA=0;DP=281
1 1000951 1000951 G - 100 291 AF=0.447771;HP=5;DB;NF=10;NR=20;NS=50;CA=M;DP=291
1 1010786 1010786 - G 100 306 AF=0.774334;HP=1;DB;NF=21;NR=27;NS=51;CA=0;DP=306
1 1026158 1026158 - GGGGG 100 591 AF=0.115637;HP=1;NF=5;NR=2;NS=52;CA=0;DP=591
1 1028861 1028863 CTC - 10 522 AF=0.0107751;HP=1;NF=5;NR=4;NS=52;CA=0;DP=522
1 1040517 1040517 - A 100 426 AF=0.0577672;HP=1;NF=7;NR=4;NS=52;CA=0;DP=426
1 1043690 1043690 - G 100 427 AF=0.065857;HP=3;DB;NF=5;NR=8;NS=51;CA=0;DP=427
1 1049375 1049375 - ACACACCTGAGCACACACACCTGTGC 100 643 AF=0.0808078;HP=1;NF=1;NR=1;NS=51;CA=M;DP=643
1 1055460 1055460 C - 100 342 AF=0.936429;HP=2;NF=13;NR=24;NS=51;CA=M;DP=342

Technical discussions (for advanced ANNOVAR users, open for more discussions if you want to contribute):

1. Multi-allelic calls: Sometimes, multi-allelic variants are called in VCF files. For example, one variant is below (A is changed to both C and G):

1 156706559 . A C,G 114 . DP=20;AF1=1;CI95=1,1;DP4=0,0,1,19;MQ=60;FQ=-63 GT:PL:GQ 1/2:237,126,90,162,0,138:99

By default, convert2annovar.pl will only regard C as the mutation. If you add -allallele argument, then the program will print out two output lines, one for C and one for G.

There are exceptions. For example,

1 11297762 . T C,A 98 . DP=19;AF1=1;CI95=1,1;DP4=0,0,17,1;MQ=60;FQ=-78 GT:PL:GQ 1/1:131,51,0,120,28,117:99

Here the subject is called as homozygous for the first alternative allele (genotype 1/1. i.e. C/C), but since there was one read containing A, samtools still keep both alleles in the VCF file (but gives a very low probabilities for it). In this case, it is reasonable to assume that this is a CC genotype and ignore the second A allele.

Remember that convert2annovar.pl has the ability to maintain the VCF file format, such that you can process any VCF file in ANNOVAR by annotation, filtering, etc, and ultimately, in the end, you still get a VCF file that is a subset of the original VCF file. Read details here. Obviously, for multi-allelic variant calls, if you add -allallele argument, one input line becomes two output lines so that you need to do some clean-up of the final results before converting back to VCF file.

2. Ambiguous indels: Sometimes, indels can be found in a stretch of identical nucleotides, so that the exact location of indel cannot be identified. Similarly, some insertions have identical first and last nucleotide, so that there are two choices to specify the location of insertion. ANNOVAR will use a simple convention that the leftmost coordinate and nucleotide should be always used, whenever ambiguity exists. I believe that this is really the only good way to solve the inconsistency in indel definitions in all variant calling systems, but obviously most other people do not think so.

Some other tools address this issue by several means: First, they can just print out the whole block of nucleotide including a different nucleotide (for example, A) before a polymer (for example, TT). Then it is up to the individual researchers to decide how they want to handle this scenario. For example, the same indel can be annotated as two versions below:

Experimental sample VCF file: chr1  11086097 AT   ATT

1KGP Dindel in VCF format:      chr1 11086097  A  AT

Second, some software use an external reference (such as a pre-existing library of "known indels") and then try to match any observed indel to these "known indels" and then use the location of the "known indels" when slightly inconsisteny exists. This has some theoretical advantages in some situations but it has its own set of issues as well.

In convert2annovar.pl, the way to represent this variant is to treat it as a simple one-base insertion. That is, if you run convert2annovar.pl (from either of the two variants above), you will see

variant from convert2annovar.pl:  chr1  11086097 11086097 -  T

and any subsequent annotation will be based on this new single-base-insertion variant.

Similar, suppose you have a change from C to CTC. It could be that there is a insertion of "TC" after C, or that there is a "CT" insertion before C. convert2annovar.pl will treat this as an insertion of "TC" after C.

convert2annovar.pl per se will not do this type of conversion. In other word, it will work on the variant that you supply in the input file. If you supply AT->ATT, or A->AT, or -->T, convert2annovar.pl will not attemp to alter them. So if you are doing filter-based annotation, and the database file contains only -->T, then the first two variants will be regarded as "un-annotated" variants.

3. Flagging in VCF4 file: Although convert2annovar.pl has some nice built-in ability to print out a subset of higher quality VCF4 calls ( for example, coverage>20), individual investigators may want to filter the VCF4 file themselves by other more sophisticated software tools, before feeding the raw variant calls in VCF4 to convert2annovar.pl The way this is done is by putting a "FILTER" flag in the VCF4 file, and the use -filter argument in convert2annovar.pl to process only calls with a user-specified flag.

For example, I can put this argument in GATK VariantFiltration walker: --clusterWindowSize 10 --filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)" --filterName "HARD_TO_VALIDATE" -B:mask,VCF lane1_1.fq.indel.vcf --filterExpression "QUAL < 30.0 || QD < 5.0 || HRun > 5 || SB > -0.10" --filterName GATKStandard --filterExpression "DP < 6" --filterName "DPlt6"

Then the new VCF4 file will have different filter annotations and I can use "convert2annovar.pl -filter PASS" subsequently to convert to annovar input format.

 

Samtools genotype-calling pileup format

This section is also obselete now, and in fact samtools now use mpileup, rather than the "old" pileup.

Note that there are many different pileup formats, but here we are dealing with the (now-obselete as of 2011) "genotype-calling" pileup which contains the variant calls in one of the columns. A more detailed description is given at the Samtools website. An example to generate the "genotype-calling" pileup file is shown below:

samtools pileup -vcf ref.fa aln.bam > raw.pileup

The commands generates pileup files that contain the consensus calls with the model implemented in MAQ (there are certainly many other specified SNP callers available as well that users can freely choose). An example genotype-calling pileup format generated from SamTools is illustrated below:

chr1 556674 G G 54 0 60 16 a,.....,...,.... (B%A+%7B;0;%=B<:
chr1 556675 C C 55 0 60 16 ,,..A..,...,.... CB%%5%,A/+,%....
chr1 556676 C C 59 0 60 16 g,.....,...,.... .B%%.%.?.=/%...1
chr1 556677 G G 75 0 60 16 ,$,.....,...,.... .B%%9%5A6?)%;?:<
chr1 556678 G K 60 60 60 24 ,$.....,...,....^~t^~t^~t^~t^~t^~t^~t^~t^~t B%%B%<A;AA%??<=??;BA%B89
chr1 556679 C C 61 0 60 23 .....a...a....,,,,,,,,, %%1%&?*:2%*&)(89/1A@B@@
chr1 556680 G K 88 93 60 23 ..A..,..A,....ttttttttt %%)%7B:B0%55:7=>>A@B?B;
chr1 556681 C C 102 0 60 25 .$....,...,....,,,,,,,,,^~,^~. %%3%.B*4.%.34.6./B=?@@>5.
chr1 556682 A A 70 0 60 24 ...C,...,....,,,,,,,,,,. %:%(B:A4%7A?;A><<999=<<'
chr1 556683 G G 99 0 60 24 ....,...,....,,,,,,,,,,. %A%3B@%?%C?AB@BB/./-1A7?

The columns are chromosome, 1-based coordinate, reference base, consensus base (IUPAC nomenclature for nucleotides), consensus quality, SNP quality, maximum mapping quality of the reads covering the sites, the number of reads covering the site, read bases and base qualities.

The convert2annovar.pl program can convert the pileup file format to ANNOVAR input files. By default, the "-snpqual 20" argument will be imposed, so that only SNPs reaching quality score >=20 will be processed and written to output files. The output varlist file contains the called mutations in ANNOVAR format (non-mutations are obviously not in the output file).

In the 2011 Januaray version of ANNOVAR, the format for handling pileup file has been quite mature/fixed. Note that the first five columns conform to the standard ANNOVAR input format, yet the sixth and following columns give information on the alleles.

[kaiwang@biocluster ~/]$ convert2annovar.pl 84060.pileup -coverage 10 | head
NOTICE: the default --format argument is set as 'pileup'
NOTICE: the default --snpqual argument for pileup format is set as 20
NOTICE: Column 6-9 in output are heterozygosity status, SNP quality, total reads, reads with mutation
1 20139 20140 CA - het 563 53 10
1 59374 59374 A G hom 129 39 37
1 798677 798677 T G het 30 26 4
1 798785 798785 G A hom 132 38 37
1 798791 798791 C T hom 156 46 45
1 799544 799544 G A het 35 39 7
1 799550 799550 G C het 64 38 8
1 799595 799595 T C het 24 28 5
1 861034 861034 A C het 46 14 4

The NOTICE line aboves tells the user what the columns 6-9 means in the output. In the first line, we see an indel with depth coverage of 53, and 10 of them support the indel. In the second line, we see a SNP with depth coverage of 39, and 37 of them supports the alternative allele (G). These additional numbers after column 6 helps user decide whether the variant calls are reliable or not.

The -fraction argument can be used to filter out variants whose alternative allele has too low percentage among all reads. For example, if we suppose that all variant calls must be supported by at least 40% reads covering a site, we can use:

[kaiwang@biocluster ~/]$ convert2annovar.pl 84060.pileup -coverage 10 -fraction 0.4 | head
NOTICE: the default --format argument is set as 'pileup'
NOTICE: the default --snpqual argument for pileup format is set as 20
NOTICE: Column 6-9 in output are heterozygosity status, SNP quality, total reads, reads with mutation
1 59374 59374 A G hom 129 39 37
1 798785 798785 G A hom 132 38 37
1 798791 798791 C T hom 156 46 45
1 871781 871781 G A het 198 13 8
1 873954 873954 C G het 211 16 7
1 873964 873964 A C het 172 17 8
1 877423 877423 A C hom 84 19 19
1 877664 877664 A G hom 63 12 12
1 878502 878502 T C hom 69 14 14
1 881483 881483 C A het 228 24 10

As can be seen by comparing the two output files, the first line of indel is no longer in output, because 10/53<40%.

Some additional useful arguments include: -altcov, which specifies the minimum coverage for the alternative allele (the -coverage specifies coverage for all reads regardless of whether they support reference allele or alternative allele); -maxcoverage, which specifies the maximum coverage level to print out this variant; --includeinfo, which specifies that all information in the input line should be included in the output line by appending them after the printed columns.

After the program finishes, it will print out some statistics. Normally, for whole-genome sequencing on humans, the heterozytoes:homozygotes ratio should be around 2:1, the transitions:transversions ratio should be 2:1. (ANNOVAR version before Sep 2010 has a bug in the ratio calculation and it has been fixed now).

Adanced notes: When the chromosome is "M", ANNOVAR will not print out "hom" or "het", instead, it will print out a number between 0 and 1 that suggest the fraction of reads that support alternative alleles. Use -chrmt argument if mitochondria is not annotated as M in your alignment.

 

Complete Genomics genotyping calling format

The complete genomics company provides many genotyping-calling files for their customers. Among them is an var*ASM.tsv file that looks like below.

[kai@beta ~/]$ head -n 20 var-GS000000088-ASM.tsv
#BUILD 1.5.0.5
#GENERATED_AT 2009-Nov-03 19:52:21.722927
#GENERATED_BY dbsnptool
#TYPE VAR-ANNOTATION
#VAR_ANN_SET /Proj/Pipeline/Production_Data/REF/HUMAN-F_06-REF/dbSNP.csv
#VAR_ANN_TYPE dbSNP
#VERSION 0.3
>locus ploidy haplotype chromosome begin end varType reference alleleSeq totalScore hapLink xRef
1 2 all chr1 0 959 no-call = ?
2 2 all chr1 959 972 = = =
3 2 all chr1 972 1001 no-call = ?
4 2 all chr1 1001 1008 = = =
5 2 all chr1 1008 1114 no-call = ?
6 2 all chr1 1114 1125 = = =
7 2 all chr1 1125 1191 no-call = ?
8 2 all chr1 1191 1225 = = =
9 2 all chr1 1225 1258 no-call = ?
10 2 all chr1 1258 1267 = = =
12 2 all chr1 1267 1275 no-call = ?

The convert2annovar.pl program can be used to convert this file to ANNOVAR format, using the "-format cg" argument. The output file looks like this:

[kai@beta ~/]$ head var-GS000000088-ASM.tsv.snp
1 28095 28095 A G snp 67 dbsnp:rs806727 hom
1 31844 31844 A G snp 133 dbsnp:rs806721 hom
1 37155 37155 T G snp 51 dbsnp:rs2691275 het
1 44449 44449 T C snp 74 het
1 45679 45679 G A snp 191 dbsnp:rs3020699 het
1 45713 45713 C G snp 191 het
1 45789 45789 T C snp 138 dbsnp:rs3020698 hom
1 46662 46662 T C snp 69 dbsnp:rs2691309 het
1 47109 47109 C G snp 56 dbsnp:rs2691313 het
1 47815 47815 A C snp 67 dbsnp:rs2691334 hom

An example command line session is given below:

[kai@node-r1-u35-c2-p14-o4 ~/]$ convert2annovar.pl -format cg -out GS000000455.query var-GS000000455-ASM.tsv
NOTICE: Converting variants from var-GS000000455-ASM.tsv
NOTICE: Done with 25667914 lines

[kai@node-r1-u35-c2-p14-o4 ~/]$ wc -l GS000000455.query
3728645 GS000000455.query

In this example, 25.6 million lines from the var*ASM.tsv file from Complete Genomics data are processed, and 3.7 million variants are written to the output file in ANNOVAR input format.

GFF3-SOLiD format

Sometimes variant calls are in GFF3 format, and they can be converted to ANNOVAR input format. (This input file should not be confused with a GFF3 annotation database, as they serve different purposes. Here we are dealing with input files only.) For example, SOLiD provides SNP variant calls in the following format:

[kai@beta ~/]$ head -n 20 var/Yoruban_snp_18x.gff
##gff-version 3
##solid-gff-version 0.3
##source-version 2
##type DNA
##date 2009-03-13
##time 0:0:0
##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
##reference-file
##input-files /data/results3/yoruban_strikes_back/files_for_Aaron/NA18507_18x_SOLiD_SNP_calls_hg18_dbSNP129_annotated.txt
##run-path
1 AB_SOLiD SNP caller SNP 997 997 1 . . coverage=3;ref_base=A; ref_score=0.3359;ref_confi=0.9528;ref_single=0/0;ref_paired=1/1;consen_base=G; consen_score=0.6641;consen_confi=0.9420;consen_single=0/0;consen_paired=2/2
1 AB_SOLiD SNP caller SNP 1371 1371 1 . . coverage=2;ref_base=A; ref_score=0.0000;ref_confi=0.0000;ref_single=0/0;ref_paired=0/0;consen_base=G; consen_score=1.0000;consen_confi=0.8717;consen_single=0/0;consen_paired=2/2
1 AB_SOLiD SNP caller SNP 2061 2061 1 . . coverage=2;ref_base=G; ref_score=0.0000;ref_confi=0.0000;ref_single=0/0;ref_paired=0/0;consen_base=C; consen_score=1.0000;consen_confi=0.9138;consen_single=0/0;consen_paired=2/2
1 AB_SOLiD SNP caller SNP 4770 4770 1 . . coverage=2;ref_base=A; ref_score=0.0000;ref_confi=0.0000;ref_single=0/0;ref_paired=0/0;consen_base=G; consen_score=1.0000;consen_confi=0.8699;consen_single=0/0;consen_paired=2/2
1 AB_SOLiD SNP caller SNP 4793 4793 1 . . coverage=16;ref_base=A; ref_score=0.0689;ref_confi=0.9384;ref_single=0/0;ref_paired=1/1;consen_base=G; consen_score=0.6858;consen_confi=0.8494;consen_single=0/0;consen_paired=11/10
1 AB_SOLiD SNP caller SNP 5074 5074 1 . . coverage=4;ref_base=T; ref_score=0.5165;ref_confi=0.9355;ref_single=2/2;ref_paired=0/0;consen_base=K; consen_score=0.4835;consen_confi=0.8759;consen_single=1/1;consen_paired=1/1
1 AB_SOLiD SNP caller SNP 6241 6241 1 . . coverage=5;ref_base=T; ref_score=0.4450;ref_confi=0.9383;ref_single=0/0;ref_paired=2/2;consen_base=Y; consen_score=0.3646;consen_confi=0.7688;consen_single=0/0;consen_paired=2/2
1 AB_SOLiD SNP caller SNP 9089 9089 1 . . coverage=5;ref_base=T; ref_score=0.4260;ref_confi=0.9483;ref_single=1/1;ref_paired=1/1;consen_base=W; consen_score=0.3639;consen_confi=0.8100;consen_single=1/1;consen_paired=1/1
1 AB_SOLiD SNP caller SNP 9131 9131 1 . . coverage=8;ref_base=C; ref_score=0.7547;ref_confi=0.9306;ref_single=3/3;ref_paired=3/3;consen_base=Y; consen_score=0.2453;consen_confi=0.9073;consen_single=0/0;consen_paired=2/2
1 AB_SOLiD SNP caller SNP 18426 18426 1 . . coverage=3;ref_base=A; ref_score=0.0000;ref_confi=0.0000;ref_single=0/0;ref_paired=0/0;consen_base=G; consen_score=1.0000;consen_confi=0.8163;consen_single=0/0;consen_paired=3/3

The conversion can be done using "-format gff3-solid" argument.

[kai@beta ~/]$ convert2annovar.pl var/Yoruban_snp_18x.gff -format gff3-solid | head
1 997 997 A G hom
1 1371 1371 A G hom
1 2061 2061 G C hom
1 4770 4770 A G hom
1 4793 4793 A G hom
1 5074 5074 T G het
1 6241 6241 T C het
1 9089 9089 T A het
1 9131 9131 C T het
1 18426 18426 A G hom

Adding the --includeinfo argument will print out an additional column with the detailed attribute of the calls.

SOAPsnp format

The Short Oligonucleotide Analysis Package (SOAP) suite is developed by BGI, and SOAPsnp is a component that generates variant calls. An example of the genotype call file is given below:

chr10 84026 G R 55 A 32 9 9 G 29 3 5 14 0.275000 1.42857 1 81
chr10 84541 C M 45 C 27 5 5 A 25 3 4 9 0.285714 1.11111 1 4
chr10 284953 A G 76 G 33 26 26 A 0 0 0 26 1.00000 1.00000 1 9472
chr10 313283 A R 99 G 32 15 15 A 28 8 8 23 0.162302 1.00000 1 28330
chr10 363048 T Y 99 T 26 12 12 C 30 10 10 22 0.461435 1.00000 1 14012
chr10 377060 G A 55 A 33 11 11 G 0 0 0 11 1.00000 1.00000 1 7654
chr10 384714 G A 74 A 33 18 18 G 0 0 0 18 1.00000 1.00000 1 552
chr10 418503 A G 34 G 34 4 4 A 0 0 0 4 1.00000 1.00000 1 7377
chr10 434997 C Y 73 T 26 5 5 C 26 4 4 9 0.682540 1.00000 1 64
chr10 435061 C T 36 T 30 16 16 C 0 0 0 16 1.00000 1.00000 1 64

The convert2annovar.pl program can handle this format, using the "-format soapsnp" argument. An example of the output file is given below:

10 84026 84026 G A het
10 84541 84541 C A het
10 284953 284953 A G hom
10 313283 313283 A G het
10 363048 363048 T C het
10 377060 377060 G A hom
10 384714 384714 G A hom
10 418503 418503 A G hom
10 434997 434997 C T het
10 435061 435061 C T hom

Note that is --includeinfo argument is used, all the information from input file will be included in the output file.

MAQ genotype calling format

The convert2annovar.pl program can handle this format, using the "-format maq " argument. Both SNPs and indels can be correctly processed.

CASAVA genotype calling format

The convert2annovar.pl program can handle this format, using the "-format casava " argument and also specifying the chromosome by "-chr" argument, since CASAVA call file per se does not contain chromosome information. Both SNPs and indels can be correctly processed. This function is not tested rigorously yet. Please report bugs to me.