ANNOVAR
Home
Download
Quick Start-up Guide
Prepare Database
Prepare Input File
Annotation
Gene-based
Region-based
Filter-based
Accessary Programs
FAQ

Preparation of input files with variants

  1. Standard format of ANNOVAR input file
  2. Format conversion script: facilitating the generation of ANNOVAR input files
    1. VCF4 genotype calling format
    2. Samtools genotype-calling pileup format
    3. Complete Genomics genotype calling format
    4. GFF3-SOLiD format
    5. SOAPsnp format
    6. MAQ genotype calling format
    7. CASAVA genotype calling format

Standard format of ANNOVAR input file

ANNOVAR takes text-based input files, where each line corresponds to one variant. On each line, the first five space- or tab- delimited columns represent chromosome, start position, end position, the reference nucleotides and the observed nucleotides. Additional columns can be supplied and will be printed out in identical form. For convenience, users can use “0” to fill in the reference nucleotides, if this information is not readily available. Insertions, deletions or block substitutions can be readily represented by this simple file format, by using “–” to represent a null nucleotide. One example is given below (this example is included as ex1.human file in the ANNOVAR package), with extra columns that serve as comments on the variants. By default, 1-based coordinate system will be assumed; if --zerostart argument is issued, a half-open zero-based coordinate system will be used in ANNOVAR instead.

The ANNOVAR package contains a few example input files. For example, the content of the ex1.human file is below:

[kai@beta ~/]$ cat ex1.human
1 161003087 161003087 C T comments: rs1000050, a SNP in Illumina SNP arrays
1 84647761 84647761 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
1 13133880 13133881 TC - comments: rs59770105, a 2-bp deletion
1 11326183 11326183 - AT comments: rs35561142, a 2-bp insertion
1 105293754 105293754 A ATAAA comments: rs10552169, a block substitution
1 67478546 67478546 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
2 233848107 233848107 T C comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
16 49303427 49303427 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
16 49314041 49314041 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
16 49321279 49321279 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
13 19661686 19661686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
13 19695176 20003944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss

The field of this file is explained in the table below:

Chr Start End Ref Obs Comments
1 161003087 161003087 C T comments: rs1000050, a SNP in Illumina SNP arrays
1 84647761 84647761 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
1 13133880 13133881 TC - comments: rs59770105, a 2-bp deletion
1 11326183 11326183 - AT comments: rs35561142, a 2-bp insertion
1 105293754 105293754 A ATAAA comments: rs10552169, a block substitution
16 49303427 49303427 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
16 49314041 49314041 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
16 49321279 49321279 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
13 19661686 19661686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
13 19695176 20003944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss

 

The example above contains several genetic variants. The first variant is a single nucleotide variant, with a substitution of C in reference genome to T. The third variant is a 2-bp deletion, with the observed nucleotides being represented by "-". The fourth variant is a 2-bp insertion, since the reference nucleotide in the reference genome is represented by “–”. The last variant is a large-scale deletion, but the reference allele is represented by “0”, eliminating the need to include reference nucleotides on this line.

Another example is shown below. Note that the first five columns conform to the specification above, yet all other columns are totally optional and the user can put anything there.

[kai@beta ~/]$ cat ex4.human
7 92570705 92570705 T C 7 3 43 D G SAMD9 1.56
7 98870495 98870495 G A 26 16 62 R C PTCD1 3.06
7 99835402 99835402 C T 13 6 46 P L PILRA 1.75
7 100122289 100122289 - CCT 5 3 60 EQ ERQ GIGYF1 3.98
7 100209410 100209410 G A 15 8 53 R H ZAN 1.81
7 100473466 100473466 A G 38 13 34 T A MUC17 0.60
7 105066159 105066160 TC - 19 3 16 E X ATXN7L1 4.92
7 113306419 113306419 C T 15 6 40 S N PPP1R3A 1.05
7 115411632 115411632 C T 14 5 36 D N TFEC -0.45
7 119702880 119702880 T C 29 3 10 C R KCND2 5.00
7 120216091 120216091 C G 20 10 50 A P TSPAN12 5.32
7 120555712 120555712 T C 10 3 30 L S C7orf58 3.00
7 128099699 128099699 C G 7 5 71 I M FAM71F2 0.42
7 128221650 128221650 T C 13 3 23 L P CCDC136 1.26

In some cases, users may want to specify only positions but not the actual nucleotides. In that case, "0" can be used to fill in the 4th and 5th column. ANNOVAR can still run on this input file, but obviously there is no output on amino acid changes. Additionally, the observed amino acid will be assumed to be of equal length of the wildtype allele (as specified by the start and end position at each line).

If ANNOVAR encounters an invalid input line, it will write the invalid line into a file called $outfile.invalid_input where $outfile is specified by the --outfile argument. If all input lines are of valid format, this output file will not exist. Therefore, even if the input file contains empty lines or invalid format, ANNOVAR can still proceed with the next input line.

The download package contains several example input files. The users can check them out.

Format conversion script: facilitating the generation of ANNOVAR input files

The convert2annovar.pl script provide some very rudimentary utility to convert other "genotype calling" format into ANNOVAR format. Currently, the program can handle Samtools genotype-calling pileup format, Illumina export format from GenomeStudio, SOLiD GFF genotype-calling format, Complete Genomics variant format, and VCF format.

1. VCF4 genotype calling format

The "-format vcf4 " argument should be specified. Both SNPs and indels can be processed.

For example, for VCF4 file containing genotype calls,

[kaiwang@biocluster ~/]$ convert2annovar.pl 84060.snp.vcf -format vcf4 | head
NOTICE: Detected that the VCF4 file is generated by GATK UnifiedGenotyper
NOTICE: column 6-10 represent heterozygosity status, quality score, read depth, RMS mapping quality, quality by depth
1 55 55 T G het 34.82 2 14.16 17.41
1 2646 2646 G A het 40.91 4 7.50 10.23
1 5074 5074 T G het 37.82 1 12.00 37.82
1 7401 7401 C A het 38.19 2 26.16 19.10
1 9907 9907 C T het 38.80 2 12.02 19.40
1 9993 9993 G A het 36.80 1 18.00 36.80
1 59133 59133 A G hom 756.78 175 5.36 4.32
1 59374 59374 A G het 1189.97 39 18.30 30.51
1 59760 59760 T C hom 169.09 132 2.92 1.28
1 81399 81399 G T het 35.63 1 37.00 35.63

The NOTICE line gives a brief format description for column 6-10 in the output file. Use -includeinfo if the user wants to have all the information.

The indel calling format may be substantially different from the SNP calling format. For example,

[kaiwang@biocluster ~/]$ convert2annovar.pl 84060.indel.vcf -format vcf4 | head
NOTICE: Detected that the VCF4 file is generated by GATK IndelGenotyper
NOTICE: column 6-9 represent heterozygosity status, quality score, read depth, read count supporting indel call, RMS mapping quality
1 2525324 2525324 - C het . 12 5 29.0
1 3553373 3553373 C - het . 6 6 60.0
1 6093012 6093012 G - het . 32 31 59.64516
1 8928691 8928691 - C het . 13 7 64.28571
1 9227566 9227573 CCCAGGCA - het . 36 12 35.833332
1 9704618 9704618 - C het . 6 2 49.5
1 10279794 10279794 T - het . 9 4 62.5
1 10343439 10343439 - T het . 8 7 61.42857
1 10347705 10347705 T - het . 9 6 54.833332
1 11003310 11003310 - G het . 8 8 47.0

As another example, suppose we want to process 1000 Genomes Project genotype calls

[kaiwang@biocluster ~/]$ convert2annovar.pl CEU.low_coverage.2010_07.sites.vcf -format vcf4 -includeinfo | head
WARNING: the support for '--format vcf4' is not well developed yet and may contain bugs.
1 533 533 G C . 423 AA=.;AC=6;AN=120;DP=423
1 41342 41342 T A . 188 AA=.;AC=29;AN=120;DP=188
1 41791 41791 G A . 192 AA=.;AC=5;AN=120;DP=192
1 44449 44449 T C . 166 AA=C;AC=2;AN=120;DP=166
1 44539 44539 C T . 131 AA=T;AC=2;AN=120;DP=131
1 44571 44571 G C . 135 AA=g;AC=7;AN=120;DP=135
1 45162 45162 C T . 166 AA=c;AC=20;AN=120;DP=166;HM2
1 52066 52066 T C . 159 AA=C;AC=18;AN=120;DP=159
1 53534 53534 G A . 243 AA=G;AC=18;AN=120;DP=243
1 75891 75891 T C . 182 AA=.;AC=11;AN=120;DP=182

In the above command, we used -includeinfo argument, so that the additional information are printed out in the same line.

Similarly, we can process indel calls

[kaiwang@biocluster ~/]$ convert2annovar.pl CEU.low_coverage.2010_07.indel.sites.vcf -format vcf4 -i | head
WARNING: the support for '--format vcf4' is not well developed yet and may contain bugs.
1 1000154 1000157 CACA - 100 615 AF=0.115095;HP=1;NF=16;NR=13;NS=52;CA=0;DP=615
1 1000907 1000907 A - 48 281 AF=0.0772696;HP=1;NF=2;NR=9;NS=51;CA=0;DP=281
1 1000951 1000951 G - 100 291 AF=0.447771;HP=5;DB;NF=10;NR=20;NS=50;CA=M;DP=291
1 1010786 1010786 - G 100 306 AF=0.774334;HP=1;DB;NF=21;NR=27;NS=51;CA=0;DP=306
1 1026158 1026158 - GGGGG 100 591 AF=0.115637;HP=1;NF=5;NR=2;NS=52;CA=0;DP=591
1 1028861 1028863 CTC - 10 522 AF=0.0107751;HP=1;NF=5;NR=4;NS=52;CA=0;DP=522
1 1040517 1040517 - A 100 426 AF=0.0577672;HP=1;NF=7;NR=4;NS=52;CA=0;DP=426
1 1043690 1043690 - G 100 427 AF=0.065857;HP=3;DB;NF=5;NR=8;NS=51;CA=0;DP=427
1 1049375 1049375 - ACACACCTGAGCACACACACCTGTGC 100 643 AF=0.0808078;HP=1;NF=1;NR=1;NS=51;CA=M;DP=643
1 1055460 1055460 C - 100 342 AF=0.936429;HP=2;NF=13;NR=24;NS=51;CA=M;DP=342

Technical discussions (for advanced ANNOVAR users, open for more discussions if you want to contribute):

1. Multi-allelic calls: Sometimes, multi-allelic variants are called in VCF files. For example, one variant is below (A is changed to both C and G):

1 156706559 . A C,G 114 . DP=20;AF1=1;CI95=1,1;DP4=0,0,1,19;MQ=60;FQ=-63 GT:PL:GQ 1/2:237,126,90,162,0,138:99

By default, convert2annovar.pl will only regard C as the mutation. If you add -allallele argument, then the program will print out two output lines, one for C and one for G.

There are exceptions. For example,

1 11297762 . T C,A 98 . DP=19;AF1=1;CI95=1,1;DP4=0,0,17,1;MQ=60;FQ=-78 GT:PL:GQ 1/1:131,51,0,120,28,117:99

Here the subject is called as homozygous for the first alternative allele (genotype 1/1. i.e. C/C), but since there was one read containing A, samtools still keep both alleles in the VCF file (but gives a very low probabilities for it). In this case, it is reasonable to assume that this is a CC genotype and ignore the second A allele.

Remember that convert2annovar.pl has the ability to maintain the VCF file format, such that you can process any VCF file in ANNOVAR by annotation, filtering, etc, and ultimately, in the end, you still get a VCF file that is a subset of the original VCF file. Read details here. Obviously, for multi-allelic variant calls, if you add -allallele argument, one input line becomes two output lines so that you need to do some clean-up of the final results before converting back to VCF file.

2. Ambiguous indels: Sometimes, indels can be found in a stretch of identical nucleotides, so that the exact location of indel cannot be identified. Similarly, some insertions have identical first and last nucleotide, so that there are two choices to specify the location of insertion. ANNOVAR will use a simple convention that the leftmost coordinate and nucleotide should be always used, whenever ambiguity exists. I believe that this is really the only good way to solve the inconsistency in indel definitions in all variant calling systems, but obviously most other people do not think so.

Some other tools address this issue by several means: First, they can just print out the whole block of nucleotide including a different nucleotide (for example, A) before a polymer (for example, TT). Then it is up to the individual researchers to decide how they want to handle this scenario. For example, the same indel can be annotated as two versions below:

Experimental sample VCF file: chr1  11086097 AT   ATT

1KGP Dindel in VCF format:      chr1 11086097  A  AT

Second, some software use an external reference (such as a pre-existing library of "known indels") and then try to match any observed indel to these "known indels" and then use the location of the "known indels" when slightly inconsisteny exists. This has some theoretical advantages in some situations but it has its own set of issues as well.

In convert2annovar.pl, the way to represent this variant is to treat it as a simple one-base insertion. That is, if you run convert2annovar.pl (from either of the two variants above), you will see

variant from convert2annovar.pl:  chr1  11086097 11086097 -  T

and any subsequent annotation will be based on this new single-base-insertion variant.

Similar, suppose you have a change from C to CTC. It could be that there is a insertion of "TC" after C, or that there is a "CT" insertion before C. convert2annovar.pl will treat this as an insertion of "TC" after C.

convert2annovar.pl per se will not do this type of conversion. In other word, it will work on the variant that you supply in the input file. If you supply AT->ATT, or A->AT, or -->T, convert2annovar.pl will not attemp to alter them. So if you are doing filter-based annotation, and the database file contains only -->T, then the first two variants will be regarded as "un-annotated" variants.

3. Flagging in VCF4 file: Although convert2annovar.pl has some nice built-in ability to print out a subset of higher quality VCF4 calls ( for example, coverage>20), individual investigators may want to filter the VCF4 file themselves by other more sophisticated software tools, before feeding the raw variant calls in VCF4 to convert2annovar.pl The way this is done is by putting a "FILTER" flag in the VCF4 file, and the use -filter argument in convert2annovar.pl to process only calls with a user-specified flag.

For example, I can put this argument in GATK VariantFiltration walker: --clusterWindowSize 10 --filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)" --filterName "HARD_TO_VALIDATE" -B:mask,VCF lane1_1.fq.indel.vcf --filterExpression "QUAL < 30.0 || QD < 5.0 || HRun > 5 || SB > -0.10" --filterName GATKStandard --filterExpression "DP < 6" --filterName "DPlt6"

Then the new VCF4 file will have different filter annotations and I can use "convert2annovar.pl -filter PASS" subsequently to convert to annovar input format.

2. Samtools genotype-calling pileup format

Note that there are many different pileup formats, but here we are dealing with the (now-obselete as of 2011) "genotype-calling" pileup which contains the variant calls in one of the columns. A more detailed description is given at the Samtools website. An example to generate the "genotype-calling" pileup file is shown below:

samtools pileup -vcf ref.fa aln.bam > raw.pileup

The commands generates pileup files that contain the consensus calls with the model implemented in MAQ (there are certainly many other specified SNP callers available as well that users can freely choose). An example genotype-calling pileup format generated from SamTools is illustrated below:

chr1 556674 G G 54 0 60 16 a,.....,...,.... (B%A+%7B;0;%=B<:
chr1 556675 C C 55 0 60 16 ,,..A..,...,.... CB%%5%,A/+,%....
chr1 556676 C C 59 0 60 16 g,.....,...,.... .B%%.%.?.=/%...1
chr1 556677 G G 75 0 60 16 ,$,.....,...,.... .B%%9%5A6?)%;?:<
chr1 556678 G K 60 60 60 24 ,$.....,...,....^~t^~t^~t^~t^~t^~t^~t^~t^~t B%%B%<A;AA%??<=??;BA%B89
chr1 556679 C C 61 0 60 23 .....a...a....,,,,,,,,, %%1%&?*:2%*&)(89/1A@B@@
chr1 556680 G K 88 93 60 23 ..A..,..A,....ttttttttt %%)%7B:B0%55:7=>>A@B?B;
chr1 556681 C C 102 0 60 25 .$....,...,....,,,,,,,,,^~,^~. %%3%.B*4.%.34.6./B=?@@>5.
chr1 556682 A A 70 0 60 24 ...C,...,....,,,,,,,,,,. %:%(B:A4%7A?;A><<999=<<'
chr1 556683 G G 99 0 60 24 ....,...,....,,,,,,,,,,. %A%3B@%?%C?AB@BB/./-1A7?

The columns are chromosome, 1-based coordinate, reference base, consensus base (IUPAC nomenclature for nucleotides), consensus quality, SNP quality, maximum mapping quality of the reads covering the sites, the number of reads covering the site, read bases and base qualities.

The convert2annovar.pl program can convert the pileup file format to ANNOVAR input files. By default, the "-snpqual 20" argument will be imposed, so that only SNPs reaching quality score >=20 will be processed and written to output files. The output varlist file contains the called mutations in ANNOVAR format (non-mutations are obviously not in the output file).

In the 2011 Januaray version of ANNOVAR, the format for handling pileup file has been quite mature/fixed. Note that the first five columns conform to the standard ANNOVAR input format, yet the sixth and following columns give information on the alleles.

[kaiwang@biocluster ~/]$ convert2annovar.pl 84060.pileup -coverage 10 | head
NOTICE: the default --format argument is set as 'pileup'
NOTICE: the default --snpqual argument for pileup format is set as 20
NOTICE: Column 6-9 in output are heterozygosity status, SNP quality, total reads, reads with mutation
1 20139 20140 CA - het 563 53 10
1 59374 59374 A G hom 129 39 37
1 798677 798677 T G het 30 26 4
1 798785 798785 G A hom 132 38 37
1 798791 798791 C T hom 156 46 45
1 799544 799544 G A het 35 39 7
1 799550 799550 G C het 64 38 8
1 799595 799595 T C het 24 28 5
1 861034 861034 A C het 46 14 4

The NOTICE line aboves tells the user what the columns 6-9 means in the output. In the first line, we see an indel with depth coverage of 53, and 10 of them support the indel. In the second line, we see a SNP with depth coverage of 39, and 37 of them supports the alternative allele (G). These additional numbers after column 6 helps user decide whether the variant calls are reliable or not.

The -fraction argument can be used to filter out variants whose alternative allele has too low percentage among all reads. For example, if we suppose that all variant calls must be supported by at least 40% reads covering a site, we can use:

[kaiwang@biocluster ~/]$ convert2annovar.pl 84060.pileup -coverage 10 -fraction 0.4 | head
NOTICE: the default --format argument is set as 'pileup'
NOTICE: the default --snpqual argument for pileup format is set as 20
NOTICE: Column 6-9 in output are heterozygosity status, SNP quality, total reads, reads with mutation
1 59374 59374 A G hom 129 39 37
1 798785 798785 G A hom 132 38 37
1 798791 798791 C T hom 156 46 45
1 871781 871781 G A het 198 13 8
1 873954 873954 C G het 211 16 7
1 873964 873964 A C het 172 17 8
1 877423 877423 A C hom 84 19 19
1 877664 877664 A G hom 63 12 12
1 878502 878502 T C hom 69 14 14
1 881483 881483 C A het 228 24 10

As can be seen by comparing the two output files, the first line of indel is no longer in output, because 10/53<40%.

Some additional useful arguments include: -altcov, which specifies the minimum coverage for the alternative allele (the -coverage specifies coverage for all reads regardless of whether they support reference allele or alternative allele); -maxcoverage, which specifies the maximum coverage level to print out this variant; --includeinfo, which specifies that all information in the input line should be included in the output line by appending them after the printed columns.

After the program finishes, it will print out some statistics. Normally, for whole-genome sequencing on humans, the heterozytoes:homozygotes ratio should be around 2:1, the transitions:transversions ratio should be 2:1. (ANNOVAR version before Sep 2010 has a bug in the ratio calculation and it has been fixed now).

Adanced notes: When the chromosome is "M", ANNOVAR will not print out "hom" or "het", instead, it will print out a number between 0 and 1 that suggest the fraction of reads that support alternative alleles. Use -chrmt argument if mitochondria is not annotated as M in your alignment.

 

3. Complete Genomics genotyping calling format

The complete genomics company provides many genotyping-calling files for their customers. Among them is an var*ASM.tsv file that looks like below.

[kai@beta ~/]$ head -n 20 var-GS000000088-ASM.tsv
#BUILD 1.5.0.5
#GENERATED_AT 2009-Nov-03 19:52:21.722927
#GENERATED_BY dbsnptool
#TYPE VAR-ANNOTATION
#VAR_ANN_SET /Proj/Pipeline/Production_Data/REF/HUMAN-F_06-REF/dbSNP.csv
#VAR_ANN_TYPE dbSNP
#VERSION 0.3
>locus ploidy haplotype chromosome begin end varType reference alleleSeq totalScore hapLink xRef
1 2 all chr1 0 959 no-call = ?
2 2 all chr1 959 972 = = =
3 2 all chr1 972 1001 no-call = ?
4 2 all chr1 1001 1008 = = =
5 2 all chr1 1008 1114 no-call = ?
6 2 all chr1 1114 1125 = = =
7 2 all chr1 1125 1191 no-call = ?
8 2 all chr1 1191 1225 = = =
9 2 all chr1 1225 1258 no-call = ?
10 2 all chr1 1258 1267 = = =
12 2 all chr1 1267 1275 no-call = ?

The convert2annovar.pl program can be used to convert this file to ANNOVAR format, using the "-format cg" argument. The output file looks like this:

[kai@beta ~/]$ head var-GS000000088-ASM.tsv.snp
1 28095 28095 A G snp 67 dbsnp:rs806727 hom
1 31844 31844 A G snp 133 dbsnp:rs806721 hom
1 37155 37155 T G snp 51 dbsnp:rs2691275 het
1 44449 44449 T C snp 74 het
1 45679 45679 G A snp 191 dbsnp:rs3020699 het
1 45713 45713 C G snp 191 het
1 45789 45789 T C snp 138 dbsnp:rs3020698 hom
1 46662 46662 T C snp 69 dbsnp:rs2691309 het
1 47109 47109 C G snp 56 dbsnp:rs2691313 het
1 47815 47815 A C snp 67 dbsnp:rs2691334 hom

An example command line session is given below:

[kai@node-r1-u35-c2-p14-o4 ~/]$ convert2annovar.pl -format cg -out GS000000455.query var-GS000000455-ASM.tsv
NOTICE: Converting variants from var-GS000000455-ASM.tsv
NOTICE: Done with 25667914 lines

[kai@node-r1-u35-c2-p14-o4 ~/]$ wc -l GS000000455.query
3728645 GS000000455.query

In this example, 25.6 million lines from the var*ASM.tsv file from Complete Genomics data are processed, and 3.7 million variants are written to the output file in ANNOVAR input format.

4. GFF3-SOLiD format

Sometimes variant calls are in GFF3 format, and they can be converted to ANNOVAR input format. (This input file should not be confused with a GFF3 annotation database, as they serve different purposes. Here we are dealing with input files only.) For example, SOLiD provides SNP variant calls in the following format:

[kai@beta ~/]$ head -n 20 var/Yoruban_snp_18x.gff
##gff-version 3
##solid-gff-version 0.3
##source-version 2
##type DNA
##date 2009-03-13
##time 0:0:0
##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
##reference-file
##input-files /data/results3/yoruban_strikes_back/files_for_Aaron/NA18507_18x_SOLiD_SNP_calls_hg18_dbSNP129_annotated.txt
##run-path
1 AB_SOLiD SNP caller SNP 997 997 1 . . coverage=3;ref_base=A; ref_score=0.3359;ref_confi=0.9528;ref_single=0/0;ref_paired=1/1;consen_base=G; consen_score=0.6641;consen_confi=0.9420;consen_single=0/0;consen_paired=2/2
1 AB_SOLiD SNP caller SNP 1371 1371 1 . . coverage=2;ref_base=A; ref_score=0.0000;ref_confi=0.0000;ref_single=0/0;ref_paired=0/0;consen_base=G; consen_score=1.0000;consen_confi=0.8717;consen_single=0/0;consen_paired=2/2
1 AB_SOLiD SNP caller SNP 2061 2061 1 . . coverage=2;ref_base=G; ref_score=0.0000;ref_confi=0.0000;ref_single=0/0;ref_paired=0/0;consen_base=C; consen_score=1.0000;consen_confi=0.9138;consen_single=0/0;consen_paired=2/2
1 AB_SOLiD SNP caller SNP 4770 4770 1 . . coverage=2;ref_base=A; ref_score=0.0000;ref_confi=0.0000;ref_single=0/0;ref_paired=0/0;consen_base=G; consen_score=1.0000;consen_confi=0.8699;consen_single=0/0;consen_paired=2/2
1 AB_SOLiD SNP caller SNP 4793 4793 1 . . coverage=16;ref_base=A; ref_score=0.0689;ref_confi=0.9384;ref_single=0/0;ref_paired=1/1;consen_base=G; consen_score=0.6858;consen_confi=0.8494;consen_single=0/0;consen_paired=11/10
1 AB_SOLiD SNP caller SNP 5074 5074 1 . . coverage=4;ref_base=T; ref_score=0.5165;ref_confi=0.9355;ref_single=2/2;ref_paired=0/0;consen_base=K; consen_score=0.4835;consen_confi=0.8759;consen_single=1/1;consen_paired=1/1
1 AB_SOLiD SNP caller SNP 6241 6241 1 . . coverage=5;ref_base=T; ref_score=0.4450;ref_confi=0.9383;ref_single=0/0;ref_paired=2/2;consen_base=Y; consen_score=0.3646;consen_confi=0.7688;consen_single=0/0;consen_paired=2/2
1 AB_SOLiD SNP caller SNP 9089 9089 1 . . coverage=5;ref_base=T; ref_score=0.4260;ref_confi=0.9483;ref_single=1/1;ref_paired=1/1;consen_base=W; consen_score=0.3639;consen_confi=0.8100;consen_single=1/1;consen_paired=1/1
1 AB_SOLiD SNP caller SNP 9131 9131 1 . . coverage=8;ref_base=C; ref_score=0.7547;ref_confi=0.9306;ref_single=3/3;ref_paired=3/3;consen_base=Y; consen_score=0.2453;consen_confi=0.9073;consen_single=0/0;consen_paired=2/2
1 AB_SOLiD SNP caller SNP 18426 18426 1 . . coverage=3;ref_base=A; ref_score=0.0000;ref_confi=0.0000;ref_single=0/0;ref_paired=0/0;consen_base=G; consen_score=1.0000;consen_confi=0.8163;consen_single=0/0;consen_paired=3/3

The conversion can be done using "-format gff3-solid" argument.

[kai@beta ~/]$ convert2annovar.pl var/Yoruban_snp_18x.gff -format gff3-solid | head
1 997 997 A G hom
1 1371 1371 A G hom
1 2061 2061 G C hom
1 4770 4770 A G hom
1 4793 4793 A G hom
1 5074 5074 T G het
1 6241 6241 T C het
1 9089 9089 T A het
1 9131 9131 C T het
1 18426 18426 A G hom

Adding the --includeinfo argument will print out an additional column with the detailed attribute of the calls.

5. SOAPsnp format

The Short Oligonucleotide Analysis Package (SOAP) suite is developed by BGI, and SOAPsnp is a component that generates variant calls. An example of the genotype call file is given below:

chr10 84026 G R 55 A 32 9 9 G 29 3 5 14 0.275000 1.42857 1 81
chr10 84541 C M 45 C 27 5 5 A 25 3 4 9 0.285714 1.11111 1 4
chr10 284953 A G 76 G 33 26 26 A 0 0 0 26 1.00000 1.00000 1 9472
chr10 313283 A R 99 G 32 15 15 A 28 8 8 23 0.162302 1.00000 1 28330
chr10 363048 T Y 99 T 26 12 12 C 30 10 10 22 0.461435 1.00000 1 14012
chr10 377060 G A 55 A 33 11 11 G 0 0 0 11 1.00000 1.00000 1 7654
chr10 384714 G A 74 A 33 18 18 G 0 0 0 18 1.00000 1.00000 1 552
chr10 418503 A G 34 G 34 4 4 A 0 0 0 4 1.00000 1.00000 1 7377
chr10 434997 C Y 73 T 26 5 5 C 26 4 4 9 0.682540 1.00000 1 64
chr10 435061 C T 36 T 30 16 16 C 0 0 0 16 1.00000 1.00000 1 64

The convert2annovar.pl program can handle this format, using the "-format soapsnp" argument. An example of the output file is given below:

10 84026 84026 G A het
10 84541 84541 C A het
10 284953 284953 A G hom
10 313283 313283 A G het
10 363048 363048 T C het
10 377060 377060 G A hom
10 384714 384714 G A hom
10 418503 418503 A G hom
10 434997 434997 C T het
10 435061 435061 C T hom

Note that is --includeinfo argument is used, all the information from input file will be included in the output file.

6. MAQ genotype calling format

The convert2annovar.pl program can handle this format, using the "-format maq " argument. Both SNPs and indels can be correctly processed.

7. CASAVA genotype calling format

The convert2annovar.pl program can handle this format, using the "-format casava " argument and also specifying the chromosome by "-chr" argument, since CASAVA call file per se does not contain chromosome information. Both SNPs and indels can be correctly processed. This function is not tested rigorously yet. Please report bugs to me.