|
Preparation of input files with variants
Standard format of ANNOVAR input file ANNOVAR takes text-based input files, where each line corresponds to one variant. On each line, the first five space- or tab- delimited columns represent chromosome, start position, end position, the reference nucleotides and the observed nucleotides. Additional columns can be supplied and will be printed out in identical form. For convenience, users can use “0” to fill in the reference nucleotides, if this information is not readily available. Insertions, deletions or block substitutions can be readily represented by this simple file format, by using “–” to represent a null nucleotide. One example is given below (this example is included as ex1.human file in the ANNOVAR package), with extra columns that serve as comments on the variants. By default, 1-based coordinate system will be assumed; if --zerostart argument is issued, a half-open zero-based coordinate system will be used in ANNOVAR instead. The ANNOVAR package contains a few example input files. For example, the content of the ex1.human file is below: [kai@beta ~/]$ cat ex1.human The field of this file is explained in the table below:
The example above contains several genetic variants. The first variant is a single nucleotide variant, with a substitution of C in reference genome to T. The third variant is a 2-bp deletion, with the observed nucleotides being represented by "-". The fourth variant is a 2-bp insertion, since the reference nucleotide in the reference genome is represented by “–”. The last variant is a large-scale deletion, but the reference allele is represented by “0”, eliminating the need to include reference nucleotides on this line. Another example is shown below. Note that the first five columns conform to the specification above, yet all other columns are totally optional and the user can put anything there. [kai@beta ~/]$ cat ex4.human In some cases, users may want to specify only positions but not the actual nucleotides. In that case, "0" can be used to fill in the 4th and 5th column. ANNOVAR can still run on this input file, but obviously there is no output on amino acid changes. Additionally, the observed amino acid will be assumed to be of equal length of the wildtype allele (as specified by the start and end position at each line). If ANNOVAR encounters an invalid input line, it will write the invalid line into a file called $outfile.invalid_input where $outfile is specified by the --outfile argument. If all input lines are of valid format, this output file will not exist. Therefore, even if the input file contains empty lines or invalid format, ANNOVAR can still proceed with the next input line. The download package contains several example input files. The users can check them out. Format conversion script: facilitating the generation of ANNOVAR input files The convert2annovar.pl script provide some very rudimentary utility to convert other "genotype calling" format into ANNOVAR format. Currently, the program can handle Samtools genotype-calling pileup format, Illumina export format from GenomeStudio, SOLiD GFF genotype-calling format, Complete Genomics variant format, and VCF format. 1. VCF4 genotype calling format The "-format vcf4 " argument should be specified. Both SNPs and indels can be processed. For example, for VCF4 file containing genotype calls, [kaiwang@biocluster ~/]$ convert2annovar.pl 84060.snp.vcf -format vcf4 | head The NOTICE line gives a brief format description for column 6-10 in the output file. Use -includeinfo if the user wants to have all the information. The indel calling format may be substantially different from the SNP calling format. For example, [kaiwang@biocluster ~/]$ convert2annovar.pl 84060.indel.vcf -format vcf4 | head As another example, suppose we want to process 1000 Genomes Project genotype calls [kaiwang@biocluster ~/]$ convert2annovar.pl CEU.low_coverage.2010_07.sites.vcf -format vcf4 -includeinfo | head In the above command, we used -includeinfo argument, so that the additional information are printed out in the same line. Similarly, we can process indel calls [kaiwang@biocluster ~/]$ convert2annovar.pl CEU.low_coverage.2010_07.indel.sites.vcf -format vcf4 -i | head Technical discussions (for advanced ANNOVAR users, open for more discussions if you want to contribute): 1. Multi-allelic calls: Sometimes, multi-allelic variants are called in VCF files. For example, one variant is below (A is changed to both C and G): 1 156706559 . A C,G 114 . DP=20;AF1=1;CI95=1,1;DP4=0,0,1,19;MQ=60;FQ=-63 GT:PL:GQ 1/2:237,126,90,162,0,138:99 By default, convert2annovar.pl will only regard C as the mutation. If you add -allallele argument, then the program will print out two output lines, one for C and one for G. There are exceptions. For example, 1 11297762 . T C,A 98 . DP=19;AF1=1;CI95=1,1;DP4=0,0,17,1;MQ=60;FQ=-78 GT:PL:GQ 1/1:131,51,0,120,28,117:99 Here the subject is called as homozygous for the first alternative allele (genotype 1/1. i.e. C/C), but since there was one read containing A, samtools still keep both alleles in the VCF file (but gives a very low probabilities for it). In this case, it is reasonable to assume that this is a CC genotype and ignore the second A allele. Remember that convert2annovar.pl has the ability to maintain the VCF file format, such that you can process any VCF file in ANNOVAR by annotation, filtering, etc, and ultimately, in the end, you still get a VCF file that is a subset of the original VCF file. Read details here. Obviously, for multi-allelic variant calls, if you add -allallele argument, one input line becomes two output lines so that you need to do some clean-up of the final results before converting back to VCF file. 2. Ambiguous indels: Sometimes, indels can be found in a stretch of identical nucleotides, so that the exact location of indel cannot be identified. Similarly, some insertions have identical first and last nucleotide, so that there are two choices to specify the location of insertion. ANNOVAR will use a simple convention that the leftmost coordinate and nucleotide should be always used, whenever ambiguity exists. I believe that this is really the only good way to solve the inconsistency in indel definitions in all variant calling systems, but obviously most other people do not think so. Some other tools address this issue by several means: First, they can just print out the whole block of nucleotide including a different nucleotide (for example, A) before a polymer (for example, TT). Then it is up to the individual researchers to decide how they want to handle this scenario. For example, the same indel can be annotated as two versions below: Experimental sample VCF file: chr1 11086097 AT ATT 1KGP Dindel in VCF format: chr1 11086097 A AT Second, some software use an external reference (such as a pre-existing library of "known indels") and then try to match any observed indel to these "known indels" and then use the location of the "known indels" when slightly inconsisteny exists. This has some theoretical advantages in some situations but it has its own set of issues as well. In convert2annovar.pl, the way to represent this variant is to treat it as a simple one-base insertion. That is, if you run convert2annovar.pl (from either of the two variants above), you will see variant from convert2annovar.pl: chr1 11086097 11086097 - T and any subsequent annotation will be based on this new single-base-insertion variant. Similar, suppose you have a change from C to CTC. It could be that there is a insertion of "TC" after C, or that there is a "CT" insertion before C. convert2annovar.pl will treat this as an insertion of "TC" after C. convert2annovar.pl per se will not do this type of conversion. In other word, it will work on the variant that you supply in the input file. If you supply AT->ATT, or A->AT, or -->T, convert2annovar.pl will not attemp to alter them. So if you are doing filter-based annotation, and the database file contains only -->T, then the first two variants will be regarded as "un-annotated" variants. 3. Flagging in VCF4 file: Although convert2annovar.pl has some nice built-in ability to print out a subset of higher quality VCF4 calls ( for example, coverage>20), individual investigators may want to filter the VCF4 file themselves by other more sophisticated software tools, before feeding the raw variant calls in VCF4 to convert2annovar.pl The way this is done is by putting a "FILTER" flag in the VCF4 file, and the use -filter argument in convert2annovar.pl to process only calls with a user-specified flag. For example, I can put this argument in GATK VariantFiltration walker: --clusterWindowSize 10 --filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)" --filterName "HARD_TO_VALIDATE" -B:mask,VCF lane1_1.fq.indel.vcf --filterExpression "QUAL < 30.0 || QD < 5.0 || HRun > 5 || SB > -0.10" --filterName GATKStandard --filterExpression "DP < 6" --filterName "DPlt6" Then the new VCF4 file will have different filter annotations and I can use "convert2annovar.pl -filter PASS" subsequently to convert to annovar input format. 2. Samtools genotype-calling pileup format Note that there are many different pileup formats, but here we are dealing with the (now-obselete as of 2011) "genotype-calling" pileup which contains the variant calls in one of the columns. A more detailed description is given at the Samtools website. An example to generate the "genotype-calling" pileup file is shown below: samtools pileup -vcf ref.fa aln.bam > raw.pileup The commands generates pileup files that contain the consensus calls with the model implemented in MAQ (there are certainly many other specified SNP callers available as well that users can freely choose). An example genotype-calling pileup format generated from SamTools is illustrated below: chr1 556674 G G 54 0 60 16 a,.....,...,.... (B%A+%7B;0;%=B<: The columns are chromosome, 1-based coordinate, reference base, consensus base (IUPAC nomenclature for nucleotides), consensus quality, SNP quality, maximum mapping quality of the reads covering the sites, the number of reads covering the site, read bases and base qualities. The convert2annovar.pl program can convert the pileup file format to ANNOVAR input files. By default, the "-snpqual 20" argument will be imposed, so that only SNPs reaching quality score >=20 will be processed and written to output files. The output varlist file contains the called mutations in ANNOVAR format (non-mutations are obviously not in the output file). In the 2011 Januaray version of ANNOVAR, the format for handling pileup file has been quite mature/fixed. Note that the first five columns conform to the standard ANNOVAR input format, yet the sixth and following columns give information on the alleles. [kaiwang@biocluster ~/]$ convert2annovar.pl 84060.pileup -coverage 10 | head The NOTICE line aboves tells the user what the columns 6-9 means in the output. In the first line, we see an indel with depth coverage of 53, and 10 of them support the indel. In the second line, we see a SNP with depth coverage of 39, and 37 of them supports the alternative allele (G). These additional numbers after column 6 helps user decide whether the variant calls are reliable or not. The -fraction argument can be used to filter out variants whose alternative allele has too low percentage among all reads. For example, if we suppose that all variant calls must be supported by at least 40% reads covering a site, we can use: [kaiwang@biocluster ~/]$ convert2annovar.pl 84060.pileup -coverage 10 -fraction 0.4 | head As can be seen by comparing the two output files, the first line of indel is no longer in output, because 10/53<40%. Some additional useful arguments include: -altcov, which specifies the minimum coverage for the alternative allele (the -coverage specifies coverage for all reads regardless of whether they support reference allele or alternative allele); -maxcoverage, which specifies the maximum coverage level to print out this variant; --includeinfo, which specifies that all information in the input line should be included in the output line by appending them after the printed columns. After the program finishes, it will print out some statistics. Normally, for whole-genome sequencing on humans, the heterozytoes:homozygotes ratio should be around 2:1, the transitions:transversions ratio should be 2:1. (ANNOVAR version before Sep 2010 has a bug in the ratio calculation and it has been fixed now). Adanced notes: When the chromosome is "M", ANNOVAR will not print out "hom" or "het", instead, it will print out a number between 0 and 1 that suggest the fraction of reads that support alternative alleles. Use -chrmt argument if mitochondria is not annotated as M in your alignment.
3. Complete Genomics genotyping calling format The complete genomics company provides many genotyping-calling files for their customers. Among them is an var*ASM.tsv file that looks like below. [kai@beta ~/]$ head -n 20 var-GS000000088-ASM.tsv The convert2annovar.pl program can be used to convert this file to ANNOVAR format, using the "-format cg" argument. The output file looks like this: [kai@beta ~/]$ head var-GS000000088-ASM.tsv.snp An example command line session is given below: [kai@node-r1-u35-c2-p14-o4 ~/]$ convert2annovar.pl -format cg -out GS000000455.query var-GS000000455-ASM.tsv In this example, 25.6 million lines from the var*ASM.tsv file from Complete Genomics data are processed, and 3.7 million variants are written to the output file in ANNOVAR input format. Sometimes variant calls are in GFF3 format, and they can be converted to ANNOVAR input format. (This input file should not be confused with a GFF3 annotation database, as they serve different purposes. Here we are dealing with input files only.) For example, SOLiD provides SNP variant calls in the following format: [kai@beta ~/]$ head -n 20 var/Yoruban_snp_18x.gff The conversion can be done using "-format gff3-solid" argument. [kai@beta ~/]$ convert2annovar.pl var/Yoruban_snp_18x.gff -format gff3-solid | head Adding the --includeinfo argument will print out an additional column with the detailed attribute of the calls. The Short Oligonucleotide Analysis Package (SOAP) suite is developed by BGI, and SOAPsnp is a component that generates variant calls. An example of the genotype call file is given below: chr10 84026 G R 55 A 32 9 9 G 29 3 5 14 0.275000 1.42857 1 81 The convert2annovar.pl program can handle this format, using the "-format soapsnp" argument. An example of the output file is given below: 10 84026 84026 G A het Note that is --includeinfo argument is used, all the information from input file will be included in the output file. 6. MAQ genotype calling format The convert2annovar.pl program can handle this format, using the "-format maq " argument. Both SNPs and indels can be correctly processed. 7. CASAVA genotype calling format The convert2annovar.pl program can handle this format, using the "-format casava " argument and also specifying the chromosome by "-chr" argument, since CASAVA call file per se does not contain chromosome information. Both SNPs and indels can be correctly processed. This function is not tested rigorously yet. Please report bugs to me.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||