Using convert_bim_allele.pl for changing allele coding in PLINK-formatted BIM files
The convert_bim_allele.pl program is designed to convert the allele coding scheme between (1) dbSNP forward strand ACGT allele (2) Illumina A/B allele or Illumina 1/2 allele (3) Illumina TOP strand and ACGT allele. In principle, it can be applied to any arrays including Affymetrix arrays as well, as long as the appropriate library files are provided. Note that for all program arguments, the single dash (-) and double dash (--) has the same effect, and the argument name can be abbreviated to the first a few letters as long as there is no ambiguity (so --argument, -argument, --arg and -argum has the same effect).
In genetic association analysis, allele coding discordance is one commonly encountered problem that many researchers struggle with, especially when two different groups compare results or perform meta-analysis between cohorts. This issue arise between different people and different computer algorithms tend to use different allele coding schemes: some use ACGT alleles based on forward strand of the human genome assembly, some use Illumina's TOP alleles, and some use Illumina's AB alleles. This software is written to convert the allele coding shemes so that results from different groups are comparable to each other. The input file for this program is a BIM file that was generated by the PLINK program, since the file has a simple format and can be easily manipulated. If the user has a PED file instead, it's easy to convert it to BIM file using PLINK, then convert back to PED file.
Below I briefly describe several commonly used allele coding schemes. To make things easier to understand, let's first look at a screen shot from dbSNP. Simply search "rs1004491" in dbSNP, and click the search result, and this screen will show up. It tells that the SNP rs1004491 is compiled based on about a dozen individual submissions (with ss identifier) from various sites. The SNP is mapped to NCBI build 36, with refSNP allele as C/T in dbSNP.
This is the most commonly used allele coding scheme, although it has many intrinsic problems that I'll describe below. The scheme assumes that a "forward strand" for a given genome is already known, and that the two alternative alleles are defined with respect to the forward strand (as opposed to the reverse strand). For example, suppose that a SNP has C allele in the forward strand in reference genome sequence (and has a G allele in the reverse strand); an alternative sequence is found in the population that has A allele in the reverse strand (and hence T allele in the forward strand), then the SNP is then coded with C and T alleles. In the above example, you can see that this SNP is a C/T SNP based on forward strand (or a A/G SNP based on reverse strand).
One obvious problem with this coding scheme is that one must know the "forward strand" with certainty, but this is usually not the case. Even for the well studied human genome, there are still many holes to be filled. It is possible that a sequence is assembled into the forward strand in 2004 human genome assembly, but it becomes reverse strand in the 2006 human genome assembly. It is also possible that a sequence cannot be assigned into either forward or reverse strand, so it's annotated as something like chr1_random in many genome databases. It is also possible that a stretch of forward strand becomes reverse strand in a particular subject due to inversion in this region. These types of situations becomes much worse for other organisms that have only low fold coverage genome sequences: it's simply not feasible to assign many contigs correctly into either a forward or a reverse strand, so even if a SNP is known based on sequencing many subjects, the correct "forward strand" coding cannot be inferred confidently.
Illumina's A/B allele coding, or TOP/BOT strand definition, is explained in here in detail by Illumina. The allele coding method solves the problem aformentioned, that is, the alleles are not dependent on the specific genome assembly, but are based on the actual polymorphism itself. Briefly, if one of the two polymorphism is A or T, and the other one is C or G, then the A or T is refered to as A allele, and the C or G is refered to as B allele, and the strand with A or T is refered to as TOP and BOT strand, respectively. If the polymorphism is A/T or C/G, then walk through the surrouding sequence (the two nucleotides up or downstream of the SNP) to find a pair of unambiguous nucleotides, and then a similar rule is applied: if A or T is on 5' side of the SNP, then it's a TOP strand otherwise it's a BOT strand. For TOP strand, A and B allele denote A and T (or C and G), respectively; whereas for BOT strand, A and B allele denote T and A (or G and C), respectively. The Illumina's coding scheme does not rely on definition of a forward strand (hence correct genome assembly), so it almost always ensure consistency between genome builds and ensures instant allele designation for newly sequenced genomic sequences or unassembled genomic sequences. As the above figure shows, dbSNP has already adopted the Illumina's coding scheme in their annotations: fwd/B means that "forward" strand of dbSNP corresponds to "BOT" strand of Illumina's coding scheme. Also note that there is a "fwd/T" in the figure above: it clearly indicates an annotation error: such errors are indeed quite prevalent! Sometimes, people often use 1/2 to denote Illumina's A/B allele, since numeric coding is more convenient in many scenarios and since some old association software only recognize numeric coded alleles.
When exporting genotypes from the Illumina BeadStudio software, the user can choose AB genotypes, or ACGT genotypes (commonly refered to as "TOP alleles"), or forward strand genotype in newer version of the software. The TOP alleles is the allele on the TOP strand, which may or may not be the forward strand: see the example above, the "fwd/B" means that dbSNP's forward strand corresponds to Illumina's BOT strand, so the "TOP allele" is the opposite as the "forward strand allele". Unfortunately many users simply do not know or understand what is "TOP allele": they simply take for granted that "TOP" means "forward" and then complain that there are many discordances when merging two different data sets (one coded as forward strand and one exported from BeadStudio). The convert_bim_allele.pl program that I describe in this article will solve problems like this.
Note that Affymetrix also has a A/B allele designation but as far as I can tell, nobody actually uses it in any circumstance except when doing CNV analysis (for example, in PennCNV-Affy, the A/B alleles are explicitly used to relate frequency information of B allele from external databases to the observed B alleles). Virtually all SNP genotype calling algorithms for the Affymetrix platform generate actual ACGT allele calls in forward strand. Unfortunately, annotating alleles in forward strand is a difficult task, and there are many annotation errors in the library file provided by Affymetrix, or those produced by any third-party software. The user will have to exercise caution when merging genotyping calls from different software together, when using the Affymetrix platform. For example, when merging two data sets on the same platform but on different subjects, it's always a good idea to calculate allele frequencies for all SNPs in these two data sets and compare them to ensure consistent frequency spectrums.
Unfortunately, many errors exist in HapMap "foward strand" haplotype data as well, and given that genotype imputation has become very popular these days, it is important to recognize this fact and handle it appropriately. For Illumina Infinium assays, since no A/T and C/G SNP exists, typically this is not a big problem and an "autoflip" function can be used in imputation software to handle the allele-flipping issues. For Illumina GodenGate assays and for Affymetrix assays, this is a practical issue, since one will have to change "correct" genotype calls (in forward strand) into incorrect ones simply to accormodate the erraneous HapMap data for the purpose of imputation. Essentially, the user will have to do extra work when merging imputated genotypes from HapMap with another genotype data set, or when performing meta-analysis involving both imputated data and genotyped data. Hopefully, the convert_bim_allele.pl program may make this easier.
Some association or linkage software requires that alleles be coded as nuemric values, therefore one may want to convert ACGT alleles into 1/2 alleles. Softwares such as PLINK can perform this task, using --recode12 argument. However, the 1/2 coding by PLINK cannot be decoded back to ACGT, since 1 and 2 denote minor/major allele, respectively, and these annotations differ between data sets. When getting a genotype data file with 1/2 allele coding from other people, it is always necessary to ask whether the alleles are coded as Illumina's A/B alleles or simply recoded using PLINK; if it's the latter, it is best to ask for original genotype data before doing anything with the data.
The program requires two main input files, an PLINK-formatted BIM file, a SNPTable file mapping different allele coding schemes. Their formats are briefly described below.
The BIM file can be generated by the PLINK software using the --make-bed argument, see details here.
An example file is shown below:
[kai@beta ~/project/]$ head 45snp.bim
Each line contains information for one marker with four fields, including chromosome, marker name, genetic position, physical position, and two alleles (when PLINK makes the file, the minor allele is always listed before major allele, but obviously this arrangments may differ between different data sets).
The SNPTable file is a tab-delimited file that annotates the allele designation for each SNP. For Illumina arrays, one can easily generate such a file from the BeadStudio software: simply click the "SNPTable" tab after opening a project file, then use "column chooser" to select the "Name", "SNP", "ILMN Strand" and "Customer Strand" columns only, then click "export" to save the output to a file with the four columns. In the GenGen software package, I have provided a SNPTable file for HumanHap550&610 arrays, as well as a file for HumanHap1M v1 and v3 arrays. Annotation file for other arrays can be generated by the user using the above procedure.
To explain this in more detail, see the first 10 lines of a SNPTable file below:
[kai@beta ~/project/ head hh550_610.snptable
The first line is a header line that explains the meaning of each column, and each subsequent columns annotate one SNP (the SNPs shown above do not have rs identifiers, but they are indeed SNPs in the Illumina array).
For arrays with indels, the alleles are shown as D (deletion) and I (insertion), rather than ACGT. For example, see below:
[kai@beta ~/project/]$ fgrep D ~/project/lib/hh1m.snptable | head