SNP allele composition within CNV
In a diploid genome without CNVs, each SNP genotype call is composed of two allele calls from two homologous chromosomes. If using A and B to denote the two possible alleles for a diallelic marker, then the three possible genotype calls are AA, AB and BB, respectively.
However, unlike a regular SNP genotype call that may be given by genotype calling software, in a CNV region, the "CNV-based SNP genotype call" could be composed of zero, one, three or even more allele calls. An example is given in the PennCNV paper in Genome Research. The figure is reproduced below:
As we can see, when there is a 3-copy CNV, there are four possible SNP allele composition within the CNV, including AAA, AAB, ABB and BBB. For each location in the genome that have a genotyped SNP, the CNV-based SNP genotype call can be inferred from the B Allele Frequency values. The above figure does not show deletion, but it is obvious that when there is a single-copy deletion, the two possible CNV-based SNP genotypes are A and B, respectively.
Note: I did not use the term "CNV genotype" in the above description. Practically, "CNV genotype" is better used to describe the CN state of two homologous chromosomes; for example, in a common CNV region, a CNV genotype can be 1+0 or1+1 or 0+1. This term has nothing to do with the actual SNP allele composition within the CNV.
Note: An often-touted concept in some other CNV calling software is the so-called "allele-specific CNV call". This is a very confusing concept to use, because when people use this term, they were actually referring to the scenario such as "2 A allele, 1 B allele" for a particular marker within a particular duplication, as opposed to "2 copies in paternal chromosome, 1 copy in maternal chromosome" which should be the right way to use. In PennCNV, if a user wants to get so-called "allele-specific CNV call", then it is simply the BAF value multiplied by the CN estimate. It is as simple as that. (People make a big deal about "allele-specific CNV call" because most other algorithms do not consider the BAF in the calling procedure.)
When family data is available, that is, when father, mother and child are all genotyped and when their CNV calls are generated by PennCNV, it is possible (but not always deterministic) to disambiguate the SNP allele composition in each of the two homologous chromosomes in parental genome. This is especially true when multiple children are genotyped in the same family. An example is shown below:
The boxes below each individual in the pedigree denote the BAF values in the CNV region, and each blue dot represent one SNP in the corresponding individual. The different colors (cyan, yellow, orange, green) for the SNP alleles denote different homologous chromosomes. There are a total of four parental chromosomes, and the allele transmission pattern is inferred, based on the pedigree structure, the CNV calls, as well as the signal intensity values for each of the six children.
To make this a little more clear, we can examine the first SNP in the figure. The father (subject 2, with 3-copies) has a SNP allele composition of AAB, while the mother (subject 1, with normal 2-copy) has a SNP allele composition of AA. There are two major possibilities for the father: maybe the SNP allele composition is AA+B in two homologous chromosomes, or maybe it is A+AB in two homologous chromosomes. Without offspring data, we would not tell. However, based one child (subject 4), who has a SNP allele composition of AAB, we can confidently tell that the father should have A in one homologous chromosome and AB in the other. Similar logic can be applied to other SNPs and other offspring in the family.
The PennCNV package provides a convenience script to help infer total SNP allele composition for each marker within a CNV. Once the user has this information, it is relatively straitforward to infer SNP allele composition within each homologous chromosome for each marker within a CNV. Suppose that the signal intensity file names for the above eight individuals are sample1.txt, sample2.txt, ..., sample8.txt, respectively. Suppose that we have already run PennCNV through this family, and inferred the CN for each individual in this region as 2, 3, 2, 3, 3, 2, 3, 3, respectively. Now we can run the command below:
[kaiwang@cc ~/penncnv/example]$ infer_snp_allele.pl -pfb example.pfb -hmm example.hmm -allcn 23233233 -start rs11716390 -end rs17039742 -out tempfile sample1.txt sample2.txt sample3.txt sample4.txt sample5.txt sample6.txt sample7.txt sample8.txt
In the command, eight signal files are provided, and the --allcn argument is provided with eight digits, corresponding to eight CN estimates for these eight files.
Again note that the procedure of inferring SNP alleles in each homologous chromosome may not be deterministic. Below is one example (from the same pedigree) showing that we cannot tell the two chromoosmes apart from the CNV calls per se. Since the mother is homozygous normal copy, we cannot tell how which maternal chromosomes that each of the children get. (In principle, it is possible to tell if using neighboring SNP markers in a phasing software but this is far outside the scope of our discusion here.)
De novo CNVs (those in offspring but not in parents) are especially interesting in many family-based studies, since many people take for granted that de novo CNVs are the culprit variants responsible for diseases in offspring (which may or may not be correct). PennCNV provides a trio-based calling algorithm that helps eliminating false positive de novo CNV calls, when the father or mother has the CNV but was missed from individual-based calling algorithm. However, since de novo CNVs are very important to study, in many cases, researchers want to have even higher confidence in de novo CNVs for selection of CNVs for expeirmental validation. Therefore, PennCNV provides a special script for the sole purpose of assigning P-values to de novo CNV calls.
The basic concept is quite simple and is described in the 2007 Genome Research paper on PennCNV. Here I reproduce the Supplementary Table 2 in that paper below for illustrating the concept:
The above table shows the BAF and LRR for the father, mother and the child, respectively, for 50 SNP markers in a CNV region. PennCNV gives a deletion CNV call for the child, but not for the father or the mother. So it is potentially a de novo CNV, but are we confident that this is really a de novo CNV? In the Genome research paper, we describe the scenario as below:
"Family information can be also used to extract more biological knowledge from detected CNVs, such as inferring the parental origin of predicted de novo CNVs. To illustrate this, consider a scenario where the father and mother genotypes at a SNP marker are AA and AB, respectively, and the PennCNV algorithm identified a de novo deletion in the offspring encompassing this SNP. If the offspring genotype call is BB (or when B Allele Frequency indicates that the actual genotype is B in the presence of “No Call” genotype), we can infer that the de novo event happened on the paternal chromosome. Similarly, when the father, mother and offspring genotypes are AA, BB and AA, respectively, we can infer that the de novo event happened on the maternal chromosome."
So let us take a look at the above example with 50 SNPs within the CNV region in all family members. A total of 13 SNPs are informative for this analysis and they were marked in bold fonts. We can unambiguously determine that the de novo event occurred on the paternal chromosome for all 13 SNPs, and on the maternal chromsome for ZERO SNP. If we do a binomial test (against expectation of 0.5), we would have a two-sided P-value of 0.0002, which means that it is highly unlikely to be a random observation. Therefore, we have high confidence that the predicted de novo CNV is a bona fide de novo CNV.
The PennCNV package provides a convenience script to automate the entire process above, and assigns P-values to predicted de novo CNV calls (currently for autosomes only). In the example/ directory in the PennCNV distribution, we can test the following command (since we know that there is a de novo CNV call in chr3 for the offspring already):
[kaiwang@cc ~/penncnv/example]$ infer_snp_allele.pl -pfb example.pfb -hmm example.hmm -denovocn 1 father.txt mother.txt offspring.txt -start rs11716390 -end rs17039742 -out tempfile
In the command line, the --denovocn argument tells the program to analyze de novo CNV with CN=1 in offspring (assuming that father and mother have normal copy). The above example shows that among the 50 markers, 13 support paternal origin, 0 supports maternal origin, with a P-value of 0.0002. To get the CNV-based genotype calls, we can examine the tempfile:
[kaiwang@cc ~/penncnv/example]$ cat tempfile
The first line of the output file tells the column header. Each line contains tab-delimited fields, representing LRR for father, mother, offspring, then BAF for father, mother, offspring, then CNV-based genotype calls for father, mother offspring, and then the parental origin (F for father, M for mother, ? for unknown). You can probably load the output to Excel (use "separate by space") for easier visual depiction.
Note that you can alternatively use "runex.pl 13" and "runex.pl 14" in the example/ directory in PennCNV to run the two types of analysis described in this page.