The PennCNV was originally developed for Illumina HumanHap550 array but has been extended to other arrays. In PennCNV, the hhall.hg18.pfb, hhall.hmm and hhall.hg18.gcmodel files have been provided in the lib/ folder. They can be used to process arrays such as HumanHap300, HumanHap550, HumanHap650, HumanHap1M, HumanCNV370, Human610, Human1. For Affymetrix arrays, the PennCNV-Affy package contains library files for commonly used marker sets. In a nut shell, PennCNV can process an array if markers for this array are annotated in the PFB file.
For other Illumina&Affymetrix SNP arrays (for example, the Illumina Omni 2.5M array, or the Affy cytogenetics array), the user needs to compile your own PFB file that specifies the chromosome coordinates and PFB value for each marker in the array. Use the compile_pfb.pl program included in the PennCNV package for doing that.
For other types of arrays, such as oligonucleotide arrays, and for situations where user does not have access to the raw data but can access processed “signal” values, please check Input Files section in the tutorial for potential solutions. Besides commonly used SNP arrays, users have also reported success on Agilent non-SNP array, Perlegen SNP array, Affymetrix 100K SNP arrays.
The PFB file is described in the Input Files section in the tutorial. If you already have hundreds of signal files (with LRR/BAF values) generated on non-human species, or on non-European populations, then you can use the compile_pfb.pl program to generate PFB file from a collection of signal files.
Yes. In principle, it can also handle oligonucleotide arrays without SNP markers. There are two ways to represent non-polymorphic probes: (1) if the marker name contains the word “cnv” (for Illumina arrays) or “CN” (for Affymetrix arrays), they will be treated as non-polymorphic markers by PennCNV. (2) if the allele frequency annotation in the PFB file for a marker is more than 1, then this marker will be treated as a non-polymorphic marker by PennCNV program. The second situation is suitable for custom designed arrays.
No. One may think that by replacing the signal intensity of a segment of autosome by intensity of male chrX, he/she can create an artificial heterozygous deletion in autosomes, which is not correct. Autosome and chrX have different signal intensity distributions. When building reference clustering files, Illumina used both males and females: as a result, the expected ZERO value of LRR in chrX corresponds to ~1.5 copy of chrX, not 2 copies!
To further understand this, one can open BeadStudio and examine by eye the mean intensity of chrX from multiple female individuals. The mean is not ZERO, but higher than ZERO, indicating that female chrX is not even similar to 2-copy autosomes. For more deails and discussions, refer to Sup Figure 1 of the 2007 PennCNV paper in Genome Research.
Some users just want to adjust signal intensity values, without generating CNV calls by PennCNV. The genomic_wave.pl program in PennCNV package can be used to adjust signal intensity values. The input file must have a field in the header line that says "*.Log R Ratio". The -adjust argument can be used to generate a new file with updated Log R Ratio measures. This procedure can be also used in Agilent arrays or Nimblegen arrays for adjustment. Email me for a script to generate GC model file for these custom arrays.
Most probably it has low signal quality. Please check the QC&Annotation section in the tutorial to see how to take advantage of the filter_cnv.pl program for automatic QC analysis.
It could be also due to heterosomatic CNVs where a fraction of cells lose/gain one chromosomal regions (such as one chromosome arm), so PennCNV gives lots of CNV calls within this region specifically. The 2007 Genome Research paper on PennCNV have some discussions on that, with some illustrations in supplementary figures. In this case, just delete this chromosome in this individual from analysis.
This can be done by the following command:
[kai@node-r2-u30-c7-p13-o21]$ filter_cnv.pl sampleall.rawcnv -numsnp 10 -length 100k -out sampleall.largecnv
If a user also generates CNV calls by cnvPartition or QuantiSNP in BeadStudio/GenomeStudio, you can export the CNV calls to a XML file (see illustration described in this tutorial), then use the convert_cnv.pl program (with "-intype xml -outtype penncnv" argument) to convert the XML file to PennCNV format. If a user also generates CNV calls by Birdseye program, the convert_cnv.pl program can also convert BirdSeed calls to standard PennCNV format. Next, you can use the compare_cnv.pl program (use -m argument to read the manual) to compare the CNV calls generated by different programs that are all in PennCNV format. A two-column list file should be prepared that specifies the file name correspondence in PennCNV call file and the call file generated by the other algorithm (but converted to PennCNV format).
The CNV calls in immunoglobulin regions are most likely cell line artifact, so they should be removed as part of the QC procedure. The scan_region.pl program can be used to do this conveniently:
scan_region.pl cnvcall imm_region -minqueryfrac 0.5 > cnvcall.imm
This command first scan the cnvcall file against known immunoglobulin regions, and any CNV call that overlap with immunoglobulin regions are written to the cnvcall.imm file (the --minqueryfrac means that at least 50% of the length in the CNV call must overlap with the immunoglobulin region, to exclude cases where a very large CNV call happens to encompass the immunoglobulin regions). Then the fgrep program is used to remove these regions from the file and generate a cleaned cnvcall.clean file. The imm_region file contains immunoglobulin regions. For the 2006 human genome assembly, these four regions can be put into the file:
The same techniques described above can be used. For telomeric regions, one can treat the 100kb or 500kb region within start or end of chromosome as telomeric region. For example, for the 500kb threshold, you can put the following regions ino a file and then use scan_region.pl to remove CNV calls:
For centromeric regions, the following definition can be used (NCBI36 2006 human genome assembly). In fact, you may want to add 100kb (or 500kb) to both the left and right of these regions, just to make sure that centromeric CNVs are identified comprehensively.
Yes. By default PennCNV only works on autosomes, without generating CNV calls for chrX and chrY. Several important issues for sex chromosome CNV calling are:
Sometimes, Illumina BeadStudio/Genome studio may export signal intensity values with weird characters. This could occur in non-English version of BeadStudio, in non-English version of Windows, in non-human SNP arrays, or any other reasons. For example, the LRR values for several markers in a file may display as "ABCDE" rather than a number, and PennCNV will ignore these values and ignore these markers in analysis. If they are "-inf" instead, PennCNV will treat them as -5. This should usually affect only a few markers for each sample.
But some other times, all the signal intensity values are wrong so PennCNV will not work at all. For example, all the decimal points in LRR/BAF become "comma", so they are not valid numbers. In that case, users can do "perl -pe 's/,/./g' < inputfile > outputfile" to generate a new signal intensity file for CNV calling. One example is shown below:
[kaiwang@cc ~]$ head -n 3 sample.split1
Whenever whole-genome data is available, it is always best to use them for CNV calling even if the interest is on one particular gene or region. Sometimes, if you do not have access to whole-genome data, and you know the region or genes that you are interested in, it's still best to acquire data for the region plus flanking markers (for example, if the region per se has 50 markers, try to ask for 150 markers surrounding the region).
If you want to find all CNVs (possibly with different sizes) in this region, use the --test argument for HMM-based CNV calling, as well as the --nomedianadjust argument. The latter argument is very important because by default a median adjustment procedure is used (to make the median LRR of whole-genome markers to be zero), and this procedure obviously should not apply here to candidate regions. If you know there exist a common CNV in this candidate region with known start and end marker and known deletion frequency and duplication frequency, you can also use the --validate argument for validation-based CNV calling to reduce false negative rates (but again the --nomedianadjust argument should be used for the operation).
The 2009Aug27 vesion of PennCNV added a script for validating de novo CNVs and assigning P-values to de novo calls. If you want to know whether a particularly interesting de novo CNV is real or not, or if you want to select a set of most confident de novo CNVs for experimental validation, then this program should definitely be used. Check it out here.
In practive, a user can multiply BAF with CN estimate to get allele-specific CNV calls. Believe it or not, it is as simple as that. People treat it as a big deal simply because most other software do not have the concept of BAF. Please read the web page on infer_snp_allele.pl for a more thorough description on this issue.
The word "state" is used internally in HMM algorithm, and its six-state definition is borrowed from the original quantisnp paper. You can simply ignore the "state", and focus on the CN=1 part, which means a deletion with copy number of 1.
First, compile your own PFB file (see format description here). Next, use the "--lastchr 29" argument in the latest version of PennCNV, which means that the last autosomal chromosome is 29 (rather than 22).
The PennCNV package includes pre-built executables for Windows, but only if you install 32-bit Perl (version 5.8.8 or 5.10.1). See the kext/ directory: it has two sub-directories 5.8.8 and 5.10.1, each containing appropriate DLL files for PennCNV. So if you install say 5.12.1, or 5.8.9, PennCNV won't work. You have to install the correct Perl version, or compile the executable yourself.
Some of the free software for CNV calling from SNP arrays include QuantiSNP, cnvPartition, BirdSuite, dChip, CNAT, CNAG, GenoCN and CORKEN. There are two recent free review articles on SNP arrays and CNVs here and here. There are quite a few companies that sell CNV calling software as well, including GoldenHelix, Partek, Nexus Copy Number software.
PennCNV considers normal copy to be 2 copies, and only gives integer estimate of copy number. For tumors, it is best to use an algorithm that specifically handles tumor samples, that gives continuous estimates of copy number and that accounts for aneuploidy levels as well as sample heterogeneity levels. This paper gives an overview of these issues and provides a software that only works in BeadStudio though. Other similar software include GAP, SOMATICs, GenoCN.
Some citations can be found here.
The most widely used CNV database is the Database for Genomic Variants (DGV). Other resources include the UCSC Genome Browser (with Structural Var track turned on), CNVVdb, CHOPPY, DECIPHER database and the new NCBI dbVar database.