VISG PUBLICATIONS & SOFTWARE
The following published research and developed software are some of the outputs from the VISG collaboration. To access reports and presentations where no links are provided, please contact Phillip.Wilcox@scionresearch.com.
Methods for estimating gene copy number variation
Duplications and deletions are common in eukaryotic genomes, but difficult to quantify on a genome-wide basis. A constant challenge with whole genome sequencing is to efficiently determine gene copy number variation (CNV), both for specific genes, and genome-wide. VISG has funded development of a read-depth based method to estimate CN at known CN variable-loci in next-generation (re)sequencing data, and which has been evaluated using the 1000 Genomes Project data. The method involves fitting a normal mixture model (in a custom R function) to sequence data containing three components relating to deletion, normal CN (CN = 2) or duplication for known copy number variable loci. Evaluations showed that at 7 of 9 complex loci there was >90% concordance with microarray data. The efficacy of this method to tag linked SNPs was also evaluated in different population subsets of the 1000 Genomes Project. This approach has been further extended to more complex loci, and to genome-wide estimation of copy number variation, and provides good genome-wide estimates of copy number variation relative to hybridisation-based methods. This method is now being extended to horticultural species, including polyploid species.
Software and presentations
- The analysis pipeline, CNVrd, runs on Linux, was released under the GPLv2 license, and is available at http://code.google.com/p/cnvrdfortagsnps/.
- A second version, CNVrd2, has been produced that includes genome-wide estimation – see https://github.com/CNVrd2/CNVrd2. This version has also been submitted to Bioconductor.
- Is copy number variation at the FCGR locus able to be tagged by single nucleotide polymorphisms?
Methods for quantitative trait detection (QTL) in allopolyploids
Detecting genes contributing to trait variation in biparental populations (i.e. QTL detection) in allopolyploids has in the past been limited to single allele dosage-based methods originally developed for diploids. Our approach estimates probabilities of different linkage phases among linked codominant marker loci, and incorporates this uncertainty in QTL detection in a Bayesian model selection framework. This approach extends a Bayesian approximation method developed by Dr Rod Ball for QTL mapping in diploids (Ball 2001 Genetics 159 (3) 1351-1364). It also provides a method to obtain model-averaged estimates of QTL effect to adjust for overestimation of gene substitution effects that typically occur in QTL detection analyses.
Reports and Presentations
A manuscript describing the method can be obtained from Gail.Timmerman-Vaughan@plantandfood.co.nz
Various presentations (at MapNet workshops) describing the method development, include:
- Hidden Markov Model (HMM) peeling and the alignment problem for QTL mapping in allo-polyploids.
- Bayesian QTL mapping for allo-polyploids and diploids with Bayes QTLBIC.
The QTL mapping method which this is based on was published in 2001. Ball, R. D. (2001). Bayesian methods for quantitative trait loci mapping based on model selection: Approximate analysis using the Bayesian Information Criterion. Genetics, 159, 1351-1364.
An R-package ‘polyploids’ has been developed (with an accompanying vignette). Contact: Gail.Timmerman-Vaughan@plantandfood.co.nz.
Bayesian Methods for Designing Association Genetics Studies
Throughout the plant and animal kingdoms, the genetic architecture of most heritable characteristics are dominated by genes that individually contribute only small amounts to overall genetic variance. To detect the effects of such genes with sufficient evidence for association, very large experimental populations and hierarchical designs are needed. Bayesian methods developed by Ball (Genetics 2005) for unstructured populations have been adapted and applied to case-control studies in human medical genetics. This approach calculates samples sizes required to achieve a minimum Bayes Factor (= direct measure of evidence for an effect) for specific polymorphisms used in association genetics. VISG has advanced this method to design and estimate the power of case-control experiments in human medical genetics. R-scripts have been written for both unstructured populations (‘ldDesign’) and case control experiments (‘ccDesign’) and includes a module for estimating Bayes Factors corresponding to the values of F statistics from a one-way ANOVA in association genetics experiments with codominant biallelic markers.
Ball, R. D. (2011). Experimental Designs for Robust Detection of Effects in Genome-Wide Case–Control Studies. Genetics, 189(4), 1497-1514. doi: 10.1534/genetics.111.131698.
Other publications relating to use of the Bayes Factor and design of experimental populations are:
- Ball, R. D. (2005). Experimental designs for reliable detection of linkage disequilibrium in unstructured random population association studies. Genetics, 170(2), 859-873.
- Ball, R. D. (2007). Statistical analysis and experimental design. In N. Oraguzie, E. H. A. Rikkerink, S. E. Gardiner & H. N. De Silva (Eds.), Association mapping in plants (pp. 133-196).
- Ball, R. D. (2013). Designing a GWAS: Power, sample size, and data structure. Methods in Molecular Biology (Vol. 1019, pp. 37-98).
- Ball, R. D. (2013). Statistical analysis of genomic data, in Methods in molecular biology (Vol. 1019, pp. 171-192): Springer Science+Business Media B.V.
Software, reports and presentations
VISG has produced an R package – ccDesign - as an addendum to an existing R package, ldDesign. http://cran.r-project.org/web/packages/ldDesign/index.html
Presentations describing the method are at:
- VISG Experimental Designs Project 2011: Experimental Design for Genome-wide case-control studies.
- Experimental design for genome-wide case-control studies.
A Galaxy Pipeline for Detecting Evidence of Signatures of Selection
Large genome-wide genotypic data sets and next generation sequencing-(re)sequence data sets contain information that could reveal evidence of selection in regions not previously known to be associated with trait variation. Preparing data sets for analysis using a suite of analytical methods is time consuming and challenging. VISG is developing a Galaxy pipeline to prepare and mine data sets for evidence of selection.
This includes implementing in a parallel computing environment. The pipeline is currently being adapted for different biological contexts– including both phase unknown (i.e. outbred diploids such as humans, livestock, and various annual and perennial plant species) and phase known (haploid tissues and diploid inbred lines of a range of crop plants and experimental animal species).
Reports and Presentations
- As part of this tool development we worked on Connecting Genetics Researchers to New Zealand eScience Infrastructure.
- Connecting Genetics Researchers to NESI.
Further Development of a Galaxy pipeline to Design High Resolution Melting assays from NGS data
High resolution melting is a PCR-based genotyping technology which reveals differences in DNA sequence without knowing a priori the actual DNA sequence differences. VISG is co-funding further development of a Galaxy-based pipeline for using next generation sequence-derived (re)sequence to design primers to regions containing putative polymorphisms.
The original pipeline was described in the following paper: Baldwin et al. 2012. A Toolkit for bulk PCR-based marker design from next-generation sequence data: application for development of a framework linkage map in bulb onion (Allium cepa L.). BMC Genomics 2012, 13:637.