Abstract
In this article, we investigate the problem of detecting boundaries of DNA copy number variation (CNV) regions using the DNA-sequencing data from multiple subject samples. Genomic features along the linear realization of the actual genome are correlated, especially within vicinity of a locus, so are the sequencing reads along the genome. It is then crucial to take the correlated structure of such high-throughput genomic data into consideration when modeling DNA-sequencing data for CNV detection from statistical and computational viewpoints. We use the framework of a fused Lasso latent feature model to solve the problem, and propose a modified information criterion for selecting the tuning parameter when search for common CNVs is shared by multiple subjects. Simulation studies and application on multiple subjects' next-generation sequencing data, downloaded from the 1000 Genome Project, showed that the proposed approach can effectively identify individual CNVs of a single subject profile and common CNVs shared by multiple subjects.
Original language | English (US) |
---|---|
Pages (from-to) | 1128-1140 |
Number of pages | 13 |
Journal | Journal of Computational Biology |
Volume | 25 |
Issue number | 10 |
DOIs | |
State | Published - Oct 2018 |
Keywords
- CNVs
- DNA-sequencing
- Lasso latent feature model
- fused Lasso
- modified Bayesian information criterion.
ASJC Scopus subject areas
- Modeling and Simulation
- Molecular Biology
- Genetics
- Computational Mathematics
- Computational Theory and Mathematics