A machine-learning approach to combined evidence validation of genome assemblies

Jeong-Hyeon Choi; Sun Kim; Haixu Tang; Justen Andrews; Don G. Gilbert; John K. Colbourne

doi:10.1093/bioinformatics/btm608

A machine-learning approach to combined evidence validation of genome assemblies

Jeong-Hyeon Choi, Sun Kim, Haixu Tang, Justen Andrews, Don G. Gilbert, John K. Colbourne

Research output: Contribution to journal › Article › peer-review

18 Scopus citations

Abstract

Motivation: While it is common to refer to 'the genome sequence' as if it were a single, complete and contiguous DNA string, it is in fact an assembly of millions of small, partially overlapping DNA fragments. Sophisticated computer algorithms (assemblers and scaffolders) merge these DNA fragments into contigs, and place these contigs into sequence scaffolds using the paired-end sequences derived from large-insert DNA libraries. Each step in this automated process is susceptible to producing errors; hence, the resulting draft assembly represents (in practice) only a likely assembly that requires further validation. Knowing which parts of the draft assembly are likely free of errors is critical if researchers are to draw reliable conclusions from the assembled sequence data. Results: We develop a machine-learning method to detect assembly errors in sequence assemblies. Several in silico measures for assembly validation have been proposed by various researchers. Using three benchmarking Drosophila draft genomes, we evaluate these techniques along with some new measures that we propose, including the good-minus-bad coverage (GMB), the good-to-bad-ratio (RGB), the average Z-score (AZ) and the average absolute Z-score (ASZ). Our results show that the GMB measure performs better than the others in both its sensitivity and its specificity for assembly error detection. Nevertheless, no single method performs sufficiently well to reliably detect genomic regions requiring attention for further experimental verification. To utilize the advantages of all these measures, we develop a novel machine learning approach that combines these individual measures to achieve a higher prediction accuracy (i.e. greater than 90%). Our combined evidence approach avoids the difficult and often ad hoc selection of many parameters the individual measures require, and significantly improves the overall precisions on the benchmarking data sets.

Original language	English (US)
Pages (from-to)	744-750
Number of pages	7
Journal	Bioinformatics
Volume	24
Issue number	6
DOIs	https://doi.org/10.1093/bioinformatics/btm608
State	Published - Mar 2008
Externally published	Yes

ASJC Scopus subject areas

Statistics and Probability
Biochemistry
Molecular Biology
Computer Science Applications
Computational Theory and Mathematics
Computational Mathematics

Access to Document

10.1093/bioinformatics/btm608

Cite this

@article{a2dad40f5b924f769bec514cde657a38,

title = "A machine-learning approach to combined evidence validation of genome assemblies",

abstract = "Motivation: While it is common to refer to 'the genome sequence' as if it were a single, complete and contiguous DNA string, it is in fact an assembly of millions of small, partially overlapping DNA fragments. Sophisticated computer algorithms (assemblers and scaffolders) merge these DNA fragments into contigs, and place these contigs into sequence scaffolds using the paired-end sequences derived from large-insert DNA libraries. Each step in this automated process is susceptible to producing errors; hence, the resulting draft assembly represents (in practice) only a likely assembly that requires further validation. Knowing which parts of the draft assembly are likely free of errors is critical if researchers are to draw reliable conclusions from the assembled sequence data. Results: We develop a machine-learning method to detect assembly errors in sequence assemblies. Several in silico measures for assembly validation have been proposed by various researchers. Using three benchmarking Drosophila draft genomes, we evaluate these techniques along with some new measures that we propose, including the good-minus-bad coverage (GMB), the good-to-bad-ratio (RGB), the average Z-score (AZ) and the average absolute Z-score (ASZ). Our results show that the GMB measure performs better than the others in both its sensitivity and its specificity for assembly error detection. Nevertheless, no single method performs sufficiently well to reliably detect genomic regions requiring attention for further experimental verification. To utilize the advantages of all these measures, we develop a novel machine learning approach that combines these individual measures to achieve a higher prediction accuracy (i.e. greater than 90%). Our combined evidence approach avoids the difficult and often ad hoc selection of many parameters the individual measures require, and significantly improves the overall precisions on the benchmarking data sets.",

author = "Jeong-Hyeon Choi and Sun Kim and Haixu Tang and Justen Andrews and Gilbert, {Don G.} and Colbourne, {John K.}",

note = "Funding Information: We are grateful to anonymous reviewers for their valuable comments. This research was supported in part by the Indiana METACyt Initiative of Indiana University, funded in part through a major grant from the Lilly Endowment, Inc. and by NSF Career DBI-0237901. Computer support was provided by an allocation TG-MCB060059N through the TeraGrid Advanced Support, by the University Information Technology Services (UITS) and by The Center for Genomics and Bioinformatics computing group. We thank Richard Repasky (UITS) who helped conceive this project.",

year = "2008",

month = mar,

doi = "10.1093/bioinformatics/btm608",

language = "English (US)",

volume = "24",

pages = "744--750",

journal = "Bioinformatics",

issn = "1367-4803",

publisher = "Oxford University Press",

number = "6",

}

TY - JOUR

T1 - A machine-learning approach to combined evidence validation of genome assemblies

AU - Choi, Jeong-Hyeon

AU - Kim, Sun

AU - Tang, Haixu

AU - Andrews, Justen

AU - Gilbert, Don G.

AU - Colbourne, John K.

N1 - Funding Information: We are grateful to anonymous reviewers for their valuable comments. This research was supported in part by the Indiana METACyt Initiative of Indiana University, funded in part through a major grant from the Lilly Endowment, Inc. and by NSF Career DBI-0237901. Computer support was provided by an allocation TG-MCB060059N through the TeraGrid Advanced Support, by the University Information Technology Services (UITS) and by The Center for Genomics and Bioinformatics computing group. We thank Richard Repasky (UITS) who helped conceive this project.

PY - 2008/3

Y1 - 2008/3

N2 - Motivation: While it is common to refer to 'the genome sequence' as if it were a single, complete and contiguous DNA string, it is in fact an assembly of millions of small, partially overlapping DNA fragments. Sophisticated computer algorithms (assemblers and scaffolders) merge these DNA fragments into contigs, and place these contigs into sequence scaffolds using the paired-end sequences derived from large-insert DNA libraries. Each step in this automated process is susceptible to producing errors; hence, the resulting draft assembly represents (in practice) only a likely assembly that requires further validation. Knowing which parts of the draft assembly are likely free of errors is critical if researchers are to draw reliable conclusions from the assembled sequence data. Results: We develop a machine-learning method to detect assembly errors in sequence assemblies. Several in silico measures for assembly validation have been proposed by various researchers. Using three benchmarking Drosophila draft genomes, we evaluate these techniques along with some new measures that we propose, including the good-minus-bad coverage (GMB), the good-to-bad-ratio (RGB), the average Z-score (AZ) and the average absolute Z-score (ASZ). Our results show that the GMB measure performs better than the others in both its sensitivity and its specificity for assembly error detection. Nevertheless, no single method performs sufficiently well to reliably detect genomic regions requiring attention for further experimental verification. To utilize the advantages of all these measures, we develop a novel machine learning approach that combines these individual measures to achieve a higher prediction accuracy (i.e. greater than 90%). Our combined evidence approach avoids the difficult and often ad hoc selection of many parameters the individual measures require, and significantly improves the overall precisions on the benchmarking data sets.

AB - Motivation: While it is common to refer to 'the genome sequence' as if it were a single, complete and contiguous DNA string, it is in fact an assembly of millions of small, partially overlapping DNA fragments. Sophisticated computer algorithms (assemblers and scaffolders) merge these DNA fragments into contigs, and place these contigs into sequence scaffolds using the paired-end sequences derived from large-insert DNA libraries. Each step in this automated process is susceptible to producing errors; hence, the resulting draft assembly represents (in practice) only a likely assembly that requires further validation. Knowing which parts of the draft assembly are likely free of errors is critical if researchers are to draw reliable conclusions from the assembled sequence data. Results: We develop a machine-learning method to detect assembly errors in sequence assemblies. Several in silico measures for assembly validation have been proposed by various researchers. Using three benchmarking Drosophila draft genomes, we evaluate these techniques along with some new measures that we propose, including the good-minus-bad coverage (GMB), the good-to-bad-ratio (RGB), the average Z-score (AZ) and the average absolute Z-score (ASZ). Our results show that the GMB measure performs better than the others in both its sensitivity and its specificity for assembly error detection. Nevertheless, no single method performs sufficiently well to reliably detect genomic regions requiring attention for further experimental verification. To utilize the advantages of all these measures, we develop a novel machine learning approach that combines these individual measures to achieve a higher prediction accuracy (i.e. greater than 90%). Our combined evidence approach avoids the difficult and often ad hoc selection of many parameters the individual measures require, and significantly improves the overall precisions on the benchmarking data sets.

UR - http://www.scopus.com/inward/record.url?scp=40749116115&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=40749116115&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btm608

DO - 10.1093/bioinformatics/btm608

M3 - Article

C2 - 18204064

AN - SCOPUS:40749116115

SN - 1367-4803

VL - 24

SP - 744

EP - 750

JO - Bioinformatics

JF - Bioinformatics

IS - 6

ER -

A machine-learning approach to combined evidence validation of genome assemblies

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this