TY - JOUR
T1 - Umpire 2.0
T2 - Simulating realistic, mixed-type, clinical data for machine learning [version 2; peer review: 1 approved, 1 approved with reservations]
AU - Coombes, Caitlin E.
AU - Abrams, Zachary B.
AU - Nakayiza, Samantha
AU - Brock, Guy
AU - Coombes, Kevin R.
N1 - Funding Information:
Grant information: This work was supported by the National Cancer Institute [P30CA016058] and the National Center For Advancing Translational Sciences [UL1TR002733].
Publisher Copyright:
© 2021. Coombes CE et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
PY - 2021
Y1 - 2021
N2 - The Umpire 2.0 R-package offers a streamlined, user-friendly workflow to simulate complex, heterogeneous, mixed-type data with known subgroup identities, dichotomous outcomes, and time-toevent data, while providing ample opportunities for fine-tuning and flexibility. Here, we describe how we have expanded the core Umpire1.0 R-package, developed to simulate gene expression data, to generate clinically realistic, mixed-type data for use in evaluating unsupervised and supervised machine learning (ML) methods. As the availability of large-scale clinical data for ML has increased, clinical data has posed unique challenges, including widely variable size, individual biological heterogeneity, data collection and measurement noise, and mixed data types. Developing and validating ML methods for clinical data requires data sets with known ground truth, generated from simulation. Umpire 2.0 addresses challenges to simulating realistic clinical data by providing the user a series of modules to generate survival parameters and subgroups, apply meaningful additive noise, and discretize to single or mixed data types. Umpire 2.0 provides broad functionality across sample sizes, feature spaces, and data types, allowing the user to simulate correlated, heterogeneous, binary, continuous, categorical, or mixed type data from the scale of a small clinical trial to data on thousands of patients drawn from electronic health records. The user may generate elaborate simulations by varying parameters in order to compare algorithms or interrogate operating characteristics of an algorithm in both supervised and unsupervised ML.
AB - The Umpire 2.0 R-package offers a streamlined, user-friendly workflow to simulate complex, heterogeneous, mixed-type data with known subgroup identities, dichotomous outcomes, and time-toevent data, while providing ample opportunities for fine-tuning and flexibility. Here, we describe how we have expanded the core Umpire1.0 R-package, developed to simulate gene expression data, to generate clinically realistic, mixed-type data for use in evaluating unsupervised and supervised machine learning (ML) methods. As the availability of large-scale clinical data for ML has increased, clinical data has posed unique challenges, including widely variable size, individual biological heterogeneity, data collection and measurement noise, and mixed data types. Developing and validating ML methods for clinical data requires data sets with known ground truth, generated from simulation. Umpire 2.0 addresses challenges to simulating realistic clinical data by providing the user a series of modules to generate survival parameters and subgroups, apply meaningful additive noise, and discretize to single or mixed data types. Umpire 2.0 provides broad functionality across sample sizes, feature spaces, and data types, allowing the user to simulate correlated, heterogeneous, binary, continuous, categorical, or mixed type data from the scale of a small clinical trial to data on thousands of patients drawn from electronic health records. The user may generate elaborate simulations by varying parameters in order to compare algorithms or interrogate operating characteristics of an algorithm in both supervised and unsupervised ML.
KW - clinical data
KW - clinical informatics
KW - clustering
KW - machine learning
KW - mixed data
KW - mixedtype data
KW - supervised machine learning
KW - unsupervised machine learning
UR - http://www.scopus.com/inward/record.url?scp=85117287345&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85117287345&partnerID=8YFLogxK
U2 - 10.12688/F1000RESEARCH.25877.2
DO - 10.12688/F1000RESEARCH.25877.2
M3 - Article
AN - SCOPUS:85117287345
SN - 2046-1402
VL - 9
SP - 1
EP - 28
JO - F1000Research
JF - F1000Research
ER -