Nonparametric distributed learning architecture for big data: Algorithm and applications

Scott Bruce; Zeda Li; Hsiang Chieh Yang; Subhadeep Mukhopadhyay

doi:10.1109/TBDATA.2018.2810187

Nonparametric distributed learning architecture for big data: Algorithm and applications

Scott Bruce, Zeda Li, Hsiang Chieh Yang, Subhadeep Mukhopadhyay

Research output: Contribution to journal › Article › peer-review

4 Scopus citations

Abstract

Dramatic increases in the size and complexity of modern datasets have made traditional centralized statistical inference prohibitive. In addition to computational challenges associated with big data learning, the presence of numerous data types (e.g., discrete, continuous, categorical, etc.) makes automation and scalability difficult. A question of immediate concern is how to design a data-intensive statistical inference architecture without changing the basic statistical modeling principles developed for small data over the last century. To address this problem, we present MetaLP, a flexible, distributed statistical modeling framework suitable for large-scale data analysis, where statistical inference meets big data computing. This framework consists of three key components that work together to provide a holistic solution for big data learning: (i) partitioning massive data into smaller datasets for parallel processing and efficient computation, (ii) modern nonparametric learning based on a specially designed, orthonormal data transformation leading to mixed data algorithms, and finally (iii) combining heterogeneous local inferences from partitioned data using meta-analysis techniques to arrive at the global inference for the original big data. We present an application of this general theory in the context of a nonparametric two-sample inference algorithm for Expedia personalized hotel recommendations based on 10 million search result records.

Original language	English (US)
Article number	8303780
Pages (from-to)	166-179
Number of pages	14
Journal	IEEE Transactions on Big Data
Volume	5
Issue number	2
DOIs	https://doi.org/10.1109/TBDATA.2018.2810187
State	Published - Jun 1 2019
Externally published	Yes

Keywords

LP transformation
Nonparametric mixed data modeling
data-parallelism
distributed statistical learning
heterogeneity
meta-analysis

ASJC Scopus subject areas

Information Systems
Information Systems and Management

Access to Document

10.1109/TBDATA.2018.2810187

Cite this

@article{4af10e3c4f474ea39c84ae540a24619d,

title = "Nonparametric distributed learning architecture for big data: Algorithm and applications",

abstract = "Dramatic increases in the size and complexity of modern datasets have made traditional centralized statistical inference prohibitive. In addition to computational challenges associated with big data learning, the presence of numerous data types (e.g., discrete, continuous, categorical, etc.) makes automation and scalability difficult. A question of immediate concern is how to design a data-intensive statistical inference architecture without changing the basic statistical modeling principles developed for small data over the last century. To address this problem, we present MetaLP, a flexible, distributed statistical modeling framework suitable for large-scale data analysis, where statistical inference meets big data computing. This framework consists of three key components that work together to provide a holistic solution for big data learning: (i) partitioning massive data into smaller datasets for parallel processing and efficient computation, (ii) modern nonparametric learning based on a specially designed, orthonormal data transformation leading to mixed data algorithms, and finally (iii) combining heterogeneous local inferences from partitioned data using meta-analysis techniques to arrive at the global inference for the original big data. We present an application of this general theory in the context of a nonparametric two-sample inference algorithm for Expedia personalized hotel recommendations based on 10 million search result records.",

keywords = "LP transformation, Nonparametric mixed data modeling, data-parallelism, distributed statistical learning, heterogeneity, meta-analysis",

author = "Scott Bruce and Zeda Li and Yang, {Hsiang Chieh} and Subhadeep Mukhopadhyay",

note = "Publisher Copyright: {\textcopyright} 2015 IEEE.",

year = "2019",

month = jun,

day = "1",

doi = "10.1109/TBDATA.2018.2810187",

language = "English (US)",

volume = "5",

pages = "166--179",

journal = "IEEE Transactions on Big Data",

issn = "2332-7790",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "2",

}

TY - JOUR

T1 - Nonparametric distributed learning architecture for big data

T2 - Algorithm and applications

AU - Bruce, Scott

AU - Li, Zeda

AU - Yang, Hsiang Chieh

AU - Mukhopadhyay, Subhadeep

PY - 2019/6/1

Y1 - 2019/6/1

N2 - Dramatic increases in the size and complexity of modern datasets have made traditional centralized statistical inference prohibitive. In addition to computational challenges associated with big data learning, the presence of numerous data types (e.g., discrete, continuous, categorical, etc.) makes automation and scalability difficult. A question of immediate concern is how to design a data-intensive statistical inference architecture without changing the basic statistical modeling principles developed for small data over the last century. To address this problem, we present MetaLP, a flexible, distributed statistical modeling framework suitable for large-scale data analysis, where statistical inference meets big data computing. This framework consists of three key components that work together to provide a holistic solution for big data learning: (i) partitioning massive data into smaller datasets for parallel processing and efficient computation, (ii) modern nonparametric learning based on a specially designed, orthonormal data transformation leading to mixed data algorithms, and finally (iii) combining heterogeneous local inferences from partitioned data using meta-analysis techniques to arrive at the global inference for the original big data. We present an application of this general theory in the context of a nonparametric two-sample inference algorithm for Expedia personalized hotel recommendations based on 10 million search result records.

AB - Dramatic increases in the size and complexity of modern datasets have made traditional centralized statistical inference prohibitive. In addition to computational challenges associated with big data learning, the presence of numerous data types (e.g., discrete, continuous, categorical, etc.) makes automation and scalability difficult. A question of immediate concern is how to design a data-intensive statistical inference architecture without changing the basic statistical modeling principles developed for small data over the last century. To address this problem, we present MetaLP, a flexible, distributed statistical modeling framework suitable for large-scale data analysis, where statistical inference meets big data computing. This framework consists of three key components that work together to provide a holistic solution for big data learning: (i) partitioning massive data into smaller datasets for parallel processing and efficient computation, (ii) modern nonparametric learning based on a specially designed, orthonormal data transformation leading to mixed data algorithms, and finally (iii) combining heterogeneous local inferences from partitioned data using meta-analysis techniques to arrive at the global inference for the original big data. We present an application of this general theory in the context of a nonparametric two-sample inference algorithm for Expedia personalized hotel recommendations based on 10 million search result records.

KW - LP transformation

KW - Nonparametric mixed data modeling

KW - data-parallelism

KW - distributed statistical learning

KW - heterogeneity

KW - meta-analysis

UR - http://www.scopus.com/inward/record.url?scp=85140807329&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85140807329&partnerID=8YFLogxK

U2 - 10.1109/TBDATA.2018.2810187

DO - 10.1109/TBDATA.2018.2810187

M3 - Article

AN - SCOPUS:85140807329

SN - 2332-7790

VL - 5

SP - 166

EP - 179

JO - IEEE Transactions on Big Data

JF - IEEE Transactions on Big Data

IS - 2

M1 - 8303780

ER -

Nonparametric distributed learning architecture for big data: Algorithm and applications

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this