Abstract:
Multiple additive regression trees (MART) have been widely used in the literature for various classification tasks. However, the overfitting effects of MART across hetero...Show MoreMetadata
Abstract:
Multiple additive regression trees (MART) have been widely used in the literature for various classification tasks. However, the overfitting effects of MART across heterogeneous and highly imbalanced big data structures within distributed environments has not yet been investigated. In this work, we utilize distributed MART with hybrid loss to resolve overfitting effects during the training of disease classification models in a case study with 10 heterogeneous and distributed clinical datasets. Lexical and semantic analysis methods were utilized to match heterogeneous terminologies with 80% overlap. Data augmentation was used to resolve class imbalance yielding virtual data with goodness of fit 0.01 and correlation difference 0.02. Our results highlight the favorable performance of the proposed distributed MART on the augmented data with an average increase by 7.3% in the accuracy, 6.8% in sensitivity, 10.4% in specificity, for a specific loss function topology.
Published in: 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC)
Date of Conference: 01-05 November 2021
Date Added to IEEE Xplore: 09 December 2021
ISBN Information:
ISSN Information:
PubMed ID: 34891606
Funding Agency:
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Classification Task ,
- Heterogeneous Data ,
- Regression Tree ,
- Gradient Boosting ,
- Multiple Trees ,
- Multiple Regression Trees ,
- Loss Function ,
- Goodness Of Fit ,
- Big Data ,
- Data Augmentation ,
- Distribution Of Dataset ,
- Virtual Data ,
- Lexical Analysis ,
- Overfitting Effect ,
- Data Quality ,
- Decision-making Process ,
- Support Vector Machine ,
- Selection Effects ,
- Dropout Rate ,
- Data Pre-processing ,
- Joint Variables ,
- Ensemble Of Trees ,
- Distributed Learning ,
- General Data Protection Regulation ,
- Data Harmonization ,
- Huber Loss ,
- Incremental Learning ,
- Stochastic Gradient Descent ,
- Training Instances ,
- Training Stage
- Author Keywords
- MeSH Terms
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Classification Task ,
- Heterogeneous Data ,
- Regression Tree ,
- Gradient Boosting ,
- Multiple Trees ,
- Multiple Regression Trees ,
- Loss Function ,
- Goodness Of Fit ,
- Big Data ,
- Data Augmentation ,
- Distribution Of Dataset ,
- Virtual Data ,
- Lexical Analysis ,
- Overfitting Effect ,
- Data Quality ,
- Decision-making Process ,
- Support Vector Machine ,
- Selection Effects ,
- Dropout Rate ,
- Data Pre-processing ,
- Joint Variables ,
- Ensemble Of Trees ,
- Distributed Learning ,
- General Data Protection Regulation ,
- Data Harmonization ,
- Huber Loss ,
- Incremental Learning ,
- Stochastic Gradient Descent ,
- Training Instances ,
- Training Stage
- Author Keywords
- MeSH Terms