NEAUA - Machine Learning Approaches to Optimize the Integration of Sociodemographic Factors in the Prediction of Cancer-Specific Survival Among Patients with High-Risk Prostate Cancer

Back to 2025 Abstracts

Machine Learning Approaches to Optimize the Integration of Sociodemographic Factors in the Prediction of Cancer-Specific Survival Among Patients with High-Risk Prostate Cancer
Ismail Ajjawi, BS¹, Isaac E. Kim, BS², Shayan Smani, BS¹, Peter Palencia, MD¹, Gabriela M. Diaz, MD¹, Ho-Joon Lee, MD¹, Isaac Y. Kim, MD¹, Preston Sprenkle, MD¹, Michael S. Leapman, MD¹.
¹Yale School of Medicine, New Haven, CT, USA, ²The Warren Alpert Medical School of Brown University, Providence, RI, USA.

BACKGROUND: Although social and demographic factors are associated with prostate cancer (PC) outcomes, these variables are not generally included in clinical risk stratification tools. We aimed to evaluate the ability of machine learning based approaches to optimize the inclusion of sociodemographic factors to predict cancer outcome in patients with high-risk prostate cancer.
METHODS: We conducted a retrospective analysis using the SEER database to identify patients with high-risk PCa diagnosed from 2010 to 2020. Two random forest models were developed: one based on clinical/pathologic factors (age, stage, PSA, Gleason grade, time to treatment, year of diagnosis) and another with sociodemographic factors (race, income, marital status, region, urbanicity). Model performance for predicting cancer-specific survival (CCS) was assessed using 5-fold cross-validation, with hyperparameter optimization performed to fine-tune the model parameters. Performance was evaluated using AUC, Brier scores, sensitivity, and specificity. A similar analysis was conducted using XGBoost models. Decision curve analysis evaluated clinical usefulness of both models.
RESULTS: We identified 80,858 eligible patients with high-risk PCa. The clinical-only random forest model (AUC 0.54) was improved with sociodemographic features (AUC 0.72, p<0.001). Brier scores demonstrated superior performance for the combined model (0.045 vs. 0.053, p<0.001), along with improvements in sensitivity (0.75 vs. 0.61, p<0.001) and specificity (0.80 vs. 0.68, p<0.001). Similar trends were observed in the XGBoost models.Variable importance analysis revealed that Gleason grade was the strongest predictor of outcome, but sociodemographic factors were informative, with income and region having predictive power comparable to PSA.Decision curve analysis revealed higher net benefit for the combined model across clinically relevant thresholds of 5-20% for CCS.
CONCLUSIONS: Population-level prediction of PCa specific survival using machine learning models was significantly improved with integration of sociodemographic variables. These findings highlight the relevance of considering both clinical and sociodemographic factors to provide more holistic estimates of cancer risk.

Back to 2025 Abstracts