TY  - JOUR
TI  - Machine learning or morphometric scaling? A systematic review of methodological confounds and the generalizability of sex classification in neuroimaging
AU  - Sapuan, Abdul Halim
AU  - Jamaludin, Iqbal
AU  - Abdul Majid, Zafri Azran
AU  - Mohd Tamrin, Mohd Izzuddin
AU  - Che Azemin, Mohd Zulfaezal
AU  - Turaev, Sherzod
PY  - 2026
JO  - Exploration of Neuroprotective Therapy
VL  - 6
SP  - 1004141
DO  - 10.37349/ent.2026.1004141
UR  - https://www.explorationpub.com/Journals/ent/Article/1004141
AB  - Background: This systematic review critically evaluates whether machine learning (ML) identifies biologically meaningful sex-related brain architecture or merely exploits methodological artifacts and allometric scaling. While ML models achieve high classification accuracies, it remains unclear if these reflect stable, mechanistically informative dimorphism or are driven by confounds such as total intracranial volume (TIV) and site-specific noise. We examine how imaging modalities, algorithms, and population strata influence both classification outcomes and biological interpretability. Methods: Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, we searched Web of Science, PubMed, and Scopus through January 2024. Included studies [healthy humans, 3T magnetic resonance imaging (MRI), ML-based sex classification] were assessed for risk of bias, focusing on data leakage, validation strategies, and confound management. Results: Thirty-five studies (n > 110,000) were included. While reported accuracies reached 98.06% for T1-weighted MRI, 96.0% for diffusion MRI (dMRI), and 94.72% for functional MRI (fMRI), performance was highly dependent on population characterization and age. Deep learning consistently outperformed traditional ML (TML) but showed high sensitivity to methodological artifacts. Notably, studies failing to correct for TIV reported potentially inflated accuracies, suggesting that many models identify physical scale rather than intrinsic neuroanatomical dimorphism. Discussion: High classification accuracies are often bolstered by methodological confounds and a lack of cross-site validation. There is a significant discrepancy between ML-driven predictive power and biological inference validity. Current pipelines do not yet allow for robust, generalizable inference about brain sex. To move beyond statistical separation toward mechanistic understanding, the field must prioritize TIV-corrected benchmarks and diverse non-WEIRD (Western, Educated, Industrialized, Rich, Democratic) datasets. We conclude that while ML is a powerful pattern detector, its results must be interpreted with caution regarding biological dimorphism.
ER  -