From:  A deep learning framework for classifying autism spectrum disorder from children’s facial images using a multi-scale ViT architecture and edge computing

 Ablation results showing the impact of removing key components from the proposed MS-ViT-based model on performance and efficiency.

CategoryModel variantAccuracy (%)Sensitivity (%)Specificity (%)AUC-ROCInference time (ms)
OptimizationFP32 (no pruning, no quantization)96.1295.4597.320.9807312
Quantization-only (INT8 PTQ, no pruning)96.0095.2097.100.9795205
Pruning-only (structured channels + brief fine-tune)96.5595.9097.700.9860246
QAT (quantization-aware training)96.7096.0097.800.9872181
Architecture
No data augmentation94.5593.4095.800.9612181
No multi-scale processing93.6892.0094.900.9475178
Shallow ViT backbone91.4590.1093.200.9311169
No ROI extraction92.8391.5094.100.9402183
Deployed (final)Full model (MS-ViT + Edge: pruning + PTQ + Augmented)96.8596.0997.920.9874181

MS-ViT: multi-scale vision transformer; AUC: area under the curve; ROC: receiver operating characteristic; ms: milliseconds; FP: false positive; PTQ: post-training quantization; ROI: regions of interest.