DeepPolyp: an artificial intelligence framework for polyp detection and segmentation

Marco Mameli; Sepideh Shiralizadeh; Massimiliano Papi; Iulian Gabriel Coltea

doi:10.37349/edht.2025.101158

Open Access

Original Article

DeepPolyp: an artificial intelligence framework for polyp detection and segmentation

Affiliation:

¹Key To Business, R&D, 00144 Roma, Italy

Email: m.mameli@key2.it

ORCID: https://orcid.org/0000-0002-5269-9939

Marco Mameli ^1*

Affiliation:

¹Key To Business, R&D, 00144 Roma, Italy

Sepideh Shiralizadeh ¹,

Affiliation:

²Department of Neurosciences, Catholic University of the Sacred Heart, 00168 Rome, Italy

³IRCCS “A. Gemelli” University Polyclinic Foundation, 00168 Rome, Italy

Massimiliano Papi ^2,3,

Affiliation:

¹Key To Business, R&D, 00144 Roma, Italy

ORCID: https://orcid.org/0009-0000-5563-7840

Iulian Gabriel Coltea ¹

Explor Digit Health Technol. 2025;3:101158 DOI: https://doi.org/10.37349/edht.2025.101158

Received: February 27, 2025 Accepted: May 27, 2025 Published: August 11, 2025

Academic Editor: Anastasios Koulaouzidis, University of Southern Denmark (SDU), Denmark

The article belongs to the special issue Deep Learning Methods and Applications for Biomedical Imaging

Abstract

Aim: Colorectal cancer is a leading cause of cancer-related mortality, emphasising the need for accurate polyp segmentation during colonoscopy for early detection. Existing methods often struggle to generalize effectively across diverse clinical scenarios. This study introduces DeepPolyp, an artificial intelligence framework designed for comprehensive benchmarking and real-time clinical deployment of polyp segmentation models.

Methods: Transformer-based segmentation models, SegFormer and SSFormer, were trained from scratch using an extensive dataset comprising public collections (CVC-ClinicDB, ETIS-LaribPolypDB, Kvasir) and recently augmented datasets (PolypDataset-TCNoEndo, PolypGen). Training involved standardized data augmentation, learning rate schedules, and early stopping. Models were evaluated using Dice and Intersection over Union (IoU) metrics. Real-time inference performance was assessed on an NVIDIA Jetson Orin device with ONNX and TensorRT optimizations.

Results: SegFormer-B4 achieved the highest accuracy (Dice: 0.9843, IoU: 0.9694), but was not selected for clinical deployment due to computational constraints. SegFormer-B2 provided comparable accuracy (Dice: 0.9787, IoU: 0.9588) with significantly faster inference (94 ms per frame), offering an optimal balance suitable for real-time clinical use. SSFormer showed lower accuracy and slower inference, limiting its practical deployment.

Conclusions: DeepPolyp enables systematic evaluation of polyp segmentation models, assisting in selecting models based on both performance and computational efficiency. Despite superior accuracy from SegFormer-B4, SegFormer-B2 was selected for clinical deployment due to its advantageous balance between accuracy and real-time execution efficiency.

Keywords

Polyp segmentation, transformer models, deep learning, real-time inference, colonoscopy, edge computing

Introduction

Colorectal cancer remains one of the leading causes of cancer-related mortality worldwide, with more than 1.9 million new cases diagnosed annually [1]. Early detection and diagnosis are essential to increase patient survival by enabling timely medical intervention and treatment planning [2, 3]. Polyp segmentation, defined as the task of identifying and delineating polyps in endoscopic images, plays a crucial role in the screening and diagnosis of colorectal cancer [4]. Accurate segmentation assists gastroenterologists in differentiating between benign and malignant lesions, guiding decisions during procedures such as polypectomy [5–7]. However, polyp segmentation remains a difficult task due to high variability in polyp size, shape, texture, and contrast with surrounding tissue [8, 9].

Anatomical differences, varying camera angles, and inconsistent lighting across endoscopic equipment create complex visual patterns, making reliable polyp segmentation especially challenging. Traditional segmentation approaches often rely on hand-crafted features, thresholding, and region-based techniques [10]. These methods usually apply filters to extract texture, edges, or color features and use post-processing techniques such as morphological operations to refine segmentation masks. While effective in controlled environments, these techniques struggle to generalize under varying imaging conditions and are unable to capture complex spatial dependencies [11]. As a result, segmentation masks generated using traditional methods are often incomplete or inaccurate [9].

The introduction of deep learning, particularly convolutional neural networks (CNNs), has brought significant improvements in polyp segmentation by enabling automatic learning of hierarchical features from image data. Encoder-decoder architectures such as U-Net and its variants have become standard in medical image analysis due to their ability to localize features and refine object boundaries [12]. However, CNNs have limitations in modelling long-range dependencies because of their inherently local receptive fields [13].

Transformer-based architectures have recently emerged as powerful alternatives by using self-attention mechanisms to capture global dependencies. These models have demonstrated strong performance across many vision tasks, including segmentation. Hybrid architectures, such as SegFormer and SSFormer, combine the strengths of convolutional layers for local feature extraction with transformer blocks for global context modelling, achieving state-of-the-art results in several segmentation benchmarks [14]. Lightweight transformer models, including Enhanced Nanonet, have also been developed to reduce computational cost while maintaining high segmentation accuracy, supporting deployment on low-power devices [15].

Despite the progress achieved by deep learning, several challenges remain. Many current segmentation models are designed with highly specialized architectures and require extensive hyperparameter tuning to reach optimal performance. Such specialization limits model adaptability to different imaging conditions or datasets. Additionally, most models are trained on relatively small and homogeneous datasets, which fail to capture the diversity of polyp appearances encountered in clinical practice. Variations in polyp morphology and imaging modalities across patients and devices further reduce generalization performance.

Another critical issue is the computational complexity of advanced models, which restricts their deployment on embedded or portable systems that require real-time operation. This limitation is particularly relevant in clinical settings where fast and reliable analysis is essential. Consequently, there is a growing demand for general-purpose segmentation networks that are robust, adaptable to various scenarios, and efficient enough for use on edge devices.

Recent advances in transformer-based segmentation models offer promising solutions, yet their application in medical imaging—and specifically in polyp segmentation—remains limited. Many existing studies rely on standard benchmark datasets, which may not adequately represent the complexity of real clinical scenarios. New datasets, such as PolypDataset-TCNoEndo [16] and PolypGen [17], provide additional variability in polyp appearance and imaging modalities, and therefore represent more realistic testbeds for evaluating model generalization.

In addition, the deployment of segmentation models on resource-constrained hardware for real-time use has not been thoroughly investigated. Although interest in edge AI is increasing, few works evaluate segmentation models in terms of latency, segmentation quality, and computational efficiency under real hardware constraints.

To address these gaps, a modular AI framework named DeepPolyp is introduced. This framework is designed to benchmark and evaluate the performance of general-purpose transformer-based segmentation models, including SSFormer [18] and SegFormer [19], when trained on a diverse and extended set of datasets. In addition to widely used public datasets such as CVC-ClinicDB [20], CVC-ColonDB [21], ETIS-LaribPolypDB [22], and Kvasir [23], two newer datasets—PolypDataset-TCNoEndo [16] and PolypGen [17, 24, 25]—are included to ensure higher variability in the evaluation.

While recent transformer-based architectures and hybrid models have shown promising results in medical image segmentation, several gaps persist in current research. First, most existing models are designed as specialized solutions tailored to specific datasets, which limits their generalizability across diverse imaging conditions. Second, research efforts often rely on benchmark datasets that do not fully reflect the variability present in real-world clinical environments. Third, there is limited investigation into the practical feasibility of deploying such models on edge devices for real-time clinical use. These limitations are addressed in this work through the following specific contributions:

1.
Introduction of DeepPolyp, a novel AI framework specifically designed for the comprehensive evaluation of polyp segmentation models in terms of accuracy, generalization, and deployment feasibility.
2.
A systematic assessment of state-of-the-art general-purpose segmentation architectures, namely SSFormer [18] and SegFormer [19], retrained and evaluated on a large and diverse collection of polyp datasets.
3.
Expansion of existing evaluation settings beyond commonly used datasets (CVC-ClinicDB [20], CVC-ColonDB [21], ETIS-LaribPolypDB [22], and Kvasir [23]) by including two additional recent datasets: PolypDataset-TCNoEndo [16] and PolypGen [17, 24, 25], providing a more realistic and challenging evaluation setting.
4.
Evaluation of model performance under computational constraints, including inference time and resource usage, to explore deployment feasibility in resource-limited clinical environments using edge hardware.

The remainder of the paper is structured systematically. State-of-the-art section provides a structured comparison and detailed critique of existing literature, highlighting the strengths and limitations of current methodologies relative to the proposed DeepPolyp framework. Materials and methods section outlines the methodology, including dataset preparation, model training protocols, and evaluation metrics. Results section presents experimental results, emphasizing model comparison and generalization capabilities. Discussion section discusses implications for clinical deployment, particularly focusing on computational efficiency and real-time capabilities.

State-of-the-art

Polyp segmentation has advanced significantly with deep learning techniques, which have overcome limitations of traditional methods [26]. Traditional approaches struggle with the variability in polyp size, shape, texture, and contrast, resulting in inconsistent segmentation. In contrast, deep learning models, particularly CNNs and transformer-based architectures, demonstrate superior accuracy and robustness. Recent research has focused on developing novel architectures and optimization strategies, including hybrid models that combine CNNs with transformers to capture both local features and global context. These advances have improved segmentation performance, enabling more accurate and reliable clinical applications.

CNN-based approaches

CNN-based models have established strong baseline performance for polyp segmentation due to their ability to extract hierarchical features. However, these models often struggle with capturing long-range dependencies and maintaining consistent performance across varied polyp morphologies.

Fan et al. [27] pioneered a parallel reverse attention network that combines global and local features to improve boundary detection and segmentation accuracy. While effective for well-defined polyps, this approach may underperform with flat or sessile polyps that lack clear boundaries.

ResUNet variants have shown promising results. Jha et al. [28] enhanced ResUNet with squeeze-and-excitation blocks, attention gates, and residual connections to boost feature extraction and segmentation performance. Though effective, these models require significant computational resources, limiting their deployment on resource-constrained devices.

DilatedSegNet [29] employs a ResNet50 backbone with a Dilated Convolution Pooling (DCP) block, achieving reliable segmentation at 33.68 FPS. While computationally efficient, it may struggle with very small polyps due to information loss during pooling operations.

MSRF-Net [30] uses Dual-Scale Dense Fusion (DSDF) blocks to preserve high-resolution features, addressing the detail loss common in CNN architectures. However, it requires careful parameter tuning to maintain optimal performance across datasets.

HarDNet-MSEG [31] achieves over 0.9 mean Dice score with an 86 FPS inference speed using a low-memory CNN backbone, making it suitable for clinical applications. Its focus on efficiency may occasionally compromise performance on challenging cases.

Transformer-based approaches

Transformer models excel at capturing global contextual information but often require significant computational resources and may lose fine local details critical for accurate boundary delineation.

Dong et al. [32] present a transformer-based approach with attention mechanisms in both encoder and decoder, refining outputs while preserving the UNet-like decoder structure. This approach effectively captures global dependencies but may struggle with real-time applications due to computational overhead.

SSFormer [18] integrates a transformer-based pyramid encoder with a Progressive Locality Decoder (PLD) and Stepwise Feature Aggregation (SFA), mitigating attention dispersion issues. While effective for capturing global context, it faces challenges with very small polyps and has increased latency compared to lightweight CNN models.

FCBFormer [33] combines convolutional and transformer-based methods through a dual-branch architecture, enhancing robustness. This approach balances global and local feature extraction but requires careful optimization to manage computational complexity.

Polyp-PVT [32] leverages pyramid vision transformers, integrating a cascaded fusion module, camouflage identification module, and similarity aggregation module. Though powerful, it requires substantial GPU resources that may not be available in all clinical settings.

Hybrid approaches

Hybrid models aim to combine the strengths of CNNs and transformers, addressing the limitations of individual approaches. These models typically offer better performance but often at the cost of increased complexity and computational requirements.

Zhang et al. [34] combine CNN and transformers, where the transformer encoder captures global dependencies, and a cascaded CNN upsampler refines local features. This approach effectively balances global context with local detail but introduces additional complexity in training and deployment.

The authors [35] introduce a fusion of Meta-Former with UNet, incorporating a multi-scale upsampling block and level-up augmentation to enhance texture representation. While this approach improves texture delineation, it requires careful balancing of the two architectural components.

FeDNet [36] introduces a Feature Decoupled Module (FDM) leveraging Laplacian pyramid decomposition for targeted optimization. Integrated with a vision transformer-based Feature Pyramid Network (FPN), FeDNet demonstrates strong accuracy and generalization but at increased computational cost.

LDNet [37] introduces a lesion-aware dynamic kernel, Lesion-aware Cross-Attention (LCA), and Efficient Self-Attention (ESA) to improve contrast between polyps and the background. This approach excels with challenging cases but requires careful implementation to maintain efficiency.

Zhou et al. [38] propose a cross-level feature aggregation and boundary prediction network, utilizing a two-stream structure to capture hierarchical semantic information. The model integrates a Cross-level Feature Fusion module to handle scale variations but may struggle with very small or flat polyps.

BDG-Net [39] employs a Boundary Distribution Map (BDM) for segmentation precision, addressing the challenge of accurate boundary delineation. However, it requires additional computational steps that may impact real-time performance.

DCRNet [40] captures contextual relations within and across images using an episodic memory mechanism. While effective for maintaining consistency across video frames, this approach requires sequential processing that increases latency.

ColonFormer [41] employs a hierarchical transformer encoder and a CNN-based decoder with multiscale feature representation. This approach effectively handles scale variations but faces challenges with real-time deployment due to its complexity.

PolypSeg+ [5] integrates an Adaptive Scale Context module and an Efficient Global Context module for real-time segmentation. It balances performance and efficiency but may still underperform on datasets with significant domain shifts.

HarDNet-DFUS [42] optimizes the HarDNet-MSEG model with ShuffleNetV2 concepts and a Lawin Transformer decoder, enhancing computational efficiency while maintaining accuracy. This approach represents a promising direction for clinical deployment.

DuAT [43] balances local and global representations with Global-to-Local Spatial Aggregation (GLSA) and Selective Boundary Aggregation (SBA). This comprehensive approach addresses multiple challenges but increases model complexity.

FuzzyNet [27] employs a Fuzzy Attention module to refine segmentation near polyp boundaries, addressing a critical challenge in clinical applications. However, it requires careful parameter tuning to achieve optimal results.

HSNet [44] combines Transformer-CNN frameworks, integrating a Cross-Semantic Attention module and Multi-Scale Prediction module for high performance. While effective, it introduces additional complexity that may challenge deployment in resource-constrained environments.

UACANet [45] enhances segmentation with Uncertainty Augmented Context Attention, improving robustness to ambiguous boundaries. This approach addresses a key clinical challenge but at the cost of increased computational overhead.

M²SNet [46] applies subtraction-based feature fusion to improve edge preservation, addressing a common limitation in polyp segmentation. This approach effectively captures boundaries but may struggle with flat or sessile polyps.

MSNet [47] employs a subtraction-based extraction mechanism for boundary delineation. While effective for well-defined polyps, it may underperform with polyps that have gradual transitions to surrounding tissue.

SANet [48] introduces a color exchange operation and probability correction strategy for small polyp segmentation. This approach specifically addresses the challenge of small polyps but may not generalize well to larger, more complex cases.

TransFuse [34] integrates CNN and Transformer models with a BiFusion module for precise segmentation. This balanced approach effectively combines global and local features but requires careful implementation to manage computational demands.

CaraNet [49] enhances small object segmentation through a Context Axial Reverse Attention Network. While effective for small polyps, it may introduce unnecessary complexity for larger, more obvious cases.

FANet [50] refines segmentation iteratively with a Feedback Attention Network. This approach improves accuracy through multiple refinement steps but increases inference time, potentially limiting real-time applications.

Enhanced U-Net [51] improves robustness with a Semantic Feature Enhancement Module (SFEM) and Adaptive Global Context Module (AGCM). This approach effectively balances performance and efficiency but still faces challenges with very small or flat polyps.

Recent specialized approaches

Recent works have focused on addressing specific challenges in polyp segmentation, such as boundary delineation, small polyp detection, and domain generalization.

The authors [28] introduce an advanced ResUNet-based architecture with residual units, squeeze-and-excitation blocks, and attention mechanisms, achieving strong results on Kvasir-SEG. However, complex attention mechanisms increase computational demands.

Tomar et al. [29] propose a dual decoder attention network, with one decoder acting as an autoencoder, enhancing feature maps through attention mechanisms. This approach improves feature representation but at the cost of model complexity.

The authors [28] develop a multi-scale residual fusion network with cross multi-scale attention, improving generalizability. While effective for handling domain shifts, this approach introduces additional parameters that increase memory requirements.

Guo et al. [52] address threshold selection by learning adaptive threshold maps through a confidence-guided manifold mixup approach, achieving a Dice coefficient of 87.307% on EndoScene. This approach improves segmentation consistency but requires careful implementation to avoid overfitting.

Research gap and motivation

Despite significant advances, challenges remain in polyp segmentation, including:

Balancing computational efficiency with segmentation accuracy for clinical deployment
Addressing performance variations across different polyp morphologies
Enabling reliable deployment on resource-constrained edge devices
Providing systematic comparison frameworks for evaluating model performance

These challenges highlight the need for a comprehensive framework to evaluate and compare different segmentation models under uniform conditions, particularly for edge deployment scenarios. This study focuses on analyzing the DUCK-Net, SSFormer, and SegFormer models for potential deployment on edge devices, addressing a critical gap in current research.

Materials and methods

This section introduces DeepPolyp, an advanced AI framework for polyp segmentation and detection, designed to systematically evaluate the effectiveness of specialized and general-purpose segmentation models. The workflow, illustrated in Figure 1, is organized into four main stages: Data preparation, SOTA model selection, model comparison, and edge porting. This structured approach ensures a rigorous evaluation of segmentation models under different imaging conditions, providing comprehensive information on their performance and feasibility. Further details are given in the following subsections.

Dataset	Training	Validation	Test
CVC-300	43	11	6
CVC-ClinicDB	440	110	62
CVC-ColonDB	273	69	38
ETIS-LaribPolypDB	140	36	20
Kvasir	720	180	100
Mixed dataset	1,077	381	704

Parameter	SSFormer	SegFormer
Learning rate	1e−4	1e−5
Epochs	200	50
Optimizer	AdamW	SGD
Learning rate scheduler	Activated	Activated
Early stopping	Not activated	Activated

Dataset	Dice 17	mIoU 17	Dice 34	mIoU 34
CVC-300	0.8711	0.7717	0.8608	0.7556
CVC-ClinicDB	0.8583	0.7517	0.8847	0.7932
CVC-ColonDB	0.5331	0.3634	0.7169	0.5587
ETIS-LaribPolypDB	0.8268	0.7048	0.8957	0.8111
Kvasir	0.8423	0.7275	0.9042	0.8251

Dataset	Dice 17	mIoU 17	Dice 34	mIoU 34
CVC-ClinicDB	0.1564	0.0848	0.0299	0.0152
CVC-ColonDB	0.02091	0.1167	0.2097	0.1171
ETIS-LaribPolypDB	0.2750	0.1594	0.0572	0.0294
Kvasir	0.0679	0.0352	0.0118	0.0059

Dataset	Dice 17	mIoU 17	Dice 34	mIoU 34
CVC-300	0.5055	0.3382	0.7348	0.5808
CVC-ColonDB	0.5751	0.4037	0.6032	0.4318
ETIS-LaribPolypDB	0.2319	0.1311	0.2207	0.1240
Kvasir	0.5909	0.4194	0.5896	0.4181

Dataset	Dice 17	mIoU 17	Dice 34	mIoU 34
CVC-300	0.8935	0.8074	0.9200	0.8519
CVC-ClinicDB	0.5310	0.3615	0.6773	0.5121
ETIS-LaribPolypDB	0.6063	0.4350	0.6587	0.4911
Kvasir	0.4310	0.2747	0.6626	0.4954

Dataset	Dice 17	mIoU 17	Dice 34	mIoU 34
CVC-300	0.1246	0.0665	0.2995	0.1761
CVC-ClinicDB	0.3518	0.2134	0.4310	0.2747
CVC-ColonDB	0.2517	0.1440	0.2902	0.1698
Kvasir	0.5261	0.3570	0.6537	0.4855

Dataset	Dice Small	mIoU Small	Dice Large	mIoU Large
CVC-300	0.8064	0.7204	0.9295	0.8734
CVC-ColonDB	0.5703	0.4845	0.9069	0.8539
CVC-ClinicDB	0.6869	0.5678	0.9212	0.8757
ETIS-LaribPolypDB	0.6027	0.5164	0.8857	0.8349
Kvasir	0.7534	0.6365	0.9386	0.8970

Dataset	Dice Small	mIoU Small	Dice Large	mIoU Large
CVC-ClinicDB	0.5716	0.4842	0.5465	0.4759
CVC-ColonDB	0.6708	0.5515	0.6476	0.5337
ETIS-LaribPolypDB	0.5826	0.4991	0.6147	0.5296
Kvasir	0.7265	0.6131	0.7143	0.5982

Dataset	Dice Small	mIoU Small	Dice Large	mIoU Large
CVC-300	0.8485	0.7779	0.8453	0.7790
CVC-ColonDB	0.9122	0.8575	0.9211	0.8689
ETIS-LaribPolypDB	0.8068	0.7222	0.8012	0.7332
Kvasir	0.8693	0.7898	0.8691	0.7962

Dataset	Dice Small	mIoU Small	Dice Large	mIoU Large
CVC-300	0.9442	0.8979	0.9490	0.9073
CVC-ClinicDB	0.8708	0.7945	0.9191	0.8573
ETIS-LaribPolypDB	0.7825	0.6911	0.7893	0.7138
Kvasir	0.8010	0.6999	0.7978	0.7064

Metric	SegFormer-B2	SegFormer-B4	SSFormer-Small	SSFormer-Large
Dice	0.9787	0.9843	0.1659	0.1780
IoU	0.9588	0.9694	0.1590	0.1616

Comparison	Metric	Mean ± Std (SegFormer)	Mean ± Std (SSFormer)	p-value (t-test)
SegFormer-B4 vs SSFormer-Large	Dice	0.9843 ± 0.0052	0.1780 ± 0.0417	< 0.001
SegFormer-B4 vs SSFormer-Large	IoU	0.9694 ± 0.0063	0.1616 ± 0.0389	< 0.001
SegFormer-B2 vs SSFormer-Small	Dice	0.9787 ± 0.0061	0.1659 ± 0.0452	< 0.01
SegFormer-B2 vs SSFormer-Small	IoU	0.9588 ± 0.0074	0.1590 ± 0.0428	< 0.01

Model	PyTorch results		TensorRT results
Model	Dice	IoU	Dice	IoU
SegFormer-B2	0.9787	0.9588	0.9231	0.8684
SegFormer-B4	0.9843	0.9694	0.9433	0.9025
SSFormer-Small	0.1659	0.1590	0.1606	0.1449
SSFormer-Large	0.1780	0.1616	0.1667	0.1487

Function	GPU execution time (ms)				Edge execution time (ms)
Function	SegFormer-B2	SegFormer-B4	SSFormer-Small	SSFormer-Large	SegFormer-B2	SegFormer-B4	SSFormer-Small	SSFormer-Large
Inference	431.71	527.20	427.65	520.96	74.18	115.45	64.38	98.76
Sigmoid			0.662				4.529
Interpolate			0.107				0.537
Mask processing			0.237				0.657
Add weighted			0.357				3.772
Image encode			0.112				5.052
Display handle update			0.35				5.508
Full pipeline	433.535	529.025	429.475	522.785	94.235	135.505	84.435	118.815

Abstract

Keywords

Introduction

State-of-the-art

CNN-based approaches

Transformer-based approaches

Hybrid approaches

Recent specialized approaches

Research gap and motivation

Materials and methods

Data preparation

Dataset selection

Dataset fusion methodology

First dataset fusion

Second dataset fusion

SOTA model comparison

Specific medical model vs general purpose model

Results

Comparison of specialised segmentation models

General-purpose versus specialised models

Statistical analysis of segmentation performance

Summary of findings

Real-time edge deployment evaluation

Discussion

Abbreviations

Declarations

Acknowledgments

Author contributions

Conflicts of interest

Ethical approval

Consent to participate

Consent to publication

Availability of data and materials

Funding

Copyright

Publisher’s note

References

An introduction to Self-Aware Deep Learning for medical imaging and diagnosis

Multimodal feature extraction and fusion for determining RGP lens specification base-curve through Pentacam images