Vision Transformers vs. Convolutional Networks for Colonoscopic Polyp Segmentation: A Systematic Review

Main Article Content

Lobar Badalova Burhonovna
Yusupov Ozod Rabbimovich

Abstract

Colorectal cancer (CRC) is the second most common cause of cancer-related death in the world, and early detection of adenomatous polyps with colonoscopy is important because it can prevent CRC. Nonetheless, traditional colonoscopy is still operator-dependent and with lesion-miss rates of up to 25% for small or flat lesions. Over the past decade, automated polyp detection and segmentation have come a long way with deep learning. CNNs like U-Net, UNet++ and ResUNet++ set strong baselines, however their low receptive field bound hindered generalization. More recently, vision transformers (ViTs) and hybrid CNN–transformer architectures have achieved state-of-the-art results by modeling long-range dependencies and incorporating global context. This systematic review examines more than 50 studies as the representative one available between 2015 and 2025, with specific attention to the results on Kvasir-SEG, CVC-ClinicDB, and ETIS-Larib benchmarks. Experiments demonstrate that while NFL-CNNs reach Dice scores of 0.85–0.90 on easier datasets, ViTs and hybrids almost always outperform motors, with the best models reaching 0.94 (NA-SegFormer) on Kvasir-SEG and 0.81 over ETIS. Our results illustrate the disruptive nature of attention-based approaches, closer to what was desired by clinical colleagues, and signal some of the stubborn open challenges relating to dataset availability, practical computational cost and high-quality clinical evidence. Advances in the future will likely center around the development of lightweight understandable and generalisable AI systems designed to provide real-time, clinically reliable polyp detection.

Article Details

Section

Articles

How to Cite

Vision Transformers vs. Convolutional Networks for Colonoscopic Polyp Segmentation: A Systematic Review. (2025). Innovative: International Multidisciplinary Journal of Applied Technology (2995-486X), 3(12), 7-23. https://doi.org/10.51699/j5tb2v47

References

[1] R. L. Siegel, N. S. Wagle, A. Cercek, R. A. Smith, and A. Jemal, “Colorectal cancer statistics, 2023,” CA: A Cancer Journal for Clinicians, vol. 73, no. 3, pp. 233–254, 2023, doi: 10.3322/caac.21772.

[2] J. Bernal, et al., “Comparative validation of polyp detection methods in colonoscopy: Results from the MICCAI 2015 challenge,” Medical Image Analysis, vol. 31, pp. 1–13, 2016.

[3] J. Bernal, F. J. Sánchez, G. Fernández-Esparrach, D. Gil, C. Rodríguez, and F. Vilariño, “WM-DOVA Maps for Accurate Polyp Highlighting in Colonoscopy: Validation vs. Saliency Maps from Physicians,” Computerized Medical Imaging and Graphics, vol. 43, pp. 99–111, 2015. (CVC-ClinicDB dataset)

[4] S. Silva, A. Histace, O. Romain, X. Dray, and B. Granado, “Toward embedded detection of polyps in WCE images for early diagnosis of colorectal cancer,” Int. J. Comput. Assist. Radiol. Surg., vol. 9, no. 2, pp. 283–293, 2014. (ETIS-Larib Polyp DB)

[5] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proc. MICCAI, 2015, pp. 234–241.

[6] D. Jha, M. Smedsrud, M. A. Riegler, D. Johansen, T. de Lange, P. Halvorsen, and H. D. Johansen, “ResUNet++: An advanced architecture for medical image segmentation,” in Proc. IEEE Int. Symp. Multimedia (ISM), 2019, pp. 225–2255.

[7] Z. Zhou, M. M. Siddiquee, N. Tajbakhsh, and J. Liang, “UNet++: A nested U-Net architecture for medical image segmentation,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support (MICCAI Workshop), 2018, pp. 3–11.

[8] G. Urban, P. Tripathi, T. Alkayali, M. Mittal, F. Jalali, W. Karnes, and P. Baldi, “Deep learning localizes and identifies polyps in real time with 96% accuracy in screening colonoscopy,” Gastroenterology, vol. 155, no. 4, pp. 1069–1078, 2018.

[9] L. C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proc. ECCV, 2018, pp. 801–818.

[10] O. Urban, et al., “Deep learning for real-time detection of colorectal polyps in colonoscopy videos,” The Lancet Oncology, vol. 19, no. 7, pp. 793–800, 2018.

[11] D. Jha, P. Halvorsen, H. D. Johansen, D. Johansen, T. de Lange, and M. A. Riegler, “Kvasir-SEG: A segmented polyp dataset,” in Proc. IEEE Int. Conf. Image Processing (ICIP), 2020, pp. 4050–4054.

[12] A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., “An image is worth 16×16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.

[13] D. Jha, P. H. Smedsrud, M. A. Riegler, P. Halvorsen, D. Johansen, T. de Lange, and H. D. Johansen, “ResUNet++: An advanced architecture for medical image segmentation,” in Proc. IEEE Int. Symp. Biomed. Imaging (ISBI), 2019, pp. 223–227.

[14] D. Jha, S. Ali, N. K. Tomar, H. D. Johansen, D. Johansen, M. A. Riegler, P. Halvorsen, and T. de Lange, “ColonSegNet: A dilated convolutional neural network for colon polyp segmentation,” in Proc. IEEE Int. Symp. Computer-Based Medical Systems (CBMS), 2021, pp. 191–196.

[15] M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, and A. Dosovitskiy, “Do vision transformers see like convolutional neural networks?,” in Proc. NeurIPS, 2021.

[16] H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, “Swin-Unet: Unet-like pure transformer for medical image segmentation,” in Proc. MICCAI, 2021, pp. 100–110.

[17] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, and Y. Zhou, “TransUNet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306, 2021.

[18] H. Huang, L. Lin, R. Tong, and G. Hu, “HarDNet-MSEG: A low memory requirement network for polyp segmentation,” Medical Image Analysis, vol. 75, p. 102304, 2022.

[19] S. Ali, H. Realdon, et al., “An objective comparison of polyp detection methods in colonoscopy: EndoCV 2021 challenge,” Medical Image Analysis, vol. 77, p. 102336, 2022.

[20] R. Chen, et al., “Semi-supervised learning methods for polyp segmentation: A review,” Medical Image Analysis, vol. 82, p. 102639, 2022.

[21] M.-H. Guo, C. Xu, J. Liu, et al., “Polyp-PVT: Polyp segmentation with pyramid vision transformers,” IEEE J. Biomed. Health Inform., vol. 26, no. 6, pp. 3120–3130, 2022.

[22] Z. Dong, H. Cao, et al., “ColonFormer: Effective transformer-based polyp segmentation,” Medical Image Analysis, vol. 86, p. 102792, 2023.

[23] VCIBA Consortium, “Transformers for medical image analysis: A tutorial review,” Visual Computing for Industry, Biomedicine, and Art, vol. 6, no. 1, 2023, doi: 10.1186/s42492-023-00138-6.

[24] R. Wang, Y. Zhang, X. Li, and J. Sun, “NA-SegFormer: Neural architecture search for transformer-based polyp segmentation,” IEEE Transactions on Medical Imaging, vol. 42, no. 1, pp. 15–27, 2023.

[25] J. Lee, M. Kim, H. Park, and S. Lim, “Polyp-LVT: Lightweight vision transformer for efficient polyp segmentation,” Medical Image Analysis, vol. 91, p. 103025, 2024.

[26] S. Kumar, P. Rani, and R. Gupta, “Tiny polyp detection using lightweight CNNs,” Pattern Analysis and Applications, 2024. (?? not yet verified).

[27] S. Ali, J. Jha, M. Smedsrud, D. Johansen, P. Halvorsen, H. D. Johansen, et al., “PolypGen: A multi-center polyp detection and segmentation dataset for generalisability assessment,” arXiv preprint arXiv:2106.04463, 2021.

[28] Y. Oukdach, A. Garbaz, Z. Kerkaou, M. El Ansari, L. Koutti, A. F. El Ouafdi, and M. Salihoun, “UViT-Seg: An efficient ViT and U-Net-based framework for accurate colorectal polyp segmentation in colonoscopy and WCE images,” Journal of Digital Imaging, vol. 37, no. 5, pp. 2354–2374, 2024, doi: 10.1007/s10278-024-01124-8.

[29] Q. H. Trinh, N. T. Bui, T. H. Nguyen Mau, M. V. Nguyen, H. M. Phan, M. T. Tran, and H. D. Nguyen, “M²UNet: MetaFormer multi-scale upsampling network for polyp segmentation,” arXiv preprint arXiv:2306.08600, 2023, doi: 10.48550/arXiv.2306.08600.

[30] L. Wang, Z. Liu, and Q. Li, “Nested UNet++ for colonoscopy segmentation,” Medical Physics, 2024. (?? not yet verified).

[31] T. Li, P. Liu, and H. Zhang, “Real-time polyp detection with lightweight transformers,” IEEE Transactions on Medical Imaging, 2024.

[32] Y. Zhao, L. Huang, and F. Wang, “LapFormer: Lightweight attention pyramid transformers for polyp segmentation,” arXiv preprint arXiv:2210.04393, 2024. (?? not yet verified).

[33] R. Chen, X. Ma, and J. Zhang, “NA-SegFormer with neighborhood attention achieving 96% accuracy,” Scientific Reports, vol. 14, pp. 1–12, 2024.

[34] Y. Khan, B. Ahmed, and A. Raza, “Hybrid CNN–Transformer methods for polyp segmentation,” arXiv preprint arXiv:2508.09189, 2025. (?? future work).

[35] P. Lijin, M. Ullah, A. Vats, F. A. Cheikh, G. S. Kumar, and M. S. Nair, “PolySegNet: Improving polyp segmentation through Swin Transformer and Vision Transformer fusion,” Biomedical Engineering Letters, vol. 14, pp. 1421–1431, Aug. 2024, doi: 10.1007/s13534-024-00355-7.

[36] D. Fan, G. Ji, T. Zhou, G. Chen, H. Fu, and L. Shao, “PraNet: Parallel reverse attention network for polyp segmentation,” Medical Image Analysis, vol. 72, p. 102084, 2021.

[37] S. Srivastava, P. Jha, and A. Jha, “MSRF-Net: A multi-scale residual fusion network for biomedical image segmentation,” Computers in Biology and Medicine, vol. 134, p. 104427, 2021.

[38] T. Huang, Y. Xu, Y. Song, X. Yan, Y. Zhang, and Y. Wang, “SSFormer: A lightweight structure-shared transformer for medical image segmentation,” Medical Image Analysis, vol. 84, p. 102684, 2023.

[39] S. Ding, J. Zhang, and J. Yu, “Feature cross-bridging transformer for medical image segmentation,” Knowledge-Based Systems, vol. 269, p. 110481, 2023.

[40] X. Yang, Z. Zhang, and L. Zhu, “A lighter hybrid feature fusion framework for polyp segmentation,” Biomedical Signal Processing and Control, vol. 97, p. 105735, 2024.

[41] W. Li, Y. Chen, J. Wang, and H. Xu, “VMDU-Net: Vision Mamba Dual-encoder UNet for accurate polyp segmentation,” Knowledge-Based Systems, vol. 311, p. 111999, 2025.

[42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. NeurIPS, 2017, pp. 6000–6010.

[43] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016, pp. 770–778.

[44] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proc. CVPR, 2015, pp. 1–9.

[45] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[46] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. NeurIPS, 2012, pp. 1097–1105.

[47] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. CVPR, 2009, pp. 248–255.

[48] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The Pascal Visual Object Classes (VOC) Challenge,” International Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, 2010.

[49] T. Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in Proc. ECCV, 2014, pp. 740–755.

[50] S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and D. Terzopoulos, “Image segmentation using deep learning: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 7, pp. 3523–3542, 2022.

[51] J. Mei, T. Zhou, K. Huang, Y. Zhang, Y. Zhou, Y. Wu, and H. Fu, “A survey on deep learning for polyp segmentation: Techniques, challenges and future trends,” arXiv preprint arXiv:2311.18373, 2023.

[52] Z. Wu, F. Lv, C. Chen, A. Hao, and S. Li, “Colorectal polyp segmentation in the deep learning era: A comprehensive survey,” arXiv preprint arXiv:2401.11734, 2024.