The adoption of Vision Transformers (ViTs) represents a significant advancement in 3D Medical Image segmentation. However, state-of-the-art architectures require extremely large and complex models with substantial computing resources. We present SegFormer3D, a hierarchical Transformer that achieves competitive performance with 33× fewer parameters and a 13× reduction in GFLOPS compared to state-of-the-art. SegFormer3D calculates attention across multiscale volumetric features and uses an all-MLP decoder to produce highly accurate segmentation masks. We benchmark on three widely used datasets (Synapse, BRaTS, ACDC), demonstrating that lightweight models are a valuable research direction for 3D medical imaging.
Fewer Parameters
Lower GFLOPS
Total Parameters
GFLOPS
We compare SegFormer3D (highlighted in green) against state-of-the-art baselines on three medical imaging datasets. Despite being significantly lighter, our model produces highly accurate segmentation masks across all datasets.
Segmentation of tumor regions (Whole Tumor, Enhancing Tumor, Tumor Core) from multi-modal MRI scans.
Segmentation of 8 abdominal organs including spleen, kidneys, liver, stomach, gallbladder, pancreas, and aorta.
Segmentation of cardiac structures: Right Ventricle (RV), Myocardium (Myo), and Left Ventricle (LV).
| Method | Params (M) | Avg (%) | WT | ET | TC |
|---|---|---|---|---|---|
| nnFormer | 150.5 | 86.4 | 91.3 | 81.8 | 86.0 |
| SegFormer3D (Ours) | 4.5 | 82.1 | 89.9 | 74.2 | 82.2 |
| UNETR | 92.49 | 71.1 | 78.9 | 58.5 | 76.1 |
| TransUNet | 96.07 | 64.4 | 70.6 | 54.2 | 68.4 |
| Method | Params (M) | Avg DSC (%) |
|---|---|---|
| nnFormer | 150.5 | 86.57 |
| SegFormer3D (Ours) | 4.5 | 82.15 |
| MISSFormer | -- | 81.96 |
| UNETR | 92.49 | 79.56 |
| TransUNet | 96.07 | 77.48 |
| Method | Params (M) | Avg (%) | RV | Myo | LV |
|---|---|---|---|---|---|
| nnFormer | 150.5 | 92.06 | 90.94 | 89.58 | 95.65 |
| SegFormer3D (Ours) | 4.5 | 90.96 | 88.50 | 88.86 | 95.53 |
| LeViT-UNet-384 | 52.17 | 90.32 | 89.55 | 87.64 | 93.76 |
| TransUNet | 96.07 | 89.71 | 88.86 | 84.54 | 95.73 |
A 4-stage hierarchical Transformer extracts multiscale volumetric features from 3D medical images. Uses overlapped patch merging to preserve spatial continuity and efficient self-attention to reduce computational complexity from O(n²) to O(n²/r).
Addresses the sequence length bottleneck in 3D volumes by compressing the key-value pairs with reduction ratios of [4×, 2×, 1×, 1×] across the four stages, maintaining performance while dramatically reducing computation.
Instead of complex convolutional decoders, we use a simple all-MLP decoder that aggregates multiscale features through linear projections and upsampling, achieving superior results with minimal parameters.
Enables automatic learning of positional cues without fixed positional encoding, ensuring superior scalability when handling different input resolutions common in medical imaging.
| Architecture | Parameters (M) | GFLOPs | Reduction vs SOTA |
|---|---|---|---|
| nnFormer | 150.5 | 213.4 | -- |
| TransUNet | 96.07 | 88.91 | -- |
| UNETR | 92.49 | 75.76 | -- |
| SwinUNETR | 62.83 | 384.2 | -- |
| SegFormer3D (Ours) | 4.51 | 17.5 | 33× params, 13× GFLOPs |
@inproceedings{perera2024segformer3d,
title={SegFormer3D: an Efficient Transformer for 3D Medical Image Segmentation},
author={Perera, Shehan and Navard, Pouyan and Yilmaz, Alper},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024}
}