SegFormer3D

An Efficient Transformer for 3D Medical Image Segmentation

Shehan Perera* Pouyan Navard* Alper Yilmaz
Photogrammetric Computer Vision Lab, The Ohio State University
CVPRW 2024

Abstract

The adoption of Vision Transformers (ViTs) represents a significant advancement in 3D Medical Image segmentation. However, state-of-the-art architectures require extremely large and complex models with substantial computing resources. We present SegFormer3D, a hierarchical Transformer that achieves competitive performance with 33× fewer parameters and a 13× reduction in GFLOPS compared to state-of-the-art. SegFormer3D calculates attention across multiscale volumetric features and uses an all-MLP decoder to produce highly accurate segmentation masks. We benchmark on three widely used datasets (Synapse, BRaTS, ACDC), demonstrating that lightweight models are a valuable research direction for 3D medical imaging.

33×

Fewer Parameters

13×

Lower GFLOPS

4.5M

Total Parameters

17

GFLOPS

Qualitative Results

We compare SegFormer3D (highlighted in green) against state-of-the-art baselines on three medical imaging datasets. Despite being significantly lighter, our model produces highly accurate segmentation masks across all datasets.

Brain Tumor Segmentation (BRaTS)

Segmentation of tumor regions (Whole Tumor, Enhancing Tumor, Tumor Core) from multi-modal MRI scans.

BRaTS Segmentation Results

Multi-Organ CT Segmentation (Synapse)

Segmentation of 8 abdominal organs including spleen, kidneys, liver, stomach, gallbladder, pancreas, and aorta.

Synapse Segmentation Results

Automated Cardiac Diagnosis (ACDC)

Segmentation of cardiac structures: Right Ventricle (RV), Myocardium (Myo), and Left Ventricle (LV).

ACDC Segmentation Results

Quantitative Results

BRaTS Performance

Method Params (M) Avg (%) WT ET TC
nnFormer 150.5 86.4 91.3 81.8 86.0
SegFormer3D (Ours) 4.5 82.1 89.9 74.2 82.2
UNETR 92.49 71.1 78.9 58.5 76.1
TransUNet 96.07 64.4 70.6 54.2 68.4

Synapse Multi-Organ Performance

Method Params (M) Avg DSC (%)
nnFormer 150.5 86.57
SegFormer3D (Ours) 4.5 82.15
MISSFormer -- 81.96
UNETR 92.49 79.56
TransUNet 96.07 77.48

ACDC Cardiac Segmentation Performance

Method Params (M) Avg (%) RV Myo LV
nnFormer 150.5 92.06 90.94 89.58 95.65
SegFormer3D (Ours) 4.5 90.96 88.50 88.86 95.53
LeViT-UNet-384 52.17 90.32 89.55 87.64 93.76
TransUNet 96.07 89.71 88.86 84.54 95.73

Method Overview

🏗️ Hierarchical Encoder

A 4-stage hierarchical Transformer extracts multiscale volumetric features from 3D medical images. Uses overlapped patch merging to preserve spatial continuity and efficient self-attention to reduce computational complexity from O(n²) to O(n²/r).

⚡ Efficient Self-Attention

Addresses the sequence length bottleneck in 3D volumes by compressing the key-value pairs with reduction ratios of [4×, 2×, 1×, 1×] across the four stages, maintaining performance while dramatically reducing computation.

🎯 All-MLP Decoder

Instead of complex convolutional decoders, we use a simple all-MLP decoder that aggregates multiscale features through linear projections and upsampling, achieving superior results with minimal parameters.

🔄 Mix-FFN Module

Enables automatic learning of positional cues without fixed positional encoding, ensuring superior scalability when handling different input resolutions common in medical imaging.

Model Efficiency Comparison

Architecture Parameters (M) GFLOPs Reduction vs SOTA
nnFormer 150.5 213.4 --
TransUNet 96.07 88.91 --
UNETR 92.49 75.76 --
SwinUNETR 62.83 384.2 --
SegFormer3D (Ours) 4.51 17.5 33× params, 13× GFLOPs

Citation

@inproceedings{perera2024segformer3d,
  title={SegFormer3D: an Efficient Transformer for 3D Medical Image Segmentation},
  author={Perera, Shehan and Navard, Pouyan and Yilmaz, Alper},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}