VasoMIM: Vascular Anatomy-Aware Masked Image Modeling for Medical Image Segmentation

De-Xing Huang1,2, Xiao-Hu Zhou1,2*, Mei-Jiang Gui1,2, Xiao-Liang Xie1,2, Shi-Qi Liu1,2, Shuang-Yi Wang1,2, Tian-Yu Xiang1,2, Rui-Ze Ma1,3, Nu-Fang Xiao1,4, Zeng-Guang Hou1,2*
1 Institute of Automation, Chinese Academy of Sciences
2 University of Chinese Academy of Sciences
3 Taiyuan University of Technology
4 Hunan University of Science and Technology
arXiv 2025

*Corresponding Authors

Abstract

Accurate vessel segmentation in X-ray angiograms is crucial for numerous clinical applications. However, the scarcity of annotated data presents a significant challenge, which has driven the adoption of self-supervised learning (SSL) methods such as masked image modeling (MIM) to leverage large-scale unlabeled data for learning transferable representations. Unfortunately, conventional MIM often fails to capture vascular anatomy because of the severe class imbalance between vessel and background pixels, leading to weak vascular representations. To address this, we introduce Vascular anatomy-aware Masked Image Modeling (VasoMIM), a novel MIM framework tailored for X-ray angiograms that explicitly incorporates anatomical knowledge into the pre-training process. Specifically, it comprises two complementary components: anatomy-guided masking strategy and anatomical consistency loss. The former preferentially masks vessel-containing patches to focus the model on reconstructing vessel-relevant regions. The latter enforces consistency in vascular semantics between the original and reconstructed images, thereby improving the discriminability of vascular representations. Empirically, VasoMIM achieves state-of-the-art performance across three datasets. These findings highlight its potential to facilitate X-ray angiogram analysis.

Overview

Directional Weight Score

Comparison of conventional MIM and VasoMIM. (a) Conventional MIM masks patches based on general rules (e.g., random masking) and learns to reconstruct patches via minimizing pixel-level loss. (b) VasoMIM guides patch masking with vascular anatomy and enforces anatomical consistency during reconstruction, enabling the model to learn richer vascular representations. Dark gray patches are vessel-relevant regions.

VasoMIM Framework

Sizes of model trees

Overall framework of VasoMIM. During pre-training, each X-ray angiogram is first processed by Frangi filter to extract its vascular anatomy. From this anatomy, we derive a patch-wise vascular anatomical distribution $f$ to guide the masking process. Finally, the model is optimized by minimizing $\mathcal{L}_{\rm train}$, which is a combination of standard pixel-wise reconstruction loss $\mathcal{L}_{\rm rec.}$ and the designed anatomical consistency loss $\mathcal{L}_{\rm cons.}$.

Main Results

Directional Weight Score

Comparison of state-of-the-art methods on ARCADE, CAXF and XCAV. All methods are reimplemented using their official codebases. The best and second-best results are highlighted in red and yellow, respectively. Results are reported as “${\rm mean}\pm{\rm std}$” over three random seeds, except for Frangi filter. indicates that the model is specialized in medical imaging.

Visualization

Sizes of model trees

Quantitative results on ARCADE, CAXF, and XCAV. Our VasoMIM can achieve more precise vessel segmentation results with fewer false positives.

Sizes of model trees

Patch-wise masking ratio over the pre-training process, i.e., $\frac{1}{E}\sum_{j=1}^E\mathbb{I}\left(\text{Patch $x_i$ is masked in epoch $j$}\right)$. For each case, we show the input image, vascular anatomy extracted by Frangi filter, and patch-wise masking ratio over pre-training. Red means higher ratio and blue indicates the opposite.

BibTeX

arXiv