An effective CNN and Transformer complementary network for medical image segmentation
Feiniu Yuan, Zhengxiao Zhang, Zhijun Fang
The Transformer network was originally proposed for natural language processing. Due to its powerful representation ability for long-range dependency, it has been extended for vision tasks in recent years. To fully utilize the advantages of Transformers and Convolutional Neural Networks (CNNs), we propose a CNN and Transformer Complementary Network (CTCsingle bondNet) for medical image segmentation. We first design two encoders by Swin Transformers and Residual CNNs to produce complementary features in Transformer and CNN domains, respectively. Then we cross-wisely concatenate these complementary features to propose a Cross-domain Fusion Block (CFB) for effectively blending them. In addition, we compute the correlation between features from the CNN and Transformer domains, and apply channel attention to the self-attention features by Transformers for capturing dual attention information. We incorporate cross-domain fusion, feature correlation and dual attention together to propose a Feature Complementary Module (FCM) for improving the representation ability of features. Finally, we design a Swin Transformer decoder to further improve the representation ability of long-range dependencies, and propose to use skip connections between the Transformer decoded features and the complementary features for extracting spatial details, contextual semantics and long-range information. Skip connections are performed in different levels for enhancing multi-scale invariance. Experimental results show that our CTCsingle bondNet significantly surpasses the state-of-the-art image segmentation models based on CNNs, Transformers, and even Transformer and CNN combined models designed for medical image segmentation. It achieves superior performance on different medical applications, including multi-organ segmentation and cardiac segmentation.