Deep BI-RADS Network for Improved Cancer Detection from Mammograms

Gil Ben-Artzi

Feras Daragma

Shahar Mahpod

School of Computer Science, Ariel University, Israel

2024 IEEE The International Conference on Pattern Recognition (ICPR)

Paper

Highlights

Combines textual BI-RADS descriptors with visual mammogram data for improved cancer detection.
Employs iterative attention layers for effective multi-modal fusion.
Achieves higher classification performance compared to image-only models.
Demonstrates an AUC of 0.872 on the CBIS-DDSM dataset.

Abstract

While state-of-the-art models for breast cancer detection leverage multi-view mammograms for enhanced diagnostic accuracy, they often focus solely on visual mammography data. However, radiologists document valuable lesion descriptors that contain additional information that can enhance mammography-based breast cancer screening. A key question is whether deep learning models can benefit from these expert-derived features. To address this question, we introduce a novel multi-modal approach that combines textual BI-RADS lesion descriptors with visual mammogram content. Our method employs iterative attention layers to effectively fuse these different modalities, significantly improving classification performance over image-only models. Experiments on the CBIS-DDSM dataset demonstrate substantial improvements across all metrics, resulting in an AUC of 0.872, demonstrating the contribution of handcrafted features to end-to-end learning.

Results on CBIS-DDSM dataset

Model	AUC	Accuracy	Specificity	Precision	Recall	F1-Score
[15]	0.680	0.661	0.670	0.638	0.651	0.644
[23]	0.811	0.723	0.750	0.686	0.698	0.692
Ours - No descriptors	0.711	0.664	0.650	0.676	0.619	0.634
Ours	0.872	0.760	0.773	0.760	0.743	0.751

Method

Training: Our model utilizes iterative attention layers to fuse BI-RADS textual descriptors with mammogram images. This multi-modal approach enhances the model's ability to classify benign vs. malignant lesions effectively.

Inference: During inference, the model leverages the learned multi-modal representation to provide more accurate predictions.

Figure 1: The Deep BI-RADS Network architecture. The model processes both CC and MLO mammogram views along with their corresponding BI-RADS descriptors through parallel branches. Each branch contains encoder blocks that reduce spatial resolution while increasing feature channels, followed by multi-attention layers that fuse visual and textual information through Cross, Self, and View attention mechanisms.

References

[15] Mo, Y., Han, C., Liu, Y., Liu, M., Shi, Z., Lin, J., Zhao, B., Huang, C., Qiu, B., Cui, Y., et al.: Hover-trans: Anatomy-aware hover-transformer for roi-free breast cancer diagnosis in ultrasound images. IEEE Transactions on Medical Imaging (2023)

[23] Tulder, G.v., Tong, Y., Marchiori, E.: Multi-view analysis of unregistered medical images using cross-view transformers. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 104–113. Springer (2021)