CMR-Analysis (including machine learning)
Yidong Zhao, MSc
PhD Candidate
Delft University of Technology, Netherlands
Yidong Zhao, MSc
PhD Candidate
Delft University of Technology, Netherlands
Qian Tao, PhD
Assistant Professor
Delft University of Technology, Zuid-Holland, Netherlands
Despite the impressive performance of deep learning in cardiac magnetic resonance imaging (CMR) segmentation, well-trained neural networks may still fail at test time [1]. Monte Carlo (MC) Dropout [2] can provide pixel-level uncertainty estimation, however, it is reported to cause “silent failure” (i.e., fail to detect failure) [3]. Additionally, pixel-level uncertainty demands intensive visual inspection by radiologists/cardiologists. This study investigates the feasibility of detecting AI failure at slice level using our recently proposed Bayesian deep learning by Checkpoint Ensemble [4]. The purpose is to enable easy and robust AI quality control in clinic.
Methods:
The Bayesian segmentation network by MC-Dropout and Checkpoint Ensemble were trained on the public ACDC dataset [5] with 100 short-axis cine steady state free precession (SSFP) CMR scans acquired from Siemens Area 1.5T and Siemens Trio Trim 3.0T and tested on the M&M challenge dataset [6] which consists of cine SSFP scans of 150 patients acquired from Siemens MAGNETOM Avanto 1.5T and Philips Achieva 1.5T. Using manual annotation as ground-truth, we designate it as AI failure if on a particular slice, the dice coefficient is below 85% for left/right ventricle (LV/RV) and 80% for myocardium, and the 95-percentile Hausdorff distance exceeds 3 mm. We derive 3 segmentation confidence scores, to indicate success or failure at slice level for LV, RV, and myocardium respectively, via averaging pixel-level uncertainty of the predicted region for each cardiac structure. We evaluated the proposed segmentation confidence score by the area under curve (AUC) of the receiver operating characteristic (ROC).
Results:
Table 1 reports the AUC values, showing that the segmentation confidence score from both Bayesian methods can effectively detect segmentation failure on LV (93.4% vs. 93.8%) and myocardium (82.6% vs. 84.3%). The segmentation confidence score based on our Checkpoint Ensemble better detected segmentation failure on RV than MC-Dropout (AUC 91.6% vs. 87.9%). Figure 1 (a) – (c) shows three examples of segmentation results and the corresponding RV segmentation confidence score by MC-Dropout (first row) and Checkpoint Ensemble (second row). In all three cases, RV segmentation failed, but the failure was not detected by the segmentation confidence score from MC-Dropout (“silent failure”). In contrast, all three failure cases were detected by the segmentation confidence score based on Checkpoint Ensemble.
Conclusion:
We proposed and validated a segmentation confidence score to detect AI CMR segmentation failure at slice level. The confidence score based on our Bayesian Checkpoint Ensemble demonstrated higher sensitivity than traditional MC-Dropout, especially on RV which is more challenging to segment. High sensitivity in failure detection prevents silent failure and potentially translates to reliable AI in clinical practice.