I. Introduction
Speech enhancement (SE) aims to remove the interference and preserve the speech from the noisy mixture. With the advent of data-driven deep learning approaches, SE models can directly learn the relationship between speech and noise from large paired clean-noisy speech datasets and gain a strong ability in noise suppression. More recent approaches prefer using time-frequency (TF) domain models [1], [2], [3], which have achieved excellent performance by enhancing the noisy speech's real and imaginary components. DCCRN [1] is a popular model for TF domain speech enhancement, and many subsequent works focus on refining this model [4], [5], [6], [7], [8], [9], [10], [11], [12].