Inter-Frame Correlation for Deep Filter Estimation
in Speech Dereverberation

Abstract

Speech dereverberation is critical for intelligibility when speech is recorded by a distant microphone, where long-tail reflections degrade temporal and spectral details. Unlike additive noise, late reverberation remains tightly correlated with the target signal, which makes suppression more difficult. Conventional deep learning approaches treat dereverberation as part of a broader enhancement pipeline by feeding raw complex STFT frames into a network to disentangle reverberation and noise from the target signal, often resulting in large models and suboptimal convergence. We propose IF-CorrNet, a correlation-to-filter architecture that explicitly exploits inter-frame STFT correlations. For each time–frequency bin, we compute correlations with adjacent frames and use these features to estimate multi-frame dereverberation filters. With explicit filter estimation from inter-frame correlation, IF-CorrNet simplifies the learning process and improves robustness, effectively mitigating overfitting on mismatched real-world recordings compared to existing models. Experimental results on the REVERB Challenge dataset demonstrate consistent improvements in reverberation and noise reduction over conventional methods.

Model Architecture

Figure 1: IF-CorrNet Architecture. The proposed IF-CorrNet consists of three main components: (1) Inter-Frame Correlation Module that captures temporal dependencies across adjacent frames by computing correlation features, (2) Backbone TF-modelling Network that processes temporal and frequency sequences based on modified Locoformer, and (3) Multi-Frame Filter Estimation Module that predicts spectral filters across multiple frames for effective dereverberation. Unlike conventional single-frame masking approaches, our method explicitly models the temporal structure of reverberation through inter-frame correlations, enabling more accurate and robust speech dereverberation.

SRMR Results on RealData of REVERB Challenge Dataset

Speech-to-Reverberation Modulation Energy Ratio (SRMR) is a non-intrusive metric that measures speech quality in reverberant conditions. Higher SRMR scores indicate better perceptual quality with less reverberation.

Method	SRMR
Input (Reverberant)	3.180
SF-Raw + SF-Mask	7.245
SF-Raw + MF-Filter	7.225
SF-Raw + Mapping	6.628
IF-Corr + MF-Filter (Ours)	7.548

Real-world Recording Samples (RealData)

Comparative results on the REVERB Challenge RealData evaluation set. Select a sample to compare the input with different architectural configurations.

Select Sample: