1. INTRODUCTION
Past research leveraging CIF [1] has made strides in closing the gap between speech and text, as evidenced by studies as [2], [3], [4], [5], [6]. However, they are confined to token-level integration, demanding strict architectural agreement between speech and text components in terms of output token sets. Despite the desirability to break through these architectural constraints, obtaining acoustic word embeddings (AWEs) for streaming ASR presents challenges, as existing methods either are architecturally not adaptable for streaming [7] or need explicit timestamp durations for each word [8].