Loading [MathJax]/extensions/MathMenu.js
Xinyuan Qian - IEEE Xplore Author Profile

Showing 1-25 of 31 results

Filter Results

Show

Results

Transformer and its derivatives have achieved success in diverse tasks across computer vision, natural language processing, and speech processing. To reduce the complexity of computations within the multi-head self-attention mechanism in Transformer, Selective State Space Models (i.e., Mamba) were proposed as an alternative. Mamba exhibited its effectiveness in natural language processing and comp...Show More
Speech enhancement plays an essential role in various applications, and the integration of visual information has been demonstrated to bring substantial advantages. However, the majority of current research concentrates on the examination of facial and lip movements, which can be compromised or entirely inaccessible in scenarios where occlusions occur or when the camera view is distant. Whereas co...Show More
Automated recognition of bird vocalizations (BVs) is essential for biodiversity monitoring through passive acoustic monitoring (PAM), yet deep learning (DL) models encounter substantial challenges in open environments. These include difficulties in detecting unknown classes, extracting species-specific features, and achieving robust cross-corpus recognition. To address these challenges, this lette...Show More
Acoustic Impulse Response (AIR) provides crucial spatial information about the environment, significantly enhancing audio immersion. However, achieving high perceptual quality while computing AIR in real-time for interactive audio-video media (IAVM) presents a challenging problem. This study proposes the Mesh to Parametric AIR (M2PAIR), a method for computing AIR designed for IAVM. M2PAIR integrat...Show More
Audio-visual target speaker extraction (AV-TSE) aims to extract the specific person's speech from the audio mixture given auxiliary visual cues. Previous methods usually search for the target voice through speech-lip synchronization. However, this strategy mainly focuseson the existence of target speech, while ignoring the variations of the noise characteristics, i.e., interference speaker and the...Show More
Audio-visual active speaker detection (AV-ASD) aims to identify which visible face is speaking in a scene with one or more persons. Most existing AV-ASD methods prioritize capturing speech-lip correspondence. However, there is a noticeable gap in addressing the challenges from real-world AV-ASD scenarios. Due to the presence of low-quality noisy videos in such cases, AV-ASD systems without a selec...Show More
Sound Source Localization (SSL) enabling technology for applications such as surveillance and robotics. While traditional Signal Processing (SP)-based Sound Source Localization (SSL) methods provide analytic solutions under specific signal and noise assumptions, recent Deep Learning (DL)-based methods have significantly outperformed them. However, their success depends on extensive training data a...Show More
Audio and visual signals complement each other in human speech perception, and the same applies to automatic speech recognition. The visual signal is less evident than the acoustic signal, but more robust in a complex acoustic environment, as far as speech perception is concerned. It remains a challenge how we effectively exploit the interaction between audio and visual signals for automatic speec...Show More
Particle filters (PFs) have been widely used in speaker tracking due to their capability in modeling a non-linear process or a non-Gaussian environment. However, particle filters are limited by several issues. For example, pre-defined handcrafted measurements are often used which can limit the model performance. In addition, the transition and update models are often preset which make PF less flex...Show More
Audio-visual deepfake detection is the process of identifying and detecting deepfakes that have been generated using both audio and visual content with AI algorithms. Most existing methods primarily focus on the overall authenticity while neglecting the position of forgeries in time. This can be particularly problematic, as even a small alteration in a clip can significantly impact its meaning. Su...Show More
Cross-modal Retrieval (CMR) is formulated for the scenarios where the queries and retrieval results are of different modalities. Existing Cross-modal Retrieval (CMR) studies mainly focus on the common contextualized information between text transcripts and images, and the synchronized event information in audio-visual recordings. Unlike all previous works, in this article, we investigate the geome...Show More
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input. The previous studies revealed the importance of lip-speech synchronization and visual quality. Despite much progress, they hardly focus on the content of lip movements i.e., the visual intelligibility of the spoken words, which is an important aspect of generati...Show More
GelSight sensors that estimate contact geometry and force by reconstructing the deformation of their soft elastomer from images would yield poor force measurements when the elastomer deforms uniformly or reaches deformation saturation. Here we present an L$^{3}$ F-TOUCH sensor that considerably enhances the three-axis force sensing capability of typical GelSight sensors. Specifically, the L$^{3}$ ...Show More
In conjunction with huge recent progress in cam-era and computer vision technology, camera-based sensors have increasingly shown considerable promise in relation to tactile sensing. In comparison to competing technologies (be they resistive, capacitive or magnetic based), they offer super-high-resolution, while suffering from fewer wiring problems. The human tactile system is composed of various t...Show More
The use of Transformer represents a recent success in speech enhancement. However, as its core component, self-attention suffers from quadratic complexity, which is computationally prohibited for long speech recordings. Moreover, it allows each time frame to attend to all time frames, neglecting the strong local correlations of speech signals. This study presents a simple yet effective sparse self...Show More
Replay speech poses a growing threat to speaker verification systems, thus the detection of replay speech becomes increasingly important. A critical factor differentiating replay speech and genuine speech is the representation of device information. Replay speech carries physical device information that originates from recording device, playback device, and environmental noise. In this work, a dev...Show More
The traditional graphic user interface in healthcare-oriented consumer electronics faced challenges such as high operational complexity, time-consuming operations, and a high risk of infection. The adoption of voice user interface (VUI) could promote network automation with enhanced efficiency, reduced simplicity and operating expense in various applications. Given noisy operational environments, ...Show More
Audio-visual signals can be used jointly for robotic perception as they complement each other. Such multi-modal sensory fusion has a clear advantage, especially under noisy acoustic conditions. Speaker localization, as an essential robotic function, was traditionally solved as a signal processing problem that now increasingly finds deep learning solutions. The question is how to fuse audio-visual ...Show More
Speech enhancement plays an essential role in a wide range of speech processing applications. Recent studies on speech enhancement tend to investigate how to effectively capture the long-term contextual dependencies of speech signals to boost performance. However, these studies generally neglect the time-frequency (T-F) distribution information of speech spectral components, which is equally impor...Show More
The adoption of voice user interface (VUI) will promote network automation with enhanced efficiency with reduced simplicity and operating expense in Industry 5.0. Given the noisy environments, speech denoising is indispensable for the VUI in Internet of Things (IoT) or Industrial IoT (IIoT). Despite Transformer's recent success in speech denoising, the adopted full self-attention suffers from quad...Show More
Speaker extraction seeks to extract the clean speech of a target speaker from a multi-talker mixture speech. There have been studies to use a pre-recorded speech sample or face image of the target speaker as the speaker cue. In human communication, co-speech gestures that are naturally timed with speech also contribute to speech perception. In this work, we explore the use of co-speech gestures se...Show More
Generalized Cross Correlation (GCC) is the most popular localization technique over the past decades and can be extended with the beamforming method e.g. Steered Response Power (SRP) when multiple microphone pairs exist. Considering the promising results of Deep Learning (DL) strategies over classical approaches, in this work, instead of directly using Generalized Cross Correlation (GCC), SRP is d...Show More
Audio-visual tracking of an unknown number of concurrent speakers in 3D is a challenging task, especially when sound and video are collected with a compact sensing platform. In this paper, we propose a tracker that builds on generative and discriminative audio-visual likelihood models formulated in a particle filtering framework. We localize multiple concurrent speakers with a de-emphasized acoust...Show More
In this work, we present the development of a new database, namely Sound Localization and Classification (SLoClas) corpus, for studying and analyzing sound localization and classification. The corpus contains a total of 23.27 hours of data recorded using a 4-channel microphone array. 10 classes of sounds are played over a loudspeaker at 1.5 meters distance from the array by varying the Direction-o...Show More
Robotic audition is a basic sense that helps robots perceive the surroundings and interact with humans. Sound Source Localization (SSL) is an essential module for a robotic system. However, the performance of most sound source localization techniques degrades in noisy and reverberant environments due to inaccurate Time Difference of Arrival (TDoA) estimation. In robotic sound source localization, ...Show More