MambaGesture2: Co-Speech Gesture Generation via Hierarchical Fusion and Spatiotemporal Aggregation

Demo Video

Watch our demo video to see MambaGesture2 in action!

Abstract

Co-speech gesture generation is crucial for producing synchronized and realistic human gestures, thereby enhancing the animation of lifelike avatars in virtual environments. While diffusion models have demonstrated remarkable capabilities, their combination with the quadratic complexity of transformers leads to increased resource consumption. Furthermore, as a time series task, current models often struggle to effectively capture multi-scale temporal dynamics. To address these challenges, we introduce MambaGesture2, a novel framework that integrates a Mamba-based network, Hierarchical U-Net Gesture Mamba (HUG-Mamba), with a multi-modality feature fusion module, SEAD. HUG-Mamba leverages the sequential data processing strengths of the Mamba model alongside the U-Net architecture, significantly enhancing the temporal coherence of generated gestures. Specifically, we propose the Temporal-Stratified Fusion (TSF) module, which employs multi-scale learning to capture diverse temporal information. Additionally, we design the Spatial-Temporal Cascaded Aggregation (STCA) module to improve both spatial and temporal modeling. Our approach is rigorously evaluated on the multi-modal BEAT2 and SHOW datasets, demonstrating significant improvements across comprehensive metrics and achieving state-of-the-art performance in co-speech gesture generation. For more information, please visit the project link: https://fcchit.github.io/mambagesture2.