arxiv:2510.22603

Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMS

Published on Oct 26

· Submitted by

Umberto Cappellazzo on Oct 28

Imperial College London

Upvote

Authors:

Umberto Cappellazzo ,

Abstract

Attention sinks and massive activations in multimodal speech recognition are identified and mitigated using a decorrelation loss, improving word error rate under high feature downsampling.

AI-generated summary

Large language models (LLMs) have recently advanced auditory speech recognition (ASR), visual speech recognition (VSR), and audio-visual speech recognition (AVSR). However, understanding of their internal dynamics under fine-tuning remains limited. In natural language processing, recent work has revealed attention sinks, tokens that attract disproportionately high attention, and associated massive activations in which some features of sink tokens exhibit huge activation in LLMs. In this work, we are the first to study these phenomena in multimodal speech recognition. Through a detailed analysis of audio-visual LLMs, we identify attention sinks and massive activations not only at the BOS token but also at intermediate low-semantic tokens across ASR, VSR, and AVSR. We show that massive activations originate in the MLP layers and correspond to fixed feature indices across all sink tokens. We further show that intermediate sink tokens exhibit high cosine similarity to the BOS token, thereby amplifying attention and activation. Building on these insights, we introduce a simple decorrelation loss that reduces cosine similarity between BOS and other tokens, effectively mitigating intermediate sinks and massive activations. Furthermore, our method improves word error rate (WER) under high audio-visual feature downsampling while remaining stable at lower downsampling rates.

View arXiv page View PDF GitHub 40 Add to collection

Community

hisoka94

Paper author Paper submitter about 19 hours ago

We study the attention sinks and massive activations phenomena for audio-visual LLMs. We identify attention sinks and massive activations not only at the BOS token but also at intermediate
low-semantic tokens across ASR, VSR, and AVSR tasks. We show that massive activations originate in the MLP layers and correspond to fixed feature indices across all sink tokens. We further show that
intermediate sink tokens exhibit high cosine similarity to the BOS token, thereby amplifying attention and activation. Building on these insights, we introduce a simple decorrelation loss that reduces cosine
similarity between BOS and other tokens, effectively mitigating intermediate sinks and massive activations. Furthermore, our method improves word error rate (WER) under high audio-visual feature
downsampling while remaining stable at lower downsampling rates.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.22603 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.22603 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.22603 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.