Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values Paper • 2510.20187 • Published 5 days ago • 17
StreamingVLM: Real-Time Understanding for Infinite Video Streams Paper • 2510.09608 • Published 17 days ago • 48
Glyph: Scaling Context Windows via Visual-Text Compression Paper • 2510.17800 • Published 7 days ago • 60
Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis Paper • 2509.09595 • Published Sep 11 • 48
Embody 3D: A Large-scale Multimodal Motion and Behavior Dataset Paper • 2510.16258 • Published 10 days ago • 6
Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model Paper • 2510.12276 • Published 13 days ago • 141
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM Paper • 2510.15870 • Published 10 days ago • 77
Latent Diffusion Model without Variational Autoencoder Paper • 2510.15301 • Published 11 days ago • 45
Ponimator: Unfolding Interactive Pose for Versatile Human-human Interaction Animation Paper • 2510.14976 • Published 11 days ago • 3
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale Paper • 2510.14979 • Published 11 days ago • 64
UniFusion: Vision-Language Model as Unified Encoder in Image Generation Paper • 2510.12789 • Published 13 days ago • 16
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI Paper • 2510.05684 • Published 20 days ago • 132
UniVideo: Unified Understanding, Generation, and Editing for Videos Paper • 2510.08377 • Published 18 days ago • 66
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time Paper • 2509.25161 • Published 28 days ago • 23
OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models Paper • 2509.17627 • Published Sep 22 • 65