deepseek-ai/DeepSeek-OCR is out! ๐ฅ my take โคต๏ธ > pretty insane it can parse and re-render charts in HTML > it uses CLIP and SAM features concatenated, so better grounding > very efficient per vision tokens/performance ratio > covers 100 languages
๐ New blog: Maintain the unmaintainable โ 1M+ Python LOC, 400+ models
How do you stop a million-line library built by thousands of contributors from collapsing under its own weight? At ๐ค Transformers, we do it with explicit software-engineering tenets, principles that make the codebase hackable at scale.
๐ Inside the post: โ One Model, One File: readability first โ you can still open a modeling file and see the full logic, top to bottom. โ Modular Transformers: visible inheritance that cuts maintenance cost by ~15ร while keeping models readable. โ Config-Driven Performance: FlashAttention, tensor parallelism, and attention scheduling are config-level features, not rewrites.
Written with @lysandre,@pcuenq and @yonigozlan, this is a deep dive into how Transformers stays fast, open, and maintainable.
IBM just released small swiss army knife for the document models: granite-docling-258M on Hugging Face ๐ฅ
> not only a document converter but also can do document question answering, understand multiple languages ๐คฏ > best part: released with Apache 2.0 license ๐ use it with your commercial projects! > it supports transformers, vLLM and MLX from the get-go! ๐ค > built on SigLIP2 & granite-165M
first vision language model built off openai/gpt-oss-20b just dropped! ๐ฅ
InternVL3.5 comes with 32 models ๐คฏ pre-trained, fine-tuned, aligned in various sizes OpenGVLab/internvl35-68ac87bd52ebe953485927fb comes with gpt-oss or Qwen3 for LLM part โคต๏ธ
Okay this is insane... WebGPU-accelerated semantic video tracking, powered by DINOv3 and Transformers.js! ๐คฏ Demo (+ source code): webml-community/DINOv3-video-tracking
This will revolutionize AI-powered video editors... which can now run 100% locally in your browser, no server inference required (costs $0)! ๐
How does it work? ๐ค 1๏ธโฃ Generate and cache image features for each frame 2๏ธโฃ Create a list of embeddings for selected patch(es) 3๏ธโฃ Compute cosine similarity between each patch and the selected patch(es) 4๏ธโฃ Highlight those whose score is above some threshold
... et voilร ! ๐ฅณ
You can also make selections across frames to improve temporal consistency! This is super useful if the object changes its appearance slightly throughout the video.