AI & ML interests

None defined yet.

andito 
posted an update 2 months ago
view post
Post
1878
Finally, our new paper is out! "𝗙𝗶𝗻𝗲𝗩𝗶𝘀𝗶𝗼𝗻: 𝗢𝗽𝗲𝗻 𝗗𝗮𝘁𝗮 𝗜𝘀 𝗔𝗹𝗹 𝗬𝗼𝘂 𝗡𝗲𝗲𝗱"! 🥳
FineVision: Open Data Is All You Need (2510.17269)

If you've ever trained a VLM, you know this problem: nobody shares their data mixtures. It's a black box, making replicating SOTA work impossible.
We wanted to change that.

FineVision unifies 200 sources into 24 million samples. With 17.3 million images and 9.5 billion answer tokens, it's the largest open resource of its kind.

In the paper, we share how we built it:
🔍 finding and cleaning data at scale
🧹 removing excessive duplicates across sources
🤗 decontaminating against 66 public benchmarks

My favorite part is Figure 6 (in the video!). It's our visual diversity analysis. It shows that FineVision isn't just bigger; it's more balanced and conceptually richer than other open datasets.
NVIDIA's Eagle 2 paper highlighted just how critical this visual diversity is, and our results confirm it: models trained on FineVision consistently outperform those trained on any other open dataset on 11 benchmarks!

🎉 To celebrate the paper, I’m also releasing a concatenated and shuffled version of the full dataset! 👉HuggingFaceM4/FineVision_full_shuffled

It’s ready to stream, so you can start training your own models right away:

from datasets import load_dataset
d = load_dataset("HuggingFaceM4/FineVision_full_shuffled", split="train", streaming=True)
print(next(iter(d)))

A big shoutout to the first authors: Luis Wiedmann and Orr Zohar. They are rockstars!
merve 
posted an update 2 months ago
view post
Post
7276
deepseek-ai/DeepSeek-OCR is out! 🔥 my take ⤵️
> pretty insane it can parse and re-render charts in HTML
> it uses CLIP and SAM features concatenated, so better grounding
> very efficient per vision tokens/performance ratio
> covers 100 languages
·
multimodalart 
posted an update 2 months ago
view post
Post
7579
Want to iterate on a Hugging Face Space with an LLM?

Now you can easily convert any HF entire repo (Model, Dataset or Space) to a text file and feed it to a language model!

multimodalart/repo2txt
merve 
posted an update 3 months ago
view post
Post
6798
large AI labs open-sourced a ton of models last week 🔥
here's few picks, find even more here merve/sep-16-releases-68d13ea4c547f02f95842f05 🤝
> IBM released a new Docling model with 258M params based on Granite (A2.0) 📝 ibm-granite/granite-docling-258M
> Xiaomi released 7B audio LM with base and instruct variants (MIT) XiaomiMiMo/mimo-audio-68cc7202692c27dae881cce0
> DecartAI released Lucy Edit, open Nano Banana 🍌 (NC) decart-ai/Lucy-Edit-Dev
> OpenGVLab released a family of agentic computer use models (3B/7B/32B) with the dataset 💻 OpenGVLab/scalecua-68c912cf56f7ff4c8e034003
> Meituan Longcat released thinking version of LongCat-Flash 💭 meituan-longcat/LongCat-Flash-Thinking
  • 2 replies
·
merve 
posted an update 3 months ago
view post
Post
3382
IBM just released small swiss army knife for the document models: granite-docling-258M on Hugging Face 🔥

> not only a document converter but also can do document question answering, understand multiple languages 🤯
> best part: released with Apache 2.0 license 👏 use it with your commercial projects!
> it supports transformers, vLLM and MLX from the get-go! 🤗
> built on SigLIP2 & granite-165M

model: ibm-granite/granite-docling-258M
demo: ibm-granite/granite-docling-258m-demo 💗
merve 
posted an update 3 months ago
merve 
posted an update 3 months ago
view post
Post
1002
fan-favorite vision LM Florence-2 is now officially supported in transformers 🤗

find all the models in
florence-community
org 🫡
merve 
posted an update 4 months ago
merve 
posted an update 4 months ago
merve 
posted an update 4 months ago
view post
Post
6286
large AI labs have dropped so many open models last week 🔥 don't miss out on them

→ Apple released on-device vision LMs apple/fastvlm-68ac97b9cd5cacefdd04872e & apple/mobileclip2-68ac947dcb035c54bcd20c47
→ OpenGVLab released InternVL3.5, 32 new vision LMs with one based on gpt-oss! (OS) OpenGVLab/internvl35-68ac87bd52ebe953485927fb
→ MSFT released a killer small TTS model (OS) microsoft/VibeVoice-1.5B

find more herehttps://huggingface.co/collections/merve/august-29-releases-68b5a3754cfb8abf59e2b486
  • 1 reply
·
merve 
posted an update 4 months ago
merve 
posted an update 5 months ago
view post
Post
3318
GPT-4.1-mini level model right in your iPhone 🤯

openbmb/MiniCPM-V-4 is only 4B while surpassing GPT-4.1-mini in vision benchmarks 🔥

allows commercial use as well!
merve 
posted an update 5 months ago
view post
Post
1190
we're all sleeping on this OCR model rednote-hilab/dots.ocr 🔥

dots.ocr is a new 3B model with sota performance, support for 100 languages & allowing commercial use! 🤯

single e2e model to extract image, convert tables, formula, and more into markdown 📝
try it MohamedRashad/Dots-OCR
merve 
posted an update 5 months ago
view post
Post
706
massive releases and tons of Flux 1. Krea LoRas past week!
here's some of the picks, find more models in collection 🫡 merve/releases-august-2-6890c14248203522b7d0267f

LLMs 💬
> Tencent dropped tencent/Hunyuan-7B-Instruct
> Qwen released Qwen/Qwen3-Coder-30B-A3B-Instruct, 30B MoE with 3B params for coding (OS)

vision/multimodal
> RedNote released rednote-hilab/dots.ocr - 3B OCR model (OS)
> Cohere released CohereLabs/command-a-vision-07-2025 - 112B (dense!) VLM for 6 languages
> StepFun-AI shipped stepfun-ai/step3 - 321B MoE VLM (OS)
> Skywork shipped Skywork/Skywork-UniPic-1.5B - new any-to-any model (image+text → image+text) (OS)
merve 
posted an update 5 months ago
merve 
posted an update 5 months ago
view post
Post
3676
past week in open AI was insane 🔥 here's some of picks, find more here merve/releases-july-25-688768ca47fe3693407e02d1

💬 LLMs & VLMs
> Qwen/Qwen3-235B-A22B-Thinking-2507 had a new update (OS)
> Qwen/Qwen3-Coder-480B-A35B-Instruct is out with 480B total 35B active params 🤯 (OS)
> AllenAI dropped an update to allenai/olmOCR-7B-0725 📝
> InternLM released internlm/Intern-S1 - 235B Qwen3 MoE + 6B InternViT encoder (OS)
> OmniSVG/OmniSVG is a new SVG generation VLM (OS)

🖼️ image/video/3D generation
> WanAI released Wan2.2 series - both T2V and I2V 14B models for high-quality video generation (OS) multimodalart/wan-22-688767e313337b434ed55112
> Tencent dropped tencent/HunyuanWorld-1 - image-to-3D scene generation model
  • 1 reply
·
merve 
posted an update 5 months ago
view post
Post
4403
🤯 241B VLM with apache-2.0 license internlm/Intern-S1

internlm released Intern-S1: multimodal reasoning model based on 235B MoE Qwen3 and 6B InternViT 😍

benchmarks look great (👑 best model ✅ best open model)