Single page image inference using plain tranformers not working
example not working.
model loaded but nothing happens.
torch==2.6.0+cu124
flash-attn==2.7.4.post1
transformers==4.51.3
*Ubuntu 22.04.5 LTS, Python 3.10.12
**update transformers didnt help
Same here (without flash-attn).
@zoldaten @cassandragemini I just tested this on a RedHat system with the same library versions on python 3.11, and it works both with and without flash-attn. Could you share more details about your environment so we can narrow down the difference? Also what do you exactly get as "model loaded but nothing happens.", is it hanging?
| "model loaded but nothing happens.", is it hanging
i see model download process from HF and loaded CPU cores. and nothing. no errors but working CPUs.
I see there's a few things we can debug, which python version are you using?
Also can you perhaps just debug by adding prints before model load, after model load, before generate to, so we can narrow down if it's in model loading or generation?
Does it work on CPU?
import torch
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
from pathlib import Path
#DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DEVICE = "cpu"
# Load images
image = load_image("123.png")
#image = load_image("https://arxiv.org/pdf/2408.09869")
# Initialize processor and model
processor = AutoProcessor.from_pretrained("ibm-granite/granite-docling-258M")
model = AutoModelForVision2Seq.from_pretrained(
"ibm-granite/granite-docling-258M",
#torch_dtype=torch.bfloat16,
#_attn_implementation="flash_attention_2" if DEVICE == "cuda" else "sdpa",
).to(DEVICE)
print(1)
# Create input messages
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Convert this page to markdown."}
]
},
]
# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)
print(2)
# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=8192)
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doctags = processor.batch_decode(
trimmed_generated_ids,
skip_special_tokens=False,
)[0].lstrip()
print(f"DocTags: \n{doctags}\n")
# Populate document
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
# create a docling document
doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document")
print(f"Markdown:\n{doc.export_to_markdown()}\n")
i see prints 1 and 2.
Which python version are you using?
I don't have my computer to hand, so I don't have a lot of details, but on my side it was with python 3.13, I tried both cuda (with an rtx 2060) and cpu, and in both cases the gpu/cpu was overloaded but nothing happened.
@asnassar i pointed out - Python 3.10.12. also i tried 3.11
BTW this code works:
from docling.document_converter import DocumentConverter
source = "333.png" # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())
Can you let me know what prints out when you run this:
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
print("Loading processor")
processor = AutoProcessor.from_pretrained("ibm-granite/granite-docling-258M")
print("Loading model")
model = AutoModelForVision2Seq.from_pretrained(
"ibm-granite/granite-docling-258M",
torch_dtype=torch.bfloat16,
)
image = load_image("https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/assets/new_arxiv.png")
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Convert this page to markdown."}
]
},
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
device = torch.device("cpu") # Force CPU
print(f"Using device: {device}")
inputs = inputs.to(device)
model = model.to(device)
print("moved to device")
print("Testing forward pass")
with torch.no_grad():
outputs = model(**inputs)
print("Forward pass completed")
print("Testing model.generate with minimal settings")
with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_new_tokens=10,
do_sample=False,
use_cache=False,
pad_token_id=processor.tokenizer.eos_token_id
)
print("Generated IDs:", generated_ids.shape)
print("Testing with different attention implementation")
model.config._attn_implementation = "eager"
with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_new_tokens=10,
do_sample=False,
use_cache=False
)
print("Generated IDs:", generated_ids.shape)
print("Testing with manual generation loop")
with torch.no_grad():
input_ids = inputs["input_ids"]
for i in range(5): # Generate 5 tokens manually
outputs = model(input_ids=input_ids)
next_token = outputs.logits[:, -1, :].argmax(dim=-1)
input_ids = torch.cat([input_ids, next_token.unsqueeze(1)], dim=1)
print(f"Generated token {i+1}: {next_token.item()}")
print("Done")
Using device: cpu
moved to device
Testing forward pass
Forward pass completed
Testing model.generate with minimal settings
Generated IDs: torch.Size([1, 1151])
Testing with different attention implementation
I'm also seeing the hang during generate:
(dmf) ghart@Mac [granite-docling-258M]$ pip freeze | grep transformers
sentence-transformers==5.1.0
transformers==4.50.3
(dmf) ghart@Mac [granite-docling-258M]$ pip freeze | grep torch
torch==2.5.1
torchvision==0.20.1
(dmf) ghart@Mac [granite-docling-258M]$ python --version
Python 3.11.9
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
import torch
model_path = "/Users/ghart/models/ibm-granite/granite-docling-258M"
processor = AutoProcessor.from_pretrained(model_path)
image = load_image("/Users/ghart/Pictures/sample-image.png")
model = AutoModelForVision2Seq.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
)
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Convert this page to markdown."}
]
},
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_new_tokens=10,
do_sample=False,
use_cache=False,
pad_token_id=processor.tokenizer.eos_token_id
)
print("Generated IDs:", generated_ids.shape)
If I switch to using mps, it terminates cleanly. I'm on an M3 MacBook Pro w/ 64GB.
It's also possible that I'm just impatient with the CPU version. If I set mac_new_tokens=1000, it runs for a long time, even with mps (still running for me).
@zoldaten thanks for your responses can you try:
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
print("Loading processor")
processor = AutoProcessor.from_pretrained("ibm-granite/granite-docling-258M")
print("Loading model")
model = AutoModelForVision2Seq.from_pretrained(
"ibm-granite/granite-docling-258M",
torch_dtype=torch.bfloat16,
)
image = load_image("https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/assets/new_arxiv.png")
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Convert this page to markdown."}
]
},
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
device = torch.device("cpu") # Force CPU
print(f"Using device: {device}")
inputs = inputs.to(device)
model = model.to(device)
print("moved to device")
print("Testing forward pass")
with torch.no_grad():
outputs = model(**inputs)
print("Forward pass completed")
print("Testing model.generate with minimal settings")
with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_new_tokens=10,
do_sample=False,
use_cache=False,
pad_token_id=processor.tokenizer.eos_token_id
)
print("Generated IDs:", generated_ids.shape)
print("Testing with different attention implementation")
model.config._attn_implementation = "eager"
with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_new_tokens=10,
do_sample=False,
use_cache=False,
pad_token_id=processor.tokenizer.pad_token_id,
eos_token_id=processor.tokenizer.eos_token_id
)
print("Generated IDs:", generated_ids.shape)
print("Testing with manual generation loop")
with torch.no_grad():
input_ids = inputs["input_ids"]
for i in range(5): # Generate 5 tokens manually
outputs = model(input_ids=input_ids)
next_token = outputs.logits[:, -1, :].argmax(dim=-1)
input_ids = torch.cat([input_ids, next_token.unsqueeze(1)], dim=1)
print(f"Generated token {i+1}: {next_token.item()}")
print("Done")
@gabegoodhart on Mac to be honest I'd go with the MLX version, it's much faster on Mac, and its what we use in the docling pipeline.
Using device: cpu
moved to device
Testing forward pass
Forward pass completed
Testing model.generate with minimal settings
Generated IDs: torch.Size([1, 1151])
Testing with different attention implementation
@asnassar sorry for misstatement. the full output is:
Using device: cpu
moved to device
Testing forward pass
Forward pass completed
Testing model.generate with minimal settings
Generated IDs: torch.Size([1, 1151])
Testing with different attention implementation
Generated IDs: torch.Size([1, 1151])
Testing with manual generation loop
Generated token 1: 100327
Generated token 2: 100260
Generated token 3: 27
Generated token 4: 1092
Generated token 5: 62
Done
but it took much time to finish.
it also works with cuda.
ok. seems its working.
set
_attn_implementation="eager" if DEVICE == "cuda" else "sdpa",
and reduced max_new_tokens=1000
not fast as expected but works. thanks!
Few recommendations here:
- If you have only a CPU, expect it to be slow with transformers/pytorch. To verify anything works please use a very low
max_new_tokenssetting. - If you run on Apple Silicon, better do not use the torch
mpsbackend, it is very flaky. Use the MLX version instead, it is ultra fast with 200-300 tokens/sec. See here: https://huggingface.co/ibm-granite/granite-docling-258M-mlx - If you have an Nvidia GPU, we recommend trying with VLLM (see sample), it has a bit of warmup cost on first load but inference is much faster and stable.