Single page image inference using plain tranformers not working

#10
by zoldaten - opened

example not working.
model loaded but nothing happens.

torch==2.6.0+cu124
flash-attn==2.7.4.post1
transformers==4.51.3

*Ubuntu 22.04.5 LTS, Python 3.10.12
**update transformers didnt help

Same here (without flash-attn).

IBM Granite org
edited Sep 18

@zoldaten @cassandragemini I just tested this on a RedHat system with the same library versions on python 3.11, and it works both with and without flash-attn. Could you share more details about your environment so we can narrow down the difference? Also what do you exactly get as "model loaded but nothing happens.", is it hanging?

| "model loaded but nothing happens.", is it hanging
i see model download process from HF and loaded CPU cores. and nothing. no errors but working CPUs.

IBM Granite org

I see there's a few things we can debug, which python version are you using?
Also can you perhaps just debug by adding prints before model load, after model load, before generate to, so we can narrow down if it's in model loading or generation?
Does it work on CPU?

import torch
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
from pathlib import Path

#DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DEVICE = "cpu"

# Load images
image = load_image("123.png")
#image = load_image("https://arxiv.org/pdf/2408.09869")

# Initialize processor and model
processor = AutoProcessor.from_pretrained("ibm-granite/granite-docling-258M")
model = AutoModelForVision2Seq.from_pretrained(
    "ibm-granite/granite-docling-258M",
    #torch_dtype=torch.bfloat16,
    #_attn_implementation="flash_attention_2" if DEVICE == "cuda" else "sdpa",
).to(DEVICE)
print(1)
# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this page to markdown."}
        ]
    },
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)
print(2)

# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=8192)
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doctags = processor.batch_decode(
    trimmed_generated_ids,
    skip_special_tokens=False,
)[0].lstrip()

print(f"DocTags: \n{doctags}\n")


# Populate document
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
# create a docling document
doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document")
print(f"Markdown:\n{doc.export_to_markdown()}\n")

i see prints 1 and 2.

IBM Granite org

Which python version are you using?

I don't have my computer to hand, so I don't have a lot of details, but on my side it was with python 3.13, I tried both cuda (with an rtx 2060) and cpu, and in both cases the gpu/cpu was overloaded but nothing happened.

@asnassar i pointed out - Python 3.10.12. also i tried 3.11

BTW this code works:

from docling.document_converter import DocumentConverter

source = "333.png"  # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())
IBM Granite org

Can you let me know what prints out when you run this:

import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

print("Loading processor")
processor = AutoProcessor.from_pretrained("ibm-granite/granite-docling-258M")

print("Loading model")
model = AutoModelForVision2Seq.from_pretrained(
    "ibm-granite/granite-docling-258M",
    torch_dtype=torch.bfloat16,
)
image = load_image("https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/assets/new_arxiv.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this page to markdown."}
        ]
    },
]

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")

device = torch.device("cpu")  # Force CPU
print(f"Using device: {device}")

inputs = inputs.to(device)
model = model.to(device)
print("moved to device")

print("Testing forward pass")
with torch.no_grad():
    outputs = model(**inputs)
print("Forward pass completed")

print("Testing model.generate with minimal settings")
with torch.no_grad():
    generated_ids = model.generate(
        **inputs, 
        max_new_tokens=10,
        do_sample=False,
        use_cache=False,
        pad_token_id=processor.tokenizer.eos_token_id
    )
    print("Generated IDs:", generated_ids.shape)

print("Testing with different attention implementation")
model.config._attn_implementation = "eager"
with torch.no_grad():
    generated_ids = model.generate(
        **inputs, 
        max_new_tokens=10,
        do_sample=False,
        use_cache=False
    )
    print("Generated IDs:", generated_ids.shape)

print("Testing with manual generation loop")
with torch.no_grad():
    input_ids = inputs["input_ids"]
    for i in range(5):  # Generate 5 tokens manually
        outputs = model(input_ids=input_ids)
        next_token = outputs.logits[:, -1, :].argmax(dim=-1)
        input_ids = torch.cat([input_ids, next_token.unsqueeze(1)], dim=1)
        print(f"Generated token {i+1}: {next_token.item()}")

print("Done")

Using device: cpu
moved to device
Testing forward pass
Forward pass completed
Testing model.generate with minimal settings
Generated IDs: torch.Size([1, 1151])
Testing with different attention implementation

IBM Granite org

I'm also seeing the hang during generate:

(dmf) ghart@Mac [granite-docling-258M]$ pip freeze | grep transformers
sentence-transformers==5.1.0
transformers==4.50.3
(dmf) ghart@Mac [granite-docling-258M]$ pip freeze | grep torch
torch==2.5.1
torchvision==0.20.1
(dmf) ghart@Mac [granite-docling-258M]$ python --version
Python 3.11.9
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
import torch

model_path = "/Users/ghart/models/ibm-granite/granite-docling-258M"
processor = AutoProcessor.from_pretrained(model_path)
image = load_image("/Users/ghart/Pictures/sample-image.png")
model = AutoModelForVision2Seq.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this page to markdown."}
        ]
    },
]

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")

with torch.no_grad():
    generated_ids = model.generate(
        **inputs, 
        max_new_tokens=10,
        do_sample=False,
        use_cache=False,
        pad_token_id=processor.tokenizer.eos_token_id
    )
    print("Generated IDs:", generated_ids.shape)
IBM Granite org

If I switch to using mps, it terminates cleanly. I'm on an M3 MacBook Pro w/ 64GB.

IBM Granite org

It's also possible that I'm just impatient with the CPU version. If I set mac_new_tokens=1000, it runs for a long time, even with mps (still running for me).

IBM Granite org

@zoldaten thanks for your responses can you try:

import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

print("Loading processor")
processor = AutoProcessor.from_pretrained("ibm-granite/granite-docling-258M")

print("Loading model")
model = AutoModelForVision2Seq.from_pretrained(
    "ibm-granite/granite-docling-258M",
    torch_dtype=torch.bfloat16,
)
image = load_image("https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/assets/new_arxiv.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this page to markdown."}
        ]
    },
]

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")

device = torch.device("cpu")  # Force CPU
print(f"Using device: {device}")

inputs = inputs.to(device)
model = model.to(device)
print("moved to device")

print("Testing forward pass")
with torch.no_grad():
    outputs = model(**inputs)
print("Forward pass completed")

print("Testing model.generate with minimal settings")
with torch.no_grad():
    generated_ids = model.generate(
        **inputs, 
        max_new_tokens=10,
        do_sample=False,
        use_cache=False,
        pad_token_id=processor.tokenizer.eos_token_id
    )
    print("Generated IDs:", generated_ids.shape)

print("Testing with different attention implementation")
model.config._attn_implementation = "eager"
with torch.no_grad():
    generated_ids = model.generate(
        **inputs, 
        max_new_tokens=10,
        do_sample=False,
        use_cache=False,
        pad_token_id=processor.tokenizer.pad_token_id,
        eos_token_id=processor.tokenizer.eos_token_id
    )
    print("Generated IDs:", generated_ids.shape)

print("Testing with manual generation loop")
with torch.no_grad():
    input_ids = inputs["input_ids"]
    for i in range(5):  # Generate 5 tokens manually
        outputs = model(input_ids=input_ids)
        next_token = outputs.logits[:, -1, :].argmax(dim=-1)
        input_ids = torch.cat([input_ids, next_token.unsqueeze(1)], dim=1)
        print(f"Generated token {i+1}: {next_token.item()}")

print("Done")
IBM Granite org

@gabegoodhart on Mac to be honest I'd go with the MLX version, it's much faster on Mac, and its what we use in the docling pipeline.

Using device: cpu
moved to device
Testing forward pass
Forward pass completed
Testing model.generate with minimal settings
Generated IDs: torch.Size([1, 1151])
Testing with different attention implementation

@asnassar sorry for misstatement. the full output is:

Using device: cpu
moved to device
Testing forward pass
Forward pass completed
Testing model.generate with minimal settings
Generated IDs: torch.Size([1, 1151])
Testing with different attention implementation
Generated IDs: torch.Size([1, 1151])
Testing with manual generation loop
Generated token 1: 100327
Generated token 2: 100260
Generated token 3: 27
Generated token 4: 1092
Generated token 5: 62
Done

but it took much time to finish.
it also works with cuda.

ok. seems its working.
set

_attn_implementation="eager" if DEVICE == "cuda" else "sdpa",

and reduced max_new_tokens=1000
not fast as expected but works. thanks!

IBM Granite org
edited Sep 19

Few recommendations here:

  1. If you have only a CPU, expect it to be slow with transformers/pytorch. To verify anything works please use a very low max_new_tokens setting.
  2. If you run on Apple Silicon, better do not use the torch mps backend, it is very flaky. Use the MLX version instead, it is ultra fast with 200-300 tokens/sec. See here: https://huggingface.co/ibm-granite/granite-docling-258M-mlx
  3. If you have an Nvidia GPU, we recommend trying with VLLM (see sample), it has a bit of warmup cost on first load but inference is much faster and stable.

Sign up or log in to comment