ibm-granite/granite-docling-258M · Single page image inference using plain tranformers not working

zoldaten

Sep 18

•

edited Sep 18

example not working.
model loaded but nothing happens.

torch==2.6.0+cu124
flash-attn==2.7.4.post1
transformers==4.51.3

*Ubuntu 22.04.5 LTS, Python 3.10.12
**update transformers didnt help

cassandragemini

Sep 18

Same here (without flash-attn).

asnassar

IBM Granite org Sep 18

•

edited Sep 18

@zoldaten @cassandragemini I just tested this on a RedHat system with the same library versions on python 3.11, and it works both with and without flash-attn. Could you share more details about your environment so we can narrow down the difference? Also what do you exactly get as "model loaded but nothing happens.", is it hanging?

zoldaten

Sep 18

| "model loaded but nothing happens.", is it hanging
i see model download process from HF and loaded CPU cores. and nothing. no errors but working CPUs.

asnassar

IBM Granite org Sep 18

I see there's a few things we can debug, which python version are you using?
Also can you perhaps just debug by adding prints before model load, after model load, before generate to, so we can narrow down if it's in model loading or generation?
Does it work on CPU?

zoldaten

Sep 18

import torch
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
from pathlib import Path

#DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DEVICE = "cpu"

# Load images
image = load_image("123.png")
#image = load_image("https://arxiv.org/pdf/2408.09869")

# Initialize processor and model
processor = AutoProcessor.from_pretrained("ibm-granite/granite-docling-258M")
model = AutoModelForVision2Seq.from_pretrained(
    "ibm-granite/granite-docling-258M",
    #torch_dtype=torch.bfloat16,
    #_attn_implementation="flash_attention_2" if DEVICE == "cuda" else "sdpa",
).to(DEVICE)
print(1)
# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this page to markdown."}
        ]
    },
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)
print(2)

# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=8192)
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doctags = processor.batch_decode(
    trimmed_generated_ids,
    skip_special_tokens=False,
)[0].lstrip()

print(f"DocTags: \n{doctags}\n")


# Populate document
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
# create a docling document
doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document")
print(f"Markdown:\n{doc.export_to_markdown()}\n")

i see prints 1 and 2.

asnassar

IBM Granite org Sep 18

Which python version are you using?

cassandragemini

Sep 18

•

edited Sep 18

I don't have my computer to hand, so I don't have a lot of details, but on my side it was with python 3.13, I tried both cuda (with an rtx 2060) and cpu, and in both cases the gpu/cpu was overloaded but nothing happened.

zoldaten

Sep 18

@asnassar i pointed out - Python 3.10.12. also i tried 3.11

BTW this code works:

from docling.document_converter import DocumentConverter

source = "333.png"  # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())

asnassar

IBM Granite org Sep 18

Can you let me know what prints out when you run this:

import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

print("Loading processor")
processor = AutoProcessor.from_pretrained("ibm-granite/granite-docling-258M")

print("Loading model")
model = AutoModelForVision2Seq.from_pretrained(
    "ibm-granite/granite-docling-258M",
    torch_dtype=torch.bfloat16,
)
image = load_image("https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/assets/new_arxiv.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this page to markdown."}
        ]
    },
]

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")

device = torch.device("cpu")  # Force CPU
print(f"Using device: {device}")

inputs = inputs.to(device)
model = model.to(device)
print("moved to device")

print("Testing forward pass")
with torch.no_grad():
    outputs = model(**inputs)
print("Forward pass completed")

print("Testing model.generate with minimal settings")
with torch.no_grad():
    generated_ids = model.generate(
        **inputs, 
        max_new_tokens=10,
        do_sample=False,
        use_cache=False,
        pad_token_id=processor.tokenizer.eos_token_id
    )
    print("Generated IDs:", generated_ids.shape)

print("Testing with different attention implementation")
model.config._attn_implementation = "eager"
with torch.no_grad():
    generated_ids = model.generate(
        **inputs, 
        max_new_tokens=10,
        do_sample=False,
        use_cache=False
    )
    print("Generated IDs:", generated_ids.shape)

print("Testing with manual generation loop")
with torch.no_grad():
    input_ids = inputs["input_ids"]
    for i in range(5):  # Generate 5 tokens manually
        outputs = model(input_ids=input_ids)
        next_token = outputs.logits[:, -1, :].argmax(dim=-1)
        input_ids = torch.cat([input_ids, next_token.unsqueeze(1)], dim=1)
        print(f"Generated token {i+1}: {next_token.item()}")

print("Done")

zoldaten

Sep 18

Using device: cpu
moved to device
Testing forward pass
Forward pass completed
Testing model.generate with minimal settings
Generated IDs: torch.Size([1, 1151])
Testing with different attention implementation

gabegoodhart

IBM Granite org Sep 18

I'm also seeing the hang during generate:

(dmf) ghart@Mac [granite-docling-258M]$ pip freeze | grep transformers
sentence-transformers==5.1.0
transformers==4.50.3
(dmf) ghart@Mac [granite-docling-258M]$ pip freeze | grep torch
torch==2.5.1
torchvision==0.20.1
(dmf) ghart@Mac [granite-docling-258M]$ python --version
Python 3.11.9

from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
import torch

model_path = "/Users/ghart/models/ibm-granite/granite-docling-258M"
processor = AutoProcessor.from_pretrained(model_path)
image = load_image("/Users/ghart/Pictures/sample-image.png")
model = AutoModelForVision2Seq.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this page to markdown."}
        ]
    },
]

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")

with torch.no_grad():
    generated_ids = model.generate(
        **inputs, 
        max_new_tokens=10,
        do_sample=False,
        use_cache=False,
        pad_token_id=processor.tokenizer.eos_token_id
    )
    print("Generated IDs:", generated_ids.shape)

gabegoodhart

IBM Granite org Sep 18

If I switch to using mps, it terminates cleanly. I'm on an M3 MacBook Pro w/ 64GB.

gabegoodhart

IBM Granite org Sep 18

It's also possible that I'm just impatient with the CPU version. If I set mac_new_tokens=1000, it runs for a long time, even with mps (still running for me).

asnassar

IBM Granite org Sep 18

@zoldaten thanks for your responses can you try:

import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

print("Loading processor")
processor = AutoProcessor.from_pretrained("ibm-granite/granite-docling-258M")

print("Loading model")
model = AutoModelForVision2Seq.from_pretrained(
    "ibm-granite/granite-docling-258M",
    torch_dtype=torch.bfloat16,
)
image = load_image("https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/assets/new_arxiv.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this page to markdown."}
        ]
    },
]

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")

device = torch.device("cpu")  # Force CPU
print(f"Using device: {device}")

inputs = inputs.to(device)
model = model.to(device)
print("moved to device")

print("Testing forward pass")
with torch.no_grad():
    outputs = model(**inputs)
print("Forward pass completed")

print("Testing model.generate with minimal settings")
with torch.no_grad():
    generated_ids = model.generate(
        **inputs, 
        max_new_tokens=10,
        do_sample=False,
        use_cache=False,
        pad_token_id=processor.tokenizer.eos_token_id
    )
    print("Generated IDs:", generated_ids.shape)

print("Testing with different attention implementation")
model.config._attn_implementation = "eager"
with torch.no_grad():
    generated_ids = model.generate(
        **inputs, 
        max_new_tokens=10,
        do_sample=False,
        use_cache=False,
        pad_token_id=processor.tokenizer.pad_token_id,
        eos_token_id=processor.tokenizer.eos_token_id
    )
    print("Generated IDs:", generated_ids.shape)

print("Testing with manual generation loop")
with torch.no_grad():
    input_ids = inputs["input_ids"]
    for i in range(5):  # Generate 5 tokens manually
        outputs = model(input_ids=input_ids)
        next_token = outputs.logits[:, -1, :].argmax(dim=-1)
        input_ids = torch.cat([input_ids, next_token.unsqueeze(1)], dim=1)
        print(f"Generated token {i+1}: {next_token.item()}")

print("Done")

asnassar

IBM Granite org Sep 18

@gabegoodhart on Mac to be honest I'd go with the MLX version, it's much faster on Mac, and its what we use in the docling pipeline.

zoldaten

Sep 19

•

edited Sep 19

Using device: cpu
moved to device
Testing forward pass
Forward pass completed
Testing model.generate with minimal settings
Generated IDs: torch.Size([1, 1151])
Testing with different attention implementation

@asnassar sorry for misstatement. the full output is:

Using device: cpu
moved to device
Testing forward pass
Forward pass completed
Testing model.generate with minimal settings
Generated IDs: torch.Size([1, 1151])
Testing with different attention implementation
Generated IDs: torch.Size([1, 1151])
Testing with manual generation loop
Generated token 1: 100327
Generated token 2: 100260
Generated token 3: 27
Generated token 4: 1092
Generated token 5: 62
Done

but it took much time to finish.
it also works with cuda.

zoldaten

Sep 19

•

edited Sep 19

ok. seems its working.
set

_attn_implementation="eager" if DEVICE == "cuda" else "sdpa",

and reduced max_new_tokens=1000
not fast as expected but works. thanks!

auerchristoph

IBM Granite org Sep 19

•

edited Sep 19

Few recommendations here:

If you have only a CPU, expect it to be slow with transformers/pytorch. To verify anything works please use a very low max_new_tokens setting.
If you run on Apple Silicon, better do not use the torch mps backend, it is very flaky. Use the MLX version instead, it is ultra fast with 200-300 tokens/sec. See here: https://huggingface.co/ibm-granite/granite-docling-258M-mlx
If you have an Nvidia GPU, we recommend trying with VLLM (see sample), it has a bit of warmup cost on first load but inference is much faster and stable.