How to use it correctly with online serving via vllm openai compatible server?

#55
by dhruvil237 - opened

Using the below command not sure if its setup correctly.
vllm serve deepseek-ai/DeepSeek-OCR --no-enable-prefix-caching --mm-processor-cache-gb 0 --logits-processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor

then calling it this way:
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-OCR",
messages=message,
temperature=0.0,
max_tokens=500,
# ngram logit processor args
extra_body={
"ngram_size": 30,
"window_size": 90,
"whitelist_token_ids": [128821, 128822],
"skip_special_tokens": False, # whitelist: ,
}
)

I am not sure if the parameters passed are affecting anything.
Can someone explain why are those parameters required and are the setup correctly?

corrected serving command:
vllm serve deepseek-ai/DeepSeek-OCR --no-enable-prefix-caching --mm-processor-cache-gb 0 --logits-processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor --enable-log-requests --gpu-memory-utilization 0.4 --chat-template /home/ubuntu/llm-ocr-exp/template_deepseek_ocr.jinja

inference:

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-OCR",
    messages=message,
    temperature=0.0,
    max_tokens=500,
    # ngram logit processor args
    extra_body={
        "vllm_xargs": {
            "ngram_size": 30,
            "window_size": 90,
            # "whitelist_token_ids": [128821, 128822],
        },
        "skip_special_tokens": False,  # whitelist: <td>, </td>
    }
)
This comment has been hidden (marked as Resolved)

@dhruvilHV Can I see your --chat-template /home/ubuntu/llm-ocr-exp/template_deepseek_ocr.jinja?

Sign up or log in to comment