kimyoungjune commited on
Commit
485cbee
·
verified ·
1 Parent(s): d6d66f1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +111 -2
README.md CHANGED
@@ -1,5 +1,114 @@
1
- Model checkpoints will be released on July 24.
2
-
3
  ---
4
  license: cc-by-nc-4.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-4.0
3
+ base_model:
4
+ - Qwen/Qwen3-1.7B
5
+ - google/siglip2-so400m-patch16-384
6
+ library_name: transformers
7
+ tags:
8
+ - multimodal
9
+ - OCR
10
+ - ncsoft
11
+ - ncai
12
+ - varco
13
+ pipeline_tag: image-text-to-text
14
+ language:
15
+ - en
16
+ - ko
17
  ---
18
+
19
+ # VARCO-VISION-2.0-1.7B-OCR
20
+
21
+ <div align="center">
22
+ <img src="./varco-vision.png" width="100%" style="background-color:white; padding:10px;" />
23
+ </div>
24
+
25
+ ## Introduction
26
+ **VARCO-VISION-2.0-1.7B-OCR** is a lightweight yet powerful OCR-specialized model derived from VARCO-VISION-2.0-1.7B, designed to deliver efficient and accurate text recognition in real-world scenarios. Unlike conventional vision-language models (VLMs) that primarily focus on transcribing visible text, this model performs both recognition and spatial localization by detecting bounding boxes around each character, enabling structured, layout-aware OCR outputs.
27
+
28
+ The model supports both Korean and English, making it well-suited for multilingual environments where mixed-script documents are common. Each recognized character is paired with its precise position in the image, formatted as `<char>{characters}</char><bbox>{x1}, {y1}, {x2}, {y2}</bbox>`, where the coordinates correspond to the top-left (`x1`, `y1`) and bottom-right (`x2`, `y2`) corners of the character's bounding box.
29
+
30
+ While VARCO-VISION-2.0-14B demonstrates strong OCR capabilities as part of its broader multimodal reasoning skills, deploying such a large model for single-task use cases can be computationally inefficient. VARCO-VISION-2.0-1.7B-OCR addresses this with a task-optimized design that retains high accuracy while significantly reducing resource requirements, making it ideal for real-time or resource-constrained applications.
31
+
32
+ ![image/gif](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-1.7B-OCR/resolve/main/examples.gif)
33
+
34
+ ## 🚨News🎙️
35
+ - 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B-OCR at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-1.7B-OCR)
36
+ - 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-1.7B)
37
+ - 📰 2025-07-18: Updated the checkpoint of VARCO-VISION-2.0-14B for improved performance.
38
+ - 📰 2025-07-16: We released VARCO-VISION-2.0-14B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-14B)
39
+ - 📰 2025-07-16: We released GME-VARCO-VISION-Embedding at [link](https://huggingface.co/NCSOFT/GME-VARCO-VISION-Embedding)
40
+
41
+ ## VARCO-VISION-2.0 Family
42
+ | Model Name | Base Models (Vision / Language) | HF Link |
43
+ | :------------------------: | :-------------------------------------------------------------------------------------------------------------------------------------------: | :--------------------------------------------------------------: |
44
+ | VARCO-VISION-2.0-14B | [siglip2-so400m-patch16-384](https://huggingface.co/google/siglip2-so400m-patch16-384) / [Qwen3-14B ](https://huggingface.co/Qwen/Qwen3-14B) | [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-14B) |
45
+ | VARCO-VISION-2.0-1.7B | [siglip2-so400m-patch16-384](https://huggingface.co/google/siglip2-so400m-patch16-384) / [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) | [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-1.7B) |
46
+ | VARCO-VISION-2.0-1.7B-OCR | [siglip2-so400m-patch16-384](https://huggingface.co/google/siglip2-so400m-patch16-384) / [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) | [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-1.7B-OCR) |
47
+ | GME-VARCO-VISION-Embedding | [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) | [link](https://huggingface.co/NCSOFT/GME-VARCO-VISION-Embedding) |
48
+
49
+ ## Model Architecture
50
+ VARCO-VISION-2.0 follows the architecture of [LLaVA-OneVision](https://arxiv.org/abs/2408.03326).
51
+
52
+ ## Evaluation
53
+
54
+ ### OCR Benchmark
55
+ | Benchmark | CLOVA OCR | PaddleOCR | EasyOCR | VARCO-VISION-2.0-1.7B-OCR |
56
+ | :-------: | :--------:| :-------: | :-----: | :-----------------------: |
57
+ | CORD | *93.9* | 91.4 | 77.8 | **95.6** |
58
+ | ICDAR2013 | *94.4* | 92.0 | 85.0 | **95.5** |
59
+ | ICDAR2015 | **84.1** | 73.7 | 57.9 | *75.4* |
60
+
61
+ ## Usage
62
+ To use this model, we recommend installing `transformers` version **4.53.1 or higher**.
63
+ Additionally, for best results, we **recommend upscaling input images to a minimum resolution of *2,304*** on the longer side if they are smaller.
64
+
65
+ ```python
66
+ import torch
67
+ from PIL import Image
68
+ from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
69
+
70
+ model_name = "NCSOFT/VARCO-VISION-2.0-1.7B-OCR"
71
+ model = LlavaOnevisionForConditionalGeneration.from_pretrained(
72
+ model_name,
73
+ torch_dtype=torch.float16,
74
+ attn_implementation="sdpa",
75
+ device_map="auto",
76
+ )
77
+ processor = AutoProcessor.from_pretrained(model_name)
78
+
79
+ image = Image.open("file:///path/to/image.jpg")
80
+
81
+ # Image upscaling for OCR performance boost
82
+ w, h = image.size
83
+ target_size = 2304
84
+ if max(w, h) < target_size:
85
+ scaling_factor = target_size / max(w, h)
86
+ new_w = int(w * scaling_factor)
87
+ new_h = int(h * scaling_factor)
88
+ image = image.resize((new_w, new_h))
89
+
90
+ conversation = [
91
+ {
92
+ "role": "user",
93
+ "content": [
94
+ {"type": "image", "image": image},
95
+ {"type": "text", "text": ""},
96
+ ],
97
+ },
98
+ ]
99
+
100
+ inputs = processor.apply_chat_template(
101
+ conversation,
102
+ add_generation_prompt=True,
103
+ tokenize=True,
104
+ return_dict=True,
105
+ return_tensors="pt"
106
+ ).to(model.device, torch.float16)
107
+
108
+ generate_ids = model.generate(**inputs, max_new_tokens=1024)
109
+ generate_ids_trimmed = [
110
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
111
+ ]
112
+ output = processor.decode(generate_ids_trimmed[0], skip_special_tokens=False)
113
+ print(output)
114
+ ```