File size: 13,482 Bytes
500cf95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
# ChatbotRAG - Complete Summary

## Tα»•ng Quan Hệ Thα»‘ng

Hệ thα»‘ng ChatbotRAG hiện Δ‘Γ£ được nΓ’ng cαΊ₯p toΓ n diện vα»›i cΓ‘c tΓ­nh nΔƒng advanced:

### ✨ TΓ­nh NΔƒng ChΓ­nh

1. **Multiple Inputs Support** (/index)
   - Index tα»‘i Δ‘a 10 texts + 10 images cΓΉng lΓΊc
   - Average embeddings tα»± Δ‘α»™ng

2. **Advanced RAG Pipeline** (/chat)
   - Query Expansion
   - Multi-Query Retrieval
   - Reranking with semantic similarity
   - Contextual Compression
   - Better Prompt Engineering

3. **PDF Support** (/upload-pdf)
   - Parse PDF thΓ nh chunks
   - Auto chunking vα»›i overlap
   - Index vΓ o RAG system

4. **Multimodal PDF** (/upload-pdf-multimodal) ⭐ NEW
   - Extract text + image URLs tα»« PDF
   - Link images vα»›i text chunks
   - Return images cΓΉng text trong chat
   - Perfect cho user guides vα»›i screenshots

---

## KiαΊΏn TrΓΊc Hệ Thα»‘ng

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    FastAPI Application                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚   Indexing   β”‚  β”‚   Search     β”‚  β”‚   Chat       β”‚      β”‚
β”‚  β”‚   Endpoints  β”‚  β”‚   Endpoints  β”‚  β”‚   Endpoint   β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                                                               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚            Advanced RAG Pipeline                      β”‚   β”‚
β”‚  β”‚  β€’ Query Expansion                                    β”‚   β”‚
β”‚  β”‚  β€’ Multi-Query Retrieval                              β”‚   β”‚
β”‚  β”‚  β€’ Reranking                                          β”‚   β”‚
β”‚  β”‚  β€’ Contextual Compression                             β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚   Jina CLIP  β”‚  β”‚   Qdrant     β”‚  β”‚   MongoDB    β”‚      β”‚
β”‚  β”‚   v2         β”‚  β”‚   Vector DB  β”‚  β”‚   Documents  β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                                                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                         β”‚
β”‚  β”‚   PDF        β”‚  β”‚  Multimodal  β”‚                         β”‚
β”‚  β”‚   Parser     β”‚  β”‚  PDF Parser  β”‚                         β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                         β”‚
β”‚                                                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

---

## Files Quan Trọng

### Core System
- **main.py** - FastAPI application vα»›i tαΊ₯t cαΊ£ endpoints
- **embedding_service.py** - Jina CLIP v2 embedding
- **qdrant_service.py** - Qdrant vector DB operations
- **advanced_rag.py** - Advanced RAG pipeline

### PDF Processing
- **pdf_parser.py** - Basic PDF parser (text only)
- **multimodal_pdf_parser.py** - Multimodal PDF parser (text + images)
- **batch_index_pdfs.py** - Batch indexing script

### Documentation
- **ADVANCED_RAG_GUIDE.md** - Advanced RAG features guide
- **PDF_RAG_GUIDE.md** - PDF usage guide
- **MULTIMODAL_PDF_GUIDE.md** - Multimodal PDF guide ⭐
- **QUICK_START_PDF.md** - Quick start for PDF
- **chatbot_guide_template.md** - Template for user guide PDF

### Testing
- **test_advanced_features.py** - Test advanced features
- **test_pdf_chatbot.py** - Test PDF chatbot (example in docs)

---

## API Endpoints

### 1. Indexing

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/index` | POST | Index texts + images (max 10 each) |
| `/documents` | POST | Add text document |
| `/upload-pdf` | POST | Upload PDF (text only) |
| `/upload-pdf-multimodal` | POST | Upload PDF with images ⭐ |

### 2. Search

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/search` | POST | Hybrid search (text + image) |
| `/search/text` | POST | Text-only search |
| `/search/image` | POST | Image-only search |
| `/rag/search` | POST | RAG knowledge base search |

### 3. Chat

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/chat` | POST | Chat with Advanced RAG |

### 4. Management

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/documents/pdf` | GET | List all PDFs |
| `/documents/pdf/{id}` | DELETE | Delete PDF document |
| `/delete/{doc_id}` | DELETE | Delete document |
| `/document/{doc_id}` | GET | Get document by ID |
| `/history` | GET | Get chat history |
| `/stats` | GET | Collection statistics |
| `/` | GET | Health check + API docs |

---

## Use Cases & Recommendations

### Case 1: PDF HΖ°α»›ng DαΊ«n Chỉ CΓ³ Text

**Scenario:** FAQ, policy document, text guide

**Solution:** `/upload-pdf`

```bash
curl -X POST "http://localhost:8000/upload-pdf" \
  -F "[email protected]" \
  -F "title=FAQ"
```

### Case 2: PDF HΖ°α»›ng DαΊ«n CΓ³ HΓ¬nh αΊ’nh ⭐ (Your Case)

**Scenario:** User guide vα»›i screenshots, tutorial vα»›i diagrams

**Solution:** `/upload-pdf-multimodal`

```bash
curl -X POST "http://localhost:8000/upload-pdf-multimodal" \
  -F "file=@user_guide_with_images.pdf" \
  -F "title=User Guide" \
  -F "category=guide"
```

**Benefits:**
- βœ“ Extract text + image URLs
- βœ“ Link images vα»›i text chunks
- βœ“ Chatbot return images in response
- βœ“ Visual context for users

### Case 3: Multiple Social Media Posts

**Scenario:** Index nhiều posts vα»›i texts vΓ  images

**Solution:** `/index` with multiple inputs

```python
data = {
    'id': 'post123',
    'texts': ['Post text 1', 'Post text 2', ...],  # Max 10
}
files = [
    ('images', open('img1.jpg', 'rb')),
    ('images', open('img2.jpg', 'rb')),  # Max 10
]
requests.post('http://localhost:8000/index', data=data, files=files)
```

### Case 4: Complex Queries

**Scenario:** CΓ’u hỏi phα»©c tαΊ‘p, cαΊ§n Δ‘α»™ chΓ­nh xΓ‘c cao

**Solution:** Advanced RAG with full options

```python
{
    'message': 'Complex question',
    'use_rag': True,
    'use_advanced_rag': True,
    'use_reranking': True,
    'use_compression': True,
    'score_threshold': 0.5,
    'top_k': 5
}
```

---

## Workflow Đề XuαΊ₯t Cho BαΊ‘n

### Setup Ban Đầu

1. **TαΊ‘o PDF hΖ°α»›ng dαΊ«n sα»­ dα»₯ng**
   - DΓΉng template: `chatbot_guide_template.md`
   - Customize nα»™i dung cho hệ thα»‘ng cα»§a bαΊ‘n
   - ThΓͺm image URLs (screenshots, diagrams)
   - Convert to PDF: `pandoc template.md -o guide.pdf`

2. **Upload PDF**
   ```bash
   curl -X POST "http://localhost:8000/upload-pdf-multimodal" \
     -F "file=@chatbot_user_guide.pdf" \
     -F "title=HΖ°α»›ng dαΊ«n sα»­ dα»₯ng ChatbotRAG" \
     -F "category=user_guide"
   ```

3. **Verify**
   ```bash
   curl http://localhost:8000/documents/pdf
   # Check "type": "multimodal_pdf" vΓ  "total_images"
   ```

### Sα»­ Dα»₯ng HΓ ng NgΓ y

1. **Chat vα»›i user**
   ```python
   response = requests.post('http://localhost:8000/chat', json={
       'message': user_question,
       'use_rag': True,
       'use_advanced_rag': True,
       'hf_token': 'your_token'
   })
   ```

2. **Display response + images**
   ```python
   # Text answer
   print(response.json()['response'])

   # Images (if any)
   for ctx in response.json()['context_used']:
       if ctx['metadata'].get('has_images'):
           for url in ctx['metadata']['image_urls']:
               # Display image in your UI
               print(f"Image: {url}")
   ```

### CαΊ­p NhαΊ­t Content

1. **Update PDF** - Edit vΓ  re-export
2. **XΓ³a PDF cΕ©**
   ```bash
   curl -X DELETE http://localhost:8000/documents/pdf/old_doc_id
   ```
3. **Upload PDF mα»›i**
   ```bash
   curl -X POST http://localhost:8000/upload-pdf-multimodal -F "file=@new_guide.pdf"
   ```

---

## Performance Tips

### 1. Chunking

**Default:**
- chunk_size: 500 words
- chunk_overlap: 50 words

**Tα»‘i Ζ°u:**
```python
# In multimodal_pdf_parser.py
parser = MultimodalPDFParser(
    chunk_size=400,      # Shorter for faster retrieval
    chunk_overlap=40,
    min_chunk_size=50
)
```

### 2. Retrieval

**Settings tα»‘t:**
```python
{
    'top_k': 5,              # 3-7 is optimal
    'score_threshold': 0.5,   # 0.4-0.6 is good
    'use_reranking': True,    # Always enable
    'use_compression': True   # Keeps context relevant
}
```

### 3. LLM

**For factual answers:**
```python
{
    'temperature': 0.3,   # Low for accuracy
    'max_tokens': 512,    # Concise answers
    'top_p': 0.9
}
```

---

## Troubleshooting

### Issue 1: Images khΓ΄ng được detect

**Solution:**
- Verify PDF cΓ³ image URLs (http://, https://)
- Check format: markdown `![](url)` hoαΊ·c HTML `<img src>`
- Test regex:
  ```python
  from multimodal_pdf_parser import MultimodalPDFParser
  parser = MultimodalPDFParser()
  urls = parser.extract_image_urls("![](https://example.com/img.png)")
  print(urls)  # Should return ['https://example.com/img.png']
  ```

### Issue 2: Chatbot khΓ΄ng tΓ¬m thαΊ₯y thΓ΄ng tin

**Solution:**
- Lower score_threshold: `0.3-0.5`
- Increase top_k: `5-10`
- Enable Advanced RAG
- Rephrase question

### Issue 3: Response quΓ‘ chαΊ­m

**Solution:**
- GiαΊ£m top_k
- Disable compression nαΊΏu khΓ΄ng cαΊ§n
- Use basic RAG thay vì advanced for simple queries

---

## Next Steps

### Immediate (BÒy Giờ)

1. βœ“ System Δ‘Γ£ ready!
2. TαΊ‘o PDF hΖ°α»›ng dαΊ«n cα»§a bαΊ‘n
3. Upload qua `/upload-pdf-multimodal`
4. Test vα»›i cΓ’u hỏi thα»±c tαΊΏ

### Short Term (1-2 tuαΊ§n)

1. Collect user feedback
2. Fine-tune parameters (top_k, threshold)
3. Add more PDFs (FAQ, tutorials, etc.)
4. Monitor chat history để improve content

### Long Term (Sau nΓ y)

1. **Hybrid Search vα»›i BM25**
   - Combine dense + sparse retrieval
   - Better for keyword queries

2. **Cross-Encoder Reranking**
   - Replace embedding similarity
   - More accurate ranking

3. **Image Processing**
   - Download vΓ  process actual images
   - Use Jina CLIP for image embeddings
   - True multimodal embeddings (text + image vectors)

4. **RAG-Anything Integration** (NαΊΏu cαΊ§n)
   - For complex PDFs with tables, charts
   - Vision encoder for embedded images
   - Advanced document understanding

---

## Comparison Matrix

| Approach | Text | Images | URLs | Complexity | Your Case |
|----------|------|--------|------|------------|-----------|
| Basic RAG | βœ“ | βœ— | βœ— | Low | βœ— |
| PDF Parser | βœ“ | βœ— | βœ— | Low | βœ— |
| **Multimodal PDF** | βœ“ | βœ— | βœ“ | **Medium** | **βœ“** |
| RAG-Anything | βœ“ | βœ“ | βœ“ | High | Overkill |

**Recommendation:** **Multimodal PDF** lΓ  perfect cho case cα»§a bαΊ‘n!

---

## KαΊΏt LuαΊ­n

### Bẑn Có Gì?

βœ… **Multiple Inputs**: Index 10 texts + 10 images
βœ… **Advanced RAG**: Query expansion, reranking, compression
βœ… **PDF Support**: Parse vΓ  index PDFs
βœ… **Multimodal PDF**: Extract text + image URLs, link together
βœ… **Complete Documentation**: Guides, examples, troubleshooting

### Làm Gì Tiếp?

1. **TαΊ‘o PDF** hΖ°α»›ng dαΊ«n vα»›i nα»™i dung cα»§a bαΊ‘n (cΓ³ image URLs)
2. **Upload** qua `/upload-pdf-multimodal`
3. **Test** vα»›i cΓ’u hỏi thα»±c tαΊΏ
4. **Iterate** - fine-tune based on feedback

### Files Cần Đọc

**Cho PDF với hình ảnh (Your case):**
- [MULTIMODAL_PDF_GUIDE.md](MULTIMODAL_PDF_GUIDE.md) ⭐⭐⭐
- [PDF_RAG_GUIDE.md](PDF_RAG_GUIDE.md)

**Cho Advanced RAG:**
- [ADVANCED_RAG_GUIDE.md](ADVANCED_RAG_GUIDE.md)

**Quick Start:**
- [QUICK_START_PDF.md](QUICK_START_PDF.md)

---

**Hệ thα»‘ng cα»§a bαΊ‘n bΓ’y giờ rαΊ₯t mαΊ‘nh! Chỉ cαΊ§n upload PDF vΓ  chat thΓ΄i! πŸš€πŸ“„πŸ€–**