ChatbotRAG / SUMMARY.md
minhvtt's picture
Upload 20 files
500cf95 verified
# ChatbotRAG - Complete Summary
## Tα»•ng Quan Hệ Thα»‘ng
Hệ thα»‘ng ChatbotRAG hiện Δ‘Γ£ được nΓ’ng cαΊ₯p toΓ n diện vα»›i cΓ‘c tΓ­nh nΔƒng advanced:
### ✨ TΓ­nh NΔƒng ChΓ­nh
1. **Multiple Inputs Support** (/index)
- Index tα»‘i Δ‘a 10 texts + 10 images cΓΉng lΓΊc
- Average embeddings tα»± Δ‘α»™ng
2. **Advanced RAG Pipeline** (/chat)
- Query Expansion
- Multi-Query Retrieval
- Reranking with semantic similarity
- Contextual Compression
- Better Prompt Engineering
3. **PDF Support** (/upload-pdf)
- Parse PDF thΓ nh chunks
- Auto chunking vα»›i overlap
- Index vΓ o RAG system
4. **Multimodal PDF** (/upload-pdf-multimodal) ⭐ NEW
- Extract text + image URLs tα»« PDF
- Link images vα»›i text chunks
- Return images cΓΉng text trong chat
- Perfect cho user guides vα»›i screenshots
---
## KiαΊΏn TrΓΊc Hệ Thα»‘ng
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ FastAPI Application β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Indexing β”‚ β”‚ Search β”‚ β”‚ Chat β”‚ β”‚
β”‚ β”‚ Endpoints β”‚ β”‚ Endpoints β”‚ β”‚ Endpoint β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Advanced RAG Pipeline β”‚ β”‚
β”‚ β”‚ β€’ Query Expansion β”‚ β”‚
β”‚ β”‚ β€’ Multi-Query Retrieval β”‚ β”‚
β”‚ β”‚ β€’ Reranking β”‚ β”‚
β”‚ β”‚ β€’ Contextual Compression β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Jina CLIP β”‚ β”‚ Qdrant β”‚ β”‚ MongoDB β”‚ β”‚
β”‚ β”‚ v2 β”‚ β”‚ Vector DB β”‚ β”‚ Documents β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ PDF β”‚ β”‚ Multimodal β”‚ β”‚
β”‚ β”‚ Parser β”‚ β”‚ PDF Parser β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
## Files Quan Trọng
### Core System
- **main.py** - FastAPI application vα»›i tαΊ₯t cαΊ£ endpoints
- **embedding_service.py** - Jina CLIP v2 embedding
- **qdrant_service.py** - Qdrant vector DB operations
- **advanced_rag.py** - Advanced RAG pipeline
### PDF Processing
- **pdf_parser.py** - Basic PDF parser (text only)
- **multimodal_pdf_parser.py** - Multimodal PDF parser (text + images)
- **batch_index_pdfs.py** - Batch indexing script
### Documentation
- **ADVANCED_RAG_GUIDE.md** - Advanced RAG features guide
- **PDF_RAG_GUIDE.md** - PDF usage guide
- **MULTIMODAL_PDF_GUIDE.md** - Multimodal PDF guide ⭐
- **QUICK_START_PDF.md** - Quick start for PDF
- **chatbot_guide_template.md** - Template for user guide PDF
### Testing
- **test_advanced_features.py** - Test advanced features
- **test_pdf_chatbot.py** - Test PDF chatbot (example in docs)
---
## API Endpoints
### 1. Indexing
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/index` | POST | Index texts + images (max 10 each) |
| `/documents` | POST | Add text document |
| `/upload-pdf` | POST | Upload PDF (text only) |
| `/upload-pdf-multimodal` | POST | Upload PDF with images ⭐ |
### 2. Search
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/search` | POST | Hybrid search (text + image) |
| `/search/text` | POST | Text-only search |
| `/search/image` | POST | Image-only search |
| `/rag/search` | POST | RAG knowledge base search |
### 3. Chat
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/chat` | POST | Chat with Advanced RAG |
### 4. Management
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/documents/pdf` | GET | List all PDFs |
| `/documents/pdf/{id}` | DELETE | Delete PDF document |
| `/delete/{doc_id}` | DELETE | Delete document |
| `/document/{doc_id}` | GET | Get document by ID |
| `/history` | GET | Get chat history |
| `/stats` | GET | Collection statistics |
| `/` | GET | Health check + API docs |
---
## Use Cases & Recommendations
### Case 1: PDF HΖ°α»›ng DαΊ«n Chỉ CΓ³ Text
**Scenario:** FAQ, policy document, text guide
**Solution:** `/upload-pdf`
```bash
curl -X POST "http://localhost:8000/upload-pdf" \
-F "[email protected]" \
-F "title=FAQ"
```
### Case 2: PDF HΖ°α»›ng DαΊ«n CΓ³ HΓ¬nh αΊ’nh ⭐ (Your Case)
**Scenario:** User guide vα»›i screenshots, tutorial vα»›i diagrams
**Solution:** `/upload-pdf-multimodal`
```bash
curl -X POST "http://localhost:8000/upload-pdf-multimodal" \
-F "file=@user_guide_with_images.pdf" \
-F "title=User Guide" \
-F "category=guide"
```
**Benefits:**
- βœ“ Extract text + image URLs
- βœ“ Link images vα»›i text chunks
- βœ“ Chatbot return images in response
- βœ“ Visual context for users
### Case 3: Multiple Social Media Posts
**Scenario:** Index nhiều posts vα»›i texts vΓ  images
**Solution:** `/index` with multiple inputs
```python
data = {
'id': 'post123',
'texts': ['Post text 1', 'Post text 2', ...], # Max 10
}
files = [
('images', open('img1.jpg', 'rb')),
('images', open('img2.jpg', 'rb')), # Max 10
]
requests.post('http://localhost:8000/index', data=data, files=files)
```
### Case 4: Complex Queries
**Scenario:** CΓ’u hỏi phα»©c tαΊ‘p, cαΊ§n Δ‘α»™ chΓ­nh xΓ‘c cao
**Solution:** Advanced RAG with full options
```python
{
'message': 'Complex question',
'use_rag': True,
'use_advanced_rag': True,
'use_reranking': True,
'use_compression': True,
'score_threshold': 0.5,
'top_k': 5
}
```
---
## Workflow Đề XuαΊ₯t Cho BαΊ‘n
### Setup Ban Đầu
1. **TαΊ‘o PDF hΖ°α»›ng dαΊ«n sα»­ dα»₯ng**
- DΓΉng template: `chatbot_guide_template.md`
- Customize nα»™i dung cho hệ thα»‘ng cα»§a bαΊ‘n
- ThΓͺm image URLs (screenshots, diagrams)
- Convert to PDF: `pandoc template.md -o guide.pdf`
2. **Upload PDF**
```bash
curl -X POST "http://localhost:8000/upload-pdf-multimodal" \
-F "file=@chatbot_user_guide.pdf" \
-F "title=HΖ°α»›ng dαΊ«n sα»­ dα»₯ng ChatbotRAG" \
-F "category=user_guide"
```
3. **Verify**
```bash
curl http://localhost:8000/documents/pdf
# Check "type": "multimodal_pdf" vΓ  "total_images"
```
### Sα»­ Dα»₯ng HΓ ng NgΓ y
1. **Chat vα»›i user**
```python
response = requests.post('http://localhost:8000/chat', json={
'message': user_question,
'use_rag': True,
'use_advanced_rag': True,
'hf_token': 'your_token'
})
```
2. **Display response + images**
```python
# Text answer
print(response.json()['response'])
# Images (if any)
for ctx in response.json()['context_used']:
if ctx['metadata'].get('has_images'):
for url in ctx['metadata']['image_urls']:
# Display image in your UI
print(f"Image: {url}")
```
### CαΊ­p NhαΊ­t Content
1. **Update PDF** - Edit vΓ  re-export
2. **XΓ³a PDF cΕ©**
```bash
curl -X DELETE http://localhost:8000/documents/pdf/old_doc_id
```
3. **Upload PDF mα»›i**
```bash
curl -X POST http://localhost:8000/upload-pdf-multimodal -F "file=@new_guide.pdf"
```
---
## Performance Tips
### 1. Chunking
**Default:**
- chunk_size: 500 words
- chunk_overlap: 50 words
**Tα»‘i Ζ°u:**
```python
# In multimodal_pdf_parser.py
parser = MultimodalPDFParser(
chunk_size=400, # Shorter for faster retrieval
chunk_overlap=40,
min_chunk_size=50
)
```
### 2. Retrieval
**Settings tα»‘t:**
```python
{
'top_k': 5, # 3-7 is optimal
'score_threshold': 0.5, # 0.4-0.6 is good
'use_reranking': True, # Always enable
'use_compression': True # Keeps context relevant
}
```
### 3. LLM
**For factual answers:**
```python
{
'temperature': 0.3, # Low for accuracy
'max_tokens': 512, # Concise answers
'top_p': 0.9
}
```
---
## Troubleshooting
### Issue 1: Images khΓ΄ng được detect
**Solution:**
- Verify PDF cΓ³ image URLs (http://, https://)
- Check format: markdown `![](url)` hoαΊ·c HTML `<img src>`
- Test regex:
```python
from multimodal_pdf_parser import MultimodalPDFParser
parser = MultimodalPDFParser()
urls = parser.extract_image_urls("![](https://example.com/img.png)")
print(urls) # Should return ['https://example.com/img.png']
```
### Issue 2: Chatbot khΓ΄ng tΓ¬m thαΊ₯y thΓ΄ng tin
**Solution:**
- Lower score_threshold: `0.3-0.5`
- Increase top_k: `5-10`
- Enable Advanced RAG
- Rephrase question
### Issue 3: Response quΓ‘ chαΊ­m
**Solution:**
- GiαΊ£m top_k
- Disable compression nαΊΏu khΓ΄ng cαΊ§n
- Use basic RAG thay vì advanced for simple queries
---
## Next Steps
### Immediate (BÒy Giờ)
1. βœ“ System Δ‘Γ£ ready!
2. TαΊ‘o PDF hΖ°α»›ng dαΊ«n cα»§a bαΊ‘n
3. Upload qua `/upload-pdf-multimodal`
4. Test vα»›i cΓ’u hỏi thα»±c tαΊΏ
### Short Term (1-2 tuαΊ§n)
1. Collect user feedback
2. Fine-tune parameters (top_k, threshold)
3. Add more PDFs (FAQ, tutorials, etc.)
4. Monitor chat history để improve content
### Long Term (Sau nΓ y)
1. **Hybrid Search vα»›i BM25**
- Combine dense + sparse retrieval
- Better for keyword queries
2. **Cross-Encoder Reranking**
- Replace embedding similarity
- More accurate ranking
3. **Image Processing**
- Download vΓ  process actual images
- Use Jina CLIP for image embeddings
- True multimodal embeddings (text + image vectors)
4. **RAG-Anything Integration** (NαΊΏu cαΊ§n)
- For complex PDFs with tables, charts
- Vision encoder for embedded images
- Advanced document understanding
---
## Comparison Matrix
| Approach | Text | Images | URLs | Complexity | Your Case |
|----------|------|--------|------|------------|-----------|
| Basic RAG | βœ“ | βœ— | βœ— | Low | βœ— |
| PDF Parser | βœ“ | βœ— | βœ— | Low | βœ— |
| **Multimodal PDF** | βœ“ | βœ— | βœ“ | **Medium** | **βœ“** |
| RAG-Anything | βœ“ | βœ“ | βœ“ | High | Overkill |
**Recommendation:** **Multimodal PDF** lΓ  perfect cho case cα»§a bαΊ‘n!
---
## KαΊΏt LuαΊ­n
### Bẑn Có Gì?
βœ… **Multiple Inputs**: Index 10 texts + 10 images
βœ… **Advanced RAG**: Query expansion, reranking, compression
βœ… **PDF Support**: Parse vΓ  index PDFs
βœ… **Multimodal PDF**: Extract text + image URLs, link together
βœ… **Complete Documentation**: Guides, examples, troubleshooting
### Làm Gì Tiếp?
1. **TαΊ‘o PDF** hΖ°α»›ng dαΊ«n vα»›i nα»™i dung cα»§a bαΊ‘n (cΓ³ image URLs)
2. **Upload** qua `/upload-pdf-multimodal`
3. **Test** vα»›i cΓ’u hỏi thα»±c tαΊΏ
4. **Iterate** - fine-tune based on feedback
### Files Cần Đọc
**Cho PDF với hình ảnh (Your case):**
- [MULTIMODAL_PDF_GUIDE.md](MULTIMODAL_PDF_GUIDE.md) ⭐⭐⭐
- [PDF_RAG_GUIDE.md](PDF_RAG_GUIDE.md)
**Cho Advanced RAG:**
- [ADVANCED_RAG_GUIDE.md](ADVANCED_RAG_GUIDE.md)
**Quick Start:**
- [QUICK_START_PDF.md](QUICK_START_PDF.md)
---
**Hệ thα»‘ng cα»§a bαΊ‘n bΓ’y giờ rαΊ₯t mαΊ‘nh! Chỉ cαΊ§n upload PDF vΓ  chat thΓ΄i! πŸš€πŸ“„πŸ€–**