Spaces:

minhvtt
/

ChatbotRAG

Sleeping

App Files Files Community

ChatbotRAG / SUMMARY.md

minhvtt

Upload 20 files

500cf95 verified 8 days ago

preview code

raw

history blame contribute delete

13.5 kB

A newer version of the Gradio SDK is available: 5.49.1

Upgrade

ChatbotRAG - Complete Summary

Tổng Quan Hệ Thống

Hệ thống ChatbotRAG hiện đã được nâng cấp toàn diện với các tính năng advanced:

✨ Tính Năng Chính

Multiple Inputs Support (/index)
- Index tối đa 10 texts + 10 images cùng lúc
- Average embeddings tự động
Advanced RAG Pipeline (/chat)
- Query Expansion
- Multi-Query Retrieval
- Reranking with semantic similarity
- Contextual Compression
- Better Prompt Engineering
PDF Support (/upload-pdf)
- Parse PDF thành chunks
- Auto chunking với overlap
- Index vào RAG system
Multimodal PDF (/upload-pdf-multimodal) ⭐ NEW
- Extract text + image URLs từ PDF
- Link images với text chunks
- Return images cùng text trong chat
- Perfect cho user guides với screenshots

Kiến Trúc Hệ Thống

┌─────────────────────────────────────────────────────────────┐
│                    FastAPI Application                       │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │   Indexing   │  │   Search     │  │   Chat       │      │
│  │   Endpoints  │  │   Endpoints  │  │   Endpoint   │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
│                                                               │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌──────────────────────────────────────────────────────┐   │
│  │            Advanced RAG Pipeline                      │   │
│  │  • Query Expansion                                    │   │
│  │  • Multi-Query Retrieval                              │   │
│  │  • Reranking                                          │   │
│  │  • Contextual Compression                             │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                               │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │   Jina CLIP  │  │   Qdrant     │  │   MongoDB    │      │
│  │   v2         │  │   Vector DB  │  │   Documents  │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
│                                                               │
│  ┌──────────────┐  ┌──────────────┐                         │
│  │   PDF        │  │  Multimodal  │                         │
│  │   Parser     │  │  PDF Parser  │                         │
│  └──────────────┘  └──────────────┘                         │
│                                                               │
└─────────────────────────────────────────────────────────────┘

Files Quan Trọng

Core System

main.py - FastAPI application với tất cả endpoints
embedding_service.py - Jina CLIP v2 embedding
qdrant_service.py - Qdrant vector DB operations
advanced_rag.py - Advanced RAG pipeline

PDF Processing

pdf_parser.py - Basic PDF parser (text only)
multimodal_pdf_parser.py - Multimodal PDF parser (text + images)
batch_index_pdfs.py - Batch indexing script

Documentation

ADVANCED_RAG_GUIDE.md - Advanced RAG features guide
PDF_RAG_GUIDE.md - PDF usage guide
MULTIMODAL_PDF_GUIDE.md - Multimodal PDF guide ⭐
QUICK_START_PDF.md - Quick start for PDF
chatbot_guide_template.md - Template for user guide PDF

Testing

test_advanced_features.py - Test advanced features
test_pdf_chatbot.py - Test PDF chatbot (example in docs)

API Endpoints

1. Indexing

Endpoint	Method	Description
`/index`	POST	Index texts + images (max 10 each)
`/documents`	POST	Add text document
`/upload-pdf`	POST	Upload PDF (text only)
`/upload-pdf-multimodal`	POST	Upload PDF with images ⭐

2. Search

Endpoint	Method	Description
`/search`	POST	Hybrid search (text + image)
`/search/text`	POST	Text-only search
`/search/image`	POST	Image-only search
`/rag/search`	POST	RAG knowledge base search

3. Chat

Endpoint	Method	Description
`/chat`	POST	Chat with Advanced RAG

4. Management

Endpoint	Method	Description
`/documents/pdf`	GET	List all PDFs
`/documents/pdf/{id}`	DELETE	Delete PDF document
`/delete/{doc_id}`	DELETE	Delete document
`/document/{doc_id}`	GET	Get document by ID
`/history`	GET	Get chat history
`/stats`	GET	Collection statistics
`/`	GET	Health check + API docs

Use Cases & Recommendations

Case 1: PDF Hướng Dẫn Chỉ Có Text

Scenario: FAQ, policy document, text guide

Solution: /upload-pdf

curl -X POST "http://localhost:8000/upload-pdf" \
  -F "[email protected]" \
  -F "title=FAQ"

Case 2: PDF Hướng Dẫn Có Hình Ảnh ⭐ (Your Case)

Scenario: User guide với screenshots, tutorial với diagrams

Solution: /upload-pdf-multimodal

curl -X POST "http://localhost:8000/upload-pdf-multimodal" \
  -F "file=@user_guide_with_images.pdf" \
  -F "title=User Guide" \
  -F "category=guide"

Benefits:

✓ Extract text + image URLs
✓ Link images với text chunks
✓ Chatbot return images in response
✓ Visual context for users

Case 3: Multiple Social Media Posts

Scenario: Index nhiều posts với texts và images

Solution: /index with multiple inputs

data = {
    'id': 'post123',
    'texts': ['Post text 1', 'Post text 2', ...],  # Max 10
}
files = [
    ('images', open('img1.jpg', 'rb')),
    ('images', open('img2.jpg', 'rb')),  # Max 10
]
requests.post('http://localhost:8000/index', data=data, files=files)

Case 4: Complex Queries

Scenario: Câu hỏi phức tạp, cần độ chính xác cao

Solution: Advanced RAG with full options

{
    'message': 'Complex question',
    'use_rag': True,
    'use_advanced_rag': True,
    'use_reranking': True,
    'use_compression': True,
    'score_threshold': 0.5,
    'top_k': 5
}

Workflow Đề Xuất Cho Bạn

Setup Ban Đầu

Tạo PDF hướng dẫn sử dụng
- Dùng template: chatbot_guide_template.md
- Customize nội dung cho hệ thống của bạn
- Thêm image URLs (screenshots, diagrams)
- Convert to PDF: pandoc template.md -o guide.pdf

Upload PDF

curl -X POST "http://localhost:8000/upload-pdf-multimodal" \
  -F "file=@chatbot_user_guide.pdf" \
  -F "title=Hướng dẫn sử dụng ChatbotRAG" \
  -F "category=user_guide"

Verify

curl http://localhost:8000/documents/pdf
# Check "type": "multimodal_pdf" và "total_images"

Sử Dụng Hàng Ngày

Chat với user

response = requests.post('http://localhost:8000/chat', json={
    'message': user_question,
    'use_rag': True,
    'use_advanced_rag': True,
    'hf_token': 'your_token'
})

Display response + images

# Text answer
print(response.json()['response'])

# Images (if any)
for ctx in response.json()['context_used']:
    if ctx['metadata'].get('has_images'):
        for url in ctx['metadata']['image_urls']:
            # Display image in your UI
            print(f"Image: {url}")

Cập Nhật Content

Update PDF - Edit và re-export

Xóa PDF cũ

curl -X DELETE http://localhost:8000/documents/pdf/old_doc_id

Upload PDF mới

curl -X POST http://localhost:8000/upload-pdf-multimodal -F "file=@new_guide.pdf"

Performance Tips

1. Chunking

Default:

chunk_size: 500 words
chunk_overlap: 50 words

Tối ưu:

# In multimodal_pdf_parser.py
parser = MultimodalPDFParser(
    chunk_size=400,      # Shorter for faster retrieval
    chunk_overlap=40,
    min_chunk_size=50
)

2. Retrieval

Settings tốt:

{
    'top_k': 5,              # 3-7 is optimal
    'score_threshold': 0.5,   # 0.4-0.6 is good
    'use_reranking': True,    # Always enable
    'use_compression': True   # Keeps context relevant
}

3. LLM

For factual answers:

{
    'temperature': 0.3,   # Low for accuracy
    'max_tokens': 512,    # Concise answers
    'top_p': 0.9
}

Troubleshooting

Issue 1: Images không được detect

Solution:

Verify PDF có image URLs (http://, https://)
Check format: markdown ![](url) hoặc HTML <img src>

Test regex:

from multimodal_pdf_parser import MultimodalPDFParser
parser = MultimodalPDFParser()
urls = parser.extract_image_urls("![](https://example.com/img.png)")
print(urls)  # Should return ['https://example.com/img.png']

Issue 2: Chatbot không tìm thấy thông tin

Solution:

Lower score_threshold: 0.3-0.5
Increase top_k: 5-10
Enable Advanced RAG
Rephrase question

Issue 3: Response quá chậm

Solution:

Giảm top_k
Disable compression nếu không cần
Use basic RAG thay vì advanced for simple queries

Next Steps

Immediate (Bây Giờ)

✓ System đã ready!
Tạo PDF hướng dẫn của bạn
Upload qua /upload-pdf-multimodal
Test với câu hỏi thực tế

Short Term (1-2 tuần)

Collect user feedback
Fine-tune parameters (top_k, threshold)
Add more PDFs (FAQ, tutorials, etc.)
Monitor chat history để improve content

Long Term (Sau này)

Hybrid Search với BM25
- Combine dense + sparse retrieval
- Better for keyword queries
Cross-Encoder Reranking
- Replace embedding similarity
- More accurate ranking
Image Processing
- Download và process actual images
- Use Jina CLIP for image embeddings
- True multimodal embeddings (text + image vectors)
RAG-Anything Integration (Nếu cần)
- For complex PDFs with tables, charts
- Vision encoder for embedded images
- Advanced document understanding

Comparison Matrix

Approach	Text	Images	URLs	Complexity	Your Case
Basic RAG	✓	✗	✗	Low	✗
PDF Parser	✓	✗	✗	Low	✗
Multimodal PDF	✓	✗	✓	Medium	✓
RAG-Anything	✓	✓	✓	High	Overkill

Recommendation: Multimodal PDF là perfect cho case của bạn!

Kết Luận

Bạn Có Gì?

✅ Multiple Inputs: Index 10 texts + 10 images ✅ Advanced RAG: Query expansion, reranking, compression ✅ PDF Support: Parse và index PDFs ✅ Multimodal PDF: Extract text + image URLs, link together ✅ Complete Documentation: Guides, examples, troubleshooting

Làm Gì Tiếp?

Tạo PDF hướng dẫn với nội dung của bạn (có image URLs)
Upload qua /upload-pdf-multimodal
Test với câu hỏi thực tế
Iterate - fine-tune based on feedback

Files Cần Đọc

Cho PDF với hình ảnh (Your case):

MULTIMODAL_PDF_GUIDE.md ⭐⭐⭐
PDF_RAG_GUIDE.md

Cho Advanced RAG:

ADVANCED_RAG_GUIDE.md

Quick Start:

QUICK_START_PDF.md

Hệ thống của bạn bây giờ rất mạnh! Chỉ cần upload PDF và chat thôi! 🚀📄🤖