| # RAG_Mini | |
| --- | |
| # Enterprise-Ready RAG System with Gradio Interface | |
| This is a powerful, enterprise-grade Retrieval-Augmented Generation (RAG) system designed to transform your documents into an interactive and intelligent knowledge base. Users can upload their own documents (PDFs, TXT files), build a searchable vector index, and ask complex questions in natural language to receive accurate, context-aware answers sourced directly from the provided materials. | |
| The entire application is wrapped in a clean, user-friendly web interface powered by Gradio. | |
|  | |
|  | |
| ## β¨ Features | |
| - **Intuitive Web UI**: Simple, clean interface built with Gradio for uploading documents and chatting. | |
| - **Multi-Document Support**: Natively handles PDF and TXT files. | |
| - **Advanced Text Splitting**: Uses a `HierarchicalSemanticSplitter` that first splits documents into large parent chunks (for context) and then into smaller child chunks (for precise search), respecting semantic boundaries. | |
| - **Hybrid Search**: Combines the strengths of dense vector search (FAISS) and sparse keyword search (BM25) for robust and accurate retrieval. | |
| - **Reranking for Accuracy**: Employs a Cross-Encoder model to rerank the retrieved documents, ensuring the most relevant context is passed to the language model. | |
| - **Persistent Knowledge Base**: Automatically saves the built vector index and metadata, allowing you to load an existing knowledge base instantly on startup. | |
| - **Modular & Extensible Codebase**: The project is logically structured into services for loading, splitting, embedding, and generation, making it easy to maintain and extend. | |
| ## ποΈ System Architecture | |
| The RAG pipeline follows a logical, multi-step process to ensure high-quality answers: | |
| 1. **Load**: Documents are loaded from various formats and parsed into a standardized `Document` object, preserving metadata like source and page number. | |
| 2. **Split**: The raw text is processed by the `HierarchicalSemanticSplitter`, creating parent and child text chunks. This provides both broad context and fine-grained detail. | |
| 3. **Embed & Index**: The child chunks are converted into vector embeddings using a `SentenceTransformer` model and indexed in a FAISS vector store. A parallel BM25 index is also built for keyword search. | |
| 4. **Retrieve**: When a user asks a question, a hybrid search query is performed against the FAISS and BM25 indices to retrieve the most relevant child chunks. | |
| 5. **Fetch Context**: The parent chunks corresponding to the retrieved child chunks are fetched. This ensures the LLM receives a wider, more complete context. | |
| 6. **Rerank**: A powerful Cross-Encoder model re-evaluates the relevance of the parent chunks against the query, pushing the best matches to the top. | |
| 7. **Generate**: The top-ranked, reranked documents are combined with the user's query into a final prompt. This prompt is sent to a Large Language Model (LLM) to generate a final, coherent answer. | |
| ``` | |
| [User Uploads Docs] -> [Loader] -> [Splitter] -> [Embedder & Vector Store] -> [Knowledge Base Saved] | |
| [User Asks Question] -> [Hybrid Search] -> [Get Parent Docs] -> [Reranker] -> [LLM] -> [Answer & Sources] | |
| ``` | |
| ## π οΈ Tech Stack | |
| - **Backend**: Python 3.9+ | |
| - **UI**: Gradio | |
| - **LLM & Embedding Framework**: Hugging Face Transformers, Sentence-Transformers | |
| - **Vector Search**: Faiss (from Facebook AI) | |
| - **Keyword Search**: rank-bm25 | |
| - **PDF Parsing**: PyMuPDF (fitz) | |
| - **Configuration**: PyYAML | |
| ## π Getting Started | |
| Follow these steps to set up and run the project on your local machine. | |
| ### 1. Prerequisites | |
| - Python 3.9 or higher | |
| - `pip` for package management | |
| ### 2. Create a `requirements.txt` file | |
| Before proceeding, it's crucial to have a `requirements.txt` file so others can easily install the necessary dependencies. In your activated terminal, run: | |
| ```bash | |
| pip freeze > requirements.txt | |
| ``` | |
| This will save all the packages from your environment into the file. Make sure this file is committed to your GitHub repository. The key packages it should contain are: `gradio`, `torch`, `transformers`, `sentence-transformers`, `faiss-cpu`, `rank_bm25`, `PyMuPDF`, `pyyaml`, `numpy`. | |
| ### 3. Installation & Setup | |
| **1. Clone the repository:** | |
| ```bash | |
| git clone https://github.com/YOUR_USERNAME/YOUR_REPOSITORY_NAME.git | |
| cd YOUR_REPOSITORY_NAME | |
| ``` | |
| **2. Create and activate a virtual environment (recommended):** | |
| ```bash | |
| # For Windows | |
| python -m venv venv | |
| .\venv\Scripts\activate | |
| # For macOS/Linux | |
| python3 -m venv venv | |
| source venv/bin/activate | |
| ``` | |
| **3. Install the required packages:** | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| **4. Configure the system:** | |
| Review the `configs/config.yaml` file. You can change the models, chunk sizes, and other parameters here. The default settings are a good starting point. | |
| > **Note:** The first time you run the application, the models specified in the config file will be downloaded from Hugging Face. This may take some time depending on your internet connection. | |
| ### 4. Running the Application | |
| To start the Gradio web server, run the `main.py` script: | |
| ```bash | |
| python main.py | |
| ``` | |
| The application will be available at **`http://localhost:7860`**. | |
| ## π How to Use | |
| The application has two primary workflows: | |
| **1. Build a New Knowledge Base:** | |
| - Drag and drop one or more `.pdf` or `.txt` files into the "Upload New Docs to Build" area. | |
| - Click the **"Build New KB"** button. | |
| - The system status will show the progress (Loading -> Splitting -> Indexing). | |
| - Once complete, the status will confirm that the knowledge base is ready, and the chat window will appear. | |
| **2. Load an Existing Knowledge Base:** | |
| - If you have previously built a knowledge base, simply click the **"Load Existing KB"** button. | |
| - The system will load the saved FAISS index and metadata from the `storage` directory. | |
| - The chat window will appear, and you can start asking questions immediately. | |
| **Chatting with Your Documents:** | |
| - Once the knowledge base is ready, type your question into the chat box at the bottom and press Enter or click "Submit". | |
| - The model will generate an answer based on the documents you provided. | |
| - The sources used to generate the answer will be displayed below the chat window. | |
| ## π Project Structure | |
| ``` | |
| . | |
| βββ configs/ | |
| β βββ config.yaml # Main configuration file for models, paths, etc. | |
| βββ core/ | |
| β βββ embedder.py # Handles text embedding. | |
| β βββ llm_interface.py # Handles reranking and answer generation. | |
| β βββ loader.py # Loads and parses documents. | |
| β βββ schema.py # Defines data structures (Document, Chunk). | |
| β βββ splitter.py # Splits documents into chunks. | |
| β βββ vector_store.py # Manages FAISS & BM25 indices. | |
| βββ service/ | |
| β βββ rag_service.py # Orchestrates the entire RAG pipeline. | |
| βββ storage/ # Default location for saved indices (auto-generated). | |
| β βββ ... | |
| βββ ui/ | |
| β βββ app.py # Contains the Gradio UI logic. | |
| βββ utils/ | |
| β βββ logger.py # Logging configuration. | |
| βββ assets/ | |
| β βββ 1.png # Screenshot of the application. | |
| βββ main.py # Entry point to run the application. | |
| βββ requirements.txt # Python package dependencies. | |
| ``` | |
| ## π§ Configuration Details (`config.yaml`) | |
| You can customize the RAG pipeline by modifying `configs/config.yaml`: | |
| - **`models`**: Specify the Hugging Face models for embedding, reranking, and generation. | |
| - **`vector_store`**: Define the paths where the FAISS index and metadata will be saved. | |
| - **`splitter`**: Control the `HierarchicalSemanticSplitter` behavior. | |
| - `parent_chunk_size`: The target size for larger context chunks. | |
| - `parent_chunk_overlap`: The overlap between parent chunks. | |
| - `child_chunk_size`: The target size for smaller, searchable chunks. | |
| - **`retrieval`**: Tune the retrieval and reranking process. | |
| - `retrieval_top_k`: How many initial candidates to retrieve with hybrid search. | |
| - `rerank_top_k`: How many final documents to pass to the LLM after reranking. | |
| - `hybrid_search_alpha`: The weighting between vector search (`alpha`) and BM25 search (`1 - alpha`). `1.0` is pure vector search, `0.0` is pure keyword search. | |
| - **`generation`**: Set parameters for the final answer generation, like `max_new_tokens`. | |
| ## π£οΈ Future Roadmap | |
| - [ ] Support for more document types (e.g., `.docx`, `.pptx`, `.html`). | |
| - [ ] Implement response streaming for a more interactive chat experience. | |
| - [ ] Integrate with other vector databases like ChromaDB or Pinecone. | |
| - [ ] Create API endpoints for programmatic access to the RAG service. | |
| - [ ] Add more advanced logging and monitoring for enterprise use. | |
| ## π€ Contributing | |
| Contributions are welcome! If you have ideas for improvements or find a bug, please feel free to open an issue or submit a pull request. | |
| ## π License | |
| This project is licensed under the MIT License. See the `LICENSE` file for details. | |