Spaces:
Sleeping
Sleeping
| # Repository Explorer - Vectorization Feature | |
| ## Overview | |
| The Repository Explorer now includes **simple vectorization** to enhance the chatbot's ability to answer questions about loaded repositories. This feature uses semantic search to find the most relevant code sections for user queries. | |
| ## How It Works | |
| ### 1. **Content Chunking** | |
| - Repository content is split into overlapping chunks (~500 lines each with 50 lines overlap) | |
| - Each chunk maintains metadata (repo ID, line numbers, chunk index) | |
| ### 2. **Embedding Creation** | |
| - Uses the lightweight `all-MiniLM-L6-v2` SentenceTransformer model | |
| - Creates vector embeddings for each chunk | |
| - Embeddings capture semantic meaning of code content | |
| ### 3. **Semantic Search** | |
| - When you ask a question, it searches for the 3 most relevant chunks | |
| - Uses cosine similarity to rank chunks by relevance | |
| - Returns both similarity scores and line number references | |
| ### 4. **Enhanced Responses** | |
| - The chatbot combines both the general repository analysis AND the most relevant code sections | |
| - Provides specific code examples and implementation details | |
| - References exact line numbers for better context | |
| ## Installation | |
| The vectorization feature requires additional dependencies: | |
| ```bash | |
| pip install sentence-transformers numpy | |
| ``` | |
| These are already included in the updated `requirements.txt`. | |
| ## Testing | |
| Run the test script to verify everything is working: | |
| ```bash | |
| python test_vectorization.py | |
| ``` | |
| This will test: | |
| - β Dependencies import correctly | |
| - β SentenceTransformer model loads | |
| - β Embedding creation works | |
| - β Similarity calculations function | |
| - β Integration with repo explorer | |
| ## Features | |
| ### β **What's Included** | |
| - **Simple setup**: Uses a lightweight, fast embedding model | |
| - **Automatic chunking**: Smart content splitting with overlap for context | |
| - **Semantic search**: Find relevant code based on meaning, not just keywords | |
| - **Graceful fallback**: If vectorization fails, falls back to text-only analysis | |
| - **Memory efficient**: In-memory storage suitable for single repository exploration | |
| - **Clear feedback**: Status messages show when vectorization is active | |
| ### π **How to Use** | |
| 1. Load any repository in the Repository Explorer tab | |
| 2. Look for "Vector embeddings created" in the status message | |
| 3. Ask questions - the chatbot will automatically use vector search | |
| 4. Responses will include "MOST RELEVANT CODE SECTIONS" with similarity scores | |
| ### π **Example Output** | |
| When you ask "How do I use this repository?", you might get: | |
| ``` | |
| === MOST RELEVANT CODE SECTIONS === | |
| --- Relevant Section 1 (similarity: 0.847, lines 25-75) --- | |
| # Installation and Usage | |
| ...actual code from those lines... | |
| --- Relevant Section 2 (similarity: 0.792, lines 150-200) --- | |
| def main(): | |
| """Main usage example""" | |
| ...actual code from those lines... | |
| ``` | |
| ## Technical Details | |
| - **Model**: `all-MiniLM-L6-v2` (384-dimensional embeddings) | |
| - **Chunk size**: 500 lines with 50 line overlap | |
| - **Search**: Top 3 most similar chunks per query | |
| - **Storage**: In-memory (cleared when loading new repository) | |
| - **Fallback**: Graceful degradation to text-only analysis if vectorization fails | |
| ## Benefits | |
| 1. **Better Context**: Finds relevant code sections even with natural language queries | |
| 2. **Specific Examples**: Provides actual code snippets related to your question | |
| 3. **Line References**: Shows exactly where information comes from | |
| 4. **Semantic Understanding**: Understands intent, not just keyword matching | |
| 5. **Fast Setup**: Lightweight model downloads quickly on first use | |
| ## Limitations | |
| - **Single Repository**: Vector store is cleared when loading a new repository | |
| - **Memory Usage**: Keeps all embeddings in memory (suitable for exploration use case) | |
| - **Model Size**: ~80MB download for the embedding model (one-time) | |
| - **No Persistence**: Vectors are recreated each time you load a repository | |
| This simple vectorization approach significantly improves the chatbot's ability to provide relevant, code-specific answers while keeping the implementation straightforward and fast. |