Blog Data Utilities
This directory contains utilities for loading, processing, and maintaining blog post data for the RAG system.
Available Tools
blog_utils.py
This Python module contains utility functions for:
- Loading blog posts from the data directory
- Processing and enriching metadata (adding URLs, titles, etc.)
- Getting statistics about the documents
- Creating and updating vector embeddings
- Loading existing vector stores
update_blog_data.py
This script allows you to:
- Update the blog data when new posts are published
- Process new blog posts
- Update the vector store
- Track changes over time
Legacy Notebooks (Reference Only)
The following notebooks are kept for reference but the functionality has been moved to Python modules:
utils_data_loading.ipynb- Contains the original utility functionsupdate_blog_data.ipynb- Demonstrates the update workflow
How to Use
Updating Blog Data
When new blog posts are published, follow these steps:
Add the markdown files to the
data/directoryRun the update script:
cd /home/mafzaal/source/lets-talk uv run python update_blog_data.pyYou can also force recreation of the vector store:
uv run python update_blog_data.py --force-recreate
This will:
- Load all blog posts (including new ones)
- Update the vector embeddings
- Save statistics for tracking
Customizing the Process
You can customize the process by editing the .env file:
DATA_DIR=data/ # Directory containing blog posts
VECTOR_STORAGE_PATH=./db/vectorstore_v3 # Path to vector store
EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l # Embedding model
QDRANT_COLLECTION=thedataguy_documents # Collection name
BLOG_BASE_URL=https://thedataguy.pro/blog/ # Base URL for blog
In the Chainlit App
The Chainlit app (app.py) has been updated to use these utility functions from the blog_utils.py module. It falls back to notebook import and direct initialization if there are any issues.
Adding Custom Processing
To add custom processing for blog posts:
- Edit the
update_document_metadatafunction inblog_utils.py - Add any additional enrichment or processing steps
- Update the vector store using the
update_blog_data.pyscript
Future Improvements
- Add scheduled update process for automatically including new blog posts
- Add tracking of embedding models and versions
- Add webhook support to automatically update when new posts are published