lets_talk / BLOG_DATA_UTILS.md
mafzaal's picture
Refactor blog data utilities and configuration
9681c5d
|
raw
history blame
2.56 kB

Blog Data Utilities

This directory contains utilities for loading, processing, and maintaining blog post data for the RAG system.

Available Tools

blog_utils.py

This Python module contains utility functions for:

  • Loading blog posts from the data directory
  • Processing and enriching metadata (adding URLs, titles, etc.)
  • Getting statistics about the documents
  • Creating and updating vector embeddings
  • Loading existing vector stores

update_blog_data.py

This script allows you to:

  • Update the blog data when new posts are published
  • Process new blog posts
  • Update the vector store
  • Track changes over time

Legacy Notebooks (Reference Only)

The following notebooks are kept for reference but the functionality has been moved to Python modules:

  • utils_data_loading.ipynb - Contains the original utility functions
  • update_blog_data.ipynb - Demonstrates the update workflow

How to Use

Updating Blog Data

When new blog posts are published, follow these steps:

  1. Add the markdown files to the data/ directory

  2. Run the update script:

    cd /home/mafzaal/source/lets-talk
    uv run python update_blog_data.py
    

    You can also force recreation of the vector store:

    uv run python update_blog_data.py --force-recreate
    

This will:

  • Load all blog posts (including new ones)
  • Update the vector embeddings
  • Save statistics for tracking

Customizing the Process

You can customize the process by editing the .env file:

DATA_DIR=data/                             # Directory containing blog posts
VECTOR_STORAGE_PATH=./db/vectorstore_v3    # Path to vector store
EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l  # Embedding model
QDRANT_COLLECTION=thedataguy_documents     # Collection name
BLOG_BASE_URL=https://thedataguy.pro/blog/ # Base URL for blog

In the Chainlit App

The Chainlit app (app.py) has been updated to use these utility functions from the blog_utils.py module. It falls back to notebook import and direct initialization if there are any issues.

Adding Custom Processing

To add custom processing for blog posts:

  1. Edit the update_document_metadata function in blog_utils.py
  2. Add any additional enrichment or processing steps
  3. Update the vector store using the update_blog_data.py script

Future Improvements

  • Add scheduled update process for automatically including new blog posts
  • Add tracking of embedding models and versions
  • Add webhook support to automatically update when new posts are published