Spaces:

mafzaal
/

lets_talk

Runtime error

App Files Files Community

lets_talk / BLOG_DATA_UTILS.md

mafzaal

Refactor blog data utilities and configuration

9681c5d 6 months ago

preview code

raw

history blame

2.56 kB

Blog Data Utilities

This directory contains utilities for loading, processing, and maintaining blog post data for the RAG system.

Available Tools

`blog_utils.py`

This Python module contains utility functions for:

Loading blog posts from the data directory
Processing and enriching metadata (adding URLs, titles, etc.)
Getting statistics about the documents
Creating and updating vector embeddings
Loading existing vector stores

`update_blog_data.py`

This script allows you to:

Update the blog data when new posts are published
Process new blog posts
Update the vector store
Track changes over time

Legacy Notebooks (Reference Only)

The following notebooks are kept for reference but the functionality has been moved to Python modules:

utils_data_loading.ipynb - Contains the original utility functions
update_blog_data.ipynb - Demonstrates the update workflow

How to Use

Updating Blog Data

When new blog posts are published, follow these steps:

Add the markdown files to the data/ directory

Run the update script:

cd /home/mafzaal/source/lets-talk
uv run python update_blog_data.py

You can also force recreation of the vector store:

uv run python update_blog_data.py --force-recreate

This will:

Load all blog posts (including new ones)
Update the vector embeddings
Save statistics for tracking

Customizing the Process

You can customize the process by editing the .env file:

DATA_DIR=data/                             # Directory containing blog posts
VECTOR_STORAGE_PATH=./db/vectorstore_v3    # Path to vector store
EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l  # Embedding model
QDRANT_COLLECTION=thedataguy_documents     # Collection name
BLOG_BASE_URL=https://thedataguy.pro/blog/ # Base URL for blog

In the Chainlit App

The Chainlit app (app.py) has been updated to use these utility functions from the blog_utils.py module. It falls back to notebook import and direct initialization if there are any issues.

Adding Custom Processing

To add custom processing for blog posts:

Edit the update_document_metadata function in blog_utils.py
Add any additional enrichment or processing steps
Update the vector store using the update_blog_data.py script

Future Improvements

Add scheduled update process for automatically including new blog posts
Add tracking of embedding models and versions
Add webhook support to automatically update when new posts are published