Spaces:

colin730
/

SummarizerApp

Running

ming commited on 12 days ago

Commit

6e48ad3

1 Parent(s): 56b5c90

Add support for sshleifer/distilbart-cnn-6-6 model for V2 API

- Added BART model detection and specific input handling
- Updated default model from t5-small to sshleifer/distilbart-cnn-6-6
- BART models now receive direct text input without prefixes
- Updated warmup and health check methods for BART compatibility
- Updated Dockerfile and README documentation

Files changed (4) hide show

Dockerfile +1 -1
README.md +4 -4
app/core/config.py +1 -1
app/services/hf_streaming_summarizer.py +28 -4

Dockerfile CHANGED Viewed

@@ -7,7 +7,7 @@ ENV PYTHONDONTWRITEBYTECODE=1 \
     PYTHONPATH=/app \
     ENABLE_V1_WARMUP=false \
     ENABLE_V2_WARMUP=true \
-    HF_MODEL_ID=t5-small \
     HF_HOME=/tmp/huggingface
 # Set work directory

     PYTHONPATH=/app \
     ENABLE_V1_WARMUP=false \
     ENABLE_V2_WARMUP=true \
+    HF_MODEL_ID=sshleifer/distilbart-cnn-6-6 \
     HF_HOME=/tmp/huggingface
 # Set work directory

README.md CHANGED Viewed

@@ -82,7 +82,7 @@ The service uses the following environment variables:
 - `ENABLE_V1_WARMUP`: Enable V1 warmup (default: `false`)
 ### V2 Configuration (HuggingFace)
-- `HF_MODEL_ID`: HuggingFace model ID (default: `t5-small`)
 - `HF_DEVICE_MAP`: Device mapping (default: `auto` for GPU fallback to CPU)
 - `HF_TORCH_DTYPE`: Torch dtype (default: `auto`)
 - `HF_HOME`: HuggingFace cache directory (default: `/tmp/huggingface`)
@@ -121,7 +121,7 @@ This app is optimized for deployment on Hugging Face Spaces using Docker SDK.
 ```bash
 ENABLE_V1_WARMUP=false
 ENABLE_V2_WARMUP=true
-HF_MODEL_ID=t5-small
 HF_HOME=/tmp/huggingface
 ```
@@ -134,7 +134,7 @@ HF_HOME=/tmp/huggingface
 - **Startup time**: ~30-60 seconds (when V1 warmup enabled)
 ### V2 (HuggingFace Streaming) - Primary on HF Spaces
-- **V2 Model**: t5-small (~250MB download)
 - **Memory usage**: ~500MB RAM (when V2 warmup enabled)
 - **Inference speed**: Real-time token streaming
 - **Startup time**: ~30-60 seconds (includes model download when V2 warmup enabled)
@@ -144,7 +144,7 @@ HF_HOME=/tmp/huggingface
 - **V2 warmup enabled by default** (`ENABLE_V2_WARMUP=true`)
 - **HuggingFace Spaces**: V2-only deployment (no Ollama)
 - **Local development**: V1 endpoints work if Ollama is running externally
-- **t5-small model**: Optimized for HuggingFace Spaces free tier
 ## 🛠️ Development

 - `ENABLE_V1_WARMUP`: Enable V1 warmup (default: `false`)
 ### V2 Configuration (HuggingFace)
+- `HF_MODEL_ID`: HuggingFace model ID (default: `sshleifer/distilbart-cnn-6-6`)
 - `HF_DEVICE_MAP`: Device mapping (default: `auto` for GPU fallback to CPU)
 - `HF_TORCH_DTYPE`: Torch dtype (default: `auto`)
 - `HF_HOME`: HuggingFace cache directory (default: `/tmp/huggingface`)
 ```bash
 ENABLE_V1_WARMUP=false
 ENABLE_V2_WARMUP=true
+HF_MODEL_ID=sshleifer/distilbart-cnn-6-6
 HF_HOME=/tmp/huggingface
 ```
 - **Startup time**: ~30-60 seconds (when V1 warmup enabled)
 ### V2 (HuggingFace Streaming) - Primary on HF Spaces
+- **V2 Model**: sshleifer/distilbart-cnn-6-6 (~300MB download)
 - **Memory usage**: ~500MB RAM (when V2 warmup enabled)
 - **Inference speed**: Real-time token streaming
 - **Startup time**: ~30-60 seconds (includes model download when V2 warmup enabled)
 - **V2 warmup enabled by default** (`ENABLE_V2_WARMUP=true`)
 - **HuggingFace Spaces**: V2-only deployment (no Ollama)
 - **Local development**: V1 endpoints work if Ollama is running externally
+- **distilbart-cnn-6-6 model**: Optimized for HuggingFace Spaces free tier with CNN/DailyMail fine-tuning
 ## 🛠️ Development

app/core/config.py CHANGED Viewed

@@ -34,7 +34,7 @@ class Settings(BaseSettings):
     max_tokens_default: int = Field(default=256, env="MAX_TOKENS_DEFAULT", ge=1)
     # V2 HuggingFace Configuration
-    hf_model_id: str = Field(default="t5-small", env="HF_MODEL_ID")
     hf_device_map: str = Field(default="auto", env="HF_DEVICE_MAP")  # "auto" for GPU fallback to CPU
     hf_torch_dtype: str = Field(default="auto", env="HF_TORCH_DTYPE")  # "auto" for automatic dtype selection
     hf_cache_dir: str = Field(default="/tmp/huggingface", env="HF_HOME")  # HuggingFace cache directory

     max_tokens_default: int = Field(default=256, env="MAX_TOKENS_DEFAULT", ge=1)
     # V2 HuggingFace Configuration
+    hf_model_id: str = Field(default="sshleifer/distilbart-cnn-6-6", env="HF_MODEL_ID")
     hf_device_map: str = Field(default="auto", env="HF_DEVICE_MAP")  # "auto" for GPU fallback to CPU
     hf_torch_dtype: str = Field(default="auto", env="HF_TORCH_DTYPE")  # "auto" for automatic dtype selection
     hf_cache_dir: str = Field(default="/tmp/huggingface", env="HF_HOME")  # HuggingFace cache directory

app/services/hf_streaming_summarizer.py CHANGED Viewed

@@ -93,8 +93,15 @@ class HFStreamingSummarizer:
             logger.warning("⚠️ HuggingFace model not initialized, skipping warmup")
             return
-        # Use T5 format for warmup
-        test_prompt = "summarize: This is a test."
         try:
             # Run in executor to avoid blocking
@@ -175,6 +182,15 @@ class HFStreamingSummarizer:
                     max_length=512,
                     truncation=True
                 )
             else:
                 # Other models use chat template
                 messages = [
@@ -267,8 +283,16 @@ class HFStreamingSummarizer:
             return False
         try:
-            # Quick test generation with T5 format
-            test_input = self.tokenizer("summarize: test", return_tensors="pt")
             test_input = test_input.to(self.model.device)
             with torch.no_grad():

             logger.warning("⚠️ HuggingFace model not initialized, skipping warmup")
             return
+        # Determine appropriate test prompt based on model type
+        if "t5" in settings.hf_model_id.lower():
+            test_prompt = "summarize: This is a test."
+        elif "bart" in settings.hf_model_id.lower():
+            # BART models expect direct text input
+            test_prompt = "This is a test article for summarization."
+        else:
+            # Generic fallback
+            test_prompt = "This is a test article for summarization."
         try:
             # Run in executor to avoid blocking
                     max_length=512,
                     truncation=True
                 )
+            elif "bart" in settings.hf_model_id.lower():
+                # BART models (including DistilBART) expect direct text input
+                # No prefixes or chat templates needed
+                inputs = self.tokenizer(
+                    text,
+                    return_tensors="pt",
+                    max_length=1024,
+                    truncation=True
+                )
             else:
                 # Other models use chat template
                 messages = [
             return False
         try:
+            # Determine appropriate test input based on model type
+            if "t5" in settings.hf_model_id.lower():
+                test_input_text = "summarize: test"
+            elif "bart" in settings.hf_model_id.lower():
+                # BART models expect direct text input
+                test_input_text = "This is a test article."
+            else:
+                test_input_text = "This is a test article."
+            test_input = self.tokenizer(test_input_text, return_tensors="pt")
             test_input = test_input.to(self.model.device)
             with torch.no_grad():