Spaces:

colin730
/

SummarizerApp

Running

ming Claude commited on Nov 21

Commit

80ea70f

1 Parent(s): 6c96c54

fix: Backend ignores client max_tokens to verify Android app hypothesis

This is a diagnostic commit to verify that the Android app's max_tokens=256
is causing early stopping, NOT a backend issue.

Changes Made:

1. **Remove Client max_tokens Constraint** (app/api/v3/scrape_summarize.py:121-123)
- BEFORE: min(text_length // 3, payload.max_tokens, 1024)
- AFTER: min(text_length // 3, 1024)
- Backend now IGNORES client's max_tokens value
- Always uses adaptive calculation for quality

2. **Remove Incompatible Generation Parameters** (app/services/hf_streaming_summarizer.py)
- Removed temperature, top_p from gen_kwargs (lines ~375, ~670)
- Removed length_penalty from gen_kwargs (lines ~391)
- These parameters are invalid with do_sample=False (greedy decoding)
- Eliminates transformers warning:
"generation flags are not valid and may be ignored: ['temperature', 'top_p', 'length_penalty']"

3. **Update Tests** (tests/test_v3_api.py)
- test_adaptive_tokens_medium_article: Now expects 600-700 tokens (not 450-512)
- test_user_max_tokens_ignored_for_quality: Renamed, expects client ignored

Expected Impact for 4237-char Article (from logs):
- BEFORE: adaptive_max=256 (constrained by client)
- AFTER: adaptive_max=1024 (4237 // 3 = 1412, capped at 1024)
- Improvement: 4x more tokens!

Logs Will Show:
```
text_length=4237, requested_max=256, adaptive_max=1024, adaptive_min=614
```

No More Warnings:
The transformers warning about invalid generation flags will disappear.

Test Results:
- All V3 tests passing (16/16) ✅
- Tests updated to reflect new behavior

Purpose of This Commit:
This is a DIAGNOSTIC commit to test the hypothesis that the Android app's
hardcoded max_tokens=256 is the root cause of early stopping. If summaries
improve after this deployment, it confirms the Android app needs updating.

Next Steps:
1. Deploy to HuggingFace Spaces
2. Test with Android app (same URL)
3. If summaries are longer and more complete → Android app is the issue
4. Then update Android app to send max_tokens=1024 or omit it entirely

Related commits:
- 5e83010: Initial adaptive calculation
- 6b2de93: Enhanced token allocation
- 6c96c54: Model config overrides

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

Files changed (3) hide show

app/api/v3/scrape_summarize.py +1 -1
app/services/hf_streaming_summarizer.py +4 -7
tests/test_v3_api.py +9 -8

app/api/v3/scrape_summarize.py CHANGED Viewed

@@ -116,10 +116,10 @@ async def _stream_generator(text: str, payload, metadata: dict, request_id: str)
     # Calculate adaptive token limits based on text length
     # Formula: scale tokens with input length, but enforce min/max bounds
     text_length = len(text)
     adaptive_max_tokens = min(
         max(text_length // 3, 300),  # At least 300 tokens, scale ~33% of input chars
-        payload.max_tokens,  # Respect user's max if specified
         1024,  # Cap at 1024 to avoid excessive generation
     )
     # Calculate minimum length (60% of max) to encourage complete thoughts

     # Calculate adaptive token limits based on text length
     # Formula: scale tokens with input length, but enforce min/max bounds
+    # Note: Ignores client's max_tokens to ensure quality (client often sends too-low values)
     text_length = len(text)
     adaptive_max_tokens = min(
         max(text_length // 3, 300),  # At least 300 tokens, scale ~33% of input chars
         1024,  # Cap at 1024 to avoid excessive generation
     )
     # Calculate minimum length (60% of max) to encourage complete thoughts

app/services/hf_streaming_summarizer.py CHANGED Viewed

@@ -372,8 +372,7 @@ class HFStreamingSummarizer:
                 "streamer": streamer,
                 "max_new_tokens": max_new_tokens,
                 "do_sample": False,
-                "temperature": temperature,
-                "top_p": top_p,
                 "pad_token_id": pad_id,
                 "eos_token_id": eos_id,
             }
@@ -389,8 +388,8 @@ class HFStreamingSummarizer:
                 gen_kwargs["min_new_tokens"] = max(
                     50, min(max_new_tokens // 2, 200)
                 )
-            # Use slightly positive length_penalty to favor complete sentences
-            gen_kwargs["length_penalty"] = 1.2
             # Reduce premature EOS in some checkpoints (optional)
             gen_kwargs["no_repeat_ngram_size"] = 3
             gen_kwargs["repetition_penalty"] = 1.05
@@ -668,15 +667,13 @@ class HFStreamingSummarizer:
                 "streamer": streamer,
                 "max_new_tokens": max_new_tokens,
                 "do_sample": False,
-                "temperature": temperature,
-                "top_p": top_p,
                 "pad_token_id": pad_id,
                 "eos_token_id": eos_id,
                 "num_return_sequences": 1,
                 "num_beams": 1,
                 "num_beam_groups": 1,
                 "min_new_tokens": calculated_min_tokens,
-                "length_penalty": 1.2,
                 "no_repeat_ngram_size": 3,
                 "repetition_penalty": 1.05,
                 # CRITICAL: Override model config defaults that cause early stopping

                 "streamer": streamer,
                 "max_new_tokens": max_new_tokens,
                 "do_sample": False,
+                # Note: temperature, top_p removed - incompatible with greedy decoding
                 "pad_token_id": pad_id,
                 "eos_token_id": eos_id,
             }
                 gen_kwargs["min_new_tokens"] = max(
                     50, min(max_new_tokens // 2, 200)
                 )
+            # Note: length_penalty removed - only works with beam search (num_beams > 1)
+            # Using greedy decoding (num_beams=1) for speed
             # Reduce premature EOS in some checkpoints (optional)
             gen_kwargs["no_repeat_ngram_size"] = 3
             gen_kwargs["repetition_penalty"] = 1.05
                 "streamer": streamer,
                 "max_new_tokens": max_new_tokens,
                 "do_sample": False,
+                # Note: temperature, top_p, length_penalty removed - incompatible with greedy decoding
                 "pad_token_id": pad_id,
                 "eos_token_id": eos_id,
                 "num_return_sequences": 1,
                 "num_beams": 1,
                 "num_beam_groups": 1,
                 "min_new_tokens": calculated_min_tokens,
                 "no_repeat_ngram_size": 3,
                 "repetition_penalty": 1.05,
                 # CRITICAL: Override model config defaults that cause early stopping

tests/test_v3_api.py CHANGED Viewed

@@ -315,7 +315,7 @@ def test_adaptive_tokens_medium_article(client: TestClient):
     with patch(
         "app.services.article_scraper.article_scraper_service.scrape_article"
     ) as mock_scrape:
-        # Medium article: ~2000 chars -> should get 500 tokens (2000 // 4)
         mock_scrape.return_value = {
             "text": "Medium article content. " * 80,  # ~2000 chars
             "title": "Medium Article",
@@ -341,8 +341,9 @@ def test_adaptive_tokens_medium_article(client: TestClient):
             )
             assert response.status_code == 200
-            # For 2000 chars with default max_tokens=512, should get ~500 tokens
-            assert 450 <= captured_kwargs.get("max_new_tokens", 0) <= 512
             # min_length should be 60% of max_new_tokens
             expected_min = int(captured_kwargs["max_new_tokens"] * 0.6)
             assert captured_kwargs.get("min_length", 0) == expected_min
@@ -386,8 +387,8 @@ def test_adaptive_tokens_long_article(client: TestClient):
             assert captured_kwargs.get("min_length", 0) == expected_min
-def test_user_max_tokens_respected(client: TestClient):
-    """Test that user-specified max_tokens is respected when lower than adaptive."""
     with patch(
         "app.services.article_scraper.article_scraper_service.scrape_article"
     ) as mock_scrape:
@@ -411,15 +412,15 @@ def test_user_max_tokens_respected(client: TestClient):
             "app.services.hf_streaming_summarizer.hf_streaming_service.summarize_text_stream",
             side_effect=mock_stream,
         ):
-            # User requests only 400 tokens
             response = client.post(
                 "/api/v3/scrape-and-summarize/stream",
                 json={"url": "https://example.com/long", "max_tokens": 400},
             )
             assert response.status_code == 200
-            # Should respect user's limit of 400
-            assert captured_kwargs.get("max_new_tokens", 0) <= 400
             # min_length should still be 60% of the actual max used
             expected_min = int(captured_kwargs["max_new_tokens"] * 0.6)
             assert captured_kwargs.get("min_length", 0) == expected_min

     with patch(
         "app.services.article_scraper.article_scraper_service.scrape_article"
     ) as mock_scrape:
+        # Medium article: ~2000 chars -> should get 666 tokens (2000 // 3)
         mock_scrape.return_value = {
             "text": "Medium article content. " * 80,  # ~2000 chars
             "title": "Medium Article",
             )
             assert response.status_code == 200
+            # Now ignores client's max_tokens, uses adaptive calculation
+            # For 2000 chars: 2000 // 3 = 666 tokens (client's 512 is ignored)
+            assert 600 <= captured_kwargs.get("max_new_tokens", 0) <= 700
             # min_length should be 60% of max_new_tokens
             expected_min = int(captured_kwargs["max_new_tokens"] * 0.6)
             assert captured_kwargs.get("min_length", 0) == expected_min
             assert captured_kwargs.get("min_length", 0) == expected_min
+def test_user_max_tokens_ignored_for_quality(client: TestClient):
+    """Test that user-specified max_tokens is IGNORED to ensure quality summaries."""
     with patch(
         "app.services.article_scraper.article_scraper_service.scrape_article"
     ) as mock_scrape:
             "app.services.hf_streaming_summarizer.hf_streaming_service.summarize_text_stream",
             side_effect=mock_stream,
         ):
+            # User requests only 400 tokens, but backend will ignore and use adaptive
             response = client.post(
                 "/api/v3/scrape-and-summarize/stream",
                 json={"url": "https://example.com/long", "max_tokens": 400},
             )
             assert response.status_code == 200
+            # Ignores user's 400, uses adaptive (4000 // 3 = 1333, capped at 1024)
+            assert captured_kwargs.get("max_new_tokens", 0) == 1024
             # min_length should still be 60% of the actual max used
             expected_min = int(captured_kwargs["max_new_tokens"] * 0.6)
             assert captured_kwargs.get("min_length", 0) == expected_min