smolvlm2-video-highlights2

Sleeping

avinashHuggingface108 commited on Sep 17

Commit

a4bd75a

1 Parent(s): 8a9a9e9

🚀 Deploy SmolVLM2 Video Highlights API

- FastAPI server with background processing
- SmolVLM2 + Whisper AI integration
- Docker deployment configuration
- Complete video highlights generation system
- REST API for mobile app integration

Files changed (7) hide show

Dockerfile +33 -0
README.md +76 -6
audio_enhanced_highlights_final.py +564 -0
fastapi_requirements.txt +9 -0
highlights_api.py +345 -0
requirements.txt +32 -0
src/smolvlm2_handler.py +281 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,33 @@

+# Use Python 3.9 slim image
+FROM python:3.9-slim
+# Set working directory
+WORKDIR /app
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    ffmpeg \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+# Copy requirements first for better caching
+COPY requirements.txt fastapi_requirements.txt ./
+RUN pip install --no-cache-dir -r requirements.txt && \
+    pip install --no-cache-dir -r fastapi_requirements.txt
+# Copy application code
+COPY . .
+# Create necessary directories
+RUN mkdir -p outputs temp samples src
+# Expose port
+EXPOSE 7860
+# Set environment variables
+ENV PYTHONPATH=/app
+ENV GRADIO_SERVER_NAME=0.0.0.0
+ENV GRADIO_SERVER_PORT=7860
+# Run the FastAPI app
+CMD ["python", "-m", "uvicorn", "highlights_api:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,11 +1,81 @@
 ---
-title: Smolvlm2 Video Highlights
-emoji: 👀
-colorFrom: red
-colorTo: gray
 sdk: docker
 pinned: false
-short_description: video highlights
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: SmolVLM2 Video Highlights
+emoji: 🎬
+colorFrom: blue
+colorTo: purple
 sdk: docker
 pinned: false
+license: apache-2.0
+app_port: 7860
 ---
+# 🎬 SmolVLM2 Video Highlights API
+**Generate intelligent video highlights using SmolVLM2 + Whisper AI**
+This is a FastAPI service that combines visual analysis (SmolVLM2) with audio transcription (Whisper) to automatically create highlight videos from longer content.
+## 🚀 Features
+- **Visual Analysis**: SmolVLM2-2.2B-Instruct analyzes video frames for interesting content
+- **Audio Processing**: Whisper transcribes speech in 99+ languages
+- **Smart Scoring**: Combines visual and audio analysis for intelligent highlights
+- **REST API**: Upload videos and download processed highlights
+- **Background Processing**: Non-blocking video processing with job tracking
+## 🔗 API Endpoints
+- `POST /upload-video` - Upload video for processing
+- `GET /job-status/{job_id}` - Check processing status
+- `GET /download/{filename}` - Download generated highlights
+- `GET /docs` - Interactive API documentation
+## 📱 Usage
+### Via API
+```bash
+# Upload video
+curl -X POST -F "video=@your_video.mp4" https://avinashhuggingface108-smolvlm2-video-highlights.hf.space/upload-video
+# Check status
+curl https://avinashhuggingface108-smolvlm2-video-highlights.hf.space/job-status/YOUR_JOB_ID
+# Download highlights
+curl -O https://avinashhuggingface108-smolvlm2-video-highlights.hf.space/download/FILENAME.mp4
+```
+### Via Android App
+Use the provided Android client code to integrate with your mobile app.
+## ⚙️ Configuration
+Default settings:
+- **Interval**: 20 seconds (analyze every 20s)
+- **Min Score**: 6.5 (quality threshold)
+- **Max Highlights**: 3 (maximum highlight segments)
+- **Whisper Model**: base (accuracy vs speed)
+- **Timeout**: 35 seconds per segment
+## 🛠️ Technology Stack
+- **SmolVLM2-2.2B-Instruct**: Vision-language model for visual content analysis
+- **OpenAI Whisper**: Speech-to-text in 99+ languages
+- **FastAPI**: Modern web framework for APIs
+- **FFmpeg**: Video processing and manipulation
+- **PyTorch**: Deep learning framework with MPS acceleration
+## 🎯 Perfect For
+- Social media content creators
+- Educational video processing
+- Meeting/lecture summarization
+- Sports highlight generation
+- Entertainment content curation
+## �� License
+Apache 2.0 - Free for commercial and personal use
+## 🤝 Contributing
+Built with ❤️ using Hugging Face Transformers and open-source AI models.

audio_enhanced_highlights_final.py ADDED Viewed

	@@ -0,0 +1,564 @@

+#!/usr/bin/env python3
+"""
+Audio-Enhanced Video Highlights Generator
+Combines SmolVLM2 visual analysis with Whisper audio transcription
+Supports 99+ languages including Telugu, Hindi, English
+"""
+import os
+import sys
+import cv2
+import argparse
+import json
+import subprocess
+import threading
+import time
+import tempfile
+from pathlib import Path
+from PIL import Image
+from typing import List, Dict, Optional
+import logging
+# Add src directory to path for imports
+sys.path.append(str(Path(__file__).parent / "src"))
+try:
+    from src.smolvlm2_handler import SmolVLM2Handler
+except ImportError:
+    print("❌ SmolVLM2Handler not found. Make sure to install dependencies first.")
+    sys.exit(1)
+try:
+    import whisper
+    WHISPER_AVAILABLE = True
+    print("✅ Whisper available for audio transcription")
+except ImportError:
+    WHISPER_AVAILABLE = False
+    print("❌ Whisper not available. Install with: pip install openai-whisper")
+    sys.exit(1)
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+class AudioVisualAnalyzer:
+    """Comprehensive analyzer combining visual and audio analysis"""
+    def __init__(self, whisper_model_size="base", timeout_seconds=30):
+        """Initialize with SmolVLM2 and Whisper models"""
+        print("🔧 Initializing Audio-Visual Analyzer...")
+        # Initialize SmolVLM2 for visual analysis
+        self.vlm_handler = SmolVLM2Handler()
+        self.timeout_seconds = timeout_seconds
+        # Initialize Whisper for audio analysis
+        if WHISPER_AVAILABLE:
+            print(f"📥 Loading Whisper model ({whisper_model_size})...")
+            self.whisper_model = whisper.load_model(whisper_model_size)
+            print("✅ Whisper model loaded successfully")
+        else:
+            self.whisper_model = None
+            print("⚠️ Whisper not available - audio analysis disabled")
+    def extract_audio_segments(self, video_path: str, segments: List[Dict]) -> List[str]:
+        """Extract audio for specific video segments"""
+        audio_files = []
+        temp_dir = tempfile.mkdtemp()
+        for i, segment in enumerate(segments):
+            start_time = segment['start_time']
+            duration = segment['duration']
+            audio_path = os.path.join(temp_dir, f"segment_{i}.wav")
+            # Extract audio segment using FFmpeg
+            cmd = [
+                'ffmpeg', '-i', video_path,
+                '-ss', str(start_time),
+                '-t', str(duration),
+                '-vn',  # No video
+                '-acodec', 'pcm_s16le',  # Uncompressed audio
+                '-ar', '16000',  # 16kHz sample rate for Whisper
+                '-ac', '1',  # Mono
+                '-f', 'wav',  # Force WAV format
+                '-y',  # Overwrite
+                audio_path
+            ]
+            try:
+                result = subprocess.run(cmd, check=True, capture_output=True, text=True)
+                if os.path.exists(audio_path) and os.path.getsize(audio_path) > 0:
+                    audio_files.append(audio_path)
+                    logger.info(f"📄 Extracted audio segment {i+1}: {duration:.1f}s")
+                else:
+                    logger.warning(f"⚠️ Audio segment {i+1} is empty or missing")
+                    audio_files.append(None)
+            except subprocess.CalledProcessError as e:
+                logger.warning(f"⚠️ No audio stream in segment {i+1} (this is normal for silent videos)")
+                audio_files.append(None)
+        return audio_files
+    def transcribe_audio_segment(self, audio_path: str) -> Dict:
+        """Transcribe audio segment with Whisper"""
+        if not WHISPER_AVAILABLE or not audio_path or not os.path.exists(audio_path):
+            return {"text": "", "language": "unknown", "confidence": 0.0}
+        try:
+            result = self.whisper_model.transcribe(
+                audio_path,
+                language=None,  # Auto-detect language
+                task="transcribe"
+            )
+            return {
+                "text": result.get("text", "").strip(),
+                "language": result.get("language", "unknown"),
+                "confidence": 1.0  # Whisper doesn't provide confidence scores
+            }
+        except Exception as e:
+            logger.error(f"❌ Audio transcription failed: {e}")
+            return {"text": "", "language": "unknown", "confidence": 0.0}
+    def analyze_visual_content(self, frame_path: str) -> Dict:
+        """Analyze visual content using SmolVLM2 with robust error handling"""
+        max_retries = 2
+        retry_count = 0
+        while retry_count < max_retries:
+            try:
+                def generate_with_timeout():
+                    prompt = ("Analyze this video frame for interesting, engaging, or highlight-worthy content. "
+                             "Rate the excitement/interest level from 1-10 and explain what makes it noteworthy. "
+                             "Focus on action, emotion, important moments, or visually striking elements.")
+                    return self.vlm_handler.generate_response(frame_path, prompt)
+                # Run with timeout protection
+                thread_result = [None]
+                exception_result = [None]
+                def target():
+                    try:
+                        thread_result[0] = generate_with_timeout()
+                    except Exception as e:
+                        exception_result[0] = e
+                thread = threading.Thread(target=target)
+                thread.daemon = True
+                thread.start()
+                thread.join(self.timeout_seconds)
+                if thread.is_alive():
+                    logger.warning(f"⏰ Visual analysis timed out after {self.timeout_seconds}s (attempt {retry_count + 1})")
+                    retry_count += 1
+                    if retry_count >= max_retries:
+                        return {"description": "Analysis timed out after multiple attempts", "score": 6.0}
+                    continue
+                if exception_result[0]:
+                    error_msg = str(exception_result[0])
+                    if "probability tensor" in error_msg or "inf" in error_msg or "nan" in error_msg:
+                        logger.warning(f"⚠️ Model inference error, retrying (attempt {retry_count + 1}): {error_msg}")
+                        retry_count += 1
+                        if retry_count >= max_retries:
+                            return {"description": "Model inference failed after retries", "score": 6.0}
+                        continue
+                    else:
+                        raise exception_result[0]
+                response = thread_result[0]
+                if not response or len(response.strip()) == 0:
+                    logger.warning(f"⚠️ Empty response, retrying (attempt {retry_count + 1})")
+                    retry_count += 1
+                    if retry_count >= max_retries:
+                        return {"description": "No meaningful response after retries", "score": 6.0}
+                    continue
+                # Extract score from response
+                score = self.extract_score_from_text(response)
+                return {"description": response, "score": score}
+            except Exception as e:
+                error_msg = str(e)
+                logger.warning(f"⚠️ Visual analysis error (attempt {retry_count + 1}): {error_msg}")
+                retry_count += 1
+                if retry_count >= max_retries:
+                    return {"description": f"Analysis failed after {max_retries} attempts: {error_msg}", "score": 6.0}
+        # Fallback if all retries failed
+        return {"description": "Analysis failed after all retry attempts", "score": 6.0}
+    def extract_score_from_text(self, text: str) -> float:
+        """Extract numeric score from analysis text"""
+        import re
+        # Look for patterns like "8/10", "score: 7", "rate: 6.5", etc.
+        patterns = [
+            r'(\d+(?:\.\d+)?)\s*/\s*10',  # "8/10" or "7.5/10"
+            r'(?:score|rating|rate)(?:\s*[:=]\s*)(\d+(?:\.\d+)?)',  # "score: 8" or "rating=7.5"
+            r'(\d+(?:\.\d+)?)\s*(?:out of|/)\s*10',  # "8 out of 10"
+            r'(?:^|\s)(\d+(?:\.\d+)?)(?:\s*[/]\s*10)?(?:\s|$)',  # Just numbers
+        ]
+        for pattern in patterns:
+            matches = re.findall(pattern, text.lower())
+            if matches:
+                try:
+                    score = float(matches[0])
+                    return min(max(score, 1.0), 10.0)  # Clamp between 1-10
+                except ValueError:
+                    continue
+        return 6.0  # Default score if no pattern found
+    def calculate_combined_score(self, visual_score: float, audio_text: str, audio_lang: str) -> float:
+        """Calculate combined score from visual and audio analysis"""
+        # Start with visual score
+        combined_score = visual_score
+        # Audio content scoring
+        if audio_text:
+            audio_bonus = 0.0
+            text_lower = audio_text.lower()
+            # Positive indicators
+            excitement_words = ['amazing', 'incredible', 'wow', 'fantastic', 'awesome', 'perfect', 'excellent']
+            action_words = ['goal', 'win', 'victory', 'success', 'breakthrough', 'achievement']
+            emotion_words = ['happy', 'excited', 'thrilled', 'surprised', 'shocked', 'love']
+            # Telugu positive indicators (basic)
+            telugu_positive = ['అద్భుతం', 'చాలా బాగుంది', 'వావ్', 'సూపర్']
+            # Count positive indicators
+            for word_list in [excitement_words, action_words, emotion_words, telugu_positive]:
+                for word in word_list:
+                    if word in text_lower:
+                        audio_bonus += 0.5
+            # Length bonus for substantial content
+            if len(audio_text) > 50:
+                audio_bonus += 0.3
+            elif len(audio_text) > 20:
+                audio_bonus += 0.1
+            # Language diversity bonus
+            if audio_lang in ['te', 'telugu']:  # Telugu content
+                audio_bonus += 0.2
+            elif audio_lang in ['hi', 'hindi']:  # Hindi content
+                audio_bonus += 0.2
+            combined_score += audio_bonus
+        # Clamp final score
+        return min(max(combined_score, 1.0), 10.0)
+    def analyze_segment(self, video_path: str, segment: Dict, temp_frame_path: str) -> Dict:
+        """Analyze a single video segment with both visual and audio"""
+        start_time = segment['start_time']
+        duration = segment['duration']
+        logger.info(f"🔍 Analyzing segment at {start_time:.1f}s ({duration:.1f}s duration)")
+        # Visual analysis
+        visual_analysis = self.analyze_visual_content(temp_frame_path)
+        # Audio analysis
+        audio_files = self.extract_audio_segments(video_path, [segment])
+        audio_analysis = {"text": "", "language": "unknown", "confidence": 0.0}
+        if audio_files and audio_files[0]:
+            audio_analysis = self.transcribe_audio_segment(audio_files[0])
+            # Cleanup temporary audio file
+            try:
+                os.unlink(audio_files[0])
+            except:
+                pass
+        # Combined scoring
+        combined_score = self.calculate_combined_score(
+            visual_analysis['score'],
+            audio_analysis['text'],
+            audio_analysis['language']
+        )
+        return {
+            'start_time': start_time,
+            'duration': duration,
+            'visual_score': visual_analysis['score'],
+            'visual_description': visual_analysis['description'],
+            'audio_text': audio_analysis['text'],
+            'audio_language': audio_analysis['language'],
+            'combined_score': combined_score,
+            'selected': False
+        }
+def extract_frames_at_intervals(video_path: str, interval_seconds: float = 10.0) -> List[Dict]:
+    """Extract frames at regular intervals from video"""
+    cap = cv2.VideoCapture(video_path)
+    if not cap.isOpened():
+        raise ValueError(f"Cannot open video file: {video_path}")
+    fps = cap.get(cv2.CAP_PROP_FPS)
+    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+    duration = total_frames / fps
+    logger.info(f"📹 Video: {duration:.1f}s, {fps:.1f} FPS, {total_frames} frames")
+    segments = []
+    current_time = 0
+    while current_time < duration:
+        segment_duration = min(interval_seconds, duration - current_time)
+        segments.append({
+            'start_time': current_time,
+            'duration': segment_duration,
+            'frame_number': int(current_time * fps)
+        })
+        current_time += interval_seconds
+    cap.release()
+    return segments
+def save_frame_at_time(video_path: str, time_seconds: float, output_path: str) -> bool:
+    """Save a frame at specific time with robust frame extraction"""
+    cap = cv2.VideoCapture(video_path)
+    if not cap.isOpened():
+        return False
+    try:
+        fps = cap.get(cv2.CAP_PROP_FPS)
+        total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+        frame_number = int(time_seconds * fps)
+        # Ensure frame number is within valid range
+        frame_number = min(frame_number, total_frames - 1)
+        frame_number = max(frame_number, 0)
+        # Try to extract frame with fallback options
+        for attempt in range(3):
+            try:
+                # Try exact frame first
+                test_frame = frame_number + attempt
+                if test_frame >= total_frames:
+                    test_frame = frame_number - attempt
+                if test_frame < 0:
+                    test_frame = frame_number
+                cap.set(cv2.CAP_PROP_POS_FRAMES, test_frame)
+                ret, frame = cap.read()
+                if ret and frame is not None and frame.size > 0:
+                    # Validate frame data
+                    if len(frame.shape) == 3 and frame.shape[2] == 3:  # Valid color frame
+                        success = cv2.imwrite(output_path, frame)
+                        if success:
+                            cap.release()
+                            return True
+            except Exception as e:
+                logger.warning(f"Frame extraction attempt {attempt + 1} failed: {e}")
+                continue
+        cap.release()
+        return False
+    except Exception as e:
+        logger.error(f"Critical error in frame extraction: {e}")
+        cap.release()
+        return False
+def create_highlights_video(video_path: str, selected_segments: List[Dict], output_path: str):
+    """Create highlights video from selected segments"""
+    if not selected_segments:
+        logger.error("❌ No segments selected for highlights")
+        return False
+    # Create temporary files for each segment
+    temp_files = []
+    temp_dir = tempfile.mkdtemp()
+    for i, segment in enumerate(selected_segments):
+        temp_file = os.path.join(temp_dir, f"segment_{i}.mp4")
+        cmd = [
+            'ffmpeg', '-i', video_path,
+            '-ss', str(segment['start_time']),
+            '-t', str(segment['duration']),
+            '-c', 'copy',  # Copy streams without re-encoding
+            '-y', temp_file
+        ]
+        try:
+            subprocess.run(cmd, check=True, capture_output=True)
+            temp_files.append(temp_file)
+            logger.info(f"✅ Created segment {i+1}/{len(selected_segments)}")
+        except subprocess.CalledProcessError as e:
+            logger.error(f"❌ Failed to create segment {i+1}: {e}")
+            continue
+    if not temp_files:
+        logger.error("❌ No valid segments created")
+        return False
+    # Create concat file
+    concat_file = os.path.join(temp_dir, "concat.txt")
+    with open(concat_file, 'w') as f:
+        for temp_file in temp_files:
+            f.write(f"file '{temp_file}'\n")
+    # Concatenate segments
+    cmd = [
+        'ffmpeg', '-f', 'concat', '-safe', '0',
+        '-i', concat_file,
+        '-c', 'copy',
+        '-y', output_path
+    ]
+    try:
+        subprocess.run(cmd, check=True, capture_output=True)
+        logger.info(f"✅ Highlights video created: {output_path}")
+        # Cleanup
+        for temp_file in temp_files:
+            try:
+                os.unlink(temp_file)
+            except:
+                pass
+        try:
+            os.unlink(concat_file)
+            os.rmdir(temp_dir)
+        except:
+            pass
+        return True
+    except subprocess.CalledProcessError as e:
+        logger.error(f"❌ Failed to create highlights video: {e}")
+        return False
+def main():
+    parser = argparse.ArgumentParser(description="Audio-Enhanced Video Highlights Generator")
+    parser.add_argument("video_path", help="Path to input video file")
+    parser.add_argument("--output", "-o", default="audio_enhanced_highlights.mp4",
+                       help="Output highlights video path")
+    parser.add_argument("--interval", "-i", type=float, default=10.0,
+                       help="Analysis interval in seconds (default: 10.0)")
+    parser.add_argument("--min-score", "-s", type=float, default=7.0,
+                       help="Minimum score for highlights (default: 7.0)")
+    parser.add_argument("--max-highlights", "-m", type=int, default=5,
+                       help="Maximum number of highlights (default: 5)")
+    parser.add_argument("--whisper-model", "-w", default="base",
+                       choices=["tiny", "base", "small", "medium", "large"],
+                       help="Whisper model size (default: base)")
+    parser.add_argument("--timeout", "-t", type=int, default=30,
+                       help="Timeout for each analysis in seconds (default: 30)")
+    parser.add_argument("--save-analysis", action="store_true",
+                       help="Save detailed analysis to JSON file")
+    args = parser.parse_args()
+    # Validate input
+    if not os.path.exists(args.video_path):
+        print(f"❌ Video file not found: {args.video_path}")
+        sys.exit(1)
+    print("🎬 Audio-Enhanced Video Highlights Generator")
+    print(f"📁 Input: {args.video_path}")
+    print(f"📁 Output: {args.output}")
+    print(f"⏱️  Analysis interval: {args.interval}s")
+    print(f"🎯 Minimum score: {args.min_score}")
+    print(f"🏆 Max highlights: {args.max_highlights}")
+    print(f"🎙️  Whisper model: {args.whisper_model}")
+    print()
+    try:
+        # Initialize analyzer
+        analyzer = AudioVisualAnalyzer(
+            whisper_model_size=args.whisper_model,
+            timeout_seconds=args.timeout
+        )
+        # Extract segments for analysis
+        segments = extract_frames_at_intervals(args.video_path, args.interval)
+        print(f"📊 Analyzing {len(segments)} segments...")
+        analyzed_segments = []
+        temp_frame_path = "temp_frame.jpg"
+        for i, segment in enumerate(segments):
+            print(f"\n🔍 Segment {i+1}/{len(segments)} (t={segment['start_time']:.1f}s)")
+            # Save frame for visual analysis
+            if save_frame_at_time(args.video_path, segment['start_time'], temp_frame_path):
+                # Analyze segment
+                analysis = analyzer.analyze_segment(args.video_path, segment, temp_frame_path)
+                analyzed_segments.append(analysis)
+                print(f"  👁️  Visual: {analysis['visual_score']:.1f}/10")
+                print(f"  🎙️  Audio: '{analysis['audio_text'][:50]}...' ({analysis['audio_language']})")
+                print(f"  🎯 Combined: {analysis['combined_score']:.1f}/10")
+            else:
+                print(f"  ❌ Failed to extract frame")
+        # Cleanup temp frame
+        try:
+            os.unlink(temp_frame_path)
+        except:
+            pass
+        if not analyzed_segments:
+            print("❌ No segments analyzed successfully")
+            sys.exit(1)
+        # Select best segments
+        analyzed_segments.sort(key=lambda x: x['combined_score'], reverse=True)
+        selected_segments = [s for s in analyzed_segments if s['combined_score'] >= args.min_score]
+        selected_segments = selected_segments[:args.max_highlights]
+        print(f"\n🏆 Selected {len(selected_segments)} highlights:")
+        for i, segment in enumerate(selected_segments):
+            print(f"{i+1}. t={segment['start_time']:.1f}s, score={segment['combined_score']:.1f}")
+            if segment['audio_text']:
+                print(f"   Audio: \"{segment['audio_text'][:100]}...\"")
+        if not selected_segments:
+            print(f"❌ No segments met minimum score of {args.min_score}")
+            sys.exit(1)
+        # Create highlights video
+        print(f"\n🎬 Creating highlights video...")
+        success = create_highlights_video(args.video_path, selected_segments, args.output)
+        if success:
+            print(f"✅ Audio-enhanced highlights created: {args.output}")
+            # Save analysis if requested
+            if args.save_analysis:
+                analysis_file = args.output.replace('.mp4', '_analysis.json')
+                with open(analysis_file, 'w') as f:
+                    json.dump({
+                        'input_video': args.video_path,
+                        'output_video': args.output,
+                        'settings': {
+                            'interval': args.interval,
+                            'min_score': args.min_score,
+                            'max_highlights': args.max_highlights,
+                            'whisper_model': args.whisper_model,
+                            'timeout': args.timeout
+                        },
+                        'segments': analyzed_segments,
+                        'selected_segments': selected_segments
+                    }, f, indent=2)
+                print(f"📊 Analysis saved: {analysis_file}")
+        else:
+            print("❌ Failed to create highlights video")
+            sys.exit(1)
+    except KeyboardInterrupt:
+        print("\n⏹️  Operation cancelled by user")
+        sys.exit(1)
+    except Exception as e:
+        print(f"❌ Error: {e}")
+        sys.exit(1)
+if __name__ == "__main__":
+    main()

fastapi_requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+# FastAPI Dependencies for SmolVLM2 Video Highlights API
+# Add these to your existing requirements.txt
+fastapi==0.104.1
+uvicorn[standard]==0.24.0
+python-multipart==0.0.6
+pydantic==2.5.0
+python-jose[cryptography]==3.3.0
+passlib[bcrypt]==1.7.4

highlights_api.py ADDED Viewed

	@@ -0,0 +1,345 @@

+#!/usr/bin/env python3
+"""
+FastAPI Wrapper for Audio-Enhanced Video Highlights
+Converts your SmolVLM2 + Whisper system into a web API for Android apps
+"""
+from fastapi import FastAPI, UploadFile, File, HTTPException, BackgroundTasks
+from fastapi.responses import FileResponse, JSONResponse
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel
+import os
+import sys
+import tempfile
+import uuid
+import json
+import asyncio
+from pathlib import Path
+from typing import Optional
+import logging
+# Add src directory to path for imports
+sys.path.append(str(Path(__file__).parent / "src"))
+try:
+    from audio_enhanced_highlights_final import AudioVisualAnalyzer, extract_frames_at_intervals, save_frame_at_time, create_highlights_video
+except ImportError:
+    print("❌ Cannot import audio_enhanced_highlights_final.py")
+    sys.exit(1)
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+# FastAPI app
+app = FastAPI(
+    title="SmolVLM2 Video Highlights API",
+    description="Generate intelligent video highlights using SmolVLM2 + Whisper",
+    version="1.0.0"
+)
+# Enable CORS for Android apps
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],  # In production, specify your Android app's domain
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Request/Response models
+class AnalysisRequest(BaseModel):
+    interval: float = 20.0
+    min_score: float = 6.5
+    max_highlights: int = 3
+    whisper_model: str = "base"
+    timeout: int = 35
+class AnalysisResponse(BaseModel):
+    job_id: str
+    status: str
+    message: str
+class JobStatus(BaseModel):
+    job_id: str
+    status: str  # "processing", "completed", "failed"
+    progress: int  # 0-100
+    message: str
+    highlights_url: Optional[str] = None
+    analysis_url: Optional[str] = None
+# Global storage for jobs (in production, use Redis/database)
+active_jobs = {}
+completed_jobs = {}
+# Create output directories
+os.makedirs("outputs", exist_ok=True)
+os.makedirs("temp", exist_ok=True)
+@app.get("/")
+async def root():
+    return {
+        "message": "SmolVLM2 Video Highlights API",
+        "version": "1.0.0",
+        "endpoints": {
+            "upload": "/upload-video",
+            "status": "/job-status/{job_id}",
+            "download": "/download/{filename}"
+        }
+    }
+@app.post("/upload-video", response_model=AnalysisResponse)
+async def upload_video(
+    background_tasks: BackgroundTasks,
+    video: UploadFile = File(...),
+    interval: float = 20.0,
+    min_score: float = 6.5,
+    max_highlights: int = 3,
+    whisper_model: str = "base",
+    timeout: int = 35
+):
+    """
+    Upload a video and start processing highlights
+    """
+    # Validate file
+    if not video.filename.lower().endswith(('.mp4', '.avi', '.mov', '.mkv')):
+        raise HTTPException(status_code=400, detail="Only video files are supported")
+    # Generate unique job ID
+    job_id = str(uuid.uuid4())
+    try:
+        # Save uploaded video
+        temp_video_path = f"temp/{job_id}_{video.filename}"
+        with open(temp_video_path, "wb") as f:
+            content = await video.read()
+            f.write(content)
+        # Store job info
+        active_jobs[job_id] = {
+            "status": "processing",
+            "progress": 0,
+            "message": "Video uploaded, starting analysis...",
+            "video_path": temp_video_path,
+            "settings": {
+                "interval": interval,
+                "min_score": min_score,
+                "max_highlights": max_highlights,
+                "whisper_model": whisper_model,
+                "timeout": timeout
+            }
+        }
+        # Start processing in background
+        background_tasks.add_task(
+            process_video_highlights,
+            job_id,
+            temp_video_path,
+            interval,
+            min_score,
+            max_highlights,
+            whisper_model,
+            timeout
+        )
+        return AnalysisResponse(
+            job_id=job_id,
+            status="processing",
+            message="Video uploaded successfully. Processing started."
+        )
+    except Exception as e:
+        logger.error(f"Upload failed: {e}")
+        raise HTTPException(status_code=500, detail=f"Upload failed: {str(e)}")
+@app.get("/job-status/{job_id}", response_model=JobStatus)
+async def get_job_status(job_id: str):
+    """
+    Get the status of a processing job
+    """
+    # Check active jobs
+    if job_id in active_jobs:
+        job = active_jobs[job_id]
+        return JobStatus(
+            job_id=job_id,
+            status=job["status"],
+            progress=job["progress"],
+            message=job["message"]
+        )
+    # Check completed jobs
+    if job_id in completed_jobs:
+        job = completed_jobs[job_id]
+        return JobStatus(
+            job_id=job_id,
+            status=job["status"],
+            progress=100,
+            message=job["message"],
+            highlights_url=job.get("highlights_url"),
+            analysis_url=job.get("analysis_url")
+        )
+    raise HTTPException(status_code=404, detail="Job not found")
+@app.get("/download/{filename}")
+async def download_file(filename: str):
+    """
+    Download generated files
+    """
+    file_path = f"outputs/{filename}"
+    if not os.path.exists(file_path):
+        raise HTTPException(status_code=404, detail="File not found")
+    return FileResponse(
+        file_path,
+        media_type='application/octet-stream',
+        filename=filename
+    )
+async def process_video_highlights(
+    job_id: str,
+    video_path: str,
+    interval: float,
+    min_score: float,
+    max_highlights: int,
+    whisper_model: str,
+    timeout: int
+):
+    """
+    Background task to process video highlights
+    """
+    try:
+        # Update status
+        active_jobs[job_id]["progress"] = 10
+        active_jobs[job_id]["message"] = "Initializing AI models..."
+        # Initialize analyzer
+        analyzer = AudioVisualAnalyzer(
+            whisper_model_size=whisper_model,
+            timeout_seconds=timeout
+        )
+        active_jobs[job_id]["progress"] = 20
+        active_jobs[job_id]["message"] = "Extracting video segments..."
+        # Extract segments
+        segments = extract_frames_at_intervals(video_path, interval)
+        total_segments = len(segments)
+        active_jobs[job_id]["progress"] = 30
+        active_jobs[job_id]["message"] = f"Analyzing {total_segments} segments..."
+        # Analyze segments
+        analyzed_segments = []
+        temp_frame_path = f"temp/{job_id}_frame.jpg"
+        for i, segment in enumerate(segments):
+            # Update progress
+            progress = 30 + int((i / total_segments) * 50)  # 30-80%
+            active_jobs[job_id]["progress"] = progress
+            active_jobs[job_id]["message"] = f"Analyzing segment {i+1}/{total_segments}"
+            # Save frame for visual analysis
+            if save_frame_at_time(video_path, segment['start_time'], temp_frame_path):
+                # Analyze segment
+                analysis = analyzer.analyze_segment(video_path, segment, temp_frame_path)
+                analyzed_segments.append(analysis)
+        # Cleanup temp frame
+        try:
+            os.unlink(temp_frame_path)
+        except:
+            pass
+        active_jobs[job_id]["progress"] = 85
+        active_jobs[job_id]["message"] = "Selecting best highlights..."
+        # Select best segments
+        analyzed_segments.sort(key=lambda x: x['combined_score'], reverse=True)
+        selected_segments = [s for s in analyzed_segments if s['combined_score'] >= min_score]
+        selected_segments = selected_segments[:max_highlights]
+        if not selected_segments:
+            raise Exception(f"No segments met minimum score of {min_score}")
+        active_jobs[job_id]["progress"] = 90
+        active_jobs[job_id]["message"] = f"Creating highlights video with {len(selected_segments)} segments..."
+        # Create output filenames
+        highlights_filename = f"{job_id}_highlights.mp4"
+        analysis_filename = f"{job_id}_analysis.json"
+        highlights_path = f"outputs/{highlights_filename}"
+        analysis_path = f"outputs/{analysis_filename}"
+        # Create highlights video
+        success = create_highlights_video(video_path, selected_segments, highlights_path)
+        if not success:
+            raise Exception("Failed to create highlights video")
+        # Save analysis
+        analysis_data = {
+            'job_id': job_id,
+            'input_video': video_path,
+            'output_video': highlights_path,
+            'settings': {
+                'interval': interval,
+                'min_score': min_score,
+                'max_highlights': max_highlights,
+                'whisper_model': whisper_model,
+                'timeout': timeout
+            },
+            'segments': analyzed_segments,
+            'selected_segments': selected_segments,
+            'summary': {
+                'total_segments': len(analyzed_segments),
+                'selected_segments': len(selected_segments),
+                'processing_time': "Completed successfully"
+            }
+        }
+        with open(analysis_path, 'w') as f:
+            json.dump(analysis_data, f, indent=2)
+        # Mark as completed
+        completed_jobs[job_id] = {
+            "status": "completed",
+            "message": f"Successfully created highlights with {len(selected_segments)} segments",
+            "highlights_url": f"/download/{highlights_filename}",
+            "analysis_url": f"/download/{analysis_filename}",
+            "summary": analysis_data['summary']
+        }
+        # Remove from active jobs
+        del active_jobs[job_id]
+        # Cleanup temp video
+        try:
+            os.unlink(video_path)
+        except:
+            pass
+    except Exception as e:
+        logger.error(f"Processing failed for job {job_id}: {e}")
+        # Mark as failed
+        completed_jobs[job_id] = {
+            "status": "failed",
+            "message": f"Processing failed: {str(e)}",
+            "highlights_url": None,
+            "analysis_url": None
+        }
+        # Remove from active jobs
+        if job_id in active_jobs:
+            del active_jobs[job_id]
+        # Cleanup
+        try:
+            os.unlink(video_path)
+        except:
+            pass
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8000)

requirements.txt ADDED Viewed

	@@ -0,0 +1,32 @@

+# SmolVLM2 Video Testing Requirements
+# Core ML and Vision Libraries
+torch>=2.0.0
+torchvision>=0.15.0
+transformers>=4.40.0
+accelerate>=0.27.0
+pillow>=10.0.0
+# Video Processing
+opencv-python>=4.8.0
+imageio>=2.31.0
+imageio-ffmpeg>=0.4.9
+# Hugging Face Integration
+huggingface-hub>=0.20.0
+datasets>=2.16.0
+# Utilities
+numpy>=1.24.0
+matplotlib>=3.7.0
+tqdm>=4.65.0
+requests>=2.31.0
+# Development Tools
+jupyter>=1.0.0
+ipykernel>=6.25.0
+black>=23.0.0
+flake8>=6.0.0
+# Optional: For better performance on Apple Silicon
+# Install with: pip install --upgrade --force-reinstall --no-cache-dir torch torchvision --index-url https://download.pytorch.org/whl/cpu

src/smolvlm2_handler.py ADDED Viewed

	@@ -0,0 +1,281 @@

+#!/usr/bin/env python3
+"""
+SmolVLM2 Model Handler
+Handles loading and inference with SmolVLM2-1.7B-Instruct model
+"""
+import torch
+from transformers import AutoModelForImageTextToText, AutoProcessor
+from PIL import Image
+import requests
+from typing import List, Union, Optional
+import logging
+import warnings
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+# Suppress some warnings for cleaner output
+warnings.filterwarnings("ignore", category=UserWarning)
+class SmolVLM2Handler:
+    """Handler for SmolVLM2 model operations"""
+    def __init__(self, model_name: str = "HuggingFaceTB/SmolVLM2-2.2B-Instruct", device: str = "auto"):
+        """
+        Initialize SmolVLM2 model
+        Args:
+            model_name: HuggingFace model identifier
+            device: Device to use ('auto', 'cpu', 'cuda', 'mps')
+        """
+        self.model_name = model_name
+        self.device = self._get_device(device)
+        self.model = None
+        self.processor = None
+        logger.info(f"Initializing SmolVLM2 on device: {self.device}")
+        self._load_model()
+    def _get_device(self, device: str) -> str:
+        """Determine the best device to use"""
+        if device == "auto":
+            if torch.cuda.is_available():
+                return "cuda"
+            elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
+                return "mps"  # Apple Silicon GPU
+            else:
+                return "cpu"
+        return device
+    def _load_model(self):
+        """Load the model and processor"""
+        try:
+            logger.info("Loading processor...")
+            self.processor = AutoProcessor.from_pretrained(
+                self.model_name,
+                trust_remote_code=True
+            )
+            logger.info("Loading model...")
+            self.model = AutoModelForImageTextToText.from_pretrained(
+                self.model_name,
+                torch_dtype=torch.float16 if self.device != "cpu" else torch.float32,
+                trust_remote_code=True,
+                device_map=self.device if self.device != "cpu" else None
+            )
+            if self.device == "cpu":
+                self.model = self.model.to(self.device)
+            logger.info("✅ Model loaded successfully!")
+        except Exception as e:
+            logger.error(f"❌ Failed to load model: {e}")
+            raise
+    def process_image(self, image_input: Union[str, Image.Image]) -> Image.Image:
+        """
+        Process image input into PIL Image
+        Args:
+            image_input: File path, URL, or PIL Image
+        Returns:
+            PIL Image object
+        """
+        if isinstance(image_input, str):
+            if image_input.startswith(('http://', 'https://')):
+                # Download from URL
+                response = requests.get(image_input)
+                image = Image.open(requests.get(image_input, stream=True).raw)
+            else:
+                # Load from file path
+                image = Image.open(image_input)
+        elif isinstance(image_input, Image.Image):
+            image = image_input
+        else:
+            raise ValueError("Image input must be file path, URL, or PIL Image")
+        # Convert to RGB if necessary
+        if image.mode != 'RGB':
+            image = image.convert('RGB')
+        return image
+    def generate_response(
+        self,
+        image_input: Union[str, Image.Image, List[Image.Image]],
+        text_prompt: str,
+        max_new_tokens: int = 512,
+        temperature: float = 0.7,
+        do_sample: bool = True
+    ) -> str:
+        """
+        Generate response from image(s) and text prompt
+        Args:
+            image_input: Single image or list of images
+            text_prompt: Text prompt/question
+            max_new_tokens: Maximum tokens to generate
+            temperature: Sampling temperature
+            do_sample: Whether to use sampling
+        Returns:
+            Generated text response
+        """
+        try:
+            # Process images
+            if isinstance(image_input, list):
+                images = [self.process_image(img) for img in image_input]
+            else:
+                images = [self.process_image(image_input)]
+            # Create proper conversation format for SmolVLM2
+            messages = [
+                {
+                    "role": "user",
+                    "content": [{"type": "text", "text": text_prompt}]
+                }
+            ]
+            # Add image content to the message
+            for img in images:
+                messages[0]["content"].insert(0, {"type": "image", "image": img})
+            # Apply chat template
+            try:
+                prompt = self.processor.apply_chat_template(
+                    messages,
+                    add_generation_prompt=True
+                )
+            except:
+                # Fallback to simple format if chat template fails
+                image_tokens = "<image>" * len(images)
+                prompt = f"{image_tokens}{text_prompt}"
+            # Prepare inputs
+            inputs = self.processor(
+                images=images,
+                text=prompt,
+                return_tensors="pt"
+            ).to(self.device)
+            # Generate response with robust parameters
+            with torch.no_grad():
+                try:
+                    generated_ids = self.model.generate(
+                        **inputs,
+                        max_new_tokens=max_new_tokens,
+                        temperature=max(0.1, min(temperature, 1.0)),  # Clamp temperature
+                        do_sample=do_sample,
+                        top_p=0.9,
+                        repetition_penalty=1.1,
+                        pad_token_id=self.processor.tokenizer.eos_token_id,
+                        eos_token_id=self.processor.tokenizer.eos_token_id,
+                        use_cache=True
+                    )
+                except RuntimeError as e:
+                    if "probability tensor" in str(e) or "nan" in str(e) or "inf" in str(e):
+                        # Retry with more conservative parameters
+                        logger.warning("Retrying with conservative parameters due to probability tensor error")
+                        generated_ids = self.model.generate(
+                            **inputs,
+                            max_new_tokens=min(max_new_tokens, 256),
+                            temperature=0.3,
+                            do_sample=False,  # Use greedy decoding
+                            pad_token_id=self.processor.tokenizer.eos_token_id,
+                            eos_token_id=self.processor.tokenizer.eos_token_id,
+                            use_cache=True
+                        )
+                    else:
+                        raise
+            # Decode only the new tokens (skip input)
+            input_length = inputs['input_ids'].shape[1]
+            new_tokens = generated_ids[0][input_length:]
+            generated_text = self.processor.tokenizer.decode(
+                new_tokens,
+                skip_special_tokens=True
+            ).strip()
+            # Return meaningful response even if empty
+            if not generated_text:
+                return "I can see the image but cannot generate a specific description."
+            return generated_text
+        except Exception as e:
+            logger.error(f"❌ Error during generation: {e}")
+            raise
+    def analyze_video_frames(
+        self,
+        frames: List[Image.Image],
+        question: str,
+        max_frames: int = 8
+    ) -> str:
+        """
+        Analyze video frames and answer questions
+        Args:
+            frames: List of PIL Image frames
+            question: Question about the video
+            max_frames: Maximum number of frames to process
+        Returns:
+            Analysis result
+        """
+        # Sample frames if too many
+        if len(frames) > max_frames:
+            step = len(frames) // max_frames
+            sampled_frames = frames[::step][:max_frames]
+        else:
+            sampled_frames = frames
+        logger.info(f"Analyzing {len(sampled_frames)} frames")
+        # Create a simple prompt for video analysis (don't add image tokens manually)
+        video_prompt = f"These are frames from a video. {question}"
+        return self.generate_response(sampled_frames, video_prompt)
+    def get_model_info(self) -> dict:
+        """Get information about the loaded model"""
+        return {
+            "model_name": self.model_name,
+            "device": self.device,
+            "model_type": type(self.model).__name__,
+            "processor_type": type(self.processor).__name__,
+            "loaded": self.model is not None and self.processor is not None
+        }
+def test_model():
+    """Test the model with a simple example"""
+    try:
+        # Initialize model
+        vlm = SmolVLM2Handler()
+        print("📋 Model Info:")
+        info = vlm.get_model_info()
+        for key, value in info.items():
+            print(f"  {key}: {value}")
+        # Test with a simple image (create a test image)
+        test_image = Image.new('RGB', (224, 224), color='blue')
+        test_prompt = "What color is this image?"
+        print(f"\n🔍 Testing with prompt: '{test_prompt}'")
+        response = vlm.generate_response(test_image, test_prompt)
+        print(f"📝 Response: {response}")
+        print("\n✅ Model test completed successfully!")
+    except Exception as e:
+        print(f"❌ Model test failed: {e}")
+        raise
+if __name__ == "__main__":
+    test_model()