Spaces:

LRU1
/

lec2note

Sleeping

App Files Files Community

LRU1 commited on Sep 6

Commit

dd9b31b

1 Parent(s): 429e139

sentence

Browse files

Files changed (28) hide show

DEVELOPER_GUIDE.md +96 -117
lec2note/__pycache__/types.cpython-310.pyc +0 -0
lec2note/api/main.py +8 -16
lec2note/ingestion/__pycache__/audio_extractor.cpython-310.pyc +0 -0
lec2note/ingestion/__pycache__/audio_extractor.cpython-313.pyc +0 -0
lec2note/ingestion/__pycache__/whisper_runner.cpython-313.pyc +0 -0
lec2note/ingestion/audio_extractor.py +4 -2
lec2note/processing/__pycache__/processor.cpython-310.pyc +0 -0
lec2note/processing/processor.py +1 -1
lec2note/scripts/__pycache__/run_pipeline.cpython-310.pyc +0 -0
lec2note/scripts/__pycache__/run_pipeline.cpython-313.pyc +0 -0
lec2note/scripts/run_pipeline.py +2 -10
lec2note/segmentation/__pycache__/chunk_merger.cpython-310.pyc +0 -0
lec2note/segmentation/__pycache__/semantic_segmenter.cpython-310.pyc +0 -0
lec2note/segmentation/__pycache__/sentence_chunker.cpython-310.pyc +0 -0
lec2note/segmentation/chunk_merger.py +68 -0
lec2note/segmentation/sentence_chunker.py +80 -0
lec2note/segmentation/visual_merger.py +63 -0
lec2note/synthesis/__pycache__/assembler.cpython-310.pyc +0 -0
lec2note/synthesis/assembler.py +3 -6
lec2note/utils/__pycache__/logging_config.cpython-313.pyc +0 -0
lec2note/vision/__pycache__/frame_extractor.cpython-310.pyc +0 -0
lec2note/vision/__pycache__/image_sampler.cpython-310.pyc +0 -0
lec2note/vision/__pycache__/keyframe_extractor.cpython-310.pyc +0 -0
lec2note/vision/frame_extractor.py +76 -0
lec2note/vision/image_comparator.py +58 -0
lec2note/vision/image_sampler.py +28 -0
lec2note/vision/keyframe_extractor.py +24 -0

DEVELOPER_GUIDE.md CHANGED Viewed

@@ -1,149 +1,132 @@
-# Lec2Note 开发者指南 (DEVELOPER GUIDE)
 ## 1. 项目概述
-Lec2Note 致力于提供一个端到端的 “视频讲座自动生成笔记” 解决方案。它通过多模态分析技术，深度融合视频画面与音频内容，生成图文并茂、结构清晰的笔记。
-### 核心流程
-1. **混合式分块 (Hybrid Segmentation)**：采用二级策略。首先基于幻灯片切换进行宏观分块，确立内容的主体结构；然后对每个块内的语音文字稿进行语义分析，进行二次切分或合并，确保每个最终块在逻辑上完整且独立。
-2. **多模态信息提取 (Multimodal Information Extraction)**
-   - 视频流：提取每个分块内的关键帧图像（包含动画、标注等动态过程），并对图像进行 OCR 识别。
-   - 音频流：抽取并识别对应时间段的语音 (ASR)，生成带时间戳的文字稿。
-3. **图文同步与对齐 (Synchronization)**：将文字稿与相应的关键帧图像进行关联，构建出“文字描述某一图像”的上下文关系。
-4. **多模态内容生成 (Multimodal Generation)**：将对齐后的图文信息块送入多模态大语言模型 (LLM)，生成包含小节标题、要点总结、核心截图的综合性笔记。
-5. **结构化输出 (Structured Output)**：汇总所有信息，输出为结构化的 Markdown/HTML/Notion 页面。
-### 目标使用场景
 - 学术课程录播
-- 会议/研讨会记录
-- 企业内部培训视频
 ## 2. 技术栈
-| 层级             | 主要技术                         | 说明                               |
-| ---------------- | -------------------------------- | ---------------------------------- |
-| 语言             | Python 3.9+                      | 主代码基于 Python 实现             |
-| 视频/图像处理    | OpenCV, Pillow                   | 关键帧提取、画面变化检测、图像处理 |
-| 文字识别 (OCR)   | PaddleOCR / Tesseract            | 从关键帧图像中提取幻灯片文字       |
-| 语音识别 (ASR)   | Whisper / Faster-Whisper         | 提供高精度 ASR，可选 GPU 加速      |
-| 大语言模型 (LLM) | OpenAI GPT-4V / LLaVA            | 多模态模型，理解图文并生成笔记     |
-| Web 框架         | FastAPI                          | 提供 RESTful & WebSocket 服务      |
-| 任务编排         | Prefect / Celery                 | 支持批处理及重试机制               |
-| 数据库           | SQLite (dev) / PostgreSQL (prod) | 存储元数据与任务状态               |
-| 容器             | Docker & Docker Compose          | 一键部署                           |
 ## 3. 目录结构与模块划分
 ```text
 Lec2Note/
-├── docs/                     # 设计文档 & 会议记录
-├── lec2note/                 # 源码包 (Python)
-│   ├── ingestion/            # 音频处理 & ASR
-│   │   ├── audio_extractor.py
-│   │   └── whisper_runner.py
-│   ├── vision/               # 视频画面处理模块
-│   │   ├── keyframe_extractor.py
 │   │   └── ocr_processor.py
-│   ├── segmentation/         # 【更新】混合式分块模块
-│   │   ├── visual_segmenter.py
-│   │   └── semantic_segmenter.py
-│   ├── processing/           # 多模态信息融合与 LLM 生成
-│   ├── synthesis/            # 全局笔记整合与导出
-│   ├── assets/               # 静态模板 (Markdown/HTML)
-│   └── api/                  # FastAPI 路由
-├── scripts/                  # CLI 脚本 & 任务调度
-├── tests/                    # PyTest 单元与集成测试
 ├── Dockerfile
 ├── docker-compose.yml
 └── README.md
 ```
-## 4. 核心功能说明
-### 4.1 混合式分块 (Hybrid Segmentation)
-这是一个二级处理过程，旨在创建逻辑连贯的内容块：
-- **视觉粗分块**：调用 `visual_segmenter.run(video_fp)`，使用 OpenCV 分析帧间差异，识别幻灯片切换的精确时间点，生成初步的 `slide_chunks`。
-- **语义精细化**：调用 `semantic_segmenter.refine(slide_chunks)`，对上一步结果进行处理：
-  - 拆分：如果一个 chunk 时长过长，则基于其 ASR 文本的语义相似度变化，将其拆分为更小、更集中的 `sub_chunks`。
-  - 合并：如果连续多个 chunk 过短且语义相关，则将它们合并为一个逻辑单元。
-- **输出**：最终得到一系列经过优化、逻辑独立的 `final_chunks`。
 ### 4.2 信息提取 (Extraction)
-对于每一个 `final_chunk`：
-```python
-extract_keyframes(chunk)        # 提取关键帧
-run_ocr_on_frames(frames)       # OCR 识别
-extract_and_transcribe_audio(chunk)  # ASR 转录
-```
 ### 4.3 图文融合与生成 (Processing)
-```python
-synchronize_text_and_frames(subtitles, frames)  # 字幕与图像对齐
-generate_note_chunk(synchronized_data)          # LLM 生成笔记
-```
 ### 4.4 笔记合成 (Synthesis)
-- 汇总所有 `note_chunk`，给大模型进行总结润色，生成完整的笔记（语言与lecture相同）。
-- 导出为最终的 Markdown 文件。
-### 4.5 API
-| 方法 & 路径        | 说明                           |
-| ------------------ | ------------------------------ |
-| `POST /upload`     | 上传视频 → 返回任务 ID         |
-| `GET /status/{id}` | 查询任务进度 (如 “视觉分块中”) |
-| `GET /notes/{id}`  | 获取生成的图文笔记             |
-### 4.6 内部模块接口一览
-| 模块                              | 关键类 / 方法                                                                | 输入              | 输出                                      | 说明                                                |
-| --------------------------------- | ---------------------------------------------------------------------------- | ----------------- | ----------------------------------------- | --------------------------------------------------- |
-| `ingestion.audio_extractor`       | `AudioExtractor.extract(video_fp: str) -> Path`                              | 视频文件路径      | `audio.wav` 文件路径                      | 使用 FFmpeg 拆分音轨并标准化到 16 kHz 单声道        |
-| `ingestion.whisper_runner`        | `WhisperRunner.transcribe(audio_fp: Path, lang: str = "zh") -> List[Dict]`   | 音频文件路径      | `[{"start":0.0,"end":3.2,"text":"…"}, …]` | 返回带时间戳的字幕列表 (JSON 序列化后写入 `.jsonl`) |
-| `vision.keyframe_extractor`       | `KeyframeExtractor.run(video_fp: str, threshold: float = 0.6) -> List[Path]` | 视频文件路径      | 关键帧图片路径列表                        | 帧间余弦相似度低于阈值即视为新幻灯片                |
-| `vision.ocr_processor`            | `OcrProcessor.run(img_fp: Path, lang: str = "ch") -> str`                    | 图片路径          | 图片中文字                                | 通过 PaddleOCR，GPU 自动检测                        |
-| `segmentation.visual_segmenter`   | `VisualSegmenter.run(video_fp) -> List[Dict]`                                | 视频文件路径      | `slide_chunks`                            | 返回 `{start,end}` 的粗分段列表                     |
-| `segmentation.semantic_segmenter` | `SemanticSegmenter.refine(slide_chunks, subtitles) -> List[Dict]`            | 粗分段 & 字幕列表 | `final_chunks`                            | 结合文本语义相似度进行二次细分/合并                 |
-| `processing.processor`            | `Processor.generate_note(chunk) -> NoteChunk`                                | `final_chunk`     | `NoteChunk(note:str,images:List[str])`    | 调用 LLM 生成单块笔记实体                           |
-| `synthesis.assembler`             | `Assembler.merge(chunks: List[NoteChunk]) -> str`                            | NoteChunk 列表    | Markdown/HTML 字符串                      | 合成全局文档并填充模板                              |
 ### 4.7 数据格式示例
-```jsonc
-// subtitles.jsonl（节选）
-{"start": 0.0,  "end": 3.2,  "text": "欢迎来到 Lec2Note 课程"}
-{"start": 3.2,  "end": 6.7,  "text": "今天我们介绍多模态笔记生成"}
-```
-```jsonc
-// chunk_schema.json（节选）
 {
-  "id": 1,
   "start": 0.0,
-  "end": 120.5,
-  "images": ["kf_0001.png", "kf_0002.png"],
-  "subtitles": [0, 1, 2, 3]
 }
 ```
-```jsonc
-// note_chunk (Processor.generate_note 输出示例)
-{
-  "note": "### 多模态分析简介\n- 本节介绍了……",
-  "images": ["kf_0001.png"]
-}
-```
-## 5. 开发环境搭建
 ```bash
 # 克隆仓库
 git clone [email protected]:your_org/lec2note.git
@@ -162,22 +145,18 @@ export OPENAI_API_KEY="YOUR_KEY"
 pytest -q
 ```
-### 快速运行本地 pipeline
 ```bash
 python -m lec2note.scripts.run_pipeline \
     --video example.mp4 \
     --output notes.md
 ```
 ## 6. 部署指南
 ### 6.1 Docker Compose
 ```bash
 docker compose up -d --build
-```

+# Lec2Note 开发者指南 (Developer Guide)
 ## 1. 项目概述
+Lec2Note 致力于提供一个端到端的 **“视频讲座自动生成笔记”** 解决方案。
+通过多模态分析技术，深度融合视频画面与音频内容，生成 **图文并茂、结构清晰** 的笔记。
+### 1.1 核心流程
+1. **基于字幕的精细分块**：将 ASR 生成的每一句话视为独立的 *micro-chunk*，并在句末时间点精确截取关键帧。
+2. **分层合并策略**
+   - **阶段一·视觉预合并**：仅根据视觉相似度合并连续 *micro-chunks*。
+   - **阶段二·语义合并**：在视觉合并结果上再根据文本语义相似度合并。
+3. **多模态信息提取与采样**
+   - **文本**：拼接主题块包含的全部字幕文本。
+   - **图像**：采样该主题块的关键帧（最多 6 张）。
+4. **分块笔记生成**：调用多模态 LLM，为每个主题块生成独立笔记。
+5. **全局笔记合成**：再次调用 LLM，对所有分块笔记进行整合、去重与润色。
+### 1.2 目标使用场景
 - 学术课程录播
+- 会议 / 研讨会记录
+- 企业内部培训视频
+---
 ## 2. 技术栈
+| 层级 | 主要技术 | 说明 |
+|------|----------|------|
+| **语言** | Python 3.9+ | 主代码基于 Python |
+| **视频 / 图像处理** | OpenCV, Pillow | 截图、图像处理、图像相似度计算 (SSIM / pHash) |
+| **OCR** | PaddleOCR / Tesseract | 提取关键帧中的幻灯片文字 |
+| **ASR** | Whisper / Faster-Whisper | 生成句子级时间戳 |
+| **语义分析** | Sentence Transformers | 计算文本语义相似度 |
+| **LLM** | Gemini-2.5-pro | 多模态模型，生成笔记 |
+| **Web 框架** | FastAPI | 提供 RESTful & WebSocket 服务 |
+| **任务编排** | Prefect / Celery | 批处理与重试机制 |
+| **数据库** | SQLite (dev) / PostgreSQL (prod) | 存储元数据与任务状态 |
+| **容器** | Docker & Docker Compose | 一键部署 |
+---
 ## 3. 目录结构与模块划分
 ```text
 Lec2Note/
+├── docs/                    # 设计文档 & 会议记录
+├── lec2note/                # 源码包 (Python)
+│   ├── ingestion/           # 音频处理 & ASR
+│   ├── vision/              # 视频画面处理模块
+│   │   ├── frame_extractor.py
+│   │   ├── image_comparator.py
+│   │   ├── image_sampler.py
 │   │   └── ocr_processor.py
+│   ├── segmentation/        # 分块与合并模块
+│   │   ├── sentence_chunker.py
+│   │   └── chunk_merger.py
+│   ├── processing/          # 多模态信息融合与 LLM 生成
+│   ├── synthesis/           # 全局笔记整合与导出
+│   ├── assets/              # 静态模板 (Markdown/HTML)
+│   └── api/                 # FastAPI 路由
+├── scripts/                 # CLI 脚本 & 任务调度
+├── tests/                   # PyTest 单元与集成测试
 ├── Dockerfile
 ├── docker-compose.yml
 └── README.md
 ```
+---
+## 4. 核心功能说明
+### 4.1 分块与合并策略
+1. **生成句块**：`sentence_chunker.run()` 根据 ASR 输出生成 `{start, end, text, keyframe_path}` 的 *sentence_chunks*。
+2. **分层合并**：`chunk_merger.run()` 先视觉预合并，再语义合并，得到 *topic_chunks*。
+3. **关键帧采样**：`image_sampler.sample()` 均匀采样关键帧，最多保留 6 张。
+4. **输出**：最终得到包含文本及代表性截图的 `final_chunks`。
 ### 4.2 信息提取 (Extraction)
+对采样后的截图进行 OCR，丰富文本信息。
 ### 4.3 图文融合与生成 (Processing)
+- **Prompt 构建**：在文本中插入 `[IMAGE_n]` 占位符，并附上对应图像列表。
+- **LLM 调用**：`processor.generate_note_chunk()` 使用多模态 LLM 生成 Markdown 格式笔记。
 ### 4.4 笔记合成 (Synthesis)
+`assembler.merge()` 收集所有 `note_chunk` 文本，构建新 Prompt，调用 LLM 进行去重、重排与润色，输出完整笔记。
+### 4.5 API 一览
+| 方法 | 路径 | 功能 |
+|------|------|------|
+| `POST` | `/upload` | 上传视频 → 返回任务 ID |
+| `GET`  | `/status/{id}` | 查询任务进度 |
+| `GET`  | `/notes/{id}` | 获取生成的图文笔记 |
+### 4.6 内部模块接口
+| 模块 | 关键类 / 方法 | 输入 | 输出 | 说明 |
+|------|--------------|------|------|------|
+| `ingestion.whisper_runner` | `WhisperRunner.transcribe()` | 音频路径 | `[{start, end, text}, …]` | 句子级 ASR 结果 |
+| `vision.frame_extractor` | `FrameExtractor.capture_at()` | 视频路径, 时间戳列表 | 图片路径列表 | 精确截图 |
+| `vision.image_comparator` | `ImageComparator.get_similarity()` | 两张图片路径 | 相似度 (0-1) | pHash / SSIM |
+| `vision.image_sampler` | `ImageSampler.sample()` | 图片路径列表, `max_n` | 采样后路径列表 | 均匀采样 |
+| `segmentation.sentence_chunker` | `SentenceChunker.run()` | 字幕列表, 视频路径 | *sentence_chunks* | 生成微型块 |
+| `segmentation.chunk_merger` | `ChunkMerger.run()` | *sentence_chunks* | `final_chunks` | 分层合并 |
+| `processing.processor` | `Processor.generate_note()` | `final_chunk` | `NoteChunk` | 调用 LLM |
+| `synthesis.assembler` | `Assembler.merge()` | `NoteChunk` 列表 | Markdown/HTML | 全局整合 |
 ### 4.7 数据格式示例
+```json
 {
   "start": 0.0,
+  "end": 25.4,
+  "text": "欢迎来到 Lec2Note 课程。今天我们介绍多模态笔记生成。首先，我们会讲解系统的核心流程...",
+  "representative_frames": [
+    "frames/kf_3.20s.png",
+    "frames/kf_15.80s.png",
+    "frames/kf_22.10s.png"
+  ]
 }
 ```
+### 4.8 健壮性：错误处理与重试
+- **任务原子性**：每步定义为独立任务。
+- **自动重试**：针对网络/LLM 失败采用指数退避重试。
+- **失败隔离**：单任务失败不会阻断整体流程，可记录后续排查。
+---
+## 5. 开发环境搭建
 ```bash
 # 克隆仓库
 git clone [email protected]:your_org/lec2note.git
 pytest -q
 ```
+### 5.1 快速运行本地 Pipeline
 ```bash
 python -m lec2note.scripts.run_pipeline \
     --video example.mp4 \
     --output notes.md
 ```
+---
 ## 6. 部署指南
 ### 6.1 Docker Compose
 ```bash
 docker compose up -d --build
+```

lec2note/__pycache__/types.cpython-310.pyc CHANGED Viewed

Binary files a/lec2note/__pycache__/types.cpython-310.pyc and b/lec2note/__pycache__/types.cpython-310.pyc differ

lec2note/api/main.py CHANGED Viewed

@@ -18,7 +18,8 @@ from lec2note.ingestion.audio_extractor import AudioExtractor
 from lec2note.ingestion.whisper_runner import WhisperRunner
 from lec2note.segmentation.visual_segmenter import VisualSegmenter
 from lec2note.segmentation.semantic_segmenter import SemanticSegmenter
-from lec2note.vision.keyframe_extractor import KeyframeExtractor
 from lec2note.vision.ocr_processor import OcrProcessor
 from lec2note.processing.processor import Processor
 from lec2note.synthesis.assembler import Assembler
@@ -41,23 +42,14 @@ def _run_pipeline(job_id: str, video_path: Path):
         job["status"] = "asr"
         subtitles = WhisperRunner.transcribe(wav)
-        job["status"] = "visual_segmentation"
-        slide_chunks = VisualSegmenter.run(video_path)
-        job["status"] = "semantic_refine"
-        final_chunks_dict = SemanticSegmenter.refine(slide_chunks, subtitles)
-        # attach images to chunks
-        keyframes = KeyframeExtractor.run(video_path)
-        final_chunks: list[FinalChunk] = []
-        for ch in final_chunks_dict:
-            fc = FinalChunk(start=ch["start"], end=ch["end"], images=keyframes)
-            final_chunks.append(fc)
         job["status"] = "ocr"
-        # run OCR for all keyframes (simplified)
-        for img in keyframes:
-            OcrProcessor.run(img)
         job["status"] = "generate_notes"
         note_chunks = [Processor.generate_note(fc, subtitles) for fc in final_chunks]

 from lec2note.ingestion.whisper_runner import WhisperRunner
 from lec2note.segmentation.visual_segmenter import VisualSegmenter
 from lec2note.segmentation.semantic_segmenter import SemanticSegmenter
+# new merger handles images; no global keyframe extraction
+from lec2note.segmentation.chunk_merger import ChunkMerger
 from lec2note.vision.ocr_processor import OcrProcessor
 from lec2note.processing.processor import Processor
 from lec2note.synthesis.assembler import Assembler
         job["status"] = "asr"
         subtitles = WhisperRunner.transcribe(wav)
+        job["status"] = "chunk_merging"
+        final_chunks = ChunkMerger.run(subtitles, video_path)
         job["status"] = "ocr"
+        # run OCR for all images in final_chunks
+        # for fc in final_chunks:
+        #     for img in fc.images:
+        #         OcrProcessor.run(img)
         job["status"] = "generate_notes"
         note_chunks = [Processor.generate_note(fc, subtitles) for fc in final_chunks]

lec2note/ingestion/__pycache__/audio_extractor.cpython-310.pyc CHANGED Viewed

Binary files a/lec2note/ingestion/__pycache__/audio_extractor.cpython-310.pyc and b/lec2note/ingestion/__pycache__/audio_extractor.cpython-310.pyc differ

lec2note/ingestion/__pycache__/audio_extractor.cpython-313.pyc ADDED Viewed

Binary file (3.69 kB). View file

lec2note/ingestion/__pycache__/whisper_runner.cpython-313.pyc ADDED Viewed

Binary file (2.65 kB). View file

lec2note/ingestion/audio_extractor.py CHANGED Viewed

@@ -55,7 +55,7 @@ class AudioExtractor:
         Returns
         -------
         Path
-            生成的 ``audio.wav`` 路径。
         """
         video_path = Path(video_fp).expanduser().resolve()
@@ -65,7 +65,9 @@ class AudioExtractor:
         out_dir = Path(output_dir or video_path.parent).expanduser().resolve()
         out_dir.mkdir(parents=True, exist_ok=True)
-        audio_path = out_dir / "audio.wav"
         # FFmpeg command
         cmd = [

         Returns
         -------
         Path
+            生成的 WAV 文件路径，文件名与输入视频同名。
         """
         video_path = Path(video_fp).expanduser().resolve()
         out_dir = Path(output_dir or video_path.parent).expanduser().resolve()
         out_dir.mkdir(parents=True, exist_ok=True)
+        # Use the same filename as the video but with .wav extension
+        audio_path = out_dir / f"{video_path.stem}.wav"
         # FFmpeg command
         cmd = [

lec2note/processing/__pycache__/processor.cpython-310.pyc CHANGED Viewed

Binary files a/lec2note/processing/__pycache__/processor.cpython-310.pyc and b/lec2note/processing/__pycache__/processor.cpython-310.pyc differ

lec2note/processing/processor.py CHANGED Viewed

@@ -54,7 +54,7 @@ class Processor:  # noqa: D101
         # insert numbered placeholders into subtitles for reference
         placeholder_subs = subtitle_text
         for idx, _ in enumerate(synced["images"], start=1):
-            placeholder_subs += f"\n\n[IMG{idx}] ← 与下方第 {idx} 张图片对应"
         # Prompt with explicit mapping guidance
         prompt_text = (

         # insert numbered placeholders into subtitles for reference
         placeholder_subs = subtitle_text
         for idx, _ in enumerate(synced["images"], start=1):
+            placeholder_subs += f"\n\n[IMG{idx}]"
         # Prompt with explicit mapping guidance
         prompt_text = (

lec2note/scripts/__pycache__/run_pipeline.cpython-310.pyc CHANGED Viewed

Binary files a/lec2note/scripts/__pycache__/run_pipeline.cpython-310.pyc and b/lec2note/scripts/__pycache__/run_pipeline.cpython-310.pyc differ

lec2note/scripts/__pycache__/run_pipeline.cpython-313.pyc ADDED Viewed

Binary file (2.56 kB). View file

lec2note/scripts/run_pipeline.py CHANGED Viewed

@@ -15,10 +15,9 @@ from lec2note.utils.logging_config import setup_logging
 from lec2note.ingestion.whisper_runner import WhisperRunner
 from lec2note.segmentation.visual_segmenter import VisualSegmenter
 from lec2note.segmentation.semantic_segmenter import SemanticSegmenter
-from lec2note.vision.keyframe_extractor import KeyframeExtractor
 from lec2note.processing.processor import Processor
 from lec2note.synthesis.assembler import Assembler
-from lec2note.types import FinalChunk
 def main():  # noqa: D401
@@ -35,14 +34,7 @@ def main():  # noqa: D401
     wav = AudioExtractor.extract(video_path)
     subtitles = WhisperRunner.transcribe(wav)
-    slide_chunks = VisualSegmenter.run(video_path)
-    final_chunks_dict = SemanticSegmenter.refine(slide_chunks, subtitles)
-    keyframes = KeyframeExtractor.run(video_path)
-    final_chunks: list[FinalChunk] = []
-    for ch in final_chunks_dict:
-        fc = FinalChunk(start=ch["start"], end=ch["end"], images=keyframes)
-        final_chunks.append(fc)
     note_chunks = [Processor.generate_note(fc, subtitles) for fc in final_chunks]
     markdown = Assembler.merge(note_chunks)

 from lec2note.ingestion.whisper_runner import WhisperRunner
 from lec2note.segmentation.visual_segmenter import VisualSegmenter
 from lec2note.segmentation.semantic_segmenter import SemanticSegmenter
+from lec2note.segmentation.chunk_merger import ChunkMerger
 from lec2note.processing.processor import Processor
 from lec2note.synthesis.assembler import Assembler
 def main():  # noqa: D401
     wav = AudioExtractor.extract(video_path)
     subtitles = WhisperRunner.transcribe(wav)
+    final_chunks = ChunkMerger.run(subtitles, video_path)
     note_chunks = [Processor.generate_note(fc, subtitles) for fc in final_chunks]
     markdown = Assembler.merge(note_chunks)

lec2note/segmentation/__pycache__/chunk_merger.cpython-310.pyc ADDED Viewed

Binary file (2.02 kB). View file

lec2note/segmentation/__pycache__/semantic_segmenter.cpython-310.pyc CHANGED Viewed

Binary files a/lec2note/segmentation/__pycache__/semantic_segmenter.cpython-310.pyc and b/lec2note/segmentation/__pycache__/semantic_segmenter.cpython-310.pyc differ

lec2note/segmentation/__pycache__/sentence_chunker.cpython-310.pyc ADDED Viewed

Binary file (2.03 kB). View file

lec2note/segmentation/chunk_merger.py ADDED Viewed

	@@ -0,0 +1,68 @@

+from __future__ import annotations
+"""Hierarchical chunk merger implementing the strategy described in DEVELOPER_GUIDE.md.
+Steps
+-----
+1. *Visual pre-merge* – use :pyclass:`~lec2note.segmentation.visual_segmenter.VisualSegmenter`
+   to obtain slide-level chunks purely based on keyframe similarity.
+2. *Semantic merge* – further merge / split those chunks according to subtitle
+   semantic similarity via :pyclass:`~lec2note.segmentation.semantic_segmenter.SemanticSegmenter`.
+3. *Image sampling* – collect all keyframes belonging to each final topic chunk
+   and uniformly sample at most **6** images using
+   :pyclass:`~lec2note.vision.image_sampler.ImageSampler`.
+The output is a list of :pyclass:`lec2note.types.FinalChunk` dataclass instances
+which are ready for downstream multimodal processing.
+"""
+# refactored: use VisualMerger instead of VisualSegmenter
+import logging
+from pathlib import Path
+from typing import List, Dict
+from lec2note.segmentation.visual_merger import VisualMerger
+from lec2note.segmentation.semantic_segmenter import SemanticSegmenter
+from lec2note.segmentation.sentence_chunker import SentenceChunker
+from lec2note.types import FinalChunk
+from lec2note.vision.image_sampler import ImageSampler
+logger = logging.getLogger(__name__)
+__all__ = ["ChunkMerger"]
+class ChunkMerger:  # noqa: D101
+    @classmethod
+    def run(
+        cls,
+        subtitles: List[Dict],
+        video_fp: str | Path,
+    ) -> List[FinalChunk]:
+        """Return list of topic-level FinalChunk objects ready for note generation."""
+        video_path = Path(video_fp).expanduser().resolve()
+        logger.info("[ChunkMerger] start merging pipeline on %s", video_path.name)
+        # 1. micro-chunks with keyframes
+        micro_chunks = SentenceChunker.run(subtitles, video_path)
+        # 2. visual merge (merge micro_chunks by image similarity)
+        visual_chunks = VisualMerger.merge(micro_chunks)
+        # 3. semantic merge – refine by subtitle semantics
+        topic_chunks_dict = SemanticSegmenter.refine(visual_chunks, subtitles)
+        # 4. map micro to topic & sample images
+        final_chunks: List[FinalChunk] = []
+        for ch in topic_chunks_dict:
+            imgs = [mc["keyframe_path"] for mc in micro_chunks if ch["start"] <= mc["start"] < ch["end"]]
+            imgs_sampled = ImageSampler.sample(imgs, max_n=6)
+            fc = FinalChunk(start=ch["start"], end=ch["end"], images=[Path(p) for p in imgs_sampled])
+            final_chunks.append(fc)
+        logger.info("[ChunkMerger] produced %d final topic chunks", len(final_chunks))
+        return final_chunks

lec2note/segmentation/sentence_chunker.py ADDED Viewed

	@@ -0,0 +1,80 @@

+from __future__ import annotations
+"""Generate sentence-level micro-chunks from subtitles with keyframes.
+This module takes the subtitle list (each element a dict with ``start``, ``end`` and
+``text`` fields) together with the original video path, and produces a list of
+micro-chunks. A micro-chunk is a dict containing:
+* ``start`` – float seconds
+* ``end``   – float seconds
+* ``text``  – sentence text
+* ``keyframe_path`` – saved image path of the representative frame captured at
+  ``end`` timestamp of the sentence.  This single frame will later be used by
+  image-level integration modules.
+The frame capture is delegated to :pyfunc:`lec2note.vision.frame_extractor.FrameExtractor.capture_at`.
+"""
+import logging
+from pathlib import Path
+from typing import List, Dict
+from lec2note.vision.frame_extractor import FrameExtractor
+__all__ = ["SentenceChunker"]
+logger = logging.getLogger(__name__)
+class SentenceChunker:  # noqa: D101
+    @classmethod
+    def run(
+        cls,
+        subtitles: List[Dict],
+        video_fp: str | Path,
+        *,
+        output_dir: str | Path | None = None,
+    ) -> List[Dict]:
+        """Generate micro-chunks aligned with subtitle sentences.
+        Parameters
+        ----------
+        subtitles
+            List of subtitle dicts from ASR with ``start``, ``end``, ``text`` keys.
+        video_fp
+            Path to input video.
+        output_dir
+            Directory to store extracted keyframes. If *None*, a ``frames``
+            sub-directory next to the video file is used.
+        """
+        video_path = Path(video_fp).expanduser().resolve()
+        if not video_path.exists():
+            raise FileNotFoundError(video_path)
+        micro_chunks: List[Dict] = []
+        timestamps: List[float] = [s["end"] for s in subtitles]
+        keyframe_paths = FrameExtractor.capture_at(video_path, timestamps, output_dir=output_dir)
+        # ensure same length
+        if len(keyframe_paths) != len(subtitles):
+            logger.warning(
+                "[SentenceChunker] expected %d keyframes but got %d",
+                len(subtitles),
+                len(keyframe_paths),
+            )
+        for idx, sub in enumerate(subtitles):
+            chunk = {
+                "start": sub["start"],
+                "end": sub["end"],
+                "text": sub["text"],
+                "keyframe_path": str(keyframe_paths[idx]) if idx < len(keyframe_paths) else "",
+            }
+            micro_chunks.append(chunk)
+        logger.info("[SentenceChunker] generated %d micro-chunks", len(micro_chunks))
+        return micro_chunks

lec2note/segmentation/visual_merger.py ADDED Viewed

	@@ -0,0 +1,63 @@

+from __future__ import annotations
+"""Merge adjacent sentence-level chunks by visual similarity.
+Input: *micro_chunks* – list of dicts from SentenceChunker, each with
+``start``, ``end``, ``text`` and ``keyframe_path``.
+Algorithm:
+1. Iterate in temporal order.
+2. Compare keyframe of current sentence with keyframe of *buffer* (last kept
+   micro chunk of current visual block) using
+   :pyfunc:`lec2note.vision.image_comparator.ImageComparator.get_similarity`.
+3. If similarity ≥ threshold (default 0.9) → merge (extend ``end`` of buffer),
+   else flush buffer to output and start new buffer.
+4. After merge, the **only keyframe kept for a visual block is that of the
+   *last* sentence**, naturally satisfied because buffer always holds last
+   sentence's keyframe.
+Return: list of ``{start, end}`` dicts representing visual-level chunks, ready
+for semantic refinement.
+"""
+import logging
+from pathlib import Path
+from typing import List, Dict
+from lec2note.vision.image_comparator import ImageComparator
+logger = logging.getLogger(__name__)
+__all__ = ["VisualMerger"]
+class VisualMerger:  # noqa: D101
+    @classmethod
+    def merge(
+        cls,
+        micro_chunks: List[Dict],
+        *,
+        sim_threshold: float = 0.9,
+    ) -> List[Dict]:
+        if not micro_chunks:
+            return []
+        visual_chunks: List[Dict] = []
+        buffer = micro_chunks[0].copy()
+        for mc in micro_chunks[1:]:
+            # compare buffer keyframe (last sentence in current block) with mc keyframe
+            try:
+                sim = ImageComparator.get_similarity(buffer["keyframe_path"], mc["keyframe_path"])
+            except Exception as exc:  # noqa: BLE001
+                logger.warning("[VisualMerger] similarity calc failed: %s", exc)
+                sim = 0.0  # force split
+            if sim >= sim_threshold:
+                # merge: extend end and replace keyframe/path to current (last)
+                buffer["end"] = mc["end"]
+                buffer["keyframe_path"] = mc["keyframe_path"]
+            else:
+                visual_chunks.append({"start": buffer["start"], "end": buffer["end"]})
+                buffer = mc.copy()
+        visual_chunks.append({"start": buffer["start"], "end": buffer["end"]})
+        logger.info("[VisualMerger] merged %d micro → %d visual chunks", len(micro_chunks), len(visual_chunks))
+        return visual_chunks

lec2note/synthesis/__pycache__/assembler.cpython-310.pyc CHANGED Viewed

Binary files a/lec2note/synthesis/__pycache__/assembler.cpython-310.pyc and b/lec2note/synthesis/__pycache__/assembler.cpython-310.pyc differ

lec2note/synthesis/assembler.py CHANGED Viewed

@@ -28,7 +28,6 @@ class Assembler:  # noqa: D101
         raw_md = "\n\n".join(body_parts)
         logger.info("[Assembler] merging %d note chunks", len(chunks))
-        # LLM 后期润色：可选，通过环境变量控制
         logger.info("[Assembler] polishing with LLM…")
         try:
             if not os.getenv("OPENAI_API_KEY"):
@@ -60,11 +59,9 @@ class Assembler:  # noqa: D101
                                 }
                             ],
                         )
-                polished = response.choices[0].message.content.strip()
-            except Exception:  # noqa: BLE001
-                polished = raw_md  # 回退
-        else:
-            polished = raw_md
         logger.info("[Assembler] final document length %d chars", len(polished))
         return TEMPLATE.format(content=polished)

         raw_md = "\n\n".join(body_parts)
         logger.info("[Assembler] merging %d note chunks", len(chunks))
         logger.info("[Assembler] polishing with LLM…")
         try:
             if not os.getenv("OPENAI_API_KEY"):
                                 }
                             ],
                         )
+            polished = response.choices[0].message.content.strip()
+        except Exception:  # noqa: BLE001
+            polished = raw_md
         logger.info("[Assembler] final document length %d chars", len(polished))
         return TEMPLATE.format(content=polished)

lec2note/utils/__pycache__/logging_config.cpython-313.pyc ADDED Viewed

Binary file (1.08 kB). View file

lec2note/vision/__pycache__/frame_extractor.cpython-310.pyc ADDED Viewed

Binary file (1.96 kB). View file

lec2note/vision/__pycache__/image_sampler.cpython-310.pyc ADDED Viewed

Binary file (1.24 kB). View file

lec2note/vision/__pycache__/keyframe_extractor.cpython-310.pyc CHANGED Viewed

Binary files a/lec2note/vision/__pycache__/keyframe_extractor.cpython-310.pyc and b/lec2note/vision/__pycache__/keyframe_extractor.cpython-310.pyc differ

lec2note/vision/frame_extractor.py ADDED Viewed

	@@ -0,0 +1,76 @@

+from __future__ import annotations
+"""Lightweight frame extractor utility used by SentenceChunker.
+This wrapper around OpenCV provides a single classmethod ``capture_at`` which
+accepts a list of timestamps and saves each captured frame as PNG into the
+specified directory.  Returned value is the list of saved ``Path`` objects in
+exactly the same order as input timestamps.
+Unlike :pyfunc:`lec2note.vision.keyframe_extractor.KeyframeExtractor` which
+searches the whole video to locate slide changes, this extractor is precise and
+only grabs frames at given times; therefore it is computationally cheaper.
+"""
+import logging
+from pathlib import Path
+from typing import List
+import cv2  # type: ignore
+__all__ = ["FrameExtractor"]
+logger = logging.getLogger(__name__)
+class FrameExtractor:  # noqa: D101
+    @classmethod
+    def capture_at(
+        cls,
+        video_fp: str | Path,
+        timestamps: List[float],
+        *,
+        output_dir: str | Path | None = None,
+        image_prefix: str = "cap",
+    ) -> List[Path]:
+        """Capture video frames at given timestamps.
+        Parameters
+        ----------
+        video_fp
+            Input video path.
+        timestamps
+            Seconds (float) where frames should be captured.
+        output_dir
+            Directory to store PNG images; default to ``frames`` next to video.
+        image_prefix
+            Prefix of output filenames.
+        """
+        video_path = Path(video_fp).expanduser().resolve()
+        if not video_path.exists():
+            raise FileNotFoundError(video_path)
+        save_dir = Path(output_dir or video_path.parent / "frames").resolve()
+        save_dir.mkdir(parents=True, exist_ok=True)
+        cap = cv2.VideoCapture(str(video_path))
+        fps = cap.get(cv2.CAP_PROP_FPS) or 30.0
+        saved: List[Path] = []
+        for ts in timestamps:
+            frame_idx = int(ts * fps)
+            cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
+            success, frame = cap.read()
+            if not success:
+                logger.warning("[FrameExtractor] failed reading frame at %.2fs", ts)
+                continue
+            out_fp = save_dir / f"{image_prefix}_{frame_idx:06d}.png"
+            cv2.imwrite(str(out_fp), frame)
+            saved.append(out_fp)
+        cap.release()
+        logger.info("[FrameExtractor] captured %d frames", len(saved))
+        return saved

lec2note/vision/image_comparator.py ADDED Viewed

	@@ -0,0 +1,58 @@

+from __future__ import annotations
+"""Compute similarity between two images.
+Two complementary metrics are provided:
+* **SSIM** – structural similarity on grayscale images.
+* **dHash distance** – perceptual hash Hamming distance.
+The public API exposes a single :pyfunc:`ImageComparator.get_similarity` method
+returning a float in \[0, 1\] where **1.0** means identical slides and **0.0**
+means completely different.  Internally a simple weighted combination of SSIM
+and inverted-normalised dHash distance is used.
+"""
+from pathlib import Path
+from typing import Tuple
+import cv2  # type: ignore
+import imagehash  # type: ignore
+from PIL import Image
+from skimage.metrics import structural_similarity as ssim  # type: ignore
+__all__ = ["ImageComparator"]
+class ImageComparator:  # noqa: D101
+    @staticmethod
+    def _load_grayscale(fp: Path):
+        img = cv2.imread(str(fp))
+        if img is None:
+            raise FileNotFoundError(fp)
+        return cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
+    @classmethod
+    def _ssim(cls, fp1: Path, fp2: Path) -> float:
+        g1, g2 = cls._load_grayscale(fp1), cls._load_grayscale(fp2)
+        score = ssim(g1, g2)
+        return float(score)
+    @staticmethod
+    def _dhash_dist(fp1: Path, fp2: Path) -> int:
+        h1, h2 = imagehash.dhash(Image.open(fp1)), imagehash.dhash(Image.open(fp2))
+        return h1 - h2  # type: ignore[return-value]
+    @classmethod
+    def get_similarity(cls, fp1: str | Path, fp2: str | Path) -> float:
+        """Return similarity in range [0, 1]. Higher is more similar."""
+        p1, p2 = Path(fp1).expanduser().resolve(), Path(fp2).expanduser().resolve()
+        ssim_val = cls._ssim(p1, p2)
+        dh_dist = cls._dhash_dist(p1, p2)
+        dh_norm = max(0.0, 1.0 - dh_dist / 64)  # 64-bit hash
+        return 0.7 * ssim_val + 0.3 * dh_norm

lec2note/vision/image_sampler.py ADDED Viewed

	@@ -0,0 +1,28 @@

+from __future__ import annotations
+"""Utility to uniformly sample a subset of images from a list.
+Used to limit the number of representative keyframes per topic chunk to a small
+constant (default 6) for efficient downstream multi-modal prompting.
+"""
+from pathlib import Path
+from typing import List
+__all__ = ["ImageSampler"]
+class ImageSampler:  # noqa: D101
+    @staticmethod
+    def sample(paths: List[str | Path], max_n: int = 6) -> List[str]:
+        """Return (at most) *max_n* paths evenly sampled from *paths* list."""
+        if len(paths) <= max_n:
+            return [str(Path(p)) for p in paths]
+        step = len(paths) / max_n
+        idxs = [int(i * step) for i in range(max_n)]
+        return [str(Path(paths[i])) for i in idxs]

lec2note/vision/keyframe_extractor.py CHANGED Viewed

@@ -15,6 +15,23 @@ import logging
 from pathlib import Path
 from typing import List
 __all__ = ["KeyframeExtractor"]
@@ -67,6 +84,10 @@ class KeyframeExtractor:
         frame_idx = 0
         saved_paths: List[Path] = []
         while True:
             success, frame = cap.read()
             if not success:
@@ -79,6 +100,9 @@ class KeyframeExtractor:
                 prev_frame = frame
             frame_idx += 1
         logging.getLogger(__name__).info("[KeyframeExtractor] saved %d keyframes to %s", len(saved_paths), save_dir)
         cap.release()

 from pathlib import Path
 from typing import List
+# optional progress bar
+try:
+    from tqdm.auto import tqdm  # type: ignore
+except ImportError:  # pragma: no cover
+    def tqdm(iterable=None, **kwargs):  # type: ignore
+        """Fallback tqdm when the package is not installed."""
+        if iterable is None:
+            class _Dummy:  # noqa: D401
+                def update(self, n=1):
+                    pass
+                def close(self):
+                    pass
+            return _Dummy()
+        return iterable
 __all__ = ["KeyframeExtractor"]
         frame_idx = 0
         saved_paths: List[Path] = []
+        # progress bar setup
+        total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) or None
+        pbar = tqdm(total=total_frames, desc="[KeyframeExtractor] extracting", unit="frame")
         while True:
             success, frame = cap.read()
             if not success:
                 prev_frame = frame
             frame_idx += 1
+            pbar.update(1)
+        pbar.close()
         logging.getLogger(__name__).info("[KeyframeExtractor] saved %d keyframes to %s", len(saved_paths), save_dir)
         cap.release()