Spaces:

LRU1
/

lec2note

Sleeping

App Files Files Community

LRU1 commited on Sep 6

Commit

429e139

0 Parent(s):

visual seg -> semantic seg

Browse files

Files changed (30) hide show

.env +3 -0
.gitignore +2 -0
DEVELOPER_GUIDE.md +183 -0
README.md +44 -0
lec2note/__pycache__/types.cpython-310.pyc +0 -0
lec2note/api/main.py +109 -0
lec2note/ingestion/__pycache__/audio_extractor.cpython-310.pyc +0 -0
lec2note/ingestion/__pycache__/audio_extractor.cpython-312.pyc +0 -0
lec2note/ingestion/__pycache__/whisper_runner.cpython-310.pyc +0 -0
lec2note/ingestion/__pycache__/whisper_runner.cpython-312.pyc +0 -0
lec2note/ingestion/audio_extractor.py +93 -0
lec2note/ingestion/whisper_runner.py +50 -0
lec2note/processing/__pycache__/processor.cpython-310.pyc +0 -0
lec2note/processing/processor.py +97 -0
lec2note/scripts/__pycache__/run_pipeline.cpython-310.pyc +0 -0
lec2note/scripts/__pycache__/run_pipeline.cpython-312.pyc +0 -0
lec2note/scripts/run_pipeline.py +54 -0
lec2note/segmentation/__pycache__/semantic_segmenter.cpython-310.pyc +0 -0
lec2note/segmentation/__pycache__/visual_segmenter.cpython-310.pyc +0 -0
lec2note/segmentation/semantic_segmenter.py +55 -0
lec2note/segmentation/visual_segmenter.py +52 -0
lec2note/synthesis/__pycache__/assembler.cpython-310.pyc +0 -0
lec2note/synthesis/assembler.py +76 -0
lec2note/types.py +36 -0
lec2note/utils/__pycache__/logging_config.cpython-310.pyc +0 -0
lec2note/utils/logging_config.py +25 -0
lec2note/vision/__pycache__/keyframe_extractor.cpython-310.pyc +0 -0
lec2note/vision/keyframe_extractor.py +85 -0
lec2note/vision/ocr_processor.py +30 -0
requirements.txt +19 -0

.env ADDED Viewed

	@@ -0,0 +1,3 @@

+export OPENAI_API_KEY="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"
+export OPENAI_API_BASE="https://openrouter.ai/api/v1"
+export LOG_LEVEL=DEBUG

.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ /data/*
2	+ /test/*

DEVELOPER_GUIDE.md ADDED Viewed

	@@ -0,0 +1,183 @@

+# Lec2Note 开发者指南 (DEVELOPER GUIDE)
+## 1. 项目概述
+Lec2Note 致力于提供一个端到端的 “视频讲座自动生成笔记” 解决方案。它通过多模态分析技术，深度融合视频画面与音频内容，生成图文并茂、结构清晰的笔记。
+### 核心流程
+1. **混合式分块 (Hybrid Segmentation)**：采用二级策略。首先基于幻灯片切换进行宏观分块，确立内容的主体结构；然后对每个块内的语音文字稿进行语义分析，进行二次切分或合并，确保每个最终块在逻辑上完整且独立。
+2. **多模态信息提取 (Multimodal Information Extraction)**
+   - 视频流：提取每个分块内的关键帧图像（包含动画、标注等动态过程），并对图像进行 OCR 识别。
+   - 音频流：抽取并识别对应时间段的语音 (ASR)，生成带时间戳的文字稿。
+3. **图文同步与对齐 (Synchronization)**：将文字稿与相应的关键帧图像进行关联，构建出“文字描述某一图像”的上下文关系。
+4. **多模态内容生成 (Multimodal Generation)**：将对齐后的图文信息块送入多模态大语言模型 (LLM)，生成包含小节标题、要点总结、核心截图的综合性笔记。
+5. **结构化输出 (Structured Output)**：汇总所有信息，输出为结构化的 Markdown/HTML/Notion 页面。
+### 目标使用场景
+- 学术课程录播
+- 会议/研讨会记录
+- 企业内部培训视频
+## 2. 技术栈
+| 层级             | 主要技术                         | 说明                               |
+| ---------------- | -------------------------------- | ---------------------------------- |
+| 语言             | Python 3.9+                      | 主代码基于 Python 实现             |
+| 视频/图像处理    | OpenCV, Pillow                   | 关键帧提取、画面变化检测、图像处理 |
+| 文字识别 (OCR)   | PaddleOCR / Tesseract            | 从关键帧图像中提取幻灯片文字       |
+| 语音识别 (ASR)   | Whisper / Faster-Whisper         | 提供高精度 ASR，可选 GPU 加速      |
+| 大语言模型 (LLM) | OpenAI GPT-4V / LLaVA            | 多模态模型，理解图文并生成笔记     |
+| Web 框架         | FastAPI                          | 提供 RESTful & WebSocket 服务      |
+| 任务编排         | Prefect / Celery                 | 支持批处理及重试机制               |
+| 数据库           | SQLite (dev) / PostgreSQL (prod) | 存储元数据与任务状态               |
+| 容器             | Docker & Docker Compose          | 一键部署                           |
+## 3. 目录结构与模块划分
+```text
+Lec2Note/
+├── docs/                     # 设计文档 & 会议记录
+├── lec2note/                 # 源码包 (Python)
+│   ├── ingestion/            # 音频处理 & ASR
+│   │   ├── audio_extractor.py
+│   │   └── whisper_runner.py
+│   ├── vision/               # 视频画面处理模块
+│   │   ├── keyframe_extractor.py
+│   │   └── ocr_processor.py
+│   ├── segmentation/         # 【更新】混合式分块模块
+│   │   ├── visual_segmenter.py
+│   │   └── semantic_segmenter.py
+│   ├── processing/           # 多模态信息融合与 LLM 生成
+│   ├── synthesis/            # 全局笔记整合与导出
+│   ├── assets/               # 静态模板 (Markdown/HTML)
+│   └── api/                  # FastAPI 路由
+├── scripts/                  # CLI 脚本 & 任务调度
+├── tests/                    # PyTest 单元与集成测试
+├── Dockerfile
+├── docker-compose.yml
+└── README.md
+```
+## 4. 核心功能说明
+### 4.1 混合式分块 (Hybrid Segmentation)
+这是一个二级处理过程，旨在创建逻辑连贯的内容块：
+- **视觉粗分块**：调用 `visual_segmenter.run(video_fp)`，使用 OpenCV 分析帧间差异，识别幻灯片切换的精确时间点，生成初步的 `slide_chunks`。
+- **语义精细化**：调用 `semantic_segmenter.refine(slide_chunks)`，对上一步结果进行处理：
+  - 拆分：如果一个 chunk 时长过长，则基于其 ASR 文本的语义相似度变化，将其拆分为更小、更集中的 `sub_chunks`。
+  - 合并：如果连续多个 chunk 过短且语义相关，则将它们合并为一个逻辑单元。
+- **输出**：最终得到一系列经过优化、逻辑独立的 `final_chunks`。
+### 4.2 信息提取 (Extraction)
+对于每一个 `final_chunk`：
+```python
+extract_keyframes(chunk)        # 提取关键帧
+run_ocr_on_frames(frames)       # OCR 识别
+extract_and_transcribe_audio(chunk)  # ASR 转录
+```
+### 4.3 图文融合与生成 (Processing)
+```python
+synchronize_text_and_frames(subtitles, frames)  # 字幕与图像对齐
+generate_note_chunk(synchronized_data)          # LLM 生成笔记
+```
+### 4.4 笔记合成 (Synthesis)
+- 汇总所有 `note_chunk`，给大模型进行总结润色，生成完整的笔记（语言与lecture相同）。
+- 导出为最终的 Markdown 文件。
+### 4.5 API
+| 方法 & 路径        | 说明                           |
+| ------------------ | ------------------------------ |
+| `POST /upload`     | 上传视频 → 返回任务 ID         |
+| `GET /status/{id}` | 查询任务进度 (如 “视觉分块中”) |
+| `GET /notes/{id}`  | 获取生成的图文笔记             |
+### 4.6 内部模块接口一览
+| 模块                              | 关键类 / 方法                                                                | 输入              | 输出                                      | 说明                                                |
+| --------------------------------- | ---------------------------------------------------------------------------- | ----------------- | ----------------------------------------- | --------------------------------------------------- |
+| `ingestion.audio_extractor`       | `AudioExtractor.extract(video_fp: str) -> Path`                              | 视频文件路径      | `audio.wav` 文件路径                      | 使用 FFmpeg 拆分音轨并标准化到 16 kHz 单声道        |
+| `ingestion.whisper_runner`        | `WhisperRunner.transcribe(audio_fp: Path, lang: str = "zh") -> List[Dict]`   | 音频文件路径      | `[{"start":0.0,"end":3.2,"text":"…"}, …]` | 返回带时间戳的字幕列表 (JSON 序列化后写入 `.jsonl`) |
+| `vision.keyframe_extractor`       | `KeyframeExtractor.run(video_fp: str, threshold: float = 0.6) -> List[Path]` | 视频文件路径      | 关键帧图片路径列表                        | 帧间余弦相似度低于阈值即视为新幻灯片                |
+| `vision.ocr_processor`            | `OcrProcessor.run(img_fp: Path, lang: str = "ch") -> str`                    | 图片路径          | 图片中文字                                | 通过 PaddleOCR，GPU 自动检测                        |
+| `segmentation.visual_segmenter`   | `VisualSegmenter.run(video_fp) -> List[Dict]`                                | 视频文件路径      | `slide_chunks`                            | 返回 `{start,end}` 的粗分段列表                     |
+| `segmentation.semantic_segmenter` | `SemanticSegmenter.refine(slide_chunks, subtitles) -> List[Dict]`            | 粗分段 & 字幕列表 | `final_chunks`                            | 结合文本语义相似度进行二次细分/合并                 |
+| `processing.processor`            | `Processor.generate_note(chunk) -> NoteChunk`                                | `final_chunk`     | `NoteChunk(note:str,images:List[str])`    | 调用 LLM 生成单块笔记实体                           |
+| `synthesis.assembler`             | `Assembler.merge(chunks: List[NoteChunk]) -> str`                            | NoteChunk 列表    | Markdown/HTML 字符串                      | 合成全局文档并填充模板                              |
+### 4.7 数据格式示例
+```jsonc
+// subtitles.jsonl（节选）
+{"start": 0.0,  "end": 3.2,  "text": "欢迎来到 Lec2Note 课程"}
+{"start": 3.2,  "end": 6.7,  "text": "今天我们介绍多模态笔记生成"}
+```
+```jsonc
+// chunk_schema.json（节选）
+{
+  "id": 1,
+  "start": 0.0,
+  "end": 120.5,
+  "images": ["kf_0001.png", "kf_0002.png"],
+  "subtitles": [0, 1, 2, 3]
+}
+```
+```jsonc
+// note_chunk (Processor.generate_note 输出示例)
+{
+  "note": "### 多模态分析简介\n- 本节介绍了……",
+  "images": ["kf_0001.png"]
+}
+```
+## 5. 开发环境搭建
+```bash
+# 克隆仓库
+git clone [email protected]:your_org/lec2note.git
+cd lec2note
+# (可选) 创建虚拟环境
+python -m venv .venv && source .venv/bin/activate
+# 安装依赖
+pip install -r requirements.txt
+# 设置环境变量
+export OPENAI_API_KEY="YOUR_KEY"
+# 运行单元测试
+pytest -q
+```
+### 快速运行本地 pipeline
+```bash
+python -m lec2note.scripts.run_pipeline \
+    --video example.mp4 \
+    --output notes.md
+```
+## 6. 部署指南
+### 6.1 Docker Compose
+```bash
+docker compose up -d --build
+```

README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+# Lec2Note2
+**Lec2Note2** 是一个端到端的 “视频讲座自动生成笔记” 解决方案。项目基于多模态分析技术，融合了视频关键帧、OCR 文字与 ASR 字幕，并通过大语言模型生成结构化笔记。
+## Features
+- Hybrid Segmentation (视觉 + 语义)
+- 多模态信息提取：关键帧、OCR、ASR
+- 图文同步融合并调用 LLM 生成笔记
+- FastAPI 提供异步任务接口
+- Docker 一键部署
+## Quickstart
+```bash
+# 克隆仓库
+git clone <repo-url> Lec2Note2 && cd Lec2Note2
+# (可选) 创建虚拟环境
+python -m venv .venv && source .venv/bin/activate
+# 安装依赖
+pip install -r requirements.txt
+# 运行单元测试
+pytest -q
+# 运行本地 pipeline
+python -m lec2note.scripts.run_pipeline \
+    --video example.mp4 \
+    --output notes.md
+```
+## API
+| Method | Path         | Description           |
+| ------ | ------------ | --------------------- |
+| POST   | /upload      | 上传视频，返回任务 ID |
+| GET    | /status/{id} | 查询任务进度          |
+| GET    | /notes/{id}  | 获取生成的笔记        |
+## License
+MIT License

lec2note/__pycache__/types.cpython-310.pyc ADDED Viewed

Binary file (1.12 kB). View file

lec2note/api/main.py ADDED Viewed

	@@ -0,0 +1,109 @@

+"""FastAPI interface for Lec2Note2 pipeline."""
+from __future__ import annotations
+import uuid
+import shutil
+from pathlib import Path
+from typing import Dict
+import threading
+import logging
+logger = logging.getLogger(__name__)
+from fastapi import FastAPI, UploadFile, File, HTTPException
+from fastapi.responses import JSONResponse, FileResponse
+from lec2note.ingestion.audio_extractor import AudioExtractor
+from lec2note.ingestion.whisper_runner import WhisperRunner
+from lec2note.segmentation.visual_segmenter import VisualSegmenter
+from lec2note.segmentation.semantic_segmenter import SemanticSegmenter
+from lec2note.vision.keyframe_extractor import KeyframeExtractor
+from lec2note.vision.ocr_processor import OcrProcessor
+from lec2note.processing.processor import Processor
+from lec2note.synthesis.assembler import Assembler
+from lec2note.types import FinalChunk
+app = FastAPI(title="Lec2Note2 API")
+DATA_ROOT = Path("/tmp/lec2note_jobs")
+DATA_ROOT.mkdir(parents=True, exist_ok=True)
+_jobs: Dict[str, Dict] = {}
+def _run_pipeline(job_id: str, video_path: Path):
+    job = _jobs[job_id]
+    try:
+        job["status"] = "extract_audio"
+        wav = AudioExtractor.extract(video_path)
+        job["status"] = "asr"
+        subtitles = WhisperRunner.transcribe(wav)
+        job["status"] = "visual_segmentation"
+        slide_chunks = VisualSegmenter.run(video_path)
+        job["status"] = "semantic_refine"
+        final_chunks_dict = SemanticSegmenter.refine(slide_chunks, subtitles)
+        # attach images to chunks
+        keyframes = KeyframeExtractor.run(video_path)
+        final_chunks: list[FinalChunk] = []
+        for ch in final_chunks_dict:
+            fc = FinalChunk(start=ch["start"], end=ch["end"], images=keyframes)
+            final_chunks.append(fc)
+        job["status"] = "ocr"
+        # run OCR for all keyframes (simplified)
+        for img in keyframes:
+            OcrProcessor.run(img)
+        job["status"] = "generate_notes"
+        note_chunks = [Processor.generate_note(fc, subtitles) for fc in final_chunks]
+        job["status"] = "synthesis"
+        md = Assembler.merge(note_chunks)
+        output_fp = DATA_ROOT / f"{job_id}.md"
+        Assembler.save(md, output_fp)
+        job["output"] = output_fp
+        job["status"] = "completed"
+        logger.info("[API] job %s completed -> %s", job_id, output_fp)
+    except Exception as exc:  # noqa: BLE001
+        job["status"] = f"error: {exc}"
+@app.post("/upload")
+async def upload_video(video: UploadFile = File(...)) -> JSONResponse:  # noqa: D401
+    """上传视频并启动后台处理。"""
+    if video.content_type not in {"video/mp4", "video/mkv", "video/avi"}:
+        raise HTTPException(status_code=400, detail="Unsupported video format")
+    job_id = str(uuid.uuid4())
+    logger.info("[API] received upload %s (size ≈ %.1f MB)", video.filename, video.size/1e6)
+    job_dir = DATA_ROOT / job_id
+    job_dir.mkdir(parents=True, exist_ok=True)
+    video_path = job_dir / video.filename
+    with video_path.open("wb") as f:
+        shutil.copyfileobj(video.file, f)
+    _jobs[job_id] = {"status": "queued"}
+    logger.info("[API] job %s queued", job_id)
+    threading.Thread(target=_run_pipeline, args=(job_id, video_path), daemon=True).start()
+    return JSONResponse({"id": job_id})
+@app.get("/status/{job_id}")
+def get_status(job_id: str):  # noqa: D401
+    if job_id not in _jobs:
+        raise HTTPException(status_code=404, detail="Job not found")
+    return JSONResponse({"status": _jobs[job_id]["status"]})
+@app.get("/notes/{job_id}")
+def get_notes(job_id: str):  # noqa: D401
+    if job_id not in _jobs:
+        raise HTTPException(status_code=404, detail="Job not found")
+    job = _jobs[job_id]
+    if job.get("status") != "completed":
+        raise HTTPException(status_code=400, detail="Job not completed")
+    return FileResponse(job["output"], media_type="text/markdown", filename="notes.md")

lec2note/ingestion/__pycache__/audio_extractor.cpython-310.pyc ADDED Viewed

Binary file (2.77 kB). View file

lec2note/ingestion/__pycache__/audio_extractor.cpython-312.pyc ADDED Viewed

Binary file (3.37 kB). View file

lec2note/ingestion/__pycache__/whisper_runner.cpython-310.pyc ADDED Viewed

Binary file (1.94 kB). View file

lec2note/ingestion/__pycache__/whisper_runner.cpython-312.pyc ADDED Viewed

Binary file (2.17 kB). View file

lec2note/ingestion/audio_extractor.py ADDED Viewed

	@@ -0,0 +1,93 @@

+"""Audio extraction utility.
+This module provides `AudioExtractor` which wraps FFmpeg to extract the audio
+track from a video lecture and convert it to a mono 16-kHz WAV file. The output
+file is deterministic, making downstream ASR reproducible.
+Example
+-------
+>>> from pathlib import Path
+>>> from lec2note.ingestion.audio_extractor import AudioExtractor
+>>> wav = AudioExtractor.extract("example.mp4")
+>>> assert Path(wav).exists()
+"""
+from __future__ import annotations
+import subprocess
+import logging
+from pathlib import Path
+from typing import Optional
+logger = logging.getLogger(__name__)
+__all__ = ["AudioExtractor"]
+class AudioExtractor:
+    """提取并规范化音频轨道。
+    Attributes
+    ----------
+    sample_rate : int
+        目标采样率，默认为 16 kHz。
+    channels : int
+        声道数，默认为 1（单声道）。
+    codec : str
+        输出编码格式，固定为 pcm_s16le。
+    """
+    sample_rate: int = 16_000
+    channels: int = 1
+    codec: str = "pcm_s16le"
+    @classmethod
+    def extract(cls, video_fp: str | Path, output_dir: Optional[str | Path] = None) -> Path:
+        """从视频文件中提取音频并转换为 WAV。
+        Parameters
+        ----------
+        video_fp : str | Path
+            输入视频文件路径。
+        output_dir : str | Path, optional
+            输出目录。如果为 ``None``，则使用同级目录。
+        Returns
+        -------
+        Path
+            生成的 ``audio.wav`` 路径。
+        """
+        video_path = Path(video_fp).expanduser().resolve()
+        logger.info("[AudioExtractor] extracting audio from %s", video_path)
+        if not video_path.exists():
+            raise FileNotFoundError(video_path)
+        out_dir = Path(output_dir or video_path.parent).expanduser().resolve()
+        out_dir.mkdir(parents=True, exist_ok=True)
+        audio_path = out_dir / "audio.wav"
+        # FFmpeg command
+        cmd = [
+            "ffmpeg",
+            "-y",  # overwrite
+            "-i",
+            str(video_path),
+            "-ac",
+            str(cls.channels),
+            "-ar",
+            str(cls.sample_rate),
+            "-vn",  # no video
+            "-acodec",
+            cls.codec,
+            str(audio_path),
+        ]
+        try:
+            subprocess.run(cmd, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+            logger.info("[AudioExtractor] saved wav to %s", audio_path)
+        except (subprocess.CalledProcessError, FileNotFoundError) as err:
+            msg = "FFmpeg 执行失败，请确保已安装 FFmpeg 且在 PATH 中。"
+            raise RuntimeError(msg) from err
+        return audio_path

lec2note/ingestion/whisper_runner.py ADDED Viewed

	@@ -0,0 +1,50 @@

+"""Thin wrapper around OpenAI Whisper for ASR transcription."""
+from __future__ import annotations
+from pathlib import Path
+import logging
+logger = logging.getLogger(__name__)
+from typing import List, Dict, Optional, Any
+import torch
+from whisper import load_model  # type: ignore
+__all__ = ["WhisperRunner"]
+class WhisperRunner:  # noqa: D101
+    model_name: str = "base"
+    @classmethod
+    def transcribe(cls, audio_fp: str | Path, lang: str = "zh") -> List[Dict[str, Any]]:
+        """Transcribe ``audio_fp`` and return list with start/end/text.
+        Notes
+        -----
+        - Automatically selects GPU if available.
+        - The function is *blocking* and can be called inside a Prefect task.
+        """
+        audio_path = Path(audio_fp).expanduser().resolve()
+        if not audio_path.exists():
+            raise FileNotFoundError(audio_path)
+        device = "cuda" if torch.cuda.is_available() else "cpu"
+        logger.info("[Whisper] loading model %s on %s", cls.model_name, device)
+        model = load_model(cls.model_name, device=device)
+        logger.info("[Whisper] transcribing %s", audio_path.name)
+        result = model.transcribe(str(audio_path), language=lang)
+        segments = result.get("segments", [])
+        # convert to our schema
+        logger.info("[Whisper] got %d segments", len(segments))
+        return [
+            {
+                "start": round(seg["start"], 2),
+                "end": round(seg["end"], 2),
+                "text": seg["text"].strip(),
+            }
+            for seg in segments
+        ]

lec2note/processing/__pycache__/processor.cpython-310.pyc ADDED Viewed

Binary file (4.92 kB). View file

lec2note/processing/processor.py ADDED Viewed

	@@ -0,0 +1,97 @@

+"""Processing pipeline: synchronize subtitles & images, generate note chunk via LLM."""
+from __future__ import annotations
+from typing import List, Dict, Any
+import base64, mimetypes
+from pathlib import Path
+import os
+import logging
+logger = logging.getLogger(__name__)
+from openai import OpenAI
+from tenacity import retry, stop_after_attempt, wait_fixed  # robust retry
+from lec2note.types import FinalChunk, NoteChunk
+__all__ = ["Processor"]
+class Processor:  # noqa: D101
+    model_name: str = os.getenv("OPENAI_MODEL", "google/gemini-2.5-pro")
+    @staticmethod
+    def _img_to_data_uri(img_path: Path) -> str:
+        mime, _ = mimetypes.guess_type(img_path)
+        b64 = base64.b64encode(img_path.read_bytes()).decode()
+        return f"data:{mime};base64,{b64}"
+    @staticmethod
+    @retry(stop=stop_after_attempt(3), wait=wait_fixed(2))
+    def _call_llm(messages: List[Dict[str, Any]]) -> str:
+        if not os.getenv("OPENAI_API_KEY"):
+            raise EnvironmentError("OPENAI_API_KEY not set")
+        client = OpenAI(
+                    base_url=os.getenv("OPENAI_API_BASE"),
+                    api_key=os.getenv("OPENAI_API_KEY"),
+                    )
+        response = client.chat.completions.create(
+            model=Processor.model_name,
+            temperature=0.2,
+            messages=messages,
+            extra_headers={"X-Title": "Lec2Note2"},
+        )
+        note = response.choices[0].message.content.strip()
+        logger.debug("[Processor] LLM returned %d chars", len(note))
+        return note
+    @classmethod
+    @classmethod
+    def _build_messages(cls, synced: Dict[str, Any]) -> List[Dict[str, Any]]:
+        subtitle_text = " ".join(synced["text"])
+        # insert numbered placeholders into subtitles for reference
+        placeholder_subs = subtitle_text
+        for idx, _ in enumerate(synced["images"], start=1):
+            placeholder_subs += f"\n\n[IMG{idx}] ← 与下方第 {idx} 张图片对应"
+        # Prompt with explicit mapping guidance
+        prompt_text = (
+            "**Role**: You are an expert academic assistant tasked with creating a definitive set of study notes from a lecture.\n\n"
+            "**Primary Objective**: Generate a **comprehensive and detailed** note segment in Markdown. Do not omit details or simplify concepts excessively. Your goal is to capture the full context of the lecture segment.\n\n"
+            "**Key Instructions**:\n\n"
+            "1.  **Capture Emphasized Points**: Pay close attention to the subtitles. Identify and highlight key points that the speaker seems to emphasize, such as repeated phrases, direct statements of importance (e.g., 'the key is...', 'remember that...'), and core definitions.\n\n"
+            "2.  **Integrate Visuals (Formulas & Tables)**: You MUST analyze the accompanying images. If an image contains crucial information like **formulas, equations, tables, code snippets, or important diagrams**, you must accurately transcribe it into the Markdown note to support the text. Follow these formats:\n"
+            "    - For **formulas and equations**, use LaTeX notation (e.g., enclose with `$` or `$$`).\n"
+            "    - For **tables**, recreate them using Markdown table syntax.\n"
+            "    - For **code**, use Markdown code blocks with appropriate language identifiers.\n\n"
+            "3.  **Structure and Format**: Organize the notes logically. Use headings, subheadings, lists, and bold text to create a clear, readable, and well-structured document.\n\n"
+            "4.  **Language**: The notes should align with the subtitles.\n\n"
+            "5.  **Image Mapping**: Stop referencing the images and try to use formulas, tables, code snippets, or important diagrams to describe the images.\n\n"
+            "---BEGIN LECTURE MATERIALS---\n"
+            f"**Subtitles (placeholders inserted)**:\n{placeholder_subs}"
+        )
+        parts: List[Dict[str, Any]] = [
+            {"type": "text", "text": prompt_text}
+        ]
+        for idx, img_fp in enumerate(synced["images"][:10], start=1):  # Limit to 6 images
+            parts.append({
+                "type": "image_url",
+                "image_url": {
+                    "url": cls._img_to_data_uri(Path(img_fp)),
+                    "detail": f"IMG{idx}",  # label matches placeholder
+                },
+            })
+        return [{"role": "user", "content": parts}]
+    @classmethod
+    def generate_note(cls, chunk: FinalChunk, subtitles: List[Dict]) -> NoteChunk:
+        """Generate a single NoteChunk from FinalChunk data."""
+        # collect text for this chunk
+        texts = [s["text"] for s in subtitles if chunk.start <= s["start"] < chunk.end]
+        synced = {"text": texts, "images": chunk.images}
+        messages = cls._build_messages(synced)
+        note = cls._call_llm(messages)
+        return NoteChunk(note=note, images=chunk.images)

lec2note/scripts/__pycache__/run_pipeline.cpython-310.pyc ADDED Viewed

Binary file (2.13 kB). View file

lec2note/scripts/__pycache__/run_pipeline.cpython-312.pyc ADDED Viewed

Binary file (2.9 kB). View file

lec2note/scripts/run_pipeline.py ADDED Viewed

	@@ -0,0 +1,54 @@

+"""CLI entry to run Lec2Note2 pipeline end-to-end.
+Usage
+-----
+python -m lec2note.scripts.run_pipeline --video path.mp4 --output notes.md
+"""
+from __future__ import annotations
+import argparse
+from pathlib import Path
+from lec2note.ingestion.audio_extractor import AudioExtractor
+from lec2note.utils.logging_config import setup_logging
+from lec2note.ingestion.whisper_runner import WhisperRunner
+from lec2note.segmentation.visual_segmenter import VisualSegmenter
+from lec2note.segmentation.semantic_segmenter import SemanticSegmenter
+from lec2note.vision.keyframe_extractor import KeyframeExtractor
+from lec2note.processing.processor import Processor
+from lec2note.synthesis.assembler import Assembler
+from lec2note.types import FinalChunk
+def main():  # noqa: D401
+    setup_logging()
+    parser = argparse.ArgumentParser(description="Run Lec2Note2 pipeline")
+    parser.add_argument("--video", required=True, help="Path to input video")
+    parser.add_argument("--output", required=True, help="Path to output markdown")
+    args = parser.parse_args()
+    video_path = Path(args.video).expanduser().resolve()
+    if not video_path.exists():
+        raise FileNotFoundError(video_path)
+    wav = AudioExtractor.extract(video_path)
+    subtitles = WhisperRunner.transcribe(wav)
+    slide_chunks = VisualSegmenter.run(video_path)
+    final_chunks_dict = SemanticSegmenter.refine(slide_chunks, subtitles)
+    keyframes = KeyframeExtractor.run(video_path)
+    final_chunks: list[FinalChunk] = []
+    for ch in final_chunks_dict:
+        fc = FinalChunk(start=ch["start"], end=ch["end"], images=keyframes)
+        final_chunks.append(fc)
+    note_chunks = [Processor.generate_note(fc, subtitles) for fc in final_chunks]
+    markdown = Assembler.merge(note_chunks)
+    Assembler.save(markdown, args.output)
+    print(f"Saved markdown to {args.output}")
+if __name__ == "__main__":  # pragma: no cover
+    main()

lec2note/segmentation/__pycache__/semantic_segmenter.cpython-310.pyc ADDED Viewed

Binary file (1.75 kB). View file

lec2note/segmentation/__pycache__/visual_segmenter.cpython-310.pyc ADDED Viewed

Binary file (2.14 kB). View file

lec2note/segmentation/semantic_segmenter.py ADDED Viewed

	@@ -0,0 +1,55 @@

+"""Refine slide chunks based on subtitle semantic similarity."""
+from __future__ import annotations
+import logging
+from typing import List, Dict
+from sentence_transformers import SentenceTransformer, util  # type: ignore
+logger = logging.getLogger(__name__)
+__all__ = ["SemanticSegmenter"]
+class SemanticSegmenter:  # noqa: D101
+    _model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
+    @classmethod
+    def refine(cls, slide_chunks: List[Dict], subtitles: List[Dict]) -> List[Dict]:
+        """Split long chunks or merge short ones by semantic change."""
+        if not slide_chunks:
+            logger.warning("[SemanticSegmenter] empty slide_chunks input")
+            return []
+        # Build text per chunk
+        chunk_texts: List[str] = []
+        for ch in slide_chunks:
+            txt = []
+            for s in subtitles:
+                if ch["start"] <= s["start"] < ch["end"]:
+                    txt.append(s["text"])
+            chunk_texts.append(" ".join(txt))
+        embeddings = cls._model.encode(chunk_texts, convert_to_tensor=True)
+        refined: List[Dict] = []
+        buffer = slide_chunks[0].copy()
+        buf_emb = embeddings[0]
+        for i in range(1, len(slide_chunks)):
+            sim = float(util.cos_sim(buf_emb, embeddings[i]))
+            duration = buffer["end"] - buffer["start"]
+            if duration > 120 and sim < 0.8:  # too long and not similar => split
+                refined.append(buffer)
+                buffer = slide_chunks[i].copy()
+                buf_emb = embeddings[i]
+            elif duration < 10 and sim > 0.9:  # too short and similar => merge
+                buffer["end"] = slide_chunks[i]["end"]
+            else:
+                refined.append(buffer)
+                buffer = slide_chunks[i].copy()
+                buf_emb = embeddings[i]
+        refined.append(buffer)
+        logger.info("[SemanticSegmenter] refined %d→%d chunks", len(slide_chunks), len(refined))
+        return refined

lec2note/segmentation/visual_segmenter.py ADDED Viewed

	@@ -0,0 +1,52 @@

+"""Visual segmentation based on keyframe timestamps.
+This module identifies slide boundaries by extracting keyframes first (via
+``lec2note.vision.keyframe_extractor``), then converting frame indices to time
+range based on video FPS.
+"""
+from __future__ import annotations
+import logging
+from pathlib import Path
+from typing import List, Dict
+import cv2  # type: ignore
+from lec2note.vision.keyframe_extractor import KeyframeExtractor
+from lec2note.types import SlideChunk
+__all__ = ["VisualSegmenter"]
+logger = logging.getLogger(__name__)
+class VisualSegmenter:  # noqa: D101
+    @classmethod
+    def run(cls, video_fp: str | Path) -> List[Dict]:  # slide_chunks list of dict
+        """Return list of ``{start, end}`` slide-level chunks."""
+        video_path = Path(video_fp).expanduser().resolve()
+        logger.info("[VisualSegmenter] start visual segmentation on %s", video_path.name)
+        keyframes = KeyframeExtractor.run(video_path,threshold=0.2)
+        if not keyframes:
+            # fallback single chunk whole video
+            cap = cv2.VideoCapture(str(video_path))
+            duration = cap.get(cv2.CAP_PROP_FRAME_COUNT) / cap.get(cv2.CAP_PROP_FPS)
+            cap.release()
+            return [{"start": 0.0, "end": duration}]
+        # Determine timestamp for each keyframe: assume filename kf_idx order matches frame order
+        cap = cv2.VideoCapture(str(video_path))
+        fps = cap.get(cv2.CAP_PROP_FPS)
+        cap.release()
+        indices = [int(p.stem.split("_")[1]) for p in keyframes]
+        indices.sort()
+        times = [idx / fps for idx in indices]
+        times.append(float("inf"))  # sentinel for last end
+        slide_chunks: List[Dict] = []
+        for i in range(len(times) - 1):
+            slide_chunks.append({"start": times[i], "end": times[i + 1]})
+        logger.info("[VisualSegmenter] generated %d slide chunks", len(slide_chunks))
+        return slide_chunks

lec2note/synthesis/__pycache__/assembler.cpython-310.pyc ADDED Viewed

Binary file (2.28 kB). View file

lec2note/synthesis/assembler.py ADDED Viewed

	@@ -0,0 +1,76 @@

+"""Assembler merges note chunks into final Markdown document."""
+from __future__ import annotations
+import logging
+from pathlib import Path
+logger = logging.getLogger(__name__)
+from typing import List
+from openai import OpenAI
+from lec2note.types import NoteChunk
+__all__ = ["Assembler"]
+TEMPLATE = """# 讲座笔记
+{content}
+"""
+class Assembler:  # noqa: D101
+    @staticmethod
+    def merge(chunks: List[NoteChunk]) -> str:
+        """Concatenate note chunks and wrap with template."""
+        body_parts = []
+        for c in chunks:
+            body_parts.append(c.note)
+        raw_md = "\n\n".join(body_parts)
+        logger.info("[Assembler] merging %d note chunks", len(chunks))
+        # LLM 后期润色：可选，通过环境变量控制
+        logger.info("[Assembler] polishing with LLM…")
+        try:
+            if not os.getenv("OPENAI_API_KEY"):
+                raise EnvironmentError("OPENAI_API_KEY not set")
+            client=OpenAI(
+                base_url=os.getenv("OPENAI_API_BASE"),
+                api_key=os.getenv("OPENAI_API_KEY"),
+            )
+            response = client.chat.completions.create(
+                            model=getenv("OPENAI_MODEL", "gpt-4o-mini"),
+                            temperature=0.3,
+                            messages=[
+                                {
+                                    "role": "user",
+                                    "content": (
+                                        "You are an expert academic editor and content synthesizer. Your task is to transform a collection of fragmented and repetitive lecture notes into a single, coherent, and logically structured study guide.\n\n"
+                                        "**Context:** These notes were generated by summarizing different segments of a single video lecture. As a result, they are not chronologically ordered and contain significant overlap and redundancy.\n\n"
+                                        "**Primary Goal:** Create a comprehensive, well-organized, and de-duplicated final document from the provided fragments.\n\n"
+                                        "**Key Instructions:**\n"
+                                        "1.  **De-duplicate and Consolidate:** Identify all repetitive definitions and explanations. Merge them into a single, comprehensive section for each core concept. For instance, fundamental terms like 'State vs. Observation', 'Policy', and the notation aside (s_t vs x_t) are likely defined multiple times; these must be consolidated.\n"
+                                        "2.  **Reorganize and Structure:** Do NOT preserve the original order. Instead, create a new, logical structure for the entire document. Use clear headings and subheadings (e.g., using Markdown's #, ##, ###) to build a clear narrative, starting from fundamental definitions and progressing to more complex topics.\n"
+                                        "3.  **Synthesize and Enhance:** Where different fragments explain the same concept with slightly different examples or details (e.g., one note uses a 'cheetah' example, another uses a 'robot'), synthesize these details to create a richer, more complete explanation under a single heading.\n"
+                                        "4.  **Polish and Format:** Ensure the final text is grammatically correct, flows naturally, and uses consistent, clean Markdown formatting (e.g., for tables, code blocks, and mathematical notation).\n\n"
+                                        "**Constraint:** Ensure all unique concepts and key details from the original notes are preserved in the final document. The goal is to lose redundancy, not information.\n\n"
+                                        "Here are the fragmented notes to process:\n\n"
+                                        f"{raw_md}"
+                                    ),
+                                }
+                            ],
+                        )
+                polished = response.choices[0].message.content.strip()
+            except Exception:  # noqa: BLE001
+                polished = raw_md  # 回退
+        else:
+            polished = raw_md
+        logger.info("[Assembler] final document length %d chars", len(polished))
+        return TEMPLATE.format(content=polished)
+    @staticmethod
+    def save(markdown: str, output: str | Path) -> Path:
+        out_path = Path(output).expanduser().resolve()
+        out_path.write_text(markdown, encoding="utf-8")
+        return out_path

lec2note/types.py ADDED Viewed

	@@ -0,0 +1,36 @@

+"""Shared dataclass definitions used across modules."""
+from __future__ import annotations
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import List, Dict, Any
+__all__ = [
+    "SlideChunk",
+    "FinalChunk",
+    "NoteChunk",
+    "Chunk",
+]
+@dataclass
+class SlideChunk:  # noqa: D101
+    start: float  # seconds
+    end: float
+@dataclass
+class FinalChunk(SlideChunk):  # noqa: D101
+    images: List[Path] = field(default_factory=list)
+    subtitles: List[int] = field(default_factory=list)  # indices in subtitles list
+@dataclass
+class NoteChunk:  # noqa: D101
+    note: str
+    images: List[Path]
+# alias used by older code
+Chunk = FinalChunk

lec2note/utils/__pycache__/logging_config.cpython-310.pyc ADDED Viewed

Binary file (849 Bytes). View file

lec2note/utils/logging_config.py ADDED Viewed

	@@ -0,0 +1,25 @@

+"""Global logging configuration for Lec2Note2.
+Call ``setup_logging()`` once at program start to enable consistent log format.
+"""
+from __future__ import annotations
+import logging
+import sys
+__all__ = ["setup_logging"]
+def setup_logging(level: int = logging.INFO) -> None:  # noqa: D401
+    """Configure root logger with sane defaults if not already configured."""
+    if logging.getLogger().handlers:
+        # Already configured by caller / framework
+        return
+    logging.basicConfig(
+        level=level,
+        format="[%(asctime)s] %(levelname)-8s %(name)s: %(message)s",
+        datefmt="%Y-%m-%d %H:%M:%S",
+        handlers=[logging.StreamHandler(sys.stdout)],
+    )

lec2note/vision/__pycache__/keyframe_extractor.cpython-310.pyc ADDED Viewed

Binary file (2.9 kB). View file

lec2note/vision/keyframe_extractor.py ADDED Viewed

	@@ -0,0 +1,85 @@

+"""Keyframe extraction based on frame similarity.
+For lecture slides视频，帧间差异大多来自幻灯片翻页。本模块通过计算
+HSV 颜色直方图余弦相似度快速检测“场景切换”，进而保存关键帧。
+"""
+from __future__ import annotations
+import cv2  # type: ignore
+import numpy as np
+from skimage.metrics import structural_similarity as ssim  # type: ignore
+import imagehash
+from PIL import Image
+import logging
+from pathlib import Path
+from typing import List
+__all__ = ["KeyframeExtractor"]
+class KeyframeExtractor:
+    """Extract keyframes when similarity drops below threshold."""
+    @staticmethod
+    def _is_new_slide(prev: np.ndarray, curr: np.ndarray, *, ssim_th: float = 0.95, dhash_th: int = 8) -> bool:
+        """Return True if curr frame is considered a new slide."""
+        # SSIM on down-scaled grayscale
+        gray_prev = cv2.cvtColor(prev, cv2.COLOR_BGR2GRAY)
+        gray_curr = cv2.cvtColor(curr, cv2.COLOR_BGR2GRAY)
+        ssim_val = ssim(gray_prev, gray_curr)
+        if ssim_val < ssim_th:
+            return True
+        # perceptual hash (dHash) comparison
+        h1 = imagehash.dhash(Image.fromarray(prev))
+        h2 = imagehash.dhash(Image.fromarray(curr))
+        if h1 - h2 > dhash_th:
+            return True
+        return False
+    @classmethod
+    def run(cls, video_fp: str | Path, threshold: float = 0.6, output_dir: str | Path | None = None) -> List[Path]:
+        """Return list of saved keyframe image paths.
+        Parameters
+        ----------
+        video_fp : str | Path
+            视频文件路径。
+        threshold : float
+            相似度阈值，低于此值认定为新幻灯片。
+        output_dir : str | Path, optional
+            保存关键帧的目录，默认与视频同级的 ``frames`` 目录。
+        """
+        video_path = Path(video_fp).expanduser().resolve()
+        if not video_path.exists():
+            raise FileNotFoundError(video_path)
+        save_dir = Path(output_dir or video_path.parent / "frames").resolve()
+        save_dir.mkdir(parents=True, exist_ok=True)
+        cap = cv2.VideoCapture(str(video_path))
+        success, prev_frame = cap.read()
+        if not success:
+            cap.release()
+            raise RuntimeError("Cannot read video")
+        frame_idx = 0
+        saved_paths: List[Path] = []
+        while True:
+            success, frame = cap.read()
+            if not success:
+                break
+            if cls._is_new_slide(prev_frame, frame, ssim_th=threshold):
+                # new slide: save current frame
+                out_fp = save_dir / f"kf_{frame_idx:04d}.png"
+                cv2.imwrite(str(out_fp), frame)
+                saved_paths.append(out_fp)
+                prev_frame = frame
+            frame_idx += 1
+        logging.getLogger(__name__).info("[KeyframeExtractor] saved %d keyframes to %s", len(saved_paths), save_dir)
+        cap.release()
+        return saved_paths

lec2note/vision/ocr_processor.py ADDED Viewed

	@@ -0,0 +1,30 @@

+"""OCR processor using PaddleOCR."""
+from __future__ import annotations
+from pathlib import Path
+from typing import List
+from paddleocr import PaddleOCR  # type: ignore
+__all__ = ["OcrProcessor"]
+class OcrProcessor:  # noqa: D101
+    # 初始化一次模型以复用 GPU
+    _ocr = PaddleOCR(use_angle_cls=True, lang="ch")
+    @classmethod
+    def run(cls, img_fp: str | Path, lang: str = "ch") -> str:
+        """Perform OCR and return concatenated text."""
+        img_path = Path(img_fp).expanduser().resolve()
+        if not img_path.exists():
+            raise FileNotFoundError(img_path)
+        result = cls._ocr.ocr(str(img_path), cls=True)
+        # PaddleOCR returns nested list
+        texts: List[str] = []
+        for line in result:  # type: ignore
+            for (_, text, _) in line:
+                texts.append(text)
+        return "\n".join(texts)

requirements.txt ADDED Viewed

	@@ -0,0 +1,19 @@

+fastapi==0.110.2
+uvicorn==0.28.1
+pydantic==2.7.1
+numpy==1.26.4
+opencv-python==4.6.0.66
+pillow==10.3.0
+paddleocr==2.7.0.3
+torch<=2.3.0
+prefect==2.19.6  # latest stable as of 2024-06, 2.17.2 not published
+openai-whisper==20231117  # wheels available on PyPI
+python-multipart==0.0.9
+rich==13.7.1
+pytest==8.2.0
+sentence-transformers==2.7.0
+openai>=1.35.0          # 新 SDK，支持 OpenRouter & httpx 0.28+
+httpx>=0.28,<0.30
+anyio>=3.7,<4.0
+scikit-image==0.25.1
+imagehash==4.3.1