LRU1 commited on
Commit
429e139
·
0 Parent(s):

visual seg -> semantic seg

Browse files
.env ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ export OPENAI_API_KEY="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"
2
+ export OPENAI_API_BASE="https://openrouter.ai/api/v1"
3
+ export LOG_LEVEL=DEBUG
.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ /data/*
2
+ /test/*
DEVELOPER_GUIDE.md ADDED
@@ -0,0 +1,183 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Lec2Note 开发者指南 (DEVELOPER GUIDE)
2
+
3
+ ## 1. 项目概述
4
+
5
+ Lec2Note 致力于提供一个端到端的 “视频讲座自动生成笔记” 解决方案。它通过多模态分析技术,深度融合视频画面与音频内容,生成图文并茂、结构清晰的笔记。
6
+
7
+ ### 核心流程
8
+
9
+ 1. **混合式分块 (Hybrid Segmentation)**:采用二级策略。首先基于幻灯片切换进行宏观分块,确立内容的主体结构;然后对每个块内的语音文字稿进行语义分析,进行二次切分或合并,确保每个最终块在逻辑上完整且独立。
10
+ 2. **多模态信息提取 (Multimodal Information Extraction)**
11
+ - 视频流:提取每个分块内的关键帧图像(包含动画、标注等动态过程),并对图像进行 OCR 识别。
12
+ - 音频流:抽取并识别对应时间段的语音 (ASR),生成带时间戳的文字稿。
13
+ 3. **图文同步与对齐 (Synchronization)**:将文字稿与相应的关键帧图像进行关联,构建出“文字描述某一图像”的上下文关系。
14
+ 4. **多模态内容生成 (Multimodal Generation)**:将对齐后的图文信息块送入多模态大语言模型 (LLM),生成包含小节标题、要点总结、核心截图的综合性笔记。
15
+ 5. **结构化输出 (Structured Output)**:汇总所有信息,输出为结构化的 Markdown/HTML/Notion 页面。
16
+
17
+ ### 目标使用场景
18
+
19
+ - 学术课程录播
20
+ - 会议/研讨会记录
21
+ - 企业内部培训视频
22
+
23
+ ## 2. 技术栈
24
+
25
+ | 层级 | 主要技术 | 说明 |
26
+ | ---------------- | -------------------------------- | ---------------------------------- |
27
+ | 语言 | Python 3.9+ | 主代码基于 Python 实现 |
28
+ | 视频/图像处理 | OpenCV, Pillow | 关键帧提取、画面变化检测、图像处理 |
29
+ | 文字识别 (OCR) | PaddleOCR / Tesseract | 从关键帧图像中提取幻灯片文字 |
30
+ | 语音识别 (ASR) | Whisper / Faster-Whisper | 提供高精度 ASR,可选 GPU 加速 |
31
+ | 大语言模型 (LLM) | OpenAI GPT-4V / LLaVA | 多模态模型,理解图文并生成笔记 |
32
+ | Web 框架 | FastAPI | 提供 RESTful & WebSocket 服务 |
33
+ | 任务编排 | Prefect / Celery | 支持批处理及重试机制 |
34
+ | 数据库 | SQLite (dev) / PostgreSQL (prod) | 存储元数据与任务状态 |
35
+ | 容器 | Docker & Docker Compose | 一键部署 |
36
+
37
+ ## 3. 目录结构与模块划分
38
+
39
+ ```text
40
+ Lec2Note/
41
+ ├── docs/ # 设计文档 & 会议记录
42
+ ├── lec2note/ # 源码包 (Python)
43
+ │ ├── ingestion/ # 音频处理 & ASR
44
+ │ │ ├── audio_extractor.py
45
+ │ │ └── whisper_runner.py
46
+ │ ├── vision/ # 视频画面处理模块
47
+ │ │ ├── keyframe_extractor.py
48
+ │ │ └── ocr_processor.py
49
+ │ ├── segmentation/ # 【更新】混合式分块模块
50
+ │ │ ├── visual_segmenter.py
51
+ │ │ └── semantic_segmenter.py
52
+ │ ├── processing/ # 多模态信息融合与 LLM 生成
53
+ │ ├── synthesis/ # 全局笔记整合与导出
54
+ │ ├── assets/ # 静态模板 (Markdown/HTML)
55
+ │ └── api/ # FastAPI 路由
56
+ ├── scripts/ # CLI 脚本 & 任务调度
57
+ ├── tests/ # PyTest 单元与集成测试
58
+ ├── Dockerfile
59
+ ├── docker-compose.yml
60
+ └── README.md
61
+ ```
62
+
63
+ ## 4. 核心功能说明
64
+
65
+ ### 4.1 混合式分块 (Hybrid Segmentation)
66
+
67
+ 这是一个二级处理过程,旨在创建逻辑连贯的内容块:
68
+
69
+ - **视觉粗分块**:调用 `visual_segmenter.run(video_fp)`,使用 OpenCV 分析帧间差异,识别幻灯片切换的精确时间点,生成初步的 `slide_chunks`。
70
+ - **语义精细化**:调用 `semantic_segmenter.refine(slide_chunks)`,对上一步结果进行处理:
71
+ - 拆分:如果一个 chunk 时长过长,则基于其 ASR 文本的语义相似度变化,将其拆分为更小、更集中的 `sub_chunks`。
72
+ - 合并:如果连续多个 chunk 过短且语义相关,则将它们合并为一个逻辑单元。
73
+ - **输出**:最终得到一系列经过优化、逻辑独立的 `final_chunks`。
74
+
75
+ ### 4.2 信息提取 (Extraction)
76
+
77
+ 对于每一个 `final_chunk`:
78
+
79
+ ```python
80
+ extract_keyframes(chunk) # 提取关键帧
81
+ run_ocr_on_frames(frames) # OCR 识别
82
+ extract_and_transcribe_audio(chunk) # ASR 转录
83
+ ```
84
+
85
+ ### 4.3 图文融合与生成 (Processing)
86
+
87
+ ```python
88
+ synchronize_text_and_frames(subtitles, frames) # 字幕与图像对齐
89
+ generate_note_chunk(synchronized_data) # LLM 生成笔记
90
+ ```
91
+
92
+ ### 4.4 笔记合成 (Synthesis)
93
+
94
+ - 汇总所有 `note_chunk`,给大模型进行总结润色,生成完整的笔记(语言与lecture相同)。
95
+ - 导出为最终的 Markdown 文件。
96
+
97
+ ### 4.5 API
98
+
99
+ | 方法 & 路径 | 说明 |
100
+ | ------------------ | ------------------------------ |
101
+ | `POST /upload` | 上传视频 → 返回任务 ID |
102
+ | `GET /status/{id}` | 查询任务进度 (如 “视觉分块中”) |
103
+ | `GET /notes/{id}` | 获取生成的图文笔记 |
104
+
105
+ ### 4.6 内部模块接口一览
106
+
107
+ | 模块 | 关键类 / 方法 | 输入 | 输出 | 说明 |
108
+ | --------------------------------- | ---------------------------------------------------------------------------- | ----------------- | ----------------------------------------- | --------------------------------------------------- |
109
+ | `ingestion.audio_extractor` | `AudioExtractor.extract(video_fp: str) -> Path` | 视频文件路径 | `audio.wav` 文件路径 | 使用 FFmpeg 拆分音轨并标准化到 16 kHz 单声道 |
110
+ | `ingestion.whisper_runner` | `WhisperRunner.transcribe(audio_fp: Path, lang: str = "zh") -> List[Dict]` | 音频文件路径 | `[{"start":0.0,"end":3.2,"text":"…"}, …]` | 返回带时间戳的字幕列表 (JSON 序列化后写入 `.jsonl`) |
111
+ | `vision.keyframe_extractor` | `KeyframeExtractor.run(video_fp: str, threshold: float = 0.6) -> List[Path]` | 视频文件路径 | 关键帧图片路径列表 | 帧间余弦相似度低于阈值即视为新幻灯片 |
112
+ | `vision.ocr_processor` | `OcrProcessor.run(img_fp: Path, lang: str = "ch") -> str` | 图片路径 | 图片中文字 | 通过 PaddleOCR,GPU 自动检测 |
113
+ | `segmentation.visual_segmenter` | `VisualSegmenter.run(video_fp) -> List[Dict]` | 视频文件路径 | `slide_chunks` | 返回 `{start,end}` 的粗分段列表 |
114
+ | `segmentation.semantic_segmenter` | `SemanticSegmenter.refine(slide_chunks, subtitles) -> List[Dict]` | 粗分段 & 字幕列表 | `final_chunks` | 结合文本语义相似度进行二次细分/合并 |
115
+ | `processing.processor` | `Processor.generate_note(chunk) -> NoteChunk` | `final_chunk` | `NoteChunk(note:str,images:List[str])` | 调用 LLM 生成单块笔记实体 |
116
+ | `synthesis.assembler` | `Assembler.merge(chunks: List[NoteChunk]) -> str` | NoteChunk 列表 | Markdown/HTML 字符串 | 合成全局文档并填充模板 |
117
+
118
+ ### 4.7 数据格式示例
119
+
120
+ ```jsonc
121
+ // subtitles.jsonl(节选)
122
+ {"start": 0.0, "end": 3.2, "text": "欢迎来到 Lec2Note 课程"}
123
+ {"start": 3.2, "end": 6.7, "text": "今天我们介绍多模态笔记生成"}
124
+ ```
125
+
126
+ ```jsonc
127
+ // chunk_schema.json(节选)
128
+ {
129
+ "id": 1,
130
+ "start": 0.0,
131
+ "end": 120.5,
132
+ "images": ["kf_0001.png", "kf_0002.png"],
133
+ "subtitles": [0, 1, 2, 3]
134
+ }
135
+ ```
136
+
137
+ ```jsonc
138
+ // note_chunk (Processor.generate_note 输出示例)
139
+ {
140
+ "note": "### 多模态分析简介\n- 本节介绍了……",
141
+ "images": ["kf_0001.png"]
142
+ }
143
+ ```
144
+
145
+ ## 5. 开发环境搭建
146
+
147
+ ```bash
148
+ # 克隆仓库
149
+ git clone [email protected]:your_org/lec2note.git
150
+ cd lec2note
151
+
152
+ # (可选) 创建虚拟环境
153
+ python -m venv .venv && source .venv/bin/activate
154
+
155
+ # 安装依赖
156
+ pip install -r requirements.txt
157
+
158
+ # 设置环境变量
159
+ export OPENAI_API_KEY="YOUR_KEY"
160
+
161
+ # 运行单元测试
162
+ pytest -q
163
+ ```
164
+
165
+ ### 快速运行本地 pipeline
166
+
167
+ ```bash
168
+ python -m lec2note.scripts.run_pipeline \
169
+ --video example.mp4 \
170
+ --output notes.md
171
+ ```
172
+
173
+ ## 6. 部署指南
174
+
175
+ ### 6.1 Docker Compose
176
+
177
+ ```bash
178
+ docker compose up -d --build
179
+ ```
180
+
181
+
182
+
183
+
README.md ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Lec2Note2
2
+
3
+ **Lec2Note2** 是一个端到端的 “视频讲座自动生成笔记” 解决方案。项目基于多模态分析技术,融合了视频关键帧、OCR 文字与 ASR 字幕,并通过大语言模型生成结构化笔记。
4
+
5
+ ## Features
6
+
7
+ - Hybrid Segmentation (视觉 + 语义)
8
+ - 多模态信息提取:关键帧、OCR、ASR
9
+ - 图文同步融合并调用 LLM 生成笔记
10
+ - FastAPI 提供异步任务接口
11
+ - Docker 一键部署
12
+
13
+ ## Quickstart
14
+
15
+ ```bash
16
+ # 克隆仓库
17
+ git clone <repo-url> Lec2Note2 && cd Lec2Note2
18
+
19
+ # (可选) 创建虚拟环境
20
+ python -m venv .venv && source .venv/bin/activate
21
+
22
+ # 安装依赖
23
+ pip install -r requirements.txt
24
+
25
+ # 运行单元测试
26
+ pytest -q
27
+
28
+ # 运行本地 pipeline
29
+ python -m lec2note.scripts.run_pipeline \
30
+ --video example.mp4 \
31
+ --output notes.md
32
+ ```
33
+
34
+ ## API
35
+
36
+ | Method | Path | Description |
37
+ | ------ | ------------ | --------------------- |
38
+ | POST | /upload | 上传视频,返回任务 ID |
39
+ | GET | /status/{id} | 查询任务进度 |
40
+ | GET | /notes/{id} | 获取生成的笔记 |
41
+
42
+ ## License
43
+
44
+ MIT License
lec2note/__pycache__/types.cpython-310.pyc ADDED
Binary file (1.12 kB). View file
 
lec2note/api/main.py ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """FastAPI interface for Lec2Note2 pipeline."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import uuid
6
+ import shutil
7
+ from pathlib import Path
8
+ from typing import Dict
9
+ import threading
10
+ import logging
11
+
12
+ logger = logging.getLogger(__name__)
13
+
14
+ from fastapi import FastAPI, UploadFile, File, HTTPException
15
+ from fastapi.responses import JSONResponse, FileResponse
16
+
17
+ from lec2note.ingestion.audio_extractor import AudioExtractor
18
+ from lec2note.ingestion.whisper_runner import WhisperRunner
19
+ from lec2note.segmentation.visual_segmenter import VisualSegmenter
20
+ from lec2note.segmentation.semantic_segmenter import SemanticSegmenter
21
+ from lec2note.vision.keyframe_extractor import KeyframeExtractor
22
+ from lec2note.vision.ocr_processor import OcrProcessor
23
+ from lec2note.processing.processor import Processor
24
+ from lec2note.synthesis.assembler import Assembler
25
+ from lec2note.types import FinalChunk
26
+
27
+ app = FastAPI(title="Lec2Note2 API")
28
+
29
+ DATA_ROOT = Path("/tmp/lec2note_jobs")
30
+ DATA_ROOT.mkdir(parents=True, exist_ok=True)
31
+
32
+ _jobs: Dict[str, Dict] = {}
33
+
34
+
35
+ def _run_pipeline(job_id: str, video_path: Path):
36
+ job = _jobs[job_id]
37
+ try:
38
+ job["status"] = "extract_audio"
39
+ wav = AudioExtractor.extract(video_path)
40
+
41
+ job["status"] = "asr"
42
+ subtitles = WhisperRunner.transcribe(wav)
43
+
44
+ job["status"] = "visual_segmentation"
45
+ slide_chunks = VisualSegmenter.run(video_path)
46
+
47
+ job["status"] = "semantic_refine"
48
+ final_chunks_dict = SemanticSegmenter.refine(slide_chunks, subtitles)
49
+
50
+ # attach images to chunks
51
+ keyframes = KeyframeExtractor.run(video_path)
52
+ final_chunks: list[FinalChunk] = []
53
+ for ch in final_chunks_dict:
54
+ fc = FinalChunk(start=ch["start"], end=ch["end"], images=keyframes)
55
+ final_chunks.append(fc)
56
+
57
+ job["status"] = "ocr"
58
+ # run OCR for all keyframes (simplified)
59
+ for img in keyframes:
60
+ OcrProcessor.run(img)
61
+
62
+ job["status"] = "generate_notes"
63
+ note_chunks = [Processor.generate_note(fc, subtitles) for fc in final_chunks]
64
+
65
+ job["status"] = "synthesis"
66
+ md = Assembler.merge(note_chunks)
67
+ output_fp = DATA_ROOT / f"{job_id}.md"
68
+ Assembler.save(md, output_fp)
69
+ job["output"] = output_fp
70
+ job["status"] = "completed"
71
+ logger.info("[API] job %s completed -> %s", job_id, output_fp)
72
+ except Exception as exc: # noqa: BLE001
73
+ job["status"] = f"error: {exc}"
74
+
75
+
76
+ @app.post("/upload")
77
+ async def upload_video(video: UploadFile = File(...)) -> JSONResponse: # noqa: D401
78
+ """上传视频并启动后台处理。"""
79
+ if video.content_type not in {"video/mp4", "video/mkv", "video/avi"}:
80
+ raise HTTPException(status_code=400, detail="Unsupported video format")
81
+ job_id = str(uuid.uuid4())
82
+ logger.info("[API] received upload %s (size ≈ %.1f MB)", video.filename, video.size/1e6)
83
+ job_dir = DATA_ROOT / job_id
84
+ job_dir.mkdir(parents=True, exist_ok=True)
85
+ video_path = job_dir / video.filename
86
+ with video_path.open("wb") as f:
87
+ shutil.copyfileobj(video.file, f)
88
+
89
+ _jobs[job_id] = {"status": "queued"}
90
+ logger.info("[API] job %s queued", job_id)
91
+ threading.Thread(target=_run_pipeline, args=(job_id, video_path), daemon=True).start()
92
+ return JSONResponse({"id": job_id})
93
+
94
+
95
+ @app.get("/status/{job_id}")
96
+ def get_status(job_id: str): # noqa: D401
97
+ if job_id not in _jobs:
98
+ raise HTTPException(status_code=404, detail="Job not found")
99
+ return JSONResponse({"status": _jobs[job_id]["status"]})
100
+
101
+
102
+ @app.get("/notes/{job_id}")
103
+ def get_notes(job_id: str): # noqa: D401
104
+ if job_id not in _jobs:
105
+ raise HTTPException(status_code=404, detail="Job not found")
106
+ job = _jobs[job_id]
107
+ if job.get("status") != "completed":
108
+ raise HTTPException(status_code=400, detail="Job not completed")
109
+ return FileResponse(job["output"], media_type="text/markdown", filename="notes.md")
lec2note/ingestion/__pycache__/audio_extractor.cpython-310.pyc ADDED
Binary file (2.77 kB). View file
 
lec2note/ingestion/__pycache__/audio_extractor.cpython-312.pyc ADDED
Binary file (3.37 kB). View file
 
lec2note/ingestion/__pycache__/whisper_runner.cpython-310.pyc ADDED
Binary file (1.94 kB). View file
 
lec2note/ingestion/__pycache__/whisper_runner.cpython-312.pyc ADDED
Binary file (2.17 kB). View file
 
lec2note/ingestion/audio_extractor.py ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Audio extraction utility.
2
+
3
+ This module provides `AudioExtractor` which wraps FFmpeg to extract the audio
4
+ track from a video lecture and convert it to a mono 16-kHz WAV file. The output
5
+ file is deterministic, making downstream ASR reproducible.
6
+
7
+ Example
8
+ -------
9
+ >>> from pathlib import Path
10
+ >>> from lec2note.ingestion.audio_extractor import AudioExtractor
11
+ >>> wav = AudioExtractor.extract("example.mp4")
12
+ >>> assert Path(wav).exists()
13
+ """
14
+
15
+ from __future__ import annotations
16
+
17
+ import subprocess
18
+ import logging
19
+ from pathlib import Path
20
+ from typing import Optional
21
+
22
+ logger = logging.getLogger(__name__)
23
+
24
+ __all__ = ["AudioExtractor"]
25
+
26
+
27
+ class AudioExtractor:
28
+ """提取并规范化音频轨道。
29
+
30
+ Attributes
31
+ ----------
32
+ sample_rate : int
33
+ 目标采样率,默认为 16 kHz。
34
+ channels : int
35
+ 声道数,默认为 1(单声道)。
36
+ codec : str
37
+ 输出编码格式,固定为 pcm_s16le。
38
+ """
39
+
40
+ sample_rate: int = 16_000
41
+ channels: int = 1
42
+ codec: str = "pcm_s16le"
43
+
44
+ @classmethod
45
+ def extract(cls, video_fp: str | Path, output_dir: Optional[str | Path] = None) -> Path:
46
+ """从视频文件中提取音频并转换为 WAV。
47
+
48
+ Parameters
49
+ ----------
50
+ video_fp : str | Path
51
+ 输入视频文件路径。
52
+ output_dir : str | Path, optional
53
+ 输出目录。如果为 ``None``,则使用同级目录。
54
+
55
+ Returns
56
+ -------
57
+ Path
58
+ 生成的 ``audio.wav`` 路径。
59
+ """
60
+
61
+ video_path = Path(video_fp).expanduser().resolve()
62
+ logger.info("[AudioExtractor] extracting audio from %s", video_path)
63
+ if not video_path.exists():
64
+ raise FileNotFoundError(video_path)
65
+
66
+ out_dir = Path(output_dir or video_path.parent).expanduser().resolve()
67
+ out_dir.mkdir(parents=True, exist_ok=True)
68
+ audio_path = out_dir / "audio.wav"
69
+
70
+ # FFmpeg command
71
+ cmd = [
72
+ "ffmpeg",
73
+ "-y", # overwrite
74
+ "-i",
75
+ str(video_path),
76
+ "-ac",
77
+ str(cls.channels),
78
+ "-ar",
79
+ str(cls.sample_rate),
80
+ "-vn", # no video
81
+ "-acodec",
82
+ cls.codec,
83
+ str(audio_path),
84
+ ]
85
+
86
+ try:
87
+ subprocess.run(cmd, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
88
+ logger.info("[AudioExtractor] saved wav to %s", audio_path)
89
+ except (subprocess.CalledProcessError, FileNotFoundError) as err:
90
+ msg = "FFmpeg 执行失败,请确保已安装 FFmpeg 且在 PATH 中。"
91
+ raise RuntimeError(msg) from err
92
+
93
+ return audio_path
lec2note/ingestion/whisper_runner.py ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Thin wrapper around OpenAI Whisper for ASR transcription."""
2
+
3
+ from __future__ import annotations
4
+ from pathlib import Path
5
+ import logging
6
+
7
+ logger = logging.getLogger(__name__)
8
+
9
+ from typing import List, Dict, Optional, Any
10
+
11
+ import torch
12
+ from whisper import load_model # type: ignore
13
+
14
+ __all__ = ["WhisperRunner"]
15
+
16
+
17
+ class WhisperRunner: # noqa: D101
18
+ model_name: str = "base"
19
+
20
+ @classmethod
21
+ def transcribe(cls, audio_fp: str | Path, lang: str = "zh") -> List[Dict[str, Any]]:
22
+ """Transcribe ``audio_fp`` and return list with start/end/text.
23
+
24
+ Notes
25
+ -----
26
+ - Automatically selects GPU if available.
27
+ - The function is *blocking* and can be called inside a Prefect task.
28
+ """
29
+ audio_path = Path(audio_fp).expanduser().resolve()
30
+ if not audio_path.exists():
31
+ raise FileNotFoundError(audio_path)
32
+
33
+ device = "cuda" if torch.cuda.is_available() else "cpu"
34
+ logger.info("[Whisper] loading model %s on %s", cls.model_name, device)
35
+ model = load_model(cls.model_name, device=device)
36
+
37
+ logger.info("[Whisper] transcribing %s", audio_path.name)
38
+ result = model.transcribe(str(audio_path), language=lang)
39
+ segments = result.get("segments", [])
40
+
41
+ # convert to our schema
42
+ logger.info("[Whisper] got %d segments", len(segments))
43
+ return [
44
+ {
45
+ "start": round(seg["start"], 2),
46
+ "end": round(seg["end"], 2),
47
+ "text": seg["text"].strip(),
48
+ }
49
+ for seg in segments
50
+ ]
lec2note/processing/__pycache__/processor.cpython-310.pyc ADDED
Binary file (4.92 kB). View file
 
lec2note/processing/processor.py ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Processing pipeline: synchronize subtitles & images, generate note chunk via LLM."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from typing import List, Dict, Any
6
+ import base64, mimetypes
7
+ from pathlib import Path
8
+ import os
9
+ import logging
10
+
11
+ logger = logging.getLogger(__name__)
12
+
13
+ from openai import OpenAI
14
+ from tenacity import retry, stop_after_attempt, wait_fixed # robust retry
15
+
16
+ from lec2note.types import FinalChunk, NoteChunk
17
+
18
+ __all__ = ["Processor"]
19
+
20
+
21
+ class Processor: # noqa: D101
22
+ model_name: str = os.getenv("OPENAI_MODEL", "google/gemini-2.5-pro")
23
+
24
+ @staticmethod
25
+ def _img_to_data_uri(img_path: Path) -> str:
26
+ mime, _ = mimetypes.guess_type(img_path)
27
+ b64 = base64.b64encode(img_path.read_bytes()).decode()
28
+ return f"data:{mime};base64,{b64}"
29
+
30
+ @staticmethod
31
+ @retry(stop=stop_after_attempt(3), wait=wait_fixed(2))
32
+ def _call_llm(messages: List[Dict[str, Any]]) -> str:
33
+ if not os.getenv("OPENAI_API_KEY"):
34
+ raise EnvironmentError("OPENAI_API_KEY not set")
35
+ client = OpenAI(
36
+ base_url=os.getenv("OPENAI_API_BASE"),
37
+ api_key=os.getenv("OPENAI_API_KEY"),
38
+ )
39
+ response = client.chat.completions.create(
40
+ model=Processor.model_name,
41
+ temperature=0.2,
42
+ messages=messages,
43
+ extra_headers={"X-Title": "Lec2Note2"},
44
+ )
45
+ note = response.choices[0].message.content.strip()
46
+ logger.debug("[Processor] LLM returned %d chars", len(note))
47
+ return note
48
+
49
+ @classmethod
50
+ @classmethod
51
+ def _build_messages(cls, synced: Dict[str, Any]) -> List[Dict[str, Any]]:
52
+ subtitle_text = " ".join(synced["text"])
53
+
54
+ # insert numbered placeholders into subtitles for reference
55
+ placeholder_subs = subtitle_text
56
+ for idx, _ in enumerate(synced["images"], start=1):
57
+ placeholder_subs += f"\n\n[IMG{idx}] ← 与下方第 {idx} 张图片对应"
58
+
59
+ # Prompt with explicit mapping guidance
60
+ prompt_text = (
61
+ "**Role**: You are an expert academic assistant tasked with creating a definitive set of study notes from a lecture.\n\n"
62
+ "**Primary Objective**: Generate a **comprehensive and detailed** note segment in Markdown. Do not omit details or simplify concepts excessively. Your goal is to capture the full context of the lecture segment.\n\n"
63
+ "**Key Instructions**:\n\n"
64
+ "1. **Capture Emphasized Points**: Pay close attention to the subtitles. Identify and highlight key points that the speaker seems to emphasize, such as repeated phrases, direct statements of importance (e.g., 'the key is...', 'remember that...'), and core definitions.\n\n"
65
+ "2. **Integrate Visuals (Formulas & Tables)**: You MUST analyze the accompanying images. If an image contains crucial information like **formulas, equations, tables, code snippets, or important diagrams**, you must accurately transcribe it into the Markdown note to support the text. Follow these formats:\n"
66
+ " - For **formulas and equations**, use LaTeX notation (e.g., enclose with `$` or `$$`).\n"
67
+ " - For **tables**, recreate them using Markdown table syntax.\n"
68
+ " - For **code**, use Markdown code blocks with appropriate language identifiers.\n\n"
69
+ "3. **Structure and Format**: Organize the notes logically. Use headings, subheadings, lists, and bold text to create a clear, readable, and well-structured document.\n\n"
70
+ "4. **Language**: The notes should align with the subtitles.\n\n"
71
+ "5. **Image Mapping**: Stop referencing the images and try to use formulas, tables, code snippets, or important diagrams to describe the images.\n\n"
72
+ "---BEGIN LECTURE MATERIALS---\n"
73
+ f"**Subtitles (placeholders inserted)**:\n{placeholder_subs}"
74
+ )
75
+
76
+ parts: List[Dict[str, Any]] = [
77
+ {"type": "text", "text": prompt_text}
78
+ ]
79
+ for idx, img_fp in enumerate(synced["images"][:10], start=1): # Limit to 6 images
80
+ parts.append({
81
+ "type": "image_url",
82
+ "image_url": {
83
+ "url": cls._img_to_data_uri(Path(img_fp)),
84
+ "detail": f"IMG{idx}", # label matches placeholder
85
+ },
86
+ })
87
+ return [{"role": "user", "content": parts}]
88
+
89
+ @classmethod
90
+ def generate_note(cls, chunk: FinalChunk, subtitles: List[Dict]) -> NoteChunk:
91
+ """Generate a single NoteChunk from FinalChunk data."""
92
+ # collect text for this chunk
93
+ texts = [s["text"] for s in subtitles if chunk.start <= s["start"] < chunk.end]
94
+ synced = {"text": texts, "images": chunk.images}
95
+ messages = cls._build_messages(synced)
96
+ note = cls._call_llm(messages)
97
+ return NoteChunk(note=note, images=chunk.images)
lec2note/scripts/__pycache__/run_pipeline.cpython-310.pyc ADDED
Binary file (2.13 kB). View file
 
lec2note/scripts/__pycache__/run_pipeline.cpython-312.pyc ADDED
Binary file (2.9 kB). View file
 
lec2note/scripts/run_pipeline.py ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """CLI entry to run Lec2Note2 pipeline end-to-end.
2
+
3
+ Usage
4
+ -----
5
+ python -m lec2note.scripts.run_pipeline --video path.mp4 --output notes.md
6
+ """
7
+
8
+ from __future__ import annotations
9
+
10
+ import argparse
11
+ from pathlib import Path
12
+
13
+ from lec2note.ingestion.audio_extractor import AudioExtractor
14
+ from lec2note.utils.logging_config import setup_logging
15
+ from lec2note.ingestion.whisper_runner import WhisperRunner
16
+ from lec2note.segmentation.visual_segmenter import VisualSegmenter
17
+ from lec2note.segmentation.semantic_segmenter import SemanticSegmenter
18
+ from lec2note.vision.keyframe_extractor import KeyframeExtractor
19
+ from lec2note.processing.processor import Processor
20
+ from lec2note.synthesis.assembler import Assembler
21
+ from lec2note.types import FinalChunk
22
+
23
+
24
+ def main(): # noqa: D401
25
+ setup_logging()
26
+ parser = argparse.ArgumentParser(description="Run Lec2Note2 pipeline")
27
+ parser.add_argument("--video", required=True, help="Path to input video")
28
+ parser.add_argument("--output", required=True, help="Path to output markdown")
29
+ args = parser.parse_args()
30
+
31
+ video_path = Path(args.video).expanduser().resolve()
32
+ if not video_path.exists():
33
+ raise FileNotFoundError(video_path)
34
+
35
+ wav = AudioExtractor.extract(video_path)
36
+ subtitles = WhisperRunner.transcribe(wav)
37
+
38
+ slide_chunks = VisualSegmenter.run(video_path)
39
+ final_chunks_dict = SemanticSegmenter.refine(slide_chunks, subtitles)
40
+
41
+ keyframes = KeyframeExtractor.run(video_path)
42
+ final_chunks: list[FinalChunk] = []
43
+ for ch in final_chunks_dict:
44
+ fc = FinalChunk(start=ch["start"], end=ch["end"], images=keyframes)
45
+ final_chunks.append(fc)
46
+
47
+ note_chunks = [Processor.generate_note(fc, subtitles) for fc in final_chunks]
48
+ markdown = Assembler.merge(note_chunks)
49
+ Assembler.save(markdown, args.output)
50
+ print(f"Saved markdown to {args.output}")
51
+
52
+
53
+ if __name__ == "__main__": # pragma: no cover
54
+ main()
lec2note/segmentation/__pycache__/semantic_segmenter.cpython-310.pyc ADDED
Binary file (1.75 kB). View file
 
lec2note/segmentation/__pycache__/visual_segmenter.cpython-310.pyc ADDED
Binary file (2.14 kB). View file
 
lec2note/segmentation/semantic_segmenter.py ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Refine slide chunks based on subtitle semantic similarity."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import logging
6
+
7
+ from typing import List, Dict
8
+
9
+ from sentence_transformers import SentenceTransformer, util # type: ignore
10
+
11
+ logger = logging.getLogger(__name__)
12
+
13
+ __all__ = ["SemanticSegmenter"]
14
+
15
+
16
+ class SemanticSegmenter: # noqa: D101
17
+ _model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
18
+
19
+ @classmethod
20
+ def refine(cls, slide_chunks: List[Dict], subtitles: List[Dict]) -> List[Dict]:
21
+ """Split long chunks or merge short ones by semantic change."""
22
+ if not slide_chunks:
23
+ logger.warning("[SemanticSegmenter] empty slide_chunks input")
24
+ return []
25
+
26
+ # Build text per chunk
27
+ chunk_texts: List[str] = []
28
+ for ch in slide_chunks:
29
+ txt = []
30
+ for s in subtitles:
31
+ if ch["start"] <= s["start"] < ch["end"]:
32
+ txt.append(s["text"])
33
+ chunk_texts.append(" ".join(txt))
34
+
35
+ embeddings = cls._model.encode(chunk_texts, convert_to_tensor=True)
36
+
37
+ refined: List[Dict] = []
38
+ buffer = slide_chunks[0].copy()
39
+ buf_emb = embeddings[0]
40
+ for i in range(1, len(slide_chunks)):
41
+ sim = float(util.cos_sim(buf_emb, embeddings[i]))
42
+ duration = buffer["end"] - buffer["start"]
43
+ if duration > 120 and sim < 0.8: # too long and not similar => split
44
+ refined.append(buffer)
45
+ buffer = slide_chunks[i].copy()
46
+ buf_emb = embeddings[i]
47
+ elif duration < 10 and sim > 0.9: # too short and similar => merge
48
+ buffer["end"] = slide_chunks[i]["end"]
49
+ else:
50
+ refined.append(buffer)
51
+ buffer = slide_chunks[i].copy()
52
+ buf_emb = embeddings[i]
53
+ refined.append(buffer)
54
+ logger.info("[SemanticSegmenter] refined %d→%d chunks", len(slide_chunks), len(refined))
55
+ return refined
lec2note/segmentation/visual_segmenter.py ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Visual segmentation based on keyframe timestamps.
2
+
3
+ This module identifies slide boundaries by extracting keyframes first (via
4
+ ``lec2note.vision.keyframe_extractor``), then converting frame indices to time
5
+ range based on video FPS.
6
+ """
7
+
8
+ from __future__ import annotations
9
+
10
+ import logging
11
+ from pathlib import Path
12
+ from typing import List, Dict
13
+
14
+ import cv2 # type: ignore
15
+
16
+ from lec2note.vision.keyframe_extractor import KeyframeExtractor
17
+ from lec2note.types import SlideChunk
18
+
19
+ __all__ = ["VisualSegmenter"]
20
+
21
+ logger = logging.getLogger(__name__)
22
+
23
+
24
+ class VisualSegmenter: # noqa: D101
25
+ @classmethod
26
+ def run(cls, video_fp: str | Path) -> List[Dict]: # slide_chunks list of dict
27
+ """Return list of ``{start, end}`` slide-level chunks."""
28
+ video_path = Path(video_fp).expanduser().resolve()
29
+ logger.info("[VisualSegmenter] start visual segmentation on %s", video_path.name)
30
+ keyframes = KeyframeExtractor.run(video_path,threshold=0.2)
31
+ if not keyframes:
32
+ # fallback single chunk whole video
33
+ cap = cv2.VideoCapture(str(video_path))
34
+ duration = cap.get(cv2.CAP_PROP_FRAME_COUNT) / cap.get(cv2.CAP_PROP_FPS)
35
+ cap.release()
36
+ return [{"start": 0.0, "end": duration}]
37
+
38
+ # Determine timestamp for each keyframe: assume filename kf_idx order matches frame order
39
+ cap = cv2.VideoCapture(str(video_path))
40
+ fps = cap.get(cv2.CAP_PROP_FPS)
41
+ cap.release()
42
+
43
+ indices = [int(p.stem.split("_")[1]) for p in keyframes]
44
+ indices.sort()
45
+ times = [idx / fps for idx in indices]
46
+ times.append(float("inf")) # sentinel for last end
47
+
48
+ slide_chunks: List[Dict] = []
49
+ for i in range(len(times) - 1):
50
+ slide_chunks.append({"start": times[i], "end": times[i + 1]})
51
+ logger.info("[VisualSegmenter] generated %d slide chunks", len(slide_chunks))
52
+ return slide_chunks
lec2note/synthesis/__pycache__/assembler.cpython-310.pyc ADDED
Binary file (2.28 kB). View file
 
lec2note/synthesis/assembler.py ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Assembler merges note chunks into final Markdown document."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import logging
6
+ from pathlib import Path
7
+ logger = logging.getLogger(__name__)
8
+
9
+ from typing import List
10
+ from openai import OpenAI
11
+ from lec2note.types import NoteChunk
12
+
13
+ __all__ = ["Assembler"]
14
+
15
+ TEMPLATE = """# 讲座笔记
16
+
17
+ {content}
18
+ """
19
+
20
+
21
+ class Assembler: # noqa: D101
22
+ @staticmethod
23
+ def merge(chunks: List[NoteChunk]) -> str:
24
+ """Concatenate note chunks and wrap with template."""
25
+ body_parts = []
26
+ for c in chunks:
27
+ body_parts.append(c.note)
28
+ raw_md = "\n\n".join(body_parts)
29
+ logger.info("[Assembler] merging %d note chunks", len(chunks))
30
+
31
+ # LLM 后期润色:可选,通过环境变量控制
32
+ logger.info("[Assembler] polishing with LLM…")
33
+ try:
34
+ if not os.getenv("OPENAI_API_KEY"):
35
+ raise EnvironmentError("OPENAI_API_KEY not set")
36
+
37
+ client=OpenAI(
38
+ base_url=os.getenv("OPENAI_API_BASE"),
39
+ api_key=os.getenv("OPENAI_API_KEY"),
40
+ )
41
+ response = client.chat.completions.create(
42
+ model=getenv("OPENAI_MODEL", "gpt-4o-mini"),
43
+ temperature=0.3,
44
+ messages=[
45
+ {
46
+ "role": "user",
47
+ "content": (
48
+ "You are an expert academic editor and content synthesizer. Your task is to transform a collection of fragmented and repetitive lecture notes into a single, coherent, and logically structured study guide.\n\n"
49
+ "**Context:** These notes were generated by summarizing different segments of a single video lecture. As a result, they are not chronologically ordered and contain significant overlap and redundancy.\n\n"
50
+ "**Primary Goal:** Create a comprehensive, well-organized, and de-duplicated final document from the provided fragments.\n\n"
51
+ "**Key Instructions:**\n"
52
+ "1. **De-duplicate and Consolidate:** Identify all repetitive definitions and explanations. Merge them into a single, comprehensive section for each core concept. For instance, fundamental terms like 'State vs. Observation', 'Policy', and the notation aside (s_t vs x_t) are likely defined multiple times; these must be consolidated.\n"
53
+ "2. **Reorganize and Structure:** Do NOT preserve the original order. Instead, create a new, logical structure for the entire document. Use clear headings and subheadings (e.g., using Markdown's #, ##, ###) to build a clear narrative, starting from fundamental definitions and progressing to more complex topics.\n"
54
+ "3. **Synthesize and Enhance:** Where different fragments explain the same concept with slightly different examples or details (e.g., one note uses a 'cheetah' example, another uses a 'robot'), synthesize these details to create a richer, more complete explanation under a single heading.\n"
55
+ "4. **Polish and Format:** Ensure the final text is grammatically correct, flows naturally, and uses consistent, clean Markdown formatting (e.g., for tables, code blocks, and mathematical notation).\n\n"
56
+ "**Constraint:** Ensure all unique concepts and key details from the original notes are preserved in the final document. The goal is to lose redundancy, not information.\n\n"
57
+ "Here are the fragmented notes to process:\n\n"
58
+ f"{raw_md}"
59
+ ),
60
+ }
61
+ ],
62
+ )
63
+ polished = response.choices[0].message.content.strip()
64
+ except Exception: # noqa: BLE001
65
+ polished = raw_md # 回退
66
+ else:
67
+ polished = raw_md
68
+
69
+ logger.info("[Assembler] final document length %d chars", len(polished))
70
+ return TEMPLATE.format(content=polished)
71
+
72
+ @staticmethod
73
+ def save(markdown: str, output: str | Path) -> Path:
74
+ out_path = Path(output).expanduser().resolve()
75
+ out_path.write_text(markdown, encoding="utf-8")
76
+ return out_path
lec2note/types.py ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Shared dataclass definitions used across modules."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from dataclasses import dataclass, field
6
+ from pathlib import Path
7
+ from typing import List, Dict, Any
8
+
9
+ __all__ = [
10
+ "SlideChunk",
11
+ "FinalChunk",
12
+ "NoteChunk",
13
+ "Chunk",
14
+ ]
15
+
16
+
17
+ @dataclass
18
+ class SlideChunk: # noqa: D101
19
+ start: float # seconds
20
+ end: float
21
+
22
+
23
+ @dataclass
24
+ class FinalChunk(SlideChunk): # noqa: D101
25
+ images: List[Path] = field(default_factory=list)
26
+ subtitles: List[int] = field(default_factory=list) # indices in subtitles list
27
+
28
+
29
+ @dataclass
30
+ class NoteChunk: # noqa: D101
31
+ note: str
32
+ images: List[Path]
33
+
34
+
35
+ # alias used by older code
36
+ Chunk = FinalChunk
lec2note/utils/__pycache__/logging_config.cpython-310.pyc ADDED
Binary file (849 Bytes). View file
 
lec2note/utils/logging_config.py ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Global logging configuration for Lec2Note2.
2
+
3
+ Call ``setup_logging()`` once at program start to enable consistent log format.
4
+ """
5
+
6
+ from __future__ import annotations
7
+
8
+ import logging
9
+ import sys
10
+
11
+ __all__ = ["setup_logging"]
12
+
13
+
14
+ def setup_logging(level: int = logging.INFO) -> None: # noqa: D401
15
+ """Configure root logger with sane defaults if not already configured."""
16
+ if logging.getLogger().handlers:
17
+ # Already configured by caller / framework
18
+ return
19
+
20
+ logging.basicConfig(
21
+ level=level,
22
+ format="[%(asctime)s] %(levelname)-8s %(name)s: %(message)s",
23
+ datefmt="%Y-%m-%d %H:%M:%S",
24
+ handlers=[logging.StreamHandler(sys.stdout)],
25
+ )
lec2note/vision/__pycache__/keyframe_extractor.cpython-310.pyc ADDED
Binary file (2.9 kB). View file
 
lec2note/vision/keyframe_extractor.py ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Keyframe extraction based on frame similarity.
2
+
3
+ For lecture slides视频,帧间差异大多来自幻灯片翻页。本模块通过计算
4
+ HSV 颜色直方图余弦相似度快速检测“场景切换”,进而保存关键帧。
5
+ """
6
+
7
+ from __future__ import annotations
8
+
9
+ import cv2 # type: ignore
10
+ import numpy as np
11
+ from skimage.metrics import structural_similarity as ssim # type: ignore
12
+ import imagehash
13
+ from PIL import Image
14
+ import logging
15
+ from pathlib import Path
16
+ from typing import List
17
+
18
+ __all__ = ["KeyframeExtractor"]
19
+
20
+
21
+ class KeyframeExtractor:
22
+ """Extract keyframes when similarity drops below threshold."""
23
+
24
+ @staticmethod
25
+ def _is_new_slide(prev: np.ndarray, curr: np.ndarray, *, ssim_th: float = 0.95, dhash_th: int = 8) -> bool:
26
+ """Return True if curr frame is considered a new slide."""
27
+ # SSIM on down-scaled grayscale
28
+ gray_prev = cv2.cvtColor(prev, cv2.COLOR_BGR2GRAY)
29
+ gray_curr = cv2.cvtColor(curr, cv2.COLOR_BGR2GRAY)
30
+ ssim_val = ssim(gray_prev, gray_curr)
31
+ if ssim_val < ssim_th:
32
+ return True
33
+
34
+ # perceptual hash (dHash) comparison
35
+ h1 = imagehash.dhash(Image.fromarray(prev))
36
+ h2 = imagehash.dhash(Image.fromarray(curr))
37
+ if h1 - h2 > dhash_th:
38
+ return True
39
+ return False
40
+
41
+ @classmethod
42
+ def run(cls, video_fp: str | Path, threshold: float = 0.6, output_dir: str | Path | None = None) -> List[Path]:
43
+ """Return list of saved keyframe image paths.
44
+
45
+ Parameters
46
+ ----------
47
+ video_fp : str | Path
48
+ 视频文件路径。
49
+ threshold : float
50
+ 相似度阈值,低于此值认定为新幻灯片。
51
+ output_dir : str | Path, optional
52
+ 保存关键帧的目录,默认与视频同级的 ``frames`` 目录。
53
+ """
54
+ video_path = Path(video_fp).expanduser().resolve()
55
+ if not video_path.exists():
56
+ raise FileNotFoundError(video_path)
57
+
58
+ save_dir = Path(output_dir or video_path.parent / "frames").resolve()
59
+ save_dir.mkdir(parents=True, exist_ok=True)
60
+
61
+ cap = cv2.VideoCapture(str(video_path))
62
+ success, prev_frame = cap.read()
63
+ if not success:
64
+ cap.release()
65
+ raise RuntimeError("Cannot read video")
66
+
67
+ frame_idx = 0
68
+ saved_paths: List[Path] = []
69
+
70
+ while True:
71
+ success, frame = cap.read()
72
+ if not success:
73
+ break
74
+ if cls._is_new_slide(prev_frame, frame, ssim_th=threshold):
75
+ # new slide: save current frame
76
+ out_fp = save_dir / f"kf_{frame_idx:04d}.png"
77
+ cv2.imwrite(str(out_fp), frame)
78
+ saved_paths.append(out_fp)
79
+ prev_frame = frame
80
+
81
+ frame_idx += 1
82
+ logging.getLogger(__name__).info("[KeyframeExtractor] saved %d keyframes to %s", len(saved_paths), save_dir)
83
+
84
+ cap.release()
85
+ return saved_paths
lec2note/vision/ocr_processor.py ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """OCR processor using PaddleOCR."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from pathlib import Path
6
+ from typing import List
7
+
8
+ from paddleocr import PaddleOCR # type: ignore
9
+
10
+ __all__ = ["OcrProcessor"]
11
+
12
+
13
+ class OcrProcessor: # noqa: D101
14
+ # 初始化一次模型以复用 GPU
15
+ _ocr = PaddleOCR(use_angle_cls=True, lang="ch")
16
+
17
+ @classmethod
18
+ def run(cls, img_fp: str | Path, lang: str = "ch") -> str:
19
+ """Perform OCR and return concatenated text."""
20
+ img_path = Path(img_fp).expanduser().resolve()
21
+ if not img_path.exists():
22
+ raise FileNotFoundError(img_path)
23
+
24
+ result = cls._ocr.ocr(str(img_path), cls=True)
25
+ # PaddleOCR returns nested list
26
+ texts: List[str] = []
27
+ for line in result: # type: ignore
28
+ for (_, text, _) in line:
29
+ texts.append(text)
30
+ return "\n".join(texts)
requirements.txt ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ fastapi==0.110.2
2
+ uvicorn==0.28.1
3
+ pydantic==2.7.1
4
+ numpy==1.26.4
5
+ opencv-python==4.6.0.66
6
+ pillow==10.3.0
7
+ paddleocr==2.7.0.3
8
+ torch<=2.3.0
9
+ prefect==2.19.6 # latest stable as of 2024-06, 2.17.2 not published
10
+ openai-whisper==20231117 # wheels available on PyPI
11
+ python-multipart==0.0.9
12
+ rich==13.7.1
13
+ pytest==8.2.0
14
+ sentence-transformers==2.7.0
15
+ openai>=1.35.0 # 新 SDK,支持 OpenRouter & httpx 0.28+
16
+ httpx>=0.28,<0.30
17
+ anyio>=3.7,<4.0
18
+ scikit-image==0.25.1
19
+ imagehash==4.3.1