web

Running

web / README.md

fetch: add private-host allowlist + env switch; include resolved IPs in error and recheck redirects; document env vars in README

f71c1c7 about 1 month ago

preview code

raw

history blame contribute delete

5.92 kB

	---
	title: Web MCP
	emoji: 🔎
	colorFrom: blue
	colorTo: green
	sdk: gradio
	sdk_version: 5.36.2
	app_file: app.py
	pinned: false
	short_description: Search & fetch the web with per-tool analytics
	thumbnail: >-
	https://cdn-uploads.huggingface.co/production/uploads/5f17f0a0925b9863e28ad517/tfYtTMw9FgiWdyyIYz6A6.png
	---

	# Web MCP Server

	A Model Context Protocol (MCP) server that exposes two composable tools—`search` (Serper metadata) and `fetch` (single-page extraction)—alongside a live analytics dashboard that tracks daily usage for each tool. The UI runs on Gradio and can be reached directly or via MCP-compatible clients like Claude Desktop and Cursor.

	## Highlights
	- Dual MCP tools with shared rate limiting (`360 requests/hour`) and structured JSON responses.
	- Daily analytics split by tool: the Analytics tab renders "Daily Search" (left) and "Daily Fetch" (right) bar charts covering the last 14 days.
	- Persistent request counters keyed by UTC date and tool: `{"YYYY-MM-DD": {"search": n, "fetch": m}}`, with automatic migration from legacy totals.
	- Pluggable storage: respects `ANALYTICS_DATA_DIR`, otherwise falls back to `/data` (if writable) or `./data` for local development.
	- Ready-to-serve Gradio app with MCP endpoints exposed via `gr.api` for direct client consumption.

	## Requirements
	- Python 3.8 or newer.
	- Serper API key (`SERPER_API_KEY`) with access to the Search and News endpoints.
	- Dependencies listed in `requirements.txt`, including `filelock` and `pandas` for analytics storage.

	Install everything with:
	```bash
	pip install -r requirements.txt
	```

	## Configuration
	1. Export your Serper API key:
	```bash
	export SERPER_API_KEY="your-api-key"
	```
	2. (Optional) Override the analytics storage path:
	```bash
	export ANALYTICS_DATA_DIR="/path/to/persistent/storage"
	```
	If unset, the app automatically prefers `/data` when available, otherwise `./data`.

	3. (Optional) Control private/local address policy for `fetch`:
	- `FETCH_ALLOW_PRIVATE` — set to `1`/`true` to disable the SSRF guard entirely (not recommended except for trusted, local testing).
	- `FETCH_PRIVATE_ALLOWLIST` — comma/space separated host patterns allowed even if they resolve to private/local IPs, e.g.:
	```bash
	export FETCH_PRIVATE_ALLOWLIST="*.corp.local, my-proxy.internal"
	```
	If neither is set, the fetcher refuses URLs whose host resolves to private, loopback, link‑local, multicast, reserved, or unspecified addresses. It also re-checks the final redirect target.

	The request counters live in `<DATA_DIR>/request_counts.json`, guarded by a file lock to support concurrent MCP calls.

	## Running Locally
	Launch the Gradio server (with MCP support enabled) via:
	```bash
	python app.py
	```
	This starts a local UI at `http://localhost:7860` and exposes the MCP SSE endpoint at `http://localhost:7860/gradio_api/mcp/sse`.

	### Connecting From MCP Clients
	- Claude Desktop – update `claude_desktop_config.json`:
	```json
	{
	"mcpServers": {
	"web-search": {
	"command": "python",
	"args": ["/absolute/path/to/app.py"],
	"env": {
	"SERPER_API_KEY": "your-api-key"
	}
	}
	}
	}
	```
	- URL-based MCP clients – run `python app.py`, then point the client to `http://localhost:7860/gradio_api/mcp/sse`.

	## Tool Reference
	### `search`
	- Purpose: Retrieve metadata-only results from Serper (general web or news).
	- Inputs:
	- `query` (str, required) – search terms.
	- `search_type` ("search" \| "news", default "search") – switch to `news` for recency-focused results.
	- `num_results` (int, default 4, range 1–20) – number of hits to return.
	- Output: JSON containing the query echo, result count, timing, and an array of entries with `position`, `title`, `link`, `domain`, and optional `source`/`date` for news.

	### `fetch`
	- Purpose: Download a single URL and extract the readable article text via Trafilatura.
	- Inputs:
	- `url` (str, required) – must start with `http://` or `https://`.
	- `timeout` (int, default 20 seconds) – client timeout for the HTTP request.
	- Output: JSON with the original and final URL, domain, HTTP status, title, ISO timestamp of the fetch, word count, cleaned `content`, and duration.

	Both tools increment their respective analytics buckets on every invocation, including validation failures and rate-limit denials, ensuring the dashboard mirrors real traffic.

	## Analytics Dashboard
	Open the Analytics tab in the Gradio UI to inspect daily activity:
	- Daily Search Count (left column) – bar chart for the past 14 days of `search` tool requests.
	- Daily Fetch Count (right column) – bar chart for the past 14 days of `fetch` tool requests.
	- Tooltips reveal the display label (e.g., `Sep 17`), raw count, and ISO date key.

	Data is stored in JSON and can be safely externalized for long-term tracking. Existing totals in the legacy integer-only format are automatically migrated during the first write.

	## Rate Limiting & Error Handling
	- Global moving-window limit of `360` requests per hour shared across both tools (powered by `limits`).
	- Standardized error payloads for missing parameters, invalid URLs, Serper issues, HTTP failures, and rate-limit hits, each preserving analytics increments.

	## Troubleshooting
	- `SERPER_API_KEY is not set` – export the key in the environment where the server runs.
	- `Rate limit exceeded` – pause requests or reduce client concurrency.
	- Empty extraction – some sites block bots; try another URL.
	- Storage permissions – ensure the chosen data directory is writable; adjust `ANALYTICS_DATA_DIR` if necessary.

	## Licensing & Contributions
	Feel free to fork and adapt for your MCP workflows. Contributions are welcome—open a PR or issue with proposed analytics enhancements, additional tooling, or documentation tweaks.