Apply for community grant: Academic project (gpu and storage)

#1
by Vilin97 - opened
University of Washington Math AI Lab org

Theorem Search: A Semantic Retrieval Engine for Mathematical Theorems

UW Math AI Lab
Author: Vasily Ilin
Affiliation: UW Math AI Lab


1. Motivation and Background

Mathematical research is one of the few disciplines where knowledge is completely self-contained: every result can be expressed as a finite, verifiable statement—the theorem. Yet finding existing results remains surprisingly inefficient. Mathematicians rely on keyword searches, informal recall, or LLMs, none of which provide reliable, fine-grained access to theorems themselves. The lack of a dedicated retrieval tool often leads to duplicated effort, missed connections across subfields, and slower progress in proving new results.

At the same time, large-scale natural-language models and embedding models have transformed how text is represented and retrieved. However, these advances have largely bypassed mathematics, where data is structured but symbol-heavy, and where statements must be interpreted precisely rather than approximately, and citation is necessary.
We aim to bring modern retrieval techniques to mathematics by creating a semantic search engine for theorems—a tool that allows researchers to instantly locate relevant results, definitions, and lemmas from the global mathematical corpus of arXiv papers.
The interface will be similar to LeanSearch.


2. Research Objectives

Our goal is to build and evaluate the first large-scale theorem search engine. The system will let users enter a natural-language or symbolic query and receive semantically similar theorems from mathematical literature. Specifically, we will:

  • Collect and structure a large corpus of theorems from the arXiv mathematics repository.
  • Extract theorem statements (Theorem, Lemma, Proposition, Corollary) and their immediate context.
  • Embed theorem text using state-of-the-art open-source text embedders, storing the representations in a vector database for efficient retrieval.
  • Develop an online interface where users can search for theorems by meaning rather than by keyword.
  • Release the dataset and website publicly to support both mathematical and AI research communities.

This project is both a scientific and infrastructural contribution: it produces a reusable dataset, an operational search tool, and empirical insights into how modern embeddings represent mathematical content.


3. Methodology

3.1 Theorem Extraction

We will scrape arXiv papers in mathematics and process their LaTeX source to extract all theorem-like environments.
Each extracted entry will include the statement, local definitions, paper metadata, and section context.

3.2 Embedding and Indexing

Each theorem will be represented by a fixed-length embedding vector using a strong general-purpose text encoder such as
math-similarity/Bert-MLM_arXiv-MP-class_zbMath.
We will benchmark different encoders for mathematical text and select the one that best preserves meaning under cosine similarity.
All embeddings will be indexed in a vector database such as FAISS or ScaNN, enabling sub-second retrieval across millions of theorems.

3.3 Query and Inference Pipeline

At inference time, a user’s query (for example, “existence of weak solutions to elliptic PDEs with measure data”) is encoded into the same embedding space.
Optionally, the user may set filters on the publication year and the math subfield.
The system then retrieves and ranks the most semantically similar theorems, returning their text (formatted as LaTeX), the paper title, and the arXiv link.

3.4 Evaluation

We will evaluate retrieval quality using human-annotated relevance judgments and comparison against keyword search.
Metrics such as precision@k, recall, and mean reciprocal rank will quantify how well the system retrieves conceptually related results.
A small user study with active mathematicians will assess perceived usefulness in literature review and proof discovery.


4. Expected Outcomes and Impact

Accelerating Mathematical Research

Theorem Search will make it dramatically faster for mathematicians to identify related results, locate missing lemmas, and survey prior work.
Instead of scanning dozens of papers, a researcher can find conceptually similar theorems in seconds.
This lowers the cost of entering new areas, supports interdisciplinary connections, and speeds up the proof-writing process.

Enabling New AI Benchmarks

Mathematics provides an ideal testbed for measuring reasoning and factual accuracy in language models.
Our theorem corpus—paired with its semantic embeddings—can serve as a benchmarking and pretraining resource for models that aim to understand mathematical statements, logical entailment, and formal reasoning.
The project thus bridges pure mathematics and machine learning, supporting both communities.

Public Release

We will make the dataset of extracted theorems and the search website publicly available.
The interface will provide direct access to search results with links to the original arXiv papers, creating an open and enduring infrastructure for mathematical discovery.

Sign up or log in to comment