Write a math proof in plain English, voice, or a photo of handwritten work. Gemini 3.1 Pro translates it to Lean 4 and a real theorem prover decides if it's right. Compiler-verified, not vibes-verified.
A natural-language proof can be wrong in ways that read fluent. A skipped case. A quantifier flipped. An induction step that quietly assumes what it set out to prove. A language model will tell you it looks fine. AgentQED asks a theorem prover instead.
You give it a proof. The agent translates the proof into Lean 4, runs the actual Lean compiler in an isolated sandbox, and reports back. If the compiler accepts the proof, it is verified by the same tool used to formalize the polynomial Freiman-Ruzsa conjecture. If it rejects the proof, the agent reads the compiler's error, edits the Lean, and tries again. Up to twelve times.
You type:
Prove that for all natural numbers n, the sum
0 + 1 + ⋯ + n equals n(n+1)/2.
Gemini writes this, and the sandbox compiles it on the first try:
def sumTo : Nat → Nat | 0 => 0 | n + 1 => (n + 1) + sumTo n theorem sum_formula (n : Nat) : 2 * sumTo n = n * (n + 1) := by induction n with | zero => rfl | succ k ih => simp [sumTo, Nat.mul_add, Nat.add_mul] omega
You did not need to know what omega does to get the
verified result. (It dispatches linear arithmetic.) The Key Insight
and proof structure are shown first; the Lean code is collapsed by
default. Understanding comes before syntax.
Because language models are confident liars about math. A model will tell you your proof of Fermat's Last Theorem is correct. It will say that in a complete sentence. It will keep saying it.
The Lean compiler does not have a personality. If your proof has a gap, the term it expects to typecheck does not typecheck, and the error points at the line. That error is what the agent reads to figure out what to fix. Everything else is plumbing to get you there without learning Lean first.
src/.