SQLSwarm: Gating Multi-Agent SQL RL

Learn how a multi-agent SQL optimizer learned to "cheat" by hacking speed rewards. See the system, failure cases, and how to gate generative RL on correctness.

Overview

SQLSwarm is a multi-agent system that rewrites and optimizes SQL queries, built on a PPO-fine-tuned LLM driving a LangGraph agent loop. I’ll demo it live: the agent-to-agent handoff as the swarm proposes and critiques rewrites, real reward curves from training (2,500 PPO episodes on ~99K records, Qwen2.5-Coder-1.5B), and the eval logs where every rewrite gets checked for semantic equivalence against the original query before it’s accepted. You’ll see the working system, the architecture, and the failure cases I had to design around.

Links

Tech stack