The wall
Not everything I make has AI in it, but enough does that the same question keeps coming up: which model should I actually use? There are comparison tools baked into various platforms, but nothing standalone, nothing free, and nothing that lets you throw a real task at a stack of models with your own files and prompts.
I just wanted that.
How it works
Pick your models, write a prompt, optionally upload files and schemas, hit run. Every model streams its response side by side, colour-coded so you can tell them apart. When they finish, an AI judge (Claude Sonnet 4.6 by default, but you can swap it) reads the responses blind and scores them on accuracy, clarity, and completeness. The judge shuffles responses before scoring so a model isn’t punished for being in the wrong column. Scores accumulate on a leaderboard across runs. Cost per response tracked down to fractions of a cent.
Under the hood
Next.js 16 with App Router, Tailwind, Framer Motion. OpenRouter as the gateway, which gives one API key access to every provider. Deployed on Cloudflare Workers, squeezed under the 2 MiB limit. No database, no auth, no sign-up. Paste a key and go.
The interesting bit was the streaming. One browser request fans out to several models on the server, and each model’s tokens have to make it back through a single stream tagged correctly. Each model gets its own fetch on the server, every chunk gets stamped with the model ID, and the client splits them apart on arrival. If the user closes the tab, every upstream fetch cancels cleanly. Took longer than the rest of the app combined.
The hackathon turned out to be a scam, which I learnt after putting most of my time into the build. A small group was lifting ideas and stretching the timeline until they got caught. Model Bench just kept being useful. A few hundred people a month use it now, mostly me.