Head-to-head LLM agentic task evaluation

Each task runs on two anonymous models. Inspect their reasoning traces, then vote for the better agent.

Loading tasks…