Each task runs on two anonymous models. Inspect their reasoning traces, then vote for the better agent.