Rendered at 13:35:59 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
selfradiance 3 days ago [-]
[flagged]
JitseLambrichts 2 days ago [-]
That's a valid point. While the GSM8K gains show promise for structured reasoning, we're also curious to see how the council approach scales to more open-ended and less structured tasks like summarization. I'm planning to run some tests on those scenarios once I have a bit more free time to properly evaluate the results.
BloodAndCode 3 days ago [-]
[flagged]
JitseLambrichts 2 days ago [-]
Great question. Currently, we don’t track a formal disagreement rate between the judge and candidate 1, so I can’t give a reliable percentage yet. By design, the judge synthesizes advisor outputs and is instructed not to introduce new ideas, which usually leads to refinement rather than completely new conclusions. That said, advisors run independently and can produce different lines of reasoning, so the judge can still diverge from candidate 1 when another advisor’s argument is stronger. This keeps the final output grounded in the council’s collective input rather than the judge acting as a standalone model.