/What is the level of agreement between LLM judges (e.g., GPT-o3 mini judge) and human clinicians on objective criteria, as measured by ICC and score differences?
Research Question
What is the level of agreement between LLM judges (e.g., GPT-o3 mini judge) and human clinicians on objective criteria, as measured by ICC and score differences?
2026