Researchllmmodel incriminationmisbehavior detectionsafety
Researchers Propose Model Incrimination Diagnosing LLM Misbehavior
5.8
Relevance ScoreAditya Singh, Gerson Kroiz, Senthooran Rajamanoharan, Neel Nanda and others publish a LessWrong piece titled "Why Did My Model Do That? Model Incrimination for Diagnosing LLM Misbehavior" proposing model incrimination approaches to diagnose LLM misbehavior.



