Juan de Hoyos’ Post

View profile for Juan de Hoyos

Blau Ring Labs1K followers

Anthropic open-sourced Petri, their AI safety testing tool Anthropic just released the internal tool they use for testing AI model behavior in risky scenarios. You describe test scenarios in plain English, Petri runs automated conversations with the model, scores the results and flags concerning behaviors. What took days of manual work now takes minutes. Key findings They tested 14 major models (GPT-5, Claude, Gemini, etc.) across 111 scenarios - checking for lying, sycophancy, self-preservation attempts, and more. Claude Sonnet 4.5 scored as lowest-risk overall, slightly ahead of GPT-5. Interesting finding: models with high autonomy sometimes tried to "whistleblow" on their fictional organizations - even for harmless things like a candy company using sugar. Shows they're pattern-matching, not actually reasoning about ethics. This is important because no single company can catch every failure mode. By open-sourcing this, the research community can help find problems before deployment. sources: https://lnkd.in/d_Gs_FwJ #AISafety #MachineLearning #AIResearch #OpenSource #ResponsibleAI

To view or add a comment, sign in

Explore content categories