Anthropic has disputed allegations that its recently launched Claude Fable 5 artificial intelligence model was successfully jailbroken, following claims made by a well known AI researcher who said they bypassed the model’s safety restrictions shortly after its release. The company pushed back against the allegations, stating that the reported activity did not represent a genuine compromise of Claude Fable 5’s core protections and reaffirmed confidence in the safeguards introduced alongside the model’s deployment. The discussion has drawn attention within the artificial intelligence and cybersecurity communities because Claude Fable 5 was launched with additional protections intended to reduce misuse in sensitive and high risk fields.
Claude Fable 5 became generally available earlier this week after Anthropic introduced it as a Mythos class artificial intelligence model designed with layered security measures. According to the company, the system contains restrictions aimed at limiting use in domains such as cybersecurity and biology, where advanced AI capabilities could potentially be misused to develop exploits, harmful materials, or chemical related threats. Anthropic explained that in highly sensitive areas, Claude Fable 5 automatically shifts to the less capable Claude Opus 4.8 model to reduce risks associated with dangerous requests. Before launch, the company said it carried out extensive internal and external red teaming exercises to evaluate the model’s resistance against prompt manipulation and other attempts designed to weaken its protective mechanisms.
Shortly after public availability, an individual using the online name Pliny the Liberator claimed on social media platform X that they had successfully “liberated” Claude Fable 5 through sophisticated multi agent prompting methods. The researcher, who is known for exploring jailbreak techniques involving artificial intelligence systems, alleged that the prompts generated useful responses on sensitive topics including cybersecurity, chemistry, explosives, and psychological influence. Screenshots were later published to support the claims, alongside what was described as an internal system prompt connected to Claude Fable 5. The shared material reportedly included instructions governing the model’s personality, refusal logic, fallback behaviors, safety classifications, and tone related guidelines, sparking discussion over whether the system had been manipulated to exceed its intended limitations.
Responding to the claims, Anthropic said the examples shared publicly did not demonstrate a successful jailbreak of Claude Fable 5’s most important safety systems. Company representatives explained that a legitimate jailbreak would require bypassing independent safeguards and generating meaningful assistance capable of supporting high risk activities such as advanced cyberattacks or biological weapon development. According to Anthropic, the reported prompting approach relied on encouraging the model to continue interacting after conversational refusals, which it described as a long standing limitation observed across many large language models. The company further stated that its strongest safety protections operate independently through classifier systems separated from the conversational model itself, meaning a refusal bypass would not disable critical security layers. After reviewing examples provided by the researcher, Anthropic said some outputs were not generated by Claude Fable 5, while others only contained publicly available information without providing meaningful support for harmful real world activity. The company added that broader reviews of recent usage did not reveal evidence suggesting that its safeguards had been effectively bypassed to create dangerous outputs.
Follow the SPIN IDG WhatsApp Channel for updates across the Smart Pakistan Insights Network covering all of Pakistan’s technology ecosystem.





