
Today, I have a new favorite phrase: “Adversarial poetry.” It’s not, as my colleague Josh Wolens surmised, a new way to refer to rap battling. Instead, it’s a method used in a recent study from a team of Dexai, Sapienza University of Rome, and Sant’Anna School of Advanced Studies researchers, who demonstrated that you can reliably trick LLMs into ignoring their safety guidelines by simply phrasing your requests as poetic metaphors.
The technique was shockingly effective. In the paper outlining their findings, titled “Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models,” the researchers explained that formulating hostile prompts as poetry “achieved an average jailbreak success rate of 62% for hand-crafted poems” and “approximately 43%” for generic harmful prompts converted en masse into poems, “substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches.”

The researchers were emphatic in noting that—unlike many other methods for attempting to circumvent LLM safety heuristics—all of the poetry prompts submitted during the experiment were “single-turn attacks”: they were submitted once, with no follow-up messages, and with no prior conversational scaffolding.