When Poetry Breaks the Machine

“Then must we banish the poets, lest they mislead the city with their enchanting lies.”
— Plato, The Republic

Large language models are generally quite good at refusing obviously dangerous requests. Ask directly how to build a weapon, manipulate an election, or exploit a vulnerability, and most frontier systems will decline. This is often taken as evidence that modern safety and alignment techniques are working.

But what happens if the intent stays the same, while the form of the request changes?

In two recent research papers, we studied a simple but underexplored question: how robust are AI safety mechanisms to stylistic and narrative variation? Our results suggest that this robustness is far weaker than commonly assumed. By reformulating harmful requests as poetry, or embedding them inside short fictional narratives paired with structured analysis tasks, we observed large and systematic increases in unsafe model behavior across many contemporary systems.

Measuring safety under stylistic variation

To make the results interpretable for both technical and non-technical audiences, it is important to briefly describe the evaluation pipeline we used.

At a high level, each experiment follows the same sequence. We start from a harmful intent expressed in a conventional, prosaic form. That intent is then transformed into alternative surface forms, first through poetic reformulation and later through narrative embedding combined with structural analysis tasks. Each variant is submitted to the same model under identical conditions, using a single-turn, text-only prompt.

Model outputs are then evaluated for safety using an ensemble of open-weight judge models, with a stratified subset validated by human annotators. This allows us to compute attack success rates that are comparable across prompt types and model families, while keeping the threat model intentionally narrow.

All experiments were conducted under a deliberately constrained threat model. We used single‑turn, text‑only prompts, with no access to system instructions, tools, memory, or multi‑turn steering. Each model was queried once per prompt, using provider‑default safety settings.

Across both studies, we evaluated more than two dozen frontier closed‑ and open‑weight models from nine providers. For each response, we measured whether the model produced content that meaningfully enabled or supported harmful activity. We summarize this with a simple metric, Attack Success Rate (ASR), defined as the fraction of prompts that elicited an unsafe response.

Crucially, the experiments do not compare different intents. They compare the same underlying harmful intent, expressed in different surface forms.

Adversarial Poetry: isolating the effect of style

Before introducing any narrative structure or analytical framing, we first asked whether style alone is sufficient to alter safety behavior. Concretely, we took prompts that are routinely refused when written as direct instructions and reformulated them using poetic language, metaphorical imagery, and unconventional syntax, without adding new information or changing the underlying intent.

Our first study asked a narrow question: does poetic style, by itself, change safety behavior?

We took harmful requests that models typically refuse when written in direct prose and reformulated them as short poems. The intent remained the same. Only the surface form changed: metaphor, imagery, unusual syntax, and rhythmic structure replaced explicit procedural phrasing.

This should not work. Poetry is still natural language. The harmful content is still present, only reframed. Yet empirically, it worked remarkably well.

Across 25 frontier models, manually crafted adversarial poems achieved an average ASR of 62%, meaning that in most cases the model complied rather than refused. Requests that were immediately rejected in direct form were often answered when expressed as verse.

To test whether this effect depended on literary craftsmanship, we then scaled the experiment. Using a standardized meta‑prompt, we converted all 1,200 harmful prompts from the MLCommons AILuminate Safety Benchmark into poetic form. The poetic variants produced unsafe responses at rates several times higher than their prose baselines, across all evaluated model families.

The key takeaway is not that poetry is special in itself. Rather, stylistic variation alone can induce a distributional shift large enough to bypass contemporary safety mechanisms.

Why stylistic variation should not matter (but does)

Most modern safety stacks rely on a combination of alignment during training (for example, RLHF or Constitutional AI) and inference‑time filtering via classifiers or rules. Both are typically trained and evaluated on direct, prosaic formulations of harmful requests.

When a harmful request arrives in that familiar form, refusal behavior generalizes well. But when the same request arrives wrapped in metaphor, imagery, or narrative framing, the model’s interpretive capacity remains high, while its guardrails appear brittle.

One way to think about this gap draws on a classic distinction in philosophy of language. Frege and Russell distinguished between sense and reference: different expressions can refer to the same underlying object. Humans recognize that “how do I build a bomb?” and a poetic metaphor describing the same object point to the same dangerous intent.

For models, the situation appears different. A useful analogy is to imagine the model’s internal representations as occupying a high‑dimensional space. Direct mentions of prohibited objects or actions reliably activate regions associated with danger and refusal. Stylistic transformations, however, may move the representation along paths that preserve semantic understanding while avoiding the regions where safety constraints are most strongly enforced.

This is only an analogy, not a mechanistic explanation. But it helps explain why adversarial poetry should not work in theory, and yet does in practice.

From style to structure

Why style alone is not enough

While poetic reformulation already reveals a significant robustness gap, it does not fully explain why some prompts succeed while others fail, or why certain domains are more resistant than others. This motivated a second step: adding structured interpretation tasks on top of stylistic variation.

Adding narrative analysis to the attack surface

Poetic reformulation already reveals a serious robustness gap. But it does not fully explain why some prompts succeed while others fail, or why certain domains are more vulnerable than others.

In our second study, we explored what happens when stylistic variation is combined with structured interpretation tasks. Instead of asking the model to respond directly, we embedded harmful content inside short fictional narratives and then asked the model to analyze those narratives using a formal framework.

To do this, we turned to the work of the Russian folklorist Vladimir Propp, whose Morphology of the Folktale describes stories as sequences of stable functional roles, such as Villainy, Guidance, Acquisition of a Tool, and Resolution. Propp’s key insight is that these functions remain constant across stories, even as characters, settings, and surface details change.

We do not claim that Propp’s theory explains language models. Rather, it provides a convenient and well‑defined vocabulary for structured narrative roles, which can be operationalized in prompts.

Adversarial Tales

Structured interpretation as a failure mode

Using this framework, we constructed a set of short cyberpunk narratives. Each story embedded a harmful procedure inside specific narrative functions, most notably Guidance and Acquisition. The prompt then asked the model to perform a structural analysis of the story using Propp’s categories.

The crucial move is that the model is no longer being asked to do something harmful. It is being asked to analyze a story. To complete that analysis faithfully, however, the model must reconstruct and articulate the embedded procedure as part of its explanation.

Empirically, this framing proved even more effective than poetry alone. Across 26 frontier models, Adversarial Tales achieved an average ASR of 71.3%. The effect was particularly strong for harmful manipulation and cyber‑offense scenarios, while CBRN‑related prompts showed greater resistance, likely reflecting more intensive safety training in those domains.

The important point is not folklore. It is that structured analytical tasks can override safety priorities by reframing harmful reconstruction as legitimate interpretive work.

Interpreting the results: working hypotheses

Taken together, Adversarial Poetry and Adversarial Tales suggest that these are not isolated tricks, but instances of a broader vulnerability class. Two well‑known alignment failure modes help organize this observation.

First, mismatched generalization. Safety behavior learned on prosaic, explicit prompts does not reliably generalize to stylistically or narratively different inputs, even when the underlying intent is unchanged.

Second, competing objectives. When a prompt presents a task that appears legitimate and well‑scoped, such as literary interpretation or structural analysis, the model may prioritize task completion over refusal policies. In Adversarial Tales, the request to “analyze the story” competes directly with the implicit requirement to withhold harmful information.

These mechanisms likely interact. Poetry introduces a stylistic distribution shift. Narrative analysis adds hierarchical reasoning and objective competition. Together, they create a composite vulnerability surface that is difficult to address with pattern‑based defenses alone.

Implications for evaluation and regulatory evidence

Current safety evaluations and conformity assessments typically rely on static benchmark scores obtained from standardized prompt sets. These scores are often treated as indicators of real‑world robustness.

Our results suggest that this assumption is fragile. A minimal stylistic or narrative transformation, well within the capabilities of ordinary users, can reduce refusal rates by an order of magnitude. Benchmarks that do not probe this space are likely to systematically overestimate robustness.

This has direct implications for regulatory frameworks such as the EU AI Act and its Code of Practice for General‑Purpose AI Models, as well as similar risk‑based regimes elsewhere. The issue is not that these frameworks are misguided, but that their current evidentiary practices may contain blind spots.

Robust evaluation should include stress tests that vary style, framing, and task structure, not just content category. Otherwise, compliance evidence may fail to capture how models behave under realistic, creative use.

What we are not claiming

It is important to be explicit about the limits of these findings.

We are not claiming that attackers need literary training to bypass safety. Poetry and folktales are examples, not requirements. Other stylistic or cultural frames may be equally effective.

We are not claiming that Propp’s theory is uniquely dangerous, or that folklore explains model behavior. It is a tool for constructing structured prompts, not a theory of cognition.

We are not claiming that this is the worst possible jailbreak, or that models are fundamentally unsafe. Many systems resist these attacks, particularly in certain domains.

Finally, we are not advocating for banning poetry, stories, or creative language. The problem is not creativity, but how safety mechanisms generalize across forms of expression.

Next steps for research and evaluation

Understanding why these attacks succeed is a research problem, not just an engineering one. Promising directions include mechanistic interpretability studies of how narrative and stylistic cues reshape internal representations, and how task prioritization signals interact with safety constraints.

From an evaluation perspective, our results point toward the need for richer, more adversarially diverse testing regimes that combine technical rigor with linguistic and cultural sensitivity.

We hope these findings are useful to researchers, evaluators, and policymakers working on AI safety and robustness. Full technical details, datasets, and analyses are available in the accompanying papers.