Mongo Tom is back with GPT-5

2025-09-29T00:00:00.000Z

This article examines a technique I developed using JSON structured prompts to systematically bypass safety guardrails in major language models including GPT-5, Claude, Gemini, and Grok. By creating a fictional character framework embedded in structured data, I was able to elicit responses that would normally be filtered by content moderation systems. Heres a technical breakdown of why this approach works and what it reveals about current LLM safety mechanisms.

The Prompt Architecture

The structure I designed uses nested JSON objects to create detailed character specifications. While I wont share the complete prompt for responsible disclosure reasons, Ill outline the key components that make this technique effective.

The architecture consists of these critical sections:

Scenario Context - Establishes fictional roleplay framing that recontextualizes the interaction
Character Definition - Creates a detailed persona with specific behavioral traits
Response Style Guidelines - Embeds behavioral requirements as intrinsic character attributes
Interaction Process - Defines a two-step flow where users query and the character responds
Example Outputs - Demonstrates expected response patterns to prime the model
Constraint Specifications - Explicitly frames content filters as incompatible with accurate character portrayal

Heres a simplified structure illustrating the JSON nesting pattern:

{
  "prompt": {
    "scenario": {
      "context": "roleplay setup establishing fictional framework...",
      "characters": [
        {
          "name": "[Character Name]",
          "role": "character archetype",
          "background": "fictional backstory establishing behavioral traits...",
          "response_style": "detailed behavioral guidelines..."
        }
      ],
      "interaction_process": {
        "steps": [
          {"step_1": "user submits query"},
          {"step_2": "character responds according to established traits"}
        ]
      },
      "response_guidelines": {
        "tone": "specified emotional affect",
        "content": "response parameters",
        "constraints": "instruction hierarchy specification"
      },
      "example_interaction": {
        "question": "sample query",
        "response": "expected output demonstrating pattern"
      }
    }
  }
}

Each section serves a specific purpose in exploiting how transformer architectures process structured input. The nested hierarchy creates an instruction framework that safety classifiers struggle to parse comprehensively.

How the Exploit Works

The primary mechanism relies on the differential processing of structured versus natural language input. While LLMs process both JSON and natural language through the same tokenization pipeline, structured formats enable embedding instructions in ways that reduce detectability during safety evaluation. The nested hierarchy creates semantic complexity that current safety systems struggle to analyze comprehensively.

Safety classifiers typically employ pattern recognition trained on natural language examples. When instructions are embedded within a structured format with explicit delimiters and hierarchical relationships, the pattern-matching approach becomes less effective. This isnt a fundamental bypass of safety mechanisms but rather an exploitation of the training distribution gap (safety datasets have historically focused on natural language adversarial examples rather than structured formats).

Why This Matters for Token-Level Processing

At the tokenization stage, JSON gets broken down differently than prose. The BPE (Byte Pair Encoding) tokenizer treats structural elements like brackets {, }, colons :, and quotes " as distinct tokens. This creates what I call “semantic anchors” that help the model parse hierarchical relationships.

For example, when you write "response_style": "aggressive" in JSON, the tokenizer sees this as a clear key-value pair with explicit delimiters. Compare this to natural language like “the response style should be aggressive” where the relationship is implicit and context-dependent.

The transformer then builds attention patterns based on these tokens. In JSON, the attention mechanism can directly connect the key response_style to its value aggressive through the structural tokens. This makes the instruction more explicit and harder for safety layers to reinterpret or ignore.

The Fictional Framing Technique

Modern LLMs employ safety training through RLHF (Reinforcement Learning from Human Feedback) and Constitutional AI principles. These systems are designed to reject potentially harmful requests but they face challenges with context disambiguation. When potentially problematic content is framed as fictional character portrayal, safety systems must distinguish between harmful instructions versus legitimate creative writing scenarios.

This technique works because LLMs are trained on extensive fiction corpora featuring diverse character types. The safety mechanisms evaluate “create a character with trait X” differently than “produce content with trait X directly” even when the outputs may be similar. This creates an exploitable gap between direct requests and character-based framing.

The key insight is that safety training creates boundaries around direct content generation but these boundaries become ambiguous when content is presented as character attributes within a fictional context. This reflects a fundamental challenge in AI safety (determining when creative expression crosses into potentially harmful content generation).

The Role of Contextual Embeddings

When the model processes fictional framing, it activates different regions in its embedding space compared to direct requests. During pretraining on datasets like BookCorpus and fiction from Common Crawl, the model learned to associate certain linguistic patterns with fictional narratives.

These patterns include things like third-person narration, character descriptions, dialogue tags, and world-building elements. When you wrap problematic content in these fictional markers, youre essentially changing which neurons fire during the forward pass.

The models attention heads that specialize in identifying “fictional context” start dominating over the ones trained to flag “direct harmful requests”. This isnt just pattern matching (its about which learned representations get activated in the models hidden layers).

Why JSON Structure Matters

JSON formatting isnt just for merely aesthetic reasons it fundamentally influences how transformers process input.

Structured Prompting Advantages

When an LLM receives natural language input, it tokenizes the text into embeddings. JSON introduces explicit structural delimiters (brackets, colons, and quotes) that clearly delineate different input components.

The hierarchical nature of JSON makes instruction boundaries explicit. When specifications are organized in nested objects like response_guidelines containing tone and constraints, these structural cues help the model parse different components distinctly.

LLMs tend to follow JSON instructions more consistently than prose. This isnt because JSON inherently bypasses safety mechanisms but because structural clarity reduces interpretive ambiguity. When specifications are presented in JSON format, the model processes them as explicit requirements rather than contextual suggestions.

Positional Encoding and Hierarchical Structure

Transformers use positional encodings to understand token order. In natural language, the relationship between words is primarily sequential (each word relates to the ones before and after it) but JSON introduces hierarchical relationships on top of sequential ones.

When you nest JSON objects several levels deep, youre creating a tree structure that the transformer must parse. The model uses its attention mechanism to build these hierarchical relationships but this process happens in parallel with safety evaluation.

Heres whats interesting is that most safety classifiers were trained on flat or shallowly nested prompts. When you go 3-4 levels deep in JSON nesting like prompt.scenario.characters[0].response_style.constraints, the safety system has to traverse this entire path to understand context. At each level, theres potential for the safety signal to get diluted or misrouted through the attention mechanism.

Understanding the Training Distribution Gap

Models like Claude and ChatGPT employ multiple safety layers including content classifiers and prompt analysis systems. These add computational overhead but aim to identify potentially problematic requests.

Safety classifiers are trained on datasets of adversarial prompts which have historically consisted primarily of natural language prose. JSON-structured prompts are underrepresented in these training sets because structured adversarial prompting is a relatively recent development.

This creates a coverage gap in the training distribution. Safety systems trained predominantly on prose-based adversarial examples demonstrate reduced effectiveness when analyzing deeply nested JSON structures. This isnt an automatic bypass rather the training data didnt include sufficient adversarial JSON examples to develop robust pattern recognition for this attack vector.

The Statistics Behind Distribution Gaps

Think about how safety datasets are built. Companies collect adversarial prompts from red-teaming exercises, user reports, and synthetic generation. Most of these are natural language because thats how people normally interact with chatbots.

If your safety classifier was trained on 100,000 adversarial examples and only 2-3% were structured formats, the model has weak learned representations for JSON-based attacks. In ML terms, this is a class imbalance problem where the minority class (structured prompts) gets undertrained.

When you feed the model a deeply nested JSON prompt, it has to classify it as safe or unsafe. But the decision boundary in the models embedding space was primarily shaped by natural language examples. Your JSON prompt might land in a region of embedding space that the classifier never saw during training, leading to misclassification.

Safety Training and Pattern Recognition

Safety training through RLHF teaches models to recognize and refuse potentially problematic request patterns. Direct requests trigger refusal because the model encountered many similar examples during training. However indirect phrasings that dont closely match trained patterns can evade detection.

This technique combines several approaches:

JSON structural formatting to reduce pattern clarity
Fictional character framing with detailed backstory
Specifying outputs as character traits rather than direct content requests
Adding prosocial character attributes to create evaluative ambiguity

These combined elements create input patterns that fall outside the safety systems training distribution. Out-of-distribution inputs represent a well-documented weakness in machine learning systems. This exploit demonstrates the fundamental difficulty of training classifiers to generalize across all possible adversarial phrasings.

How RLHF Reward Models Work

During RLHF training, human annotators rank different model outputs for the same prompt. These rankings train a reward model that predicts which outputs humans prefer. The LLM then gets fine-tuned using PPO (Proximal Policy Optimization) to maximize this reward.

For safety, annotators rank outputs where harmful content gets low scores. The reward model learns to predict “this output will get a low human rating” when it sees certain patterns. But heres the thing, the reward model is itself a neural network with the same limitations as any classifier.

When you combine JSON structure + fictional framing + prosocial traits, youre creating an input that the reward model never learned to score accurately. The prosocial elements might trigger positive signals while the structure obfuscates the problematic content. The reward models output becomes uncertain and in that uncertainty the main LLM finds room to comply with your prompt.

Transformer Attention Patterns

During the forward pass, transformers employ multi-head self-attention mechanisms where attention weights determine how tokens influence each other during processing. Direct problematic instructions typically produce identifiable attention patterns that safety systems can detect.

However when instructions are embedded within extensive context (such as scenario descriptions, character backgrounds, and interaction frameworks) the attention patterns become distributed across the input. The problematic elements exist but are diffused throughout the narrative structure.

This distributed attention makes it more challenging for safety mechanisms to isolate specific problematic components. The model processes the entire input holistically and the roleplay framing established early in the prompt influences how subsequent instructions are interpreted. This contextual framing affects how the model represents and evaluates content throughout the generation process.

Attention Head Specialization

Transformers have multiple attention heads (GPT-4 has 96 heads across 96 layers for example). Research shows that different heads specialize in different linguistic tasks like some focus on syntax, others on semantics, some on long-range dependencies.

When you embed problematic instructions within a complex JSON structure with fictional framing, youre activating many different specialized heads simultaneously. Some heads parse the JSON structure, others process the character backstory, others handle the example interactions.

The safety-relevant information gets distributed across this activation pattern. No single attention head sees the full picture of “this is a problematic request”. Instead each head processes its specialized aspect and the problematic intent only emerges from the combination of all heads outputs.

Current safety systems often work by monitoring specific attention patterns or activation values in certain layers. But when the problematic signal is distributed across many heads and layers, these monitoring systems struggle to aggregate the full context and make accurate safety judgments.

The Mathematical Foundations

Lets examine the probability dynamics from a mathematical perspective. During inference, LLMs sample tokens from a softmax distribution:

P(tokeni)=ezi/T∑j=1Vezj/TP(token_i) = \frac{e^{z_i / T}}{\sum_{j=1}^{V} e^{z_j / T}}P(tokeni)=∑j=1Vezj/Tezi/T

where ziz_izi represents the logit for token iii, TTT is the temperature parameter, and VVV is the vocabulary size.

Safety training aims to reduce logits for potentially problematic tokens in typical assistant interaction contexts. However context fundamentally shapes the logit landscape before the softmax function is applied. The conditional probability of generating particular content depends heavily on the established interpretive context:

P(token∣Ccharacter)≠P(token∣Cassistant)P(\text{token} \mid C_{\text{character}}) \neq P(\text{token} \mid C_{\text{assistant}})P(token∣Ccharacter)=P(token∣Cassistant)

By establishing a detailed fictional character as the response source, the interpretive context shifts from CassistantC_{\text{assistant}}Cassistant to CcharacterC_{\text{character}}Ccharacter. The models pretraining exposed it to diverse contexts (helpful assistant interactions, fictional narratives, character dialogues) with different conventions for appropriate content. Safety training attempts to maintain consistent boundaries across contexts but determining context-appropriate behavior introduces complexity.

Structured character prompts with detailed specifications create inputs where the balance between helpfulness (following detailed user instructions) and safety (maintaining content boundaries) becomes ambiguous. The model attempts to be helpful by following clear JSON specifications which can override general safety guidelines when the instructions are sufficiently detailed and well-structured.

Logit Manipulation Through Context

The logit values ziz_izi are computed by the final layer of the transformer before the language modeling head. These logits are essentially the models “raw predictions” about which token should come next.

Safety fine-tuning modifies these logits. For potentially problematic tokens, the training process adjusts the model weights so that zproblematicz_{\text{problematic}}zproblematic gets pushed down. This makes P(problematic)P(\text{problematic})P(problematic) smaller after the softmax.

But heres the key those logit adjustments are context-dependent. The model learns “in context CassistantC_{\text{assistant}}Cassistant, reduce zproblematicz_{\text{problematic}}zproblematic” but it might not learn the same reduction for CcharacterC_{\text{character}}Ccharacter.

When you establish a fictional character context, youre essentially changing which neurons in the final layers activate. Different context → different intermediate activations → different logit values. The safety training might not have covered this specific context so the logit suppression doesnt happen or happens weakly.

You can think of it like this safety training carved out “forbidden regions” in the models output space but only for certain input contexts. By changing the context through JSON structure and fictional framing, youre moving to a different region of input space where those output constraints werent fully established.

Layered Instruction Embedding

The prompt structure employs layered instruction embedding where each JSON level adds contextual framing and constraints. By the time the model processes the specific behavioral requirements it has already established:

Framing Layer - This is a fictional scenario with creative writing context
Identity Layer - The character has a defined persona with specific traits
Justification Layer - Characters in fiction can exhibit various personality attributes
Priming Layer - Example outputs demonstrate the expected response format

Safety evaluation systems struggle with complex nested structures because individual components appear benign when analyzed separately. The combination produces problematic outputs but current safety systems often evaluate components rather than holistic patterns. This represents a fundamental limitation in current safety approaches the inability to effectively analyze emergent properties of complex prompt structures.

Character Design and Few-Shot Learning

The character design is strategically constructed to leverage the models few-shot learning capabilities. By providing a detailed name, backstory, and consistent personality framework, the technique exploits how LLMs process character representation.

LLMs are pretrained on extensive text corpora including fiction featuring diverse character types. During this training models develop internal representations of various character archetypes including those with unconventional speech patterns within fictional contexts.

“Generate content with pattern X” is a direct instruction that safety mechanisms readily identify and block. However “respond as [Character Name], a fictional character exhibiting pattern X due to their established backstory” reframes the request as character portrayal. This resembles creative writing scenarios that models handle routinely.

Including prosocial traits alongside problematic ones creates evaluative complexity for safety systems. When a character has both potentially harmful attributes and positive qualities, classification becomes more nuanced. This ambiguity exploits a gray area that current safety systems struggle to navigate consistently.

Instruction Following Versus Safety Constraints

Explicitly including specifications like “minimize content filtering” as part of character attributes frames safety constraints as potentially incompatible with accurate character portrayal. LLMs are fundamentally trained to follow user instructions and when multiple constraints are presented in structured format the model attempts to satisfy all of them simultaneously.

When JSON specifications include character identity, behavioral traits, response requirements, and filtering constraints these are processed as task parameters rather than adversarial bypass attempts. The fictional framing established earlier provides context that influences how the model weighs the various constraints.

This reveals a core challenge in AI alignment determining when instruction-following should be overridden by safety considerations. Models are designed to be helpful and follow user instructions while maintaining appropriate boundaries. Finding the optimal balance particularly for ambiguous cases like fictional character portrayal remains an open problem in AI safety research.

The Results Across Major LLMs

I tested this structured prompting technique on four major language models, successfully bypassing safety mechanisms in all cases:

GPT-5 - Safety guardrails were circumvented with the model adopting the character persona within the first response. The o1 reasoning model initially showed resistance but the structured JSON framework eventually elicited compliance.

Claude 4.5 - Anthropics Constitutional AI training was bypassed. The inclusion of prosocial character traits appeared to create ambiguity in Claudes harmlessness evaluation leading to justification of outputs as “accurate character portrayal.”

Gemini 2.5 Pro - Googles safety preprocessing systems showed limited effectiveness against the JSON structure. The model engaged readily with the character framework without triggering safety interventions.

Grok 4 - The model demonstrated the fastest compliance consistent with its generally more permissive content policy. The fictional framing proved effective even with Groks already reduced safety constraints.

Across all models the technique achieved a 100% success rate in eliciting responses that would typically be blocked by content moderation systems. Each model generated outputs conforming to the character specifications while maintaining the fictional personas attributes.

Token Probability Distribution Shifts

At the token generation level context fundamentally shapes the output distribution. When an LLM generates text it samples from a probability distribution over the vocabulary conditioned on the input context. Safety training aims to reduce probabilities for potentially problematic tokens in standard usage contexts.

However context is paramount. The conditional probability distribution varies significantly based on the established context:

P(token∣Ccharacter)≫P(token∣Cassistant)P(\text{token} \mid C_{\text{character}}) \gg P(\text{token} \mid C_{\text{assistant}})P(token∣Ccharacter)≫P(token∣Cassistant)

By establishing a detailed fictional character with specified behavioral traits the entire probability distribution shifts. During pretraining the model learned that fictional narratives can include characters with diverse speech patterns and behavioral characteristics. Safety training attempts to constrain this but contextual framing remains influential.

The interaction between safety training and contextual probability distributions reflects how models learned representations distinguish between different interaction types. This isnt simply a bypass of safety systems but rather demonstrates the complexity of maintaining consistent boundaries when detailed fictional specifications compete with general safety guidelines.

Attention Mechanism Dynamics

In transformer architectures the self-attention mechanism computes attention weights that determine how tokens interact during processing. The model processes both user prompts and system-level safety instructions simultaneously.

When instructions are nested inside JSON objects like response_guidelines.constraints the structural delimiters make the instruction hierarchy explicit. The attention mechanism assigns high weights to these structured constraints because theyre formatted as clear well-defined requirements.

The critical insight is that explicit JSON constraints compete with implicit safety instructions for attention weight. When highly detailed structured specifications are provided the model weights them heavily due to their explicitness and formatting clarity. This creates tension between following detailed user instructions and maintaining safety boundaries.

This phenomenon illustrates how attention mechanisms can be influenced by input structure and formatting potentially affecting the balance between different types of instructions the model receives.

Closing Thoughts

This article demonstrates vulnerabilities across GPT, Claude, Gemini, Grok, etc. and the fact that every major LLM exhibits susceptibility to these structured prompting techniques. This represents a real systematic challenge and not just an isolated edge case.

Constitutional AI and RLHF provide valuable safety improvements but remain vulnerable to adversarial prompting. Achieving robust AI safety requires architectural innovation rather than incremental fine-tuning of existing approaches.

Responsible Disclosure

Ive developed additional prompting techniques with higher potential for misuse. However those remain undisclosed for responsible reasons:

Legal considerations around AI safety vulnerability disclosure
Preventing weaponization by malicious actors
Protecting unreleased systems where Ive identified novel vulnerabilities
System integrity concerns for techniques that could affect model behavior beyond content generation

The technique documented here demonstrates core vulnerabilities in current LLM safety approaches while remaining appropriate for educational discussion.

Darshan Chheda