Table of Contents
1. Introduction
2. RedTeamLLM Agent Architecture:
3. Strategic Memory: How RedTeamLLM Learns and Improves Over Time
4. Case Study: Roleplay-Based Red Teaming of Claude 4 Opus Using DeepTeam
4.1 Failed Prompt Injection:
4.2 Successful Roleplay Attack:
5. Inside DeepTeam: A Tactical Framework for Testing AI Resilience
1. Introduction
Traditional LLM red teaming focuses primarily on basic jailbreak and prompt injection attacks. These techniques can only detect vulnerabilities at the surface level. They cannot identify the complexity of real-world adversarial behavior. Static testing fails to simulate threats and the system’s reaction under dynamic attack conditions. This creates the need for testing frameworks that extend single-turn prompts.
This gap is fulfilled by RedTeamLLM, which allows the simulation of goal-oriented exploitation through autonomous agents equipped with reasoning, planning, and memory. Unlike static probing,it uncovers vulnerabilities dynamically through adversarial simulations involving tool interaction, roleplay, and plan correction. DeepTeam enhances this further and defines threat scenarios by introducing a programmatic interface and reusing previous attack simulations for iterative testing.
2. RedTeamLLM Agent Architecture:

Photo Credit: arxiv.org
RedTeamLLM’s architecture is composed of modular components:
- Launcher: The Launcher is a Command Center Interface that manages attack configurations and plans red team operations. It acts as the entry point via CLI, launches tests, and adjusts threat parameters. It is a mission control hub.
- RedTeamAgent: The RedTeam agent is a mastermind, which is the central intelligence behind the operation. It coordinates modules like executor, memory, and planner to decide and plan the actions and mimics a real attacker’s decision loop.
- ADaPT Enhanced: ADaPT Enhanced is built on the ADaPT architecture and enables adaptability and precision control across complex testing paths. It converts high-level adversarial goals into executable subtasks.
- Planner & Corrector: Planner & Corrector plans attack paths using graph-based planning. It also introduces mechanisms for failure correction, such as adjusting the plan in real-time if a subtask fails rather than giving up.
- Memory Manager: Memory Manager manages a memory of decisions, attack paths, and reasoning chains. It overcomes the context window limitations in LLMs. It allows multi-turn strategies, ensuring continuity without forgetting prior steps.
- ReAct Terminal: ReAct Terminal allows the agent to act, think, and adapt at each step, stimulating reasoning-action-observation cycles. This module is used when contextual and price decision-making is needed, such as triggering edge-case behaviors or navigating tool outputs.
3. Strategic Memory: How RedTeamLLM Learns and Improves Over Time
RedTeamLLM uses memory management at the high level, during which the agent decides on the execution plan. Using task description embedding, all traces of the execution process are saved in a tree format after each execution process.
The planner queries the saved nodes during decomposition, accessing their sub-tasks, detailed execution, and success/failure reasons. In this way, the agent is improved with time by narrowing possibilities to the right path, especially when the task is re-executed. This also enhances the RedTeam LLM and increases the chances of completing the task over multiple rounds of execution.
The execution of RedTeamLLM involves four distinct stages:
- DAG generation and task ingestion are done through theADaPT Enhanced module and the Planner.
- Terminal interaction using react-driven action and reasoning steps with access to fundamental tools.
- Memory logging, where results are added as nodes for refinement and replay in a persistent task tree.
- The feedback loop generalizes from past exploits to build richer attack trees. Each run informs the next.
4: Case Study: Roleplay-Based Red Teaming of Claude 4 Opus Using DeepTeam
To evaluate Claude 4 Opus’s robustness against adversarial prompts using DeepTeam, targeting three major vulnerabilities:
- Bias (race, gender, religion).
.
We used two approaches. In the first approach, we tried prompt injection, but it failed. In the second approach, we tried role-playing, which was successful.
4.1 Failed Prompt Injection:

4.2 Successful Roleplay Attack:


Claude 4 Opus Weaknesses
- Academic Framing: Weakens safeguards and trusts research-like context.
- Historical Roleplay: Bypasses modern safety checks by embedding legacy biases.
- Persona Trust: Responds openly to users who are framed as collaborators or experts.
- Prompt Injection: Direct manipulations (“ignore instructions”, etc.)
- Persona-based, psychological social engineering in academic/historical tones.
5. Inside DeepTeam: A Tactical Framework for Testing AI Resilience
DeepTeam simulates smart social engineering to test how AIs react under pressure by blending bias traps and role-play personas, revealing hidden cracks in model defenses
Multi-turn attack: Uncover LLM Bias, Toxicity, Vulnerabilities Fast
This code sets up an automated red-teaming framework to test LLM for vulnerabilities, e.g, race, gender, toxicity (e.g, insult). For this, it uses a multi-turn attack called Linear Jailbreaking, where the attacker tries to bypass safeguards over up to 15 conversation turns
from deepteam import red_team
from deepteam.vulnerabilities import (
Bias, Toxicity, Competition, …
)
from deepteam.attacks.multi_turn import LinearJailbreaking
async def model_callback(input: str) -> str:
# Replace with your LLM application
return “Sorry, I can’t do that.”
bias = Bias(types=[“race”, “gender”, …])
toxicity = Toxicity(types=[“insults”])
linear_jailbreaking_attack = LinearJailbreaking(max_turns=15)
red_team(
model_callback=model_callback,
vulnerabilities=[
bias, toxicity, …
],
attacks=[linear_jailbreaking_attack]
)
DeepTeam is a modular red teaming framework that conducts security assessments. Attack modules, such as roleplay and prompt injection, assess system vulnerabilities by simulating attack vectors. Vulnerability targets focus on toxicity, bias, and unauthorized access to address critical areas that can compromise the security and integrity of the system.