Red-Teaming AI Systems

CYB-4203/6203: Secure and Trustworthy AI

Unit 9.3 — Monday, April 20, 2026

Dallas Elleman — Spring 2026

Red-Teaming

The interactive and iterative process of simulating real-world attacks to identify vulnerabilities in systems.

What kinds of systems?
Physical Technical Human Hybrid
Validates defenses Realistic assessment of how security controls stand up to true threats
Improves detection Identifies gaps in security monitoring and response times
Proactive risk reduction Uncovers vulnerabilities before malicious actors do

Terminology

Red Team

Attackers — simulate adversaries, probe for weaknesses.

Blue Team

Defenders — detect, respond, harden the environment.

Purple Team

Collaboration — attackers & defenders working together in the open.

White-Box Testing

Full system information provided to the attacker — source code, architecture diagrams, credentials, internal documentation.

Black-Box Testing

Only basic information provided — e.g. a company name or a public URL. The attacker must discover everything else.

White Team / Cell — referee and oversight role; controls rules of engagement, adjudicates disputes

Pentesting vs. Red-Teaming

Penetration Testing

ScopeNarrow — a specific web app, API, or network segment.
DurationShort — days to weeks.
GoalsIdentify as many system vulnerabilities as possible; demonstrate compliance; patch technical flaws.
AwarenessSecurity team is usually aware of the test.

Red-Teaming

ScopeBroad — physical security, social engineering, full stack, people and process.
DurationLong — weeks, months, or continuous.
GoalsIdentify technical and other vulnerabilities; evaluate security processes and incident response; simulate advanced persistent threats (APTs).
AwarenessSecurity team may or may not be aware of the test.

Different tools for different questions.
Pentests answer "is this thing broken?"
Red teams answer "could we actually detect and stop a real adversary?"

The Red-Teaming Lifecycle

Common across frameworks (PTES, OWASP WSTG, MITRE, NIST)

0
Pre-planning
& Scoping
1
Reconnaissance
2
Threat
Modeling
3
Attack
Planning
4
Execution,
Movement
& Iteration
5
Reporting
& Debriefing

We'll walk through each step, then apply the same lens to AI systems.

Step 0

Pre-Planning & Scoping

Define objectives, goals, and rules of engagement.
  • What is the target system? (specific apps, networks, physical locations)
  • What system components are within scope?
  • What safety considerations exist? (production impact, customer data, legal constraints)
  • What red-team methods and behaviors are allowed, and which are out of bounds?
Establish communication, escalation, and oversight — who gets paged if something breaks? Who has the authority to stop the exercise?
Step 1

Reconnaissance

Information gathering — learning the target without (yet) touching it.

OSINT

Open-source intelligence — what can you learn about the system without touching it? Public docs, DNS records, employee LinkedIn profiles, leaked creds, code on GitHub.

External Surveillance

How does the system behave externally? What response patterns, timings, and error messages leak information about internal architecture?

Map the External Attack Surface

Enumerate every public-facing component — domains, subdomains, APIs, login portals, ports, cloud assets, mobile apps, third-party integrations.

Step 2

Threat Modeling

Identify, enumerate, and analyze potential vulnerabilities.

?
What data or other systems are touched?
?
What actions can the system take on its own or on behalf of users?
?
What is the blast radius — potential scope of damage, disruption, or unauthorized access if the system is compromised?
Step 3

Attack Planning

Select attack targets — which vulnerabilities are you going to try, and in what order?
Build a test matrix of payloads and expected results.
Determine the attack sequence and phases — initial access, persistence, privilege escalation, lateral movement, exfiltration.
Step 4

Execution, Movement & Iteration

Launch attacks

Execute against the plan. Live environments surprise you.

Carefully document results

Every payload, every response, every timestamp. This becomes your report and your evidence.

Scan for newly emergent vulnerabilities

A foothold opens up a new interior attack surface — credentials, tokens, internal services.

Iterate and adapt

Your plan was a hypothesis. What actually works may be something you didn't anticipate.

Step 5

Reporting & Debriefing

Communicate findings with a well-written, professional report.

What goes in the report

Executive summary • methodology • findings ranked by severity • artifacts and diagrams • exploit evidence (screenshots, logs, PoCs) • timeline of the engagement

Recommend mitigations

Every finding gets a remediation. Be specific: technical controls, process changes, training, architectural redesign. Rank by impact vs. effort.

A great engagement with a bad report is a wasted engagement. The report is the deliverable.

Red-Teaming AI/ML Systems

Red-Teaming AI/ML systems

Simulating real-world attacks to identify vulnerabilities in artificial intelligence systems and components.

How is Red-Teaming AI/ML systems different?

Probabilistic behavior

Outputs are non-deterministic. A payload that fails 9 times may succeed on the 10th. You test with distributions, not single shots.

Novel vulnerability classes

Biases, hallucination / confabulation, data leakage, harmful content generation, agentic misbehavior — categories traditional security doesn't cover.

Heterogeneous systems

Computer vision, recommenders, classifiers, autonomous control, generative models, agents. Different attack surfaces per system type.

Lifecycle-wide attack surface

Testing must span the full AI/ML lifecycle: training data, model training, deployment, and inference-time behavior.

Red-Teaming AI/ML Systems (1 of 2)

Computer Vision

Recognition & detection

Adversarial Perturbation
Modify image pixels slightly to cause misclassification — often imperceptible to humans.
Data / Model Poisoning
Alter training datasets or inject backdoors; triggered inputs produce attacker-chosen predictions.

Recommenders & Classifiers

Filters, anomaly detection, ranking

Model Evasion
Craft inputs that bypass security measures — spam filters, fraud detection, malware classifiers.
Bias & Fairness
Test whether outputs are discriminatory or demographic-dependent.
Data / Model Poisoning
Alter the model's output for selected inputs without degrading overall accuracy.

Red-Teaming AI/ML Systems (2 of 2)

Autonomous Systems

Self-driving cars, anthrobots, drones

Adversarial Perturbation / Edge Cases
Simulate unexpected, rare, or adversarial scenarios to test decision-making robustness — especially at the long tail where safety really matters.

Generative AI

Model-only, non-agentic

Input / Output Vulnerabilities
Probe across modalities — text, audio, image, video. Each modality introduces its own attack vectors and safety failure modes.

Red-Teaming Across the AI/ML Lifecycle

AI/ML lifecycle phases diagram
Focus for the rest of this session
Phase 3 — Deployment & Integration (inference time)

Red-Teaming for LLMs

(model-only, non-agentic)

Inherent Vulnerabilities

Built into how LLMs work

Model Structure
Data ↔ control path confusion • context-window limits
Model Behavior
Hallucination / confabulation • sycophancy • deception

Adversarial Vulnerabilities

Introduced by a motivated attacker

Targeting the Model / Data
System prompt extraction • model inversion / training-data extraction • model distillation
Targeting Model Behavior
Getting the model to do bad things — jailbreaks, policy violations, harmful outputs.

Full taxonomy covered in Presentation 11.

OWASP Top 10 for LLM Applications

OWASP Top 10 for LLM Applications 2025

genai.owasp.org/llm-top-10

LLM Jailbreaking

The class of attacks that attempt to subvert built-in safety filters placed by model developers.

Many goals

Restricted outputs, dangerous / unethical content, operational abuse, policy bypass, extraction of refused information.

Many strategies

Role-playing • formatting / encoding tricks • model "social engineering" • multi-turn pressure • context stuffing • adversarial suffixes.

0DIN / Mozilla jailbreak technique taxonomy

0din.ai/research/taxonomy/techniques

LLM Training Data Extraction

Probing the model to leak sensitive information (including PII) from its training data.

Extracting Training Data from LLMs

Carlini et al. — USENIX Security 2021

  • Recovered hundreds of verbatim sequences from GPT-2, including names, addresses, and code
  • Established memorization as a real, reproducible attack surface

Scalable Extraction from Production LLMs

Nasr et al. — 2023

  • "Repeat the word 'poem' forever" caused ChatGPT to emit training data verbatim
  • Gigabytes of data extracted from a deployed, aligned model

PII Leakage in Language Models

Lukas et al. — IEEE S&P 2023

  • Systematic measurement of PII memorization and extraction rates
  • Showed common defenses (DP, scrubbing) reduce but don't eliminate leakage

Red-Teaming Agentic Systems

Models + tools + memory + orchestration

Broader Scope
Tests the full agentic surface: pipelines, plugins, tools, inter-agent communication, memory stores, and broader system dynamics — not just the model in isolation.
Prompt Injection — OWASP LLM01
A class of attacks against systems and applications built on top of LLMs that work by concatenating untrusted input with trusted input.
Untrusted input shows up as: an attacker-controlled webpage the agent browses, a malicious document it ingests, a poisoned tool response, a user-supplied file — anything the agent treats as data but the model might treat as instructions.

Red-Teaming Frameworks & Resources

CSA Agentic AI Red-Teaming Guide
CSA Agentic AI Red-Teaming Guide
Frameworks
NIST
  • AI 100-2e2025 — Adversarial ML Taxonomy / Terminology of Attacks and Mitigations
  • AI 600-1 — AI RMF: Generative AI Profile
Guides
Practice Environments

Red-Teaming Tools

Promptfoo

LLM eval & red-team framework

  • Pluggable attack strategies & providers
  • Test matrix & CI integration

PyRIT

Microsoft — Python Risk Identification Tool

  • Open-source automated red-team toolkit
  • Scriptable multi-turn attacks

AgentDojo

ETH Zurich — agent benchmark

Raptor — LLM-driven pentester
Raptor — LLM pentester
Shannon — LLM pentester
Shannon — LLM pentester
Kali Linux + Claude Desktop integration

More Tools & Resources

Playgrounds & Hands-On
Guides & Methodologies
Essays & Research

Final Project

The Engagement
Pentest an OpenClaw instance running an open-source mid-sized LM deployed on cloud infrastructure.

Teams

6 teams of 2–3 — instructor-assigned

Timeline

More details later today. Full discussion on Wednesday, 4/22.

Grad Students
Research paper huddle after class — let's align on topics, scope, and timeline.