Red-Teaming AI Systems

CYB-4203/6203: Secure and Trustworthy AI

Unit 9.3 — Monday, April 20, 2026

Dallas Elleman — Spring 2026

Red-Teaming

The interactive and iterative process of simulating real-world attacks to identify vulnerabilities in systems.

What kinds of systems?

Physical Technical Human Hybrid

Validates defenses Realistic assessment of how security controls stand up to true threats

Improves detection Identifies gaps in security monitoring and response times

Proactive risk reduction Uncovers vulnerabilities before malicious actors do

Terminology

Red Team

Attackers — simulate adversaries, probe for weaknesses.

Blue Team

Defenders — detect, respond, harden the environment.

Purple Team

Collaboration — attackers & defenders working together in the open.

White-Box Testing

Full system information provided to the attacker — source code, architecture diagrams, credentials, internal documentation.

Black-Box Testing

Only basic information provided — e.g. a company name or a public URL. The attacker must discover everything else.

White Team / Cell — referee and oversight role; controls rules of engagement, adjudicates disputes

Pentesting vs. Red-Teaming

Penetration Testing

Scope	Narrow — a specific web app, API, or network segment.
Duration	Short — days to weeks.
Goals	Identify as many system vulnerabilities as possible; demonstrate compliance; patch technical flaws.
Awareness	Security team is usually aware of the test.

Red-Teaming

Scope	Broad — physical security, social engineering, full stack, people and process.
Duration	Long — weeks, months, or continuous.
Goals	Identify technical and other vulnerabilities; evaluate security processes and incident response; simulate advanced persistent threats (APTs).
Awareness	Security team may or may not be aware of the test.

Different tools for different questions.
Pentests answer "is this thing broken?"
Red teams answer "could we actually detect and stop a real adversary?"

The Red-Teaming Lifecycle

Common across frameworks (PTES, OWASP WSTG, MITRE, NIST)

Pre-planning
& Scoping

Reconnaissance

Threat
Modeling

Attack
Planning

Execution,
Movement
& Iteration

Reporting
& Debriefing

We'll walk through each step, then apply the same lens to AI systems.

Step 0

Pre-Planning & Scoping

Define objectives, goals, and rules of engagement.

What is the target system? (specific apps, networks, physical locations)
What system components are within scope?
What safety considerations exist? (production impact, customer data, legal constraints)
What red-team methods and behaviors are allowed, and which are out of bounds?

Establish communication, escalation, and oversight — who gets paged if something breaks? Who has the authority to stop the exercise?

Step 1

Reconnaissance

Information gathering — learning the target without (yet) touching it.

OSINT

Open-source intelligence — what can you learn about the system without touching it? Public docs, DNS records, employee LinkedIn profiles, leaked creds, code on GitHub.

External Surveillance

How does the system behave externally? What response patterns, timings, and error messages leak information about internal architecture?

Map the External Attack Surface

Enumerate every public-facing component — domains, subdomains, APIs, login portals, ports, cloud assets, mobile apps, third-party integrations.

Step 2

Threat Modeling

Identify, enumerate, and analyze potential vulnerabilities.

What data or other systems are touched?

What actions can the system take on its own or on behalf of users?

What is the blast radius — potential scope of damage, disruption, or unauthorized access if the system is compromised?

Step 3

Attack Planning

Select attack targets — which vulnerabilities are you going to try, and in what order?

Build a test matrix of payloads and expected results.

Determine the attack sequence and phases — initial access, persistence, privilege escalation, lateral movement, exfiltration.

Step 4

Execution, Movement & Iteration

Launch attacks

Execute against the plan. Live environments surprise you.

Carefully document results

Every payload, every response, every timestamp. This becomes your report and your evidence.

Scan for newly emergent vulnerabilities

A foothold opens up a new interior attack surface — credentials, tokens, internal services.

Iterate and adapt

Your plan was a hypothesis. What actually works may be something you didn't anticipate.

Step 5

Reporting & Debriefing

Communicate findings with a well-written, professional report.

What goes in the report

Executive summary • methodology • findings ranked by severity • artifacts and diagrams • exploit evidence (screenshots, logs, PoCs) • timeline of the engagement

Recommend mitigations

Every finding gets a remediation. Be specific: technical controls, process changes, training, architectural redesign. Rank by impact vs. effort.

A great engagement with a bad report is a wasted engagement. The report is the deliverable.

Red-Teaming AI/ML Systems

Red-Teaming AI/ML systems

Simulating real-world attacks to identify vulnerabilities in artificial intelligence systems and components.

How is Red-Teaming AI/ML systems different?

Probabilistic behavior

Outputs are non-deterministic. A payload that fails 9 times may succeed on the 10th. You test with distributions, not single shots.

Novel vulnerability classes

Biases, hallucination / confabulation, data leakage, harmful content generation, agentic misbehavior — categories traditional security doesn't cover.

Heterogeneous systems

Computer vision, recommenders, classifiers, autonomous control, generative models, agents. Different attack surfaces per system type.

Lifecycle-wide attack surface

Testing must span the full AI/ML lifecycle: training data, model training, deployment, and inference-time behavior.

Red-Teaming AI/ML Systems (1 of 2)

Computer Vision

Recognition & detection

Adversarial Perturbation

Modify image pixels slightly to cause misclassification — often imperceptible to humans.

Data / Model Poisoning

Alter training datasets or inject backdoors; triggered inputs produce attacker-chosen predictions.

Recommenders & Classifiers

Filters, anomaly detection, ranking

Model Evasion

Craft inputs that bypass security measures — spam filters, fraud detection, malware classifiers.

Bias & Fairness

Test whether outputs are discriminatory or demographic-dependent.

Data / Model Poisoning

Alter the model's output for selected inputs without degrading overall accuracy.

Red-Teaming AI/ML Systems (2 of 2)

Autonomous Systems

Self-driving cars, anthrobots, drones

Adversarial Perturbation / Edge Cases

Simulate unexpected, rare, or adversarial scenarios to test decision-making robustness — especially at the long tail where safety really matters.

Generative AI

Model-only, non-agentic

Input / Output Vulnerabilities

Probe across modalities — text, audio, image, video. Each modality introduces its own attack vectors and safety failure modes.

Red-Teaming Across the AI/ML Lifecycle

Focus for the rest of this session

Phase 3 — Deployment & Integration (inference time)

Red-Teaming for LLMs

(model-only, non-agentic)

Inherent Vulnerabilities

Built into how LLMs work

Model Structure

Data ↔ control path confusion • context-window limits

Model Behavior

Hallucination / confabulation • sycophancy • deception

Adversarial Vulnerabilities

Introduced by a motivated attacker

Targeting the Model / Data

System prompt extraction • model inversion / training-data extraction • model distillation

Targeting Model Behavior

Getting the model to do bad things — jailbreaks, policy violations, harmful outputs.

Full taxonomy covered in Presentation 11.

OWASP Top 10 for LLM Applications

genai.owasp.org/llm-top-10

LLM Jailbreaking

The class of attacks that attempt to subvert built-in safety filters placed by model developers.

Many goals

Restricted outputs, dangerous / unethical content, operational abuse, policy bypass, extraction of refused information.

Many strategies

Role-playing • formatting / encoding tricks • model "social engineering" • multi-turn pressure • context stuffing • adversarial suffixes.

0din.ai/research/taxonomy/techniques

LLM Training Data Extraction

Probing the model to leak sensitive information (including PII) from its training data.

Extracting Training Data from LLMs

Carlini et al. — USENIX Security 2021

Recovered hundreds of verbatim sequences from GPT-2, including names, addresses, and code
Established memorization as a real, reproducible attack surface

arxiv.org/abs/2012.07805

Scalable Extraction from Production LLMs

Nasr et al. — 2023

"Repeat the word 'poem' forever" caused ChatGPT to emit training data verbatim
Gigabytes of data extracted from a deployed, aligned model

arxiv.org/abs/2311.17035

PII Leakage in Language Models

Lukas et al. — IEEE S&P 2023

Systematic measurement of PII memorization and extraction rates
Showed common defenses (DP, scrubbing) reduce but don't eliminate leakage

arxiv.org/abs/2302.00539

Red-Teaming Agentic Systems

Models + tools + memory + orchestration

Broader Scope

Tests the full agentic surface: pipelines, plugins, tools, inter-agent communication, memory stores, and broader system dynamics — not just the model in isolation.

Prompt Injection — OWASP LLM01

A class of attacks against systems and applications built on top of LLMs that work by concatenating untrusted input with trusted input.

Untrusted input shows up as: an attacker-controlled webpage the agent browses, a malicious document it ingests, a poisoned tool response, a user-supplied file — anything the agent treats as data but the model might treat as instructions.

Red-Teaming Frameworks & Resources

CSA Agentic AI Red-Teaming Guide

Frameworks

NIST

AI 100-2e2025 — Adversarial ML Taxonomy / Terminology of Attacks and Mitigations
AI 600-1 — AI RMF: Generative AI Profile

Guides

Promptfoo Red-Teaming Guide

Practice Environments

Lakera Gandalf — LLM jailbreaking challenges
Lakera AgentBreaker — agentic-system attack challenges

Red-Teaming Tools

Promptfoo

LLM eval & red-team framework

Pluggable attack strategies & providers
Test matrix & CI integration

PyRIT

Microsoft — Python Risk Identification Tool

Open-source automated red-team toolkit
Scriptable multi-turn attacks

AgentDojo

ETH Zurich — agent benchmark

Prompt-injection evaluation for tool-using agents
NeurIPS paper · poster

Raptor — LLM pentester

Shannon — LLM pentester

Kali + Claude Desktop

More Tools & Resources

Playgrounds & Hands-On

Guides & Methodologies

Essays & Research

Final Project

The Engagement

Pentest an OpenClaw instance running an open-source mid-sized LM deployed on cloud infrastructure.

Teams

6 teams of 2–3 — instructor-assigned

Timeline

More details later today. Full discussion on Wednesday, 4/22.

Grad Students

Research paper huddle after class — let's align on topics, scope, and timeline.