Units 11, 12, 13 — with a Weekend Briefing

CYB-4203/6203: Secure and Trustworthy AI

Monday, April 27, 2026

Dallas Elleman — Spring 2026

Course Orientation

Section 4 — SYNTHESIS

Last session — Pres 20
Building & Operationalizing Secure AI
How to protect AI/ML systems
Today — Pres 21
Risk, Audit, & the Industry Landscape
Where the field is, and where it's going
Plan: Weekend briefing → Units 11 & 12 (skim) → Unit 13 (focus)

Weekend Briefing

What I worked on, and what I want you to know about

Final-Project Infrastructure — Original System Design

Original final-project system design: RunPod GPU pod with local Ollama, no third-party inference

Final-Project Infrastructure — Revised System Design

Revised final-project system design: CPU-only RunPod pod with HuggingFace Router and Together AI inference

Final-Project Infrastructure — Post-Mortem

Local-GPU pivot to hosted inference. Ollama on RunPod kept hitting host-capacity-fail and was too slow on Blackwell. Pivoted to HuggingFace Router → Together AI → Qwen2.5-7B-Instruct; CPU-only pods for all 6 teams.
Volume-disk vs. network-volume tradeoff. Network volumes can only be Terminated, not Stopped — defeats Stop/Resume. Volume disks die when the host is full. Solution: volume disk + droplet-side rsync backup.
Security model rewrite (v2). Hosted inference means student prompts now traverse a third-party provider. Re-did the threat model end-to-end — verdict: GO with seven compensating controls (no real PII, fictional secrets, $20 spend cap, rate limits, key-rotation script, hourly canary).

Anthropic Research Fellows

Anthropic Research Fellows job posting screenshot

job-boards.greenhouse.io/anthropic/jobs/5023394008

Personal Website — dallas-elleman.com

Interactive personal website / resume demo

Live demo (interactive resume)

Claude Code — Auto Mode

Claude Code auto mode permission classifier

code.claude.com/docs/en/permission-modes#eliminate-prompts-with-auto-mode

Hurricane Hackathon promotional image

Risk Management & Crisis Response

Unit 11 — survey only

Unit 11 — Topics

  • 11.1 NIST AI Risk Management Framework — structure, implementation, and practical application.
  • 11.2 Organizational governance — risk assessment, ethics boards, and accountability structures.
  • 11.3 Incident response and crisis management — preparation, escalation, and recovery protocols.

Independent Auditing, Documentation, & Disclosure

Unit 12 — survey only

Unit 12 — Topics

  • 12.1 Documentation standards — model cards, datasheets, and transparency requirements.
  • 12.2 Preparing for external audits and regulatory review.
  • 12.3 Independent evaluation — third-party testing, certification, and validation processes.
  • 12.4 Stakeholder engagement — disclosure practices, communication, and accountability mechanisms.

Industry Applications & Emerging Challenges

Unit 13 — today's focus

13.1 — Sector-Specific Security & Trust

Four sectors where AI security and trust requirements are the most differentiated — each gets a single best-reference URL you can use as a launching pad.

Healthcare

FDA, HIPAA, clinical AI

Finance

SR 11-7, fair lending, fraud

Defense

DoD RAI, dual-use, ethics

Critical Infrastructure

CISA, EO 14110, ICS/OT

13.1a — Healthcare (U.S.)

SaMD framework + Good Machine Learning Practice

FDA's SaMD classification + the 10 GMLP principles set baseline safety, validation, and clinical-evaluation expectations for any AI/ML medical device.

Predetermined Change Control Plan (PCCP)

Final FDA guidance (December 2024) governs post-market model updates and drift monitoring — the regulatory answer to "the model isn't static after deployment."

Transparency + lifecycle expectations

FDA Transparency Guiding Principles (June 2024) + AI-enabled device lifecycle draft (January 2025) address bias, hallucination risk, and clinician-facing labeling.

Single best reference

13.1b — Finance

Existing model-risk frameworks apply — with gaps

SR 11-7 model risk and third-party-risk guidance carry over to AI; gaps remain for credit unions and generative AI.

AI is dual-use in finance

Amplifies cybersecurity, data-privacy, fair-lending bias, and synthetic-identity fraud risks — while strengthening fraud detection.

Trustworthy deployment essentials

Explainability, bias testing, governance, and continuous monitoring — coordinated across OCC, FDIC, Fed, SEC, CFPB, NCUA.

Single best reference

13.1c — Defense

AI-first warfighting posture

Establishes seven Pace-Setting Projects with mandated cross-Service data access, compute, and talent provisioning — AI moves from "responsible enabler" to capability frontier.

CDAO benchmarks become acquisition criteria

CDAO must publish model-objectivity and trust benchmarks as primary procurement criteria — vendors compete on measured trust, not stated principles.

Subordinate documents still operate

2024 RAI Strategy & Implementation Pathway and DoDD 3000.09 (2023) remain in force as implementing documents under the new top-level strategy.

Single best reference (current)

13.1d — Critical Infrastructure

Four secure-AI principles for OT

Awareness of AI use · threat-informed design · secure-by-design AI · secure AI lifecycle management.

OT-focused, all 16 sectors

Targets operators across the 16 U.S. critical-infrastructure sectors — co-authored with allied cyber agencies (ACSC, CCCS, NCSC-UK, others).

Operationalizes current policy

Implements the July 2025 AI Action Plan's CISA mandate. Effectively supersedes the Biden-era April 2024 DHS guidance (which is still hosted but orphaned by EO 14179).

Single best reference (current)

U.S. AI Policy — Biden → Trump (in 15 months)

Several of the references on the prior slides come from two different administrations. Here's the delta.

Dimension Biden (2021–Jan 2025) Trump (Jan 2025–present)
Overarching EO EO 14110 (Oct 2023) — risk + safety + civil rights EO 14179 (Jan 2025) — "Removing Barriers"; revokes 14110. AI Action Plan (Jul 2025).
Federal AI use OMB M-24-10 + M-24-18 OMB M-25-21 + M-25-22 (Apr 2025); CAIO + risk-mgmt skeleton retained, framing toward speed/innovation
Frontier model reporting DPA-based mandatory reporting + pre-deployment safety-test sharing Mandatory reporting paused; voluntary CAISI engagement; red-teaming reframed as ideological-bias check
"AI Bill of Rights" OSTP Blueprint (Oct 2022), cited across agencies Framing dropped; document moved to Biden archive
AI Safety Institute US AISI at NIST (Nov 2023); MOUs with frontier labs Renamed CAISI (Jun 2025) — standards/competitiveness framing replaces "safety"
State preemption Implicit deference; state experimentation encouraged Active preemption: Dec 2025 EO + DOJ AI Litigation Task Force (legal force contested)
Critical infra AI DHS/CISA April 2024 guidelines under EO 14110 CISA Dec 2025 OT principles (joint w/ allies); 2024 doc orphaned but still hosted

What stayed: NIST AI RMF + GenAI Profile, the institute itself (rebranded), sector regulators, and a rapidly growing body of state AI laws.

Single best reference

13.2 — Live Policy Debates

Three debates that will shape the regulatory environment your career will live in.

Open Model Release

Transparency vs. proliferation

Foundation Model Regulation

Compute thresholds, evals, audits

Governance Evolution

Voluntary → binding, national → international

13.2a — Open Model Release

Benefits of open weights

Research access, competition, privacy — cited and substantiated.

Marginal misuse risks

CBRN, cyber, CSAM, disinformation — framed as marginal risk over the closed-model baseline, not absolute.

Recommended posture

Active monitoring + evidence collection rather than preemptive restriction; preserve flexibility for future regulatory pivots.

Single best reference

13.2b — Foundation Model Regulation

An IPCC-style consensus document for AI

Multilateral: 30 governments + UN, EU, OECD; chaired by Yoshua Bengio; ~100 nominated experts.

Surveys the regulatory toolkit

Capability evaluations, red-teaming, transparency mandates, third-party audits — and the limits of compute thresholds as a policy lever.

Frames the 2025–2026 trajectory

EU AI Act GPAI tier · the AISI network (UK, US, plus follow-ons) · emerging frontier safety frameworks.

Single best reference

13.2c — AI Governance Evolution

Voluntary → binding

NIST AI RMF and OECD AI Principles giving way to enforceable EU AI Act, US executive orders, and state laws.

National → multilateral

AI Safety Institutes proliferating since Bletchley 2023; international AISI network formalized at Seoul 2024 and Paris 2025.

Velocity

Legislative AI mentions up 21.3% across 75 countries in 2024; US state AI laws went from 49 to 131.

Single best reference

13.3 Current Landscape & 13.4 Emerging Technologies

Top student-vote topics — deeper treatment

13.3 — Where the Attacks Actually Live (2025–2026)

Real production incidents have stopped looking like academic adversarial-example papers. The frontier moved.

Indirect prompt injection

EchoLeak (CVE-2025-32711) was the first zero-click LLM exploit at production scale — emails → Copilot → data exfil. The pattern repeats: attacker text reaches the context window through any retrieval channel.

Tool-call abuse / excessive agency

Agents wired into email, calendar, repos, and browsers gain real-world leverage. 2025–2026 incidents centered on agent tool-misuse, not on "the model said something bad."

Supply-chain attacks on models

Pickle-deserialization RCE in HF model files (PyTorch / TensorFlow loaders); typosquatted model names; poisoned LoRA adapters distributed via community hubs.

RAG-corpus poisoning

Wiki, Drive, ticket-system content treated as authoritative once retrieved — the trust boundary is set at indexing time, not query time. (See Pres 20.)

Reference

13.3 — State of the Defenders

AI Safety Institutes

UK AISI, US AISI, Japan AISI, Singapore AI Safety Centre, EU AI Office — doing pre-deployment evals on frontier models. Network formalized at Seoul 2024, Paris 2025.

Lab-internal red teams

Anthropic, OpenAI, Google DeepMind, Meta — structured red-team programs publishing capability/safety reports per release. Frontier safety frameworks now standard.

Eval ecosystem

METR, Apollo Research, Pattern Labs, MLCommons AILuminate, plus academic shops (CHAI, MILA, FAR.AI). Evals are professionalizing — with all the methodology debates that implies.

Defensive frameworks landing

NIST AI RMF + GenAI Profile (AI 600-1), MITRE ATLAS, Meta's Agents Rule of Two, OWASP LLM Top 10. Different layers, complementary, none sufficient alone.

References

13.4 — Emerging Technologies, in the Security Lens

Three frontiers worth watching — each will reshape the attack surface in the next 12–24 months.

Agentic Systems

Tool use, MCP, computer use, multi-agent

Mechanistic Interpretability

Sparse autoencoders, circuits, probes

AI for AI Safety

Constitutional AI, debate, scalable oversight

13.4a — Agentic Systems

Model Context Protocol (MCP)

Anthropic's open standard for tool/server integration with LLM clients. Now the de-facto agent ↔ tool wiring layer (Claude, Cursor, Continue, ChatGPT, others).

Computer use / browser agents

Claude Computer Use, OpenAI ChatGPT Agent (Operator merged in Aug 2025), Google Gemini Agent (Project Mariner winds down May 4, 2026). Agents that drive a real browser or desktop — full web/app capability + every web/app vulnerability.

Multi-agent orchestration

Long-running agent crews (engineering, research, ops). New problems: cross-agent prompt injection, principal/delegate identity, audit trails across agent boundaries.

Reasoning + extended thinking

GPT-5.5 Thinking (o-series folded into GPT-5 line), Claude Opus 4.7 adaptive thinking, DeepSeek V4-Pro (R1 superseded; V4 collapses chat + reasoner). Models that plan before acting — longer attack chains.

Reference (April 2026)

13.4b — Mechanistic Interpretability

From "black box" to "we can name some circuits"

Sparse autoencoders identify human-interpretable features inside frontier models. Anthropic's Scaling Monosemanticity mapped millions of features in Claude 3 Sonnet; later work extends this to Claude 3.5+ and Llama-class models.

Why it matters for security

If we can detect deception features, sycophancy features, or jailbreak-trigger features at activation time, we get a defense layer that doesn't depend on prompt-text classification.

Where it's still pre-paradigmatic

Feature-finding works; causally intervening to suppress unsafe behavior at scale, in deployed systems, is research-grade. Don't ship interpretability-based safety as your only line.

References

13.4c — AI for AI Safety

Scalable oversight

Use AI systems to help humans evaluate other AI systems on tasks humans cannot reliably grade alone. Constitutional AI, RLAIF, debate, recursive reward modeling.

Automated red-teaming

Models that generate jailbreaks, edge cases, and adversarial inputs against other models. Currently used in production at every frontier lab.

Verifiable / formal-method assists

LLMs as theorem-proving assistants (Lean, Coq), policy-as-code generators, and formal-spec drafters — bringing program-verification rigor into AI safety.

References

What's Next

Wednesday, April 29 — Week 14, Session 2
Unit 14 — Career pathways, professional development.
Monday, May 4 — Final Class Session
Synthesis & wrap-up + final exam review.
Final Project
Due date TBD — will be announced once finalized.
Final Exam Time
I'll send out a survey — we get to choose. Watch your email.