01
CYB-4203 / 6203 · Unit 8 · Presentation 17

Differential Privacy
& Federated Learning

Secure & Trustworthy AI · Spring 2026 · Dallas Elleman

Title

Opening. Today we're covering two of the central privacy-enhancing technologies in modern AI/ML: differential privacy and federated learning. We'll start with DP and the history that led us there, then move to federated learning in the second half.

02

The Illusion of Invisibility

We want to collect and analyze data to cure diseases, to train AI, to plan our cities.

But the more detailed that data is, the higher the risk of privacy harms for the data sources — i.e., people.

The field that tries to resolve this paradox is called privacy-enhancing technology.
And to understand where it is now, we have to go back to 1996.

The Illusion of Invisibility

Set up the core tension of the whole unit: the usefulness of data versus the harm that detail creates. Don't belabor specific examples — let the students connect it to things they already care about.

Transition: "The field that tries to resolve this paradox is called privacy-enhancing technology, and to understand where it is now, we have to go back to 1996."

03

1996

Time magazine cover from 1996
The internet in 1996
Oasis, 1996
Governor William F. Weld of Massachusetts, 1996
Gov. William F. Weld
Governor of Massachusetts

The internet is a toddler. Anonymization means taking a black marker to the obvious fields.

1996 — Setting the Scene

Ground the students in the era. The internet is new, spreadsheets are the primary tool, and the standard approach to "anonymization" is: delete names, delete addresses, delete Social Security numbers, and you're done.

This matters because the failure we're about to see isn't a failure of the people involved — they did what everyone thought was enough. It's a failure of the *assumption* that deletion equals anonymity.

04

Direct Identifiers

Group Insurance Commission logo

The Group Insurance Commission in Massachusetts decides (with good intentions) to release anonymized data on state employees' hospital visits — the idea was to let researchers crunch the numbers and find ways to lower healthcare costs.

GIC Hospital Visit Records (1996)
Name Address SSN Ethnicity Sex Birth Date ZIP Visit Date Diagnosis Procedure Medication Charge
William F. Weld 15 Louisburg Sq, Boston 042-31-8876 WhiteM1945-07-3102138 1996-05-14HypertensionBlood panelLisinopril$1,840
Margaret O'Shea 88 Beacon St, Boston 019-44-2210 WhiteF1962-03-1102116 1996-02-22AsthmaPulmonary testAlbuterol$620
David Chen 412 Mass Ave, Cambridge 127-88-5501 AsianM1978-09-0402139 1996-08-03MigraineMRISumatriptan$2,410
Rosa Alvarez 27 Blue Hill Ave, Roxbury 553-12-6049 HispanicF1955-12-1902119 1996-06-30Diabetes, Type IIA1C panelMetformin$980
James Whitaker 9 Elm St, Worcester 204-76-1183 BlackM1949-11-0201609 1996-04-17ArthritisX-ray, kneeNaproxen$1,275
Linda Petrov 56 Walnut Ave, Springfield 396-55-7720 WhiteF1970-06-2401108 1996-07-09DepressionPsych evalFluoxetine$540
Marcus Johnson 301 Seaver St, Dorchester 611-09-3354 BlackM1983-01-1502121 1996-09-28Fractured radiusCast, orthoOxycodone$3,120
Emily Tanaka 74 Harvard St, Brookline 445-22-9918 AsianF1990-04-0702446 1996-11-05PneumoniaChest X-rayAzithromycin$1,680
Governor William F. Weld
Gov. Weld · a state employee,
so his record is in the release.

To anonymize the data, they deleted direct identifiers — names, addresses, Social Security numbers — of patients and called it good.

Direct Identifiers

The Group Insurance Commission wanted to help researchers study healthcare costs. Their approach to anonymization was the standard of the era: delete the fields that *obviously* identify someone — name, address, SSN — and publish the rest.

Click "Remove Direct Identifiers" to show what the released version looked like. The direct-ID columns go black. Weld's portrait blurs — because now he's just another row in a sea of "anonymous" patients.

This looked safe. It wasn't. The next few slides show why.

05
Latanya Sweeney
Latanya Sweeney
MIT researcher, 1996

Latanya Sweeney, a researcher at MIT, saw a problem — and decided to prove a point.

She bought the voter registration list (a public record) from Cambridge, MA for $20.

She then cross-referenced this voter list with the released medical records, found Governor Weld's personal medical records in the “de-identified” data set, and mailed them to him to demonstrate the problem.

Cambridge, MA Voter Registration List — public record, $20
Name Address ZIP Birth Date Sex Date Registered Party Last Voted
William F. Weld15 Louisburg Sq, Boston02138 1945-07-31M1968-09-12Republican1996-03-05
Anne Driscoll210 Broadway, Cambridge02139 1952-02-18F1974-10-01Democrat1996-03-05
Robert Kim47 Garden St, Cambridge02138 1965-11-22M1987-08-14Independent1994-11-08
Sarah Goldberg12 Brattle St, Cambridge02138 1971-04-09F1989-06-03Democrat1996-03-05
Michael O'Brien305 Mass Ave, Cambridge02139 1948-08-30M1970-02-19Democrat1995-11-07
Priya Natarajan88 Prospect St, Cambridge02139 1980-06-17F1998-09-20Democrat1996-03-05
Thomas Reilly59 Hancock St, Cambridge02139 1955-01-04M1977-11-08Republican1996-03-05
Janet Williams22 Oxford St, Cambridge02138 1963-10-25F1984-04-15Independent1994-11-08

Latanya Sweeney — Proving the Point

Sweeney, then an MIT researcher (now one of the world's leading privacy scholars), saw the GIC release and saw an obvious hole. She decided to make the vulnerability concrete by picking the most visible possible target: the governor.

She walked into the Cambridge city office and paid $20 for the public voter registration list. Now she had two spreadsheets. One had diagnoses. The other had names. Both had three fields in common: birth date, ZIP code, and gender.

Walk the class through the voter table on screen. Note Weld's row: 1945-07-31, male, 02138. Those three values will come back in the next two slides.

Payoff: Sweeney mailed Weld his own medical records to prove the vulnerability. This is one of the most cited demonstrations in the history of privacy research.

06

The Linkage Attack

Linkage attack diagram

Cross-reference two safe-looking datasets on their shared columns — and anonymity vanishes.

The Linkage Attack

The technique has a name: linkage attack. You take two datasets that each look safe on their own, find the columns they share, and use one to re-identify records in the other.

The key insight: anonymity isn't a property of a single dataset. It's a property of a dataset *in the context of everything else that's publicly known*. And in 1996, as now, an enormous amount of data is publicly known.

07
Interactive Demo
Finding the Governor

Click each filter in sequence. Watch the two tables converge.

Anonymized Medical Records
Birth DateZIPGenderDiagnosis
Cambridge Voter List ($20)
NameBirth DateZIPGender
Medical records remaining
20
Voter records remaining
30

Interactive: Finding the Governor

Walk through the filters one at a time. Let the class call out what they notice. After the first filter, we're down to 6 matches on each side. After gender, 3. After ZIP, 1.

The mathematical inevitability of that final "1" is the whole lesson. There was only ever going to be one match.

08
87%

of the US population can be uniquely identified using only ZIP code, birth date, and gender. — Sweeney, 2000

Fields like these are called quasi-identifiers — data points that, while not unique alone, can be combined with other public or private information to re-identify individuals.

Data scientists realized they needed a new approach. They couldn't just delete names — they needed a mathematical guarantee of safety.

Common Quasi-Identifiers by Category

Geographic Information
  • ZIP codes, census tracts, counties
  • Business or facility names
  • Movement patterns or locations
  • School districts or neighborhoods
Demographic Details
  • Age ranges or birth years
  • Gender or sex
  • Education level or occupation
  • Race or ethnicity
  • Income brackets
Temporal Data
  • Service, admission, or discharge dates
  • Purchase timelines or patterns
  • Employment history or tenure
  • Subscription duration
  • Event attendance dates
Lifestyle Indicators
  • Vehicle ownership details
  • Educational institutions attended
  • Homeownership status
  • Purchase categories or preferences
  • Leisure activities or memberships

87%

Let that number breathe. Sweeney's 2000 paper quantified what we just saw: three public fields are enough to re-identify the vast majority of Americans.

The crowd we were supposed to be hiding in — it isn't that big. Or maybe we're all much more unique than we thought.

09

k-Anonymity: The Spartacus Strategy

I am Spartacus scene — slaves standing up one by one to claim the identity
“I am Spartacus.”

The Romans demand the slaves identify Spartacus. One by one, every slave stands and claims, “I am Spartacus.” If everyone looks like the target, no one can be singled out.

This is k-anonymity: transform the data so that every record is indistinguishable from at least k − 1 others on its quasi-identifiers. Instead of age 32, store 30–40. Instead of ZIP 02138, store 021**. Suppress the outliers you can’t generalize.

If k = 5, every person in the release is hiding in a crowd of at least five identical-looking records. The Romans can’t pick the right one.

The Spartacus Strategy — k-Anonymity

The first serious attempt to fix the linkage attack was k-anonymity, introduced by Sweeney herself a few years after the Weld incident.

The intuition is the Spartacus scene. If everyone in the group looks the same, the Romans can't pick out the real target.

Mechanically, k-anonymity uses two tools: generalization (turning age 32 into "30–40", turning ZIP 12345 into "123**") and suppression (deleting outliers entirely — if you're the only 104-year-old in the dataset, you get cut).

10
Interactive Demo
k-Anonymizing the Governor's Dataset

The same 20 medical records from Finding the Governor. Two tools: generalization (replace precise values with buckets) and suppression (drop rows that can't be hidden in a group of k).

Raw Medical Records (k = 1)
Birth DateZIPGenderDiagnosis
Generalization

Replace precise values with buckets. Exact birth date → birth decade. Full ZIP → ZIP prefix (021**). Ages become ranges. The data loses detail but each row becomes indistinguishable from others in its bucket.

Suppression

Some rows can't be hidden in a group of k — they're too rare. Drop them entirely. Outliers pay the price for everyone else's anonymity.

Interactive: k-Anonymizing the Medical Records

Same 20 records students saw in the Finding the Governor demo. At k=1 it's the raw table. Click k=3 and the birth dates become 20-year buckets, ZIPs collapse to 021**, and the two records that can't find a group of 3 get suppressed.

At k=5 the buckets widen further. More rows fit into each group, but the data is even blurrier.

Emphasize the two techniques explicitly: generalization (replace with a bucket) and suppression (drop the outlier). These are the only tools k-anonymity has.

Set up the next slide: "So is this safe? Let's look at one more example and see what still leaks."

11
Interactive Demo
k-Anonymity & the Homogeneity Attack
Raw Hospital Dataset (k = 1)
AgeZIPGenderDiagnosis

Interactive: k-Anonymity

Step through the three k values. At k=1 we see raw data. At k=3, age and ZIP are generalized but everyone is still present. At k=5, the generalization is heavier and the two outliers are suppressed.

Then hit Reveal. The homogeneity attack: the five males in the first group all have Heart Disease. The attacker doesn't need to know *which* male is the target. They know what condition he has, because everyone in his hiding group has the same condition.

k-anonymity was patched with l-diversity and t-closeness, but the bigger problem was the curse of dimensionality: in modern high-dimensional data, you'd have to suppress so much that the dataset becomes useless.

12

Differential Privacy

2006

Dwork, McSherry, Nissim, Smith

Stop protecting the data.
Start protecting the process.

Instead of trying to sanitize a dataset until it is “safe to release,” define a mathematical property of the release mechanism itself: the output of any query should look effectively the same whether or not any single individual is in the database.

That property is called ε-differential privacy. The parameter ε controls how much any one person can influence the answer — and therefore how much they can be leaked.

Play →
Differentially Private Tetris
Feel the effect of ε on gameplay — low budget means noisy controls.

Frosted Glass

You can see the shape. You cannot see the pores.

A diverse crowd
Privacy Budget (ε) ε = 1.00

Compute the true answer — then deliberately blur it with mathematical noise. Low ε = heavy blur = strong privacy. High ε = crisp detail = weak privacy.

Differential Privacy (2006) — Frosted Glass

Dwork, McSherry, Nissim, and Smith published the foundational paper that created the field. The key move was a change of framing: instead of trying to sanitize a dataset so it was "safe to release," they defined a mathematical property of the release mechanism itself.

Plausible deniability: the output of a query should be effectively the same whether or not you, specifically, are in the database.

The frosted-glass metaphor is the intuition that makes DP click for most people. Through frosted glass, you can see the big picture — beach vs. forest, crowd vs. empty street — but not individual faces. Use the slider to show: low epsilon = heavy blur, high epsilon = clear. Same image, same underlying data; what changes is the privacy budget.

The mechanism: compute the true answer, then add random noise drawn from a carefully chosen distribution (Laplace, for the basic case). The noise scale depends on epsilon.

13
Interactive Demo
The Epsilon Dial
Privacy Budget (ε) ε = 1.00
The balance point — moderate privacy, moderate utility
True Mean
$59,400
Reported Mean
$59,400
Privacy Error
±$0

Interactive: The Epsilon Dial

Low epsilon = thick frosted glass = strong privacy, weak utility. High epsilon = thin glass = weak privacy, strong utility.

Click re-roll a few times so students see the noise is actually random. This is the key distinguishing feature from k-anonymity: there is no single "anonymized dataset" — each query draws fresh noise.

Emphasize: you cannot have perfect privacy *and* perfect accuracy. Someone has to decide where to set the dial.

14

Differential Privacy in the Wild

📱
Apple

Your iPhone adds noise before data ever leaves the device. Apple learns trending emoji. It never sees your keystrokes.

🗺️
Google Maps

"Popular Times" tracks device density with noise. They know the grocery store is busy. They don't know you're in it.

🏛️
2020 US Census

For the first time, the official population count has mathematical noise baked in — to defeat reconstruction attacks.

🏦
Bank Collaboration (SMPC)

Two banks detect money laundering together without ever sharing customer lists. The secret files talk. The humans don't see.

DP in the Wild

Four real deployments. Apple's local DP is the one students are most likely to have encountered — noise added on-device before anything is transmitted.

The 2020 Census is the most consequential. Reconstruction attacks had become feasible enough that the Bureau switched methodology entirely. We'll come back to the consequences of that decision on the next slide.

15

Minority Erasure: The Costs of Differential Privacy

If a group has 1,000 people, adding ±5 is a rounding error.

If a group has three people...

...they can mathematically cease to exist.

The Dark Side — Minority Erasure

This is the honest reckoning. DP isn't a free lunch. The noise doesn't affect all groups equally.

Large groups absorb the noise without trouble. Small groups — which in demographic data usually means minority communities — can be noised into oblivion. A group of 3 + a random noise draw of -4 = reported as 0.

This isn't hypothetical. The 2020 Census noise levels were controversial for exactly this reason.

16
Interactive Demo
Minority Erasure
Privacy Budget (ε) ε = 1.00
Groups Tracked
5
Groups Erased
0
True Population
977

Interactive: Minority Erasure

Lower the epsilon and re-roll the noise a few times. The majority group (950) is untouched. The minorities (12, 7, 5, 3) collapse to zero at low epsilon, often.

Census data decides school placement, voting districts, and federal aid distribution. If the noise washes out a small community, they lose funding and political representation. The paradox: the people who most need protection from identification are also the ones most harmed by the loss of accuracy.

Researchers call the active work in this space "equitable differential privacy." It's unsolved.

17

What's at Stake

Funding & Representation

Census data decides where schools get built, where voting-district lines are drawn, and where federal aid lands. If the privacy noise washes out a small community, they lose funding and political representation — even though the people are still there.

Voting Rights Act § 203

Federal law requires jurisdictions to provide ballots and voter materials in a non-English language whenever census data shows a language-minority population above a fixed threshold. If DP noise pushes those counts below the line, the trigger fails — and voters lose translated ballots entirely, disenfranchising communities the protection was designed for.

Read the memo →
Census Distortion Program: The Perils of Differentially Private Census Data
Meredith Strohm Gunter, Weldon Cooper Center for Public Service, University of Virginia · Memorandum to Gov. Ralph Northam, January 23, 2020 · PDF

Minority Erasure — What's at Stake

Two concrete examples of how noise in released counts becomes harm in the real world.

Funding and representation. Federal funding allocations (Title I schools, Medicaid matching, highway funds, — all tied to population counts), political boundary drawing, and where new public infrastructure gets built all run off census data. Small communities noised to zero become invisible to those processes.

Voting Rights Act Section 203. Requires bilingual ballots where a language-minority population exceeds 5% of eligible voters OR 10,000 people in a jurisdiction. The trigger depends entirely on the count being accurate. DP noise on the small end of those populations can push them below the threshold and strip the legal protection.

The memo on the link is from 2020, when the Bureau was finalizing its DP approach. Virginia's state demographer laid out concrete distortion cases and asked the governor to push back. Worth a click after class.

18

Differential Privacy, AI/ML, and the “Matthew Effect”

“For to everyone who has, more will be given… but from the one who has not, even what they have will be taken away.”
— Matthew 25:29

Majority wins, minority loses

The model learns the majority class perfectly — its signal is strong enough to survive the noise. But it completely fails to learn the minority class. Their signal is the same order of magnitude as the noise added to protect them.

Outliers clipped

DP training mechanisms (e.g. DP-SGD) clip per-example gradients before adding noise. Rare and informative examples — the edge cases — get their influence capped. The model never hears from them.

Bias, amplified

The result: an AI system that is more biased than it would have been naturally. The model doesn't just inherit bias from the data — the privacy mechanism itself makes it worse.

Erasure, not protection

If you're too unique, the algorithm literally erases you, distorting reality. The same mechanism that promises to protect you from identification also protects the model from learning you exist.

The Matthew Effect

The name comes from sociology of science, ultimately from Matthew 25:29: the rich get richer, the poor get poorer. When applied to DP training, it captures the failure mode exactly.

Mechanically: DP training adds noise at each gradient step, and clips per-example gradient norms so any single example can't influence the model too much. Both moves disproportionately damage signal from rare classes and outliers.

Walk through the four cards. Land on the last one: the fairness story and the privacy story are coupled. You can't fix one without thinking about the other.

19

Differential Privacy, Compliance, and Privacy Theater

EU
GDPR 2018

The General Data Protection Regulation sets strict rules for collection, purpose limitation, and the right to erasure. “Anonymized” data — data that can no longer be linked back to a person — falls outside its scope entirely, creating a strong incentive to really anonymize, not just redact names.

QC
Quebec Law 25 2022

Requires that personal data be irreversibly anonymized before release or reuse — the black-marker method doesn't qualify. Penalties reach $25M CAD or 4% of global revenue. This effectively forces organizations toward techniques like differential privacy.

The loophole

“We use differential privacy.”
Great. What did you set ε to?

The law says “use DP.” It usually doesn’t say what ε to use. A company can announce DP for the PR win, ship it with ε = 14 or 20, and be technically compliant while offering almost no real protection. Fancy math, safe door wide open. This is privacy theater.

Compliance & Privacy Theater

Regulation is catching up. GDPR (2018) is the landmark: it creates an explicit legal category for "anonymized" data that no longer identifies anyone, and carves that category out of the regulation entirely. So there is a large compliance incentive to achieve real anonymization.

Quebec's Law 25 (2022) goes further and specifies irreversibly anonymized. Under that wording, the old delete-the-names approach doesn't meet the bar. Penalties are severe: up to CAD $25M or 4% of global revenue. Companies operating in Quebec are effectively pushed toward techniques like DP.

The catch: the law mandates the technique but usually not the parameters. A company can adopt DP in name, configure it with a huge epsilon, and be formally compliant while leaking freely. This is the privacy theater trap — the move the law was trying to block reappears in a new costume.

Pedagogical payoff: privacy-as-math-guarantee only works if someone is watching the parameter, not just the label.

20

Privacy is not a switch.
It's a dial.

The hand turning it balances safety, accuracy, and fairness.

Further reading

The Reframe

We've gone from the black marker to injecting measured chaos. From redaction to perturbation. We've accepted that we have to damage the data a little to save the people in it.

The key conceptual shift: privacy isn't a binary. It's a dial that balances three things pulling against each other.

Leave them with the question: is there a point where being *too* private actually hurts? Can you be so hidden from the systems that are supposed to help you that you get left behind? That's a real tradeoff. It doesn't have a clean answer.

21

Intro to Federated Learning

Explorable →
Federated Learning
Google PAIR · pair.withgoogle.com

Intro to Federated Learning

Transition slide into the federated learning half of the unit. Open the Google PAIR explorable live and walk through the first scroll-interactive together. Return to the deck afterward for the FL content.

1 / 18