whitepaper · v0.1 draft

Toward a Typed Framework for American Privacy

Last revised 2026-04-21

00 / Abstract

Privacy in America is governed by a patchwork of constitutional doctrine, sectoral statutes, common-law torts, state-level initiatives, and industry self-regulation. Each tradition carries its own vocabulary, its own threat model, and its own unit of harm. Synthesis across traditions happens informally, in footnotes, at high cost.

This paper proposes a typed graph substrate for reasoning about American privacy at scale. Nodes represent conceptual primitives --- rights, actors, mechanisms, precedents, threat models. Edges carry curves: functions describing how one quantity varies with another. Curve bodies are drawn from a closed registry; the model never invents the math. A reasoning seam built on Anthropic's Structured Outputs API lets large language models propose nodes, edges, and curve parameters while remaining inside the typed schema.

The goal is not a canonical ontology. It is a substrate on which many ontologies can be tested, revised, and composed.

01 / Introduction

[DRAFT] funny

02 / Why a Graph

[DRAFT] funny

03 / Substrate

The substrate is deliberately generic. Two content kinds exist at v0: Node and Edge. Each is persisted as a single YAML file, content-addressable via a six-hex ID prefixed by its storage kind.

3.1 Nodes

A node is any named concept the graph reasons about. Each node carries:

id --- node_<body>, where body is one or more lowercase alphanumeric segments joined by _ or -. Three generator functions cover the common cases: new_id (six-hex random), slug_id (human-readable slug), content_id (slug plus six-hex content hash). The regex is the source of truth; drift against it fails validation.
kind --- open string at v0; tightens to a closed Literal enum once the domain taxonomy commits.
name, description --- human-readable.
metadata --- arbitrary JSON for citations, statute numbers, dates.

3.2 Edges

An edge connects two nodes and carries a curve:

id --- edge_<body>, same shape as node IDs (§3.1).
kind --- open string at v0.
from_node / to_node --- references to node IDs. Referential integrity is checked at build time, both by the Python loader and by Astro's reference("nodes") binding.
curve --- see §04.
metadata

Edges are first-class files rather than embedded records because they carry content --- the curve. A design that stores edges as scalar weights on node records would force the curve to live either inline (bloating node files) or in a secondary structure (splitting the edge's identity across two places).

04 / Curves

4.1 Closed-set registry

A curve is a pair (name, params) drawn from a closed registry. Curve bodies live in code; the model never invents curve math. The v0 registry (rendered from src/lib/curves.ts at build time, drift-checked against apai/curves/registry.py):

name	params	y(x)
`constant`	`{c}`	`y = c`
`linear`	`{a, b}`	`y = a*x + b`
`sigmoid`	`{L, k, x0}`	`y = L / (1 + exp(-k*(x - x0)))`
`step`	`{high, low, threshold}`	`low if x < threshold else high`
`exponential`	`{a, k}`	`y = a * exp(k*x)`

4.2 Semantics by node kind

The registry is shape-generic. Its meaning --- what the y-axis represents --- is fixed by the kind of the node the curve belongs to. Meaning is implicit in the (kind) and machine-readable; the model does not choose what a curve means, the substrate does.

Illustrative assignments, pending operator ratification (see §10):

person → privacy tolerance. A scalar position in [0, 1] on the §9.1 axis. §4.1 shapes apply; linear and sigmoid cover most interior cases.
society → privacy tolerance distribution. A distribution over the same [0, 1] axis. The §4.1 registry does not yet express this shape (see §07, q. b).
law → optional. May carry an erosion curve (fraction of original protection still in force as a function of time-since-enactment or times-challenged) or may carry nothing; not every node needs a curve.
source → impact weight. A scalar in [0, 1] applied when the source contributes to a population-curve reconstruction (see §9.3).

Two commitments:

Not every node kind carries a curve. The (kind → semantic) map assigns one only where one is meaningful.
The map is drift-checked Python ↔ TypeScript, like the §4.1 registry. An edge whose terminal node kind has no assignment must either adopt a default or declare no curve.

4.3 Drift prevention

The Python engine and the TypeScript site both hand-author the registry; they are not generated from a shared spec. At test time the Python side writes the registry to engine/tests/_curves.json; the TypeScript side reads that file and asserts equality against its own CURVE_NAMES and required-param sets. Drift is caught at CI, not at runtime.

05 / Reasoning Seam

Reasoning is delegated to large language models via Anthropic's Structured Outputs API with strict tool use. The single call pattern the engine exposes is:

client.structured(
    model=Model.OPUS,
    system_blocks=system_blocks_for_query(extra_context=graph_snapshot),
    user_message="Propose a curve for the edge from katz_v_us to reasonable_expectation.",
    tool={
        "name": "propose_curve",
        "description": "Emit a curve drawn from the registry.",
        "input_schema": { ... },
    },
)

Under strict: true, the SDK validates the model's output against the provided JSON schema server-side. No retry loop is written on our side; every output that returns is, by construction, a legal curve.

System blocks default to an ephemeral-cached preamble, so once graph context grows past the caching threshold the seam is already shaped to take advantage of it.

[DRAFT] funny

06 / Node Kinds

[DRAFT]

Candidate sets to evaluate:

right --- a specific protected interest (Fourth Amendment, informational self-determination).
precedent --- a decided case that crystallized or shifted doctrine.
statute --- a codified rule (HIPAA, CCPA, VPPA).
law --- umbrella for codified-or-established legal rules; may resolve to statute or precedent once the split is committed. Under §4.2 it optionally carries an erosion curve.
actor --- an entity that can collect, process, or disclose (government agency, data broker, platform).
person --- a named individual. Carries an implicit privacy-tolerance curve (§4.2, §9.1). Most instances in the graph are public figures, litigants, or counterparties; the framework does not silently construct per-citizen dossiers.
society --- a population or cohort. Carries an implicit privacy-tolerance distribution (§4.2, §9.2). Typically one-per-jurisdiction at first (US, California, etc.); may split by cohort later.
source --- a primary source used to reconstruct a population curve (§9.3): court filing, breach notice, transparency report, journalistic investigation. Carries an impact-weight curve (§4.2).
mechanism --- the means by which privacy is impinged or preserved (geofence warrant, k-anonymity, differential privacy).
threat_model --- a specific adversary-capability pairing.
jurisdiction --- where rules apply (federal, state, sectoral).

Each needs a definition, an ID prefix, at least three canonical examples, and a rule for what edge kinds may connect it to what other node kinds. Not every kind carries a curve; see §4.2.

07 / Open Questions

Labeled a–j so other sections can cite specific questions stably (e.g. "§07, q. b"). Resolved questions stay in place with a Resolved marker, not removed.

Resolved. a. Is privacy a scalar or a bundle?: Scalar at v0. Resolved by axiom_scalar-tolerance: the American rule-space (Constitution, Bill of Rights, sectoral statutes, torts) is rich enough to carry the structure a bundle would otherwise encode. Expansion to parallel axes is deferred until the scalar demonstrably fails to distinguish a case the framework needs to distinguish.
b. Shape mismatch: distribution curves.: §4.1 is scalar → scalar. §9.2's population curve is a distribution, and its reshape under a shock is distribution → distribution. Two candidate resolutions: (i) a second composable "distribution" registry; (ii) bin the distribution and apply §4.1 per bin. Neither committed.
c. Representing genuine disagreement.: Competing doctrinal readings of Katz, for instance. Multiple edges with different curves? One edge with a distribution of curve parameters? An explicit disputed_by meta-edge?
d. Keeping the graph current.: As case law and statute evolve, what is the cost of an out-of-date curve on an edge, and what signal marks one as stale? Does every edge carry a last_validated_at trigger?
e. Registry admission for DP families.: When does the f-DP / Rényi DP family enter §4.1, and how are they named so they compose with the existing shape-generic set without masquerading as them? The general form of this question --- the rule by which a new curve enters the registry at all --- has no stated answer.
f. Operationalizing individual tolerance.: §9.1 describes anchor points vividly. §9.3 commits to reconstructing populations from primary sources. What remains open: for a named individual (a litigant, a public figure, a counterparty) --- how is their position on the axis assigned? Inferred from behavioral signals? Self-reported in filings? Left null unless explicitly claimed? Partially answered by axiom_transparency-first: revealed tolerance is taken at face value, because the framework separately requires disclosure to be legible --- collapsing the gap between revealed and hypothetical. What remains open is the assignment mechanism itself.
g. Rigorous primary-source reconstruction.: §9.3 sketches a workflow: AI parses primary sources, emits typed mass-movement claims, the population curve is the weighted aggregate. The methodology is not rigorous yet. Adversarial sources, missing-data biases, confounded signals, and the weighting function itself are all unresolved.
h. Site ↔ engine boundary.: The site is a static Astro build. The engine evaluates curves in Python. How does the site show live curve values --- are they precomputed at build time and baked into the bundle, or is there a runtime service? Neither is currently built.
i. Source of truth for the kind taxonomy.: When kinds narrow from open string to Literal[...] (roadmap v0.1), where does the canonical list live? Python only, with TypeScript mirror? Both hand-authored with drift check like §4.1? A third shared artifact?
j. Governance loop formalization.: §10 commits to propose-then-review (reasoning outputs never auto-canonicalize). The mechanism --- directory split, status field, dedicated review UI, PR-based --- is not yet chosen.
Resolved. What does y represent, by default, on an edge?: Answered by §4.2: the y-axis is fixed by an implicit (node kind → semantic) map. A given node kind either has an assignment or does not carry a curve.

08 / Roadmap

v0 (done) --- typed substrate, closed-set curve registry (5 shapes), reasoning seam over Structured Outputs, site scaffold, drift-checked Python↔TypeScript parity.
v0.1 --- commit node-kind taxonomy (§06). Narrow kind: str to Literal[...]. Add per-kind validators. Commit the (kind → semantic) map from §4.2, drift-checked like §4.1. Commit the proposed-vs-canonical split (§10) as a concrete mechanism (§07, q. j).
v0.2 --- seed ~30 canonical nodes + edges covering Fourth Amendment doctrine, common-law privacy torts, and major federal sectoral statutes. All seeding flows through propose → operator review → canonicalize.
v0.3 --- first real reasoning query: given the current graph, propose a missing edge and justify it. Proposals land in the proposed tier only; the analysis page renders them alongside their justifications for operator review.
v1.0 --- graph size justifies Kùzu embedded graph + MCP tools; evaluate_curve is exposed as a callable tool rather than a prompt hint.

09 / The Privacy Curve

The curve formalism of §04 is shape-generic. One load-bearing application of it --- and one of the reasons this framework exists --- is the curve that describes privacy tolerance: how much of their own data a given person will cede under what conditions, and how much intrusion they will absorb before they resist.

Privacy tolerance is not a scalar. Two people who both report "I care about privacy" can mean structurally different things. Bucketing them under one number hides the structure that any policy, product, or doctrinal argument eventually runs into.

9.1 Individual tolerance --- anchor points

A person's privacy tolerance is a curve over context: y = f(x) with y ∈ [0, 1], where x is an adversary or situation parameter. The anchor points below summarize what a person's curve tends to average to; they are calibration points, not per-context values.

Normalize the position to a tolerance axis on [0, 1]. Five anchors calibrate the interior:

0.00 --- surveillance-welcoming. Would consent to arbitrary state observation: a camera in the bathroom, a knock on the door answered with cooperation and boot-licking. The concept of state intrusion as a loss has not formed.
0.25 --- default-normie. Uses Facebook, Google, Instagram as given. Unfamiliar with ad blockers. Often poor, often ESL, often on a stock Android phone. "Digital privacy" is not a category they have an opinion on because the category has not been made legible to them.
0.50 --- aware but unbothered. Recognizes a cookie banner. Has an ad blocker installed. Clicks deny cookies sometimes, not always. Has heard of cryptocurrency, vaguely. Registers the topic; does not spend cycles on it.
0.75 --- operationally private. VPN for all traffic. SimpleLogin aliases behind Proton. Torrents games; has since childhood. Treats compelled disclosure as an adversarial act --- "accessing my digital data is akin to digital rape" is a representative framing from this cohort. Wants a warrant attached to every request.
1.00 --- off-grid. Legally dead. Lives alone with one trusted counterpart in a cabin in the woods. All communication is in-person or over hardware they built themselves with their own E2E-encrypted radio stack. Blocks all outgoing wireless and RF signals. No grid electricity. Never appears in public.

The anchors are extremes, intended for calibration. Most real positions are interior. The point of the curve is that the interior is where policy, product, and jurisprudence actually bite --- and the interior has shape. It is not a straight line from endpoint to endpoint.

9.2 The population curve

A curve over one individual is a tolerance. A curve over the population is a distribution of tolerances. Both are first-class objects in the graph; they are not the same object.

Policy built on the mean --- most Americans are somewhere around 0.4, so design for 0.4 --- is the wrong shape. Three specific failure modes:

Assumes unimodality. The empirical distribution is almost certainly not unimodal: a large mass in the 0.25–0.50 band, a secondary peak in the 0.70–0.85 technical-privacy band, a long right tail toward 1. A single-peaked model misses both peaks.
Flattens variance. Two populations with identical means and radically different variances receive identical policy. Variance is what determines how many people the policy visibly fails.
Treats position as stable. An individual's position moves in response to specific events --- a Dobbs-era data request, a breach, a geofence warrant served on a neighbor. Shocks propagate asymmetrically. The same event that moves the 0.50 cohort to 0.60 may move the 0.75 cohort to 0.95.

In the substrate, the population curve is a society node whose edges to events (breaches, rulings, disclosures) describe how the distribution reshapes under a given shock.

Open shape issue. The §4.1 registry is scalar → scalar. A distribution over [0, 1] is not a scalar y, and a "reshape under shock" is a distribution → distribution transform, not a scalar curve. The registry as it stands cannot express these. Two plausible resolutions, neither yet committed: (a) a second, composable "distribution" registry alongside §4.1; (b) representing a distribution as a vector of scalar curves (PDF bins), so §4.1 shapes apply per bin. Tracked as §07, q. b.

9.3 Constructing the distribution from primary sources

There is no path to a census of American privacy tolerance. The people whose position matters most --- the 0.25 default-normie cohort --- are, by construction, the population least equipped to report their own position: the category of "digital privacy" has not been made legible to them. Self-report surveys over-sample the already-mobilized.

The substrate's answer is to construct the population curve by aggregating primary sources: court filings, breach disclosures, consent-rate telemetry, legislative testimony, platform transparency reports, journalistic investigations. Each source is a node; its impact weight (per §4.2) is a curve encoding how much that source should shape the reconstructed population curve, as a function of source properties (recency, reach, methodological rigor, adversarial incentive of the publisher).

Parsing is delegated. Primary sources are typically unstructured; extracting a tolerance signal from them is an AI-reasoning task that falls under the same Structured Outputs seam as §05. The seam emits a typed claim --- "this source moves the mass at [0.25, 0.50] by approximately X under conditions Y" --- and the population curve is the aggregate of such claims, weighted by source impact.

This is where the framework diverges from textbook statistics. We are not sampling a population that can be sampled. We are reconstructing a distribution from adversarially-collected observational data. The methodology for doing so rigorously is itself an open research problem (§07, q. g).

9.4 Why the curve

A single reported statistic --- "64% of Americans are concerned about privacy" --- is a projection onto one number. It throws away almost everything that determines whether a given intervention lands. A curve keeps:

Shape. Bimodal, unimodal, and long-tailed are structurally different policy environments, even at the same reported mean.
Response. The slope at a given position predicts how much of the population moves under a marginal change in pressure. Sigmoid regions and linear regions imply different interventions.
Tails. The people who file the lawsuits, build the circumvention tools, and author the manifestos live in the tails. They are not outliers to round off. They are a coupled, disproportionately loud minority whose actions reshape the body.

The framework is built to represent these curves as first-class citizens of the graph, not as footnotes to a one-dimensional concern score. That is why the project exists.

10 / Governance

Reasoning outputs (§05) never land in the canonical graph directly. The substrate distinguishes two states:

Proposed. Everything the reasoning seam emits: proposed curves, proposed nodes, proposed edges, proposed (kind → semantic) assignments. Proposed records are persistent, inspectable, and typed. They are not loaded by graph consumers.
Canonical. Records the operator has reviewed and accepted. Canonical records are what §3, §4, and §9 describe; what the site renders; what the engine evaluates.

Canonicalization is a manual act. The reasoning seam produces artifacts for review, not mutations of the canonical graph. This is the load-bearing guardrail against a common failure mode: reasoning that drifts, compounds on its own outputs, and is discovered only long after it has seeded the graph.

The mechanism --- whether by directory split (data/proposed/ vs data/nodes/), by a status field on each record, by a dedicated review UI, or by a PR-based workflow --- is an open question (§07, q. j). The commitment is to the distinction, not to the implementation.

The operator's review cadence is not constrained. A proposed record may sit for hours or days before canonicalization. The substrate does not push on the operator; the operator pulls when ready.

11 / References

[DRAFT]