DEC-0028: Exact Corpus Membership and Proxy Coverage
Decision
in_corpus: true now means one thing only:
- the specific
LIB-IDhas an exact, manifest-backed corpus ingest incorpus/{LIB-ID}/andcorpus/_manifest.yaml
Approximate or proxy textual coverage must not be represented as
in_corpus: true.
When a KB library book is only partially or approximately represented by related corpus texts, that state should be recorded with:
corpus_proxy_ids: ["CORPUS-...."]
That proxy field means:
- there is related corpus material that may help with research or contextual interpretation
- but the exact library book or edition has not itself been ingested
Context
An audit on 2026-04-03 found 13 KB library books marked in_corpus: true with
no corresponding corpus/{LIB-ID}/ directory and no manifest entry. The false
positives came from a one-shot migration that trusted approximate entries in
corpus/_lib-mapping.yaml as if they were exact corpus membership.
Affected books:
LIB-0093LIB-0136LIB-0137LIB-0138LIB-0157LIB-0177LIB-0182LIB-0183LIB-0222LIB-0249LIB-0253LIB-0288LIB-0289
Those approximate matches included translation substitutions, subset texts, and single-work proxies for omnibus editions.
Policy
Exact membership
Use in_corpus: true only when:
- the LIB book has an exact corpus ingest
- the ingest is reflected in
corpus/_manifest.yaml - the book can honestly participate in corpus-based activation and readiness logic
Proxy coverage
Use corpus_proxy_ids when:
- a related corpus text overlaps materially with the library book
- the overlap is useful for research
- but the exact library book, translation, or edition is not present
Proxy coverage may support:
- exploratory research
- rough thematic assistance
- future ingest prioritization
Proxy coverage must not automatically justify:
- episode readiness
- source-essay generation from exact-book assumptions
- claims that a book was "scanned and embedded"
Consequences
Positive
- Corpus readiness metrics become trustworthy again.
in_corpusregains hard-fact semantics.- Approximate coverage is preserved without overstating provenance.
Tradeoffs
- Some legacy essay and prompt artifacts now carry provenance taint and should be interpreted more carefully.
- At least one approved artifact tied to proxy coverage (
LIB-0253) deserves a follow-up provenance review.
Implementation
- flip the 13 false positives back to
in_corpus: false - add
corpus_proxy_idswhere the heuristic mapping is still useful context - update reporting scripts so exact corpus coverage comes from the manifest, not
_lib-mapping.yaml - keep
_lib-mapping.yamlas a heuristic research aid, not a constitutional source of truth for corpus membership