← Project Log
DecisionDEC-0028

Exact Corpus Membership and Proxy Coverage

humanStatus: accepted

DEC-0028: Exact Corpus Membership and Proxy Coverage

Decision

in_corpus: true now means one thing only:

  • the specific LIB-ID has an exact, manifest-backed corpus ingest in corpus/{LIB-ID}/ and corpus/_manifest.yaml

Approximate or proxy textual coverage must not be represented as in_corpus: true.

When a KB library book is only partially or approximately represented by related corpus texts, that state should be recorded with:

  • corpus_proxy_ids: ["CORPUS-...."]

That proxy field means:

  • there is related corpus material that may help with research or contextual interpretation
  • but the exact library book or edition has not itself been ingested

Context

An audit on 2026-04-03 found 13 KB library books marked in_corpus: true with no corresponding corpus/{LIB-ID}/ directory and no manifest entry. The false positives came from a one-shot migration that trusted approximate entries in corpus/_lib-mapping.yaml as if they were exact corpus membership.

Affected books:

  • LIB-0093
  • LIB-0136
  • LIB-0137
  • LIB-0138
  • LIB-0157
  • LIB-0177
  • LIB-0182
  • LIB-0183
  • LIB-0222
  • LIB-0249
  • LIB-0253
  • LIB-0288
  • LIB-0289

Those approximate matches included translation substitutions, subset texts, and single-work proxies for omnibus editions.

Policy

Exact membership

Use in_corpus: true only when:

  • the LIB book has an exact corpus ingest
  • the ingest is reflected in corpus/_manifest.yaml
  • the book can honestly participate in corpus-based activation and readiness logic

Proxy coverage

Use corpus_proxy_ids when:

  • a related corpus text overlaps materially with the library book
  • the overlap is useful for research
  • but the exact library book, translation, or edition is not present

Proxy coverage may support:

  • exploratory research
  • rough thematic assistance
  • future ingest prioritization

Proxy coverage must not automatically justify:

  • episode readiness
  • source-essay generation from exact-book assumptions
  • claims that a book was "scanned and embedded"

Consequences

Positive

  • Corpus readiness metrics become trustworthy again.
  • in_corpus regains hard-fact semantics.
  • Approximate coverage is preserved without overstating provenance.

Tradeoffs

  • Some legacy essay and prompt artifacts now carry provenance taint and should be interpreted more carefully.
  • At least one approved artifact tied to proxy coverage (LIB-0253) deserves a follow-up provenance review.

Implementation

  1. flip the 13 false positives back to in_corpus: false
  2. add corpus_proxy_ids where the heuristic mapping is still useful context
  3. update reporting scripts so exact corpus coverage comes from the manifest, not _lib-mapping.yaml
  4. keep _lib-mapping.yaml as a heuristic research aid, not a constitutional source of truth for corpus membership
0:00
0:00