Formal Foundations and Indexing

The Selectivity Theorem

The "MORK theory" paper provides a formal analysis of when ZAM/MORK yields substantial advantage. The selectivity exponent:

\[\gamma(p) = -\log_N\!\left(\frac{|P(p)|}{N}\right)\]
  • Variables: \(p\) = a prefix in the trie, \(|P(p)|\) = size of the posting list for prefix \(p\), \(N\) = total number of atoms
  • Meaning: Measures the normalized information content of a prefix — how much it narrows the search space
  • Source: Goertzel (2025), Articulating Conditions Where ZAM/MORK Yield Benefit

For a \(k\)-leg join over prefixes \(p_1, \ldots, p_k\), under \(\varepsilon\)-independence assumptions:

  • If \(\sum \gamma(p_i) > 1\): the selectivity analysis predicts \(O(1)\) expected candidate intersections — ZAM feeds a constant number of candidates to the unifier
  • If \(\sum \gamma(p_i) < 1\): the expected intersection grows as \(N^{1-\sum\gamma}\), still sublinear

Under a hierarchical generative model (tree-structured conditionals with exponential decay of prefix probabilities), even moderately deep bindings across a few legs push the selectivity sum past 1. Real-world AtomSpaces — logic proofs, program ASTs, semantic parses, knowledge graphs — are approximately hierarchical: symbols follow heavy-tailed distributions, structures are compositionally generated, and query legs bind weakly dependent positions.

Slot-Centric Indexing and the Break-Even Rule

Instead of \(k!\) permutations, MORK maintains at most \(k\) slot-centric prefix views — one per argument position — plus selectively promoted "hot" pair views. Key structure (binary relation):

  • Canonical: R/_1/<EXPR_ID>/_2/<C_ID> → payload
  • Flip view: R/_2/<C_ID>/_1/<EXPR_ID> → pointer to canonical

The probabilistic break-even rule:

\[p_{s+1}(M - L) > \alpha\]
  • Variables: \(p_{s+1}\) = probability a query anchors on the \((s{+}1)\)-th slot; \(M\) = mining cost when anchor is missing; \(L\) = direct prefix-view hit cost; \(\alpha\) = storage cost of one additional view
  • Meaning: Materialize the next slot view only when expected cost reduction exceeds storage cost

Worked example: with \(p_1=0.4, p_2=0.6, L=1, M=50\), the flip view yields a 29.4× expected speedup for roughly 2× index entries. For hot multi-slot queries, the general problem is submodular coverage under a storage budget, solvable greedily. (Source: Goertzel 2025, MORK Slots)