Hyperon AI Algorithms+Semantic Parsing (LLM/NLP)+Semantic Parsing Full

Approved by Ursula Addison on 2026-05-14

Home / Hyperon AI Algorithms / Semantic Parsing / Semantic Parsing Full

Responsible: Leung Man Hin, Roman Treutlein (Hyperon-era NLP); Linas Vepstas (Link Grammar, learn, matrix); Ben Goertzel (architecture, grammar induction research)

Papers: Goertzel, Suarez-Madrigal & Yu (2020), Guiding Symbolic Natural Language Grammar Induction via Transformer-Based Sequence Probabilities (arXiv:2005.12533); Goertzel et al. (2010), An NLP Architecture for Embodied AGI (AGI-10); Goertzel (2025), Hyperon Whitepaper §7.2; Sleator & Temperley (1993), Parsing English with a Link Grammar

Status: The legacy NLP pipeline (Link Grammar + lg-atomese + learn) is operational and mature. Hyperon-era approaches — LLM-assisted NL-to-MeTTa conversion, SENF canonical forms, and Symbolic Transformer Heads — are at varying stages from operational demo to active development to proposed.

This card provides technical depth beyond the concise Semantic Parsing index card. Semantic parsing in the Hyperon ecosystem spans two generations: a mature OpenCog-era pipeline (Link Grammar → RelEx → AtomSpace) and emerging Hyperon-era approaches that combine LLM capabilities with formal symbolic structures. The goal in both cases is the same: convert natural language into grounded, logically manipulable representations that PLN can reason over.

Related cards: PLN Full (reasoning over parsed representations), AtomSpace Full (storage substrate), WILLIAM Full (compression-driven template mining), MeTTa-Motto (LLM integration)

Legacy Pipeline: Link Grammar → AtomSpace

The OpenCog NLP pipeline, still operational, processes natural language through multiple stages:

Link Grammar

A context-free, dependency-style parser that produces typed link graphs between words. Instead of phrase-structure trees, Link Grammar identifies typed binary connections (links) between word pairs, constrained by a dictionary of connector rules. Independently maintained (v5.13.0) by original OpenCog contributors outside the Hyperon project. Supports English, Thai, Russian, Arabic, Persian, German. Multi-threaded, UTF-8, cloud-ready.

LG is formally equivalent to pregroup grammar, which is modeled using compact bilinear categories (Lambek 2008). This positions LG within the same categorical framework used in quantum NLP (DisCoCat). LG connectors map 1-1 to CCG categories — e.g., LG's O- & S+ corresponds to CCG's (S\NP)/NP — though the mapping is "pointless" in isolation because it produces equally messy CCG. The key motivation for the LG-CCG equivalence was that CCG's transparent syntax-semantics interface could enable auto-generation of RelEx2Logic rules, potentially replacing hundreds of hand-coded mappings. Coordination remains LG's "Achilles heel" — CCG handles it naturally via (X\X)/X. (mailing-list-backed: Link-grammar-word-grammar-and-CCG, 2014)

The No-Links-Crossing Constraint

LG enforces planarity — no links may cross. This is a known fundamental limitation that creates a gap between surface syntax (SSynt) and deep syntax (DSynt). For sentences like "the dog was black," the parser must choose between a root-to-verb link and a more semantically intuitive adjective-complement analysis because both require crossing links. This SSynt-to-DSynt gap was the core architectural reason RelEx existed as a post-processing layer: "graph-write rules" converting surface parses to semantic representations. Richard Hudson (creator of Word Grammar) confirmed that any learning mechanism will inevitably discover crossed links, reinforcing that planarity is theoretically imperfect. (mailing-list-backed: Link-crossing-and-copulas-LG-5.5.1, 2014)

lg-atomese

C++ bridge exposing Link Grammar dictionaries and parse results as AtomSpace Atoms. Enables dynamic dictionary maintenance via learning algorithms — interleaved learning and parsing rather than batch. Independently maintained as part of the classical OpenCog stack.

RelEx

Java dependency extractor (v1.6.3) that took Link Grammar output and produced shallow semantic relations plus OpenCog Atomese. Included anaphora resolution and FrameNet mapping. Legacy, unmaintained since ~2016.

RelEx2Logic was deliberately designed to perform minimal semantic interpretation, leaving deeper semantics to PLN. This was a lesson from the predecessor RelEx2Frame, which attempted deeper semantics using FrameNet but "got unwieldy and buggy and was pretty much non-usable." The deliberate shallowness was architectural: extract argument structure only, let PLN handle reasoning. By late 2017, R2L output was acknowledged as inadequate for PLN reasoning — "the biggest issue is that the PLN solver doesn't work very well on the output that R2L is generating" — which motivated the eventual pivot to LLM-based NL-to-MeTTa conversion. (mailing-list-backed: General-discussion-of-linguistic-representations-in-OpenCog, On-converting-natural-language-text-to-atomese, 2014–2017)

The Catena Mechanism

To progressively absorb RelEx's complexity into LG itself, Linas introduced h/d (head/dependent) direction markers on LG links, making LG more Word-Grammar-like. This disambiguated cases where dependency direction is unclear (e.g., VJ links to conjunctions) and "allows a large portion of the complexity of RelEx to be ripped out" by moving headedness information into the parser. (mailing-list-backed: A-catena-proposal, 2014)

learn

Neuro-symbolic structure learning in Guile Scheme. Originally focused on NL but broadened to learn any structural patterns via frequentist counting over hypergraph representations. Produces symbolic vector representations (comparable to Word2Vec/GloVe but as explicit, queryable graphs). Key dependency: the matrix library — sparse vector/matrix operations for AtomSpace computing correlation, mutual information, and cosine similarity over million-dimensional sparse vectors.

The learn pipeline's theoretical backbone is MST (Maximum Spanning Tree) parsing via pointwise mutual information, achieving ~85% accuracy on unsupervised dependency recovery (Deniz Yuret, 1998 PhD thesis). The grammar learning target has the formal structure of a sheaf — the same mathematical structure that professional linguists hand-author when building parse-rule lexicons. A key discovery: removing infrequent disjuncts as "noise" actually discarded most of the grammatical signal — the "noise" was the signal. Asymmetric MI (I(X,Y)/H(X)) was proposed as superior to symmetric MI, with a generalization from MST to maximum spanning DAG (MSDAG). (mailing-list-backed: Unsupervised-Lang-linasvepstas-works, Revised-report-on-connector-sets, Language-learning-status, 2014–2017)

generate

C++ network generation library using sheaf theory and connector semantics. Generates parse trees, deduction chains, and plans from weighted constraints. Linas advocated Meaning-Text Theory (Melcuk) as the strongest framework for generation, noting it "shows you exactly how to convert general ideas into explicit, grammatical sentences" and underpins several commercial linguistics products. (mailing-list-backed: General-discussion-of-linguistic-representations-in-OpenCog, 2014)

Hyperon-Era Approaches

NL-to-MeTTa via LLMs

The current strategy replaces hand-built pipeline stages with LLM-assisted conversion, validated against formal reasoning:

nl2pln_demo: Converts NL sentences to PLN/MeTTa representations via LLMs (Anthropic API), stores in a knowledge base, enables backward-chaining queries with proof traces. Operational demo with interactive KB shell.
metta-nl-corpus: Dagster pipeline generating and validating NL-to-MeTTa expression pairs from the SNLI dataset. Uses LLMs for generation, MeTTa inference engine for validation. Three validation paths: entailment (transitive reasoning), contradiction (logical bottom), neutral. Target: 20k silver + 10k gold labeled pairs. Under active development.

SENF (Semantic Elegant Normal Form)

Proposed canonical representation that collapses idiomatic NL variations into a single graph structure by combining LLM semantic intuition with formal rewrite rules. A direct architectural successor to an earlier (2017) proposal to use Lojban as a "logical normal form" intermediate layer ("RelEx2Lojban"). Ben Goertzel argued Lojban maps directly to predicate logic (hence PLN-friendly Atomese), covers all everyday semantics, and could enable automatic mapping rule generation via parallel English/Lojban corpora. The approach was not adopted — Linas was "virulently anti-lojban" — but the core insight (needing a canonical semantic normal form between NL surface forms and logical representations) directly prefigures SENF. (mailing-list-backed: Replacing-Relex2Logic-with-Relex2Lojban, 2017)

Dependent Types for NL Semantics

Recent architectural direction: using dependent type theory for knowledge representation, where quantifier dependencies are localized and directly grounded in observations via the Curry-Howard correspondence. This aligns with MeTTa's native gradual type system.

Unified Parsing and Reasoning (Unrealized Vision)

Ben Goertzel proposed implementing Word Grammar parsing directly in PLN, where learning language would simply mean "learning PLN rules," and the pattern miner would replace a separate grammar learner. Each word would be a ConceptNode, syntactic relationships InheritanceLinks to CategoryNodes, and parsing would mean finding the most likely network of links satisfying WG constraints. The practical concern: LG is sub-millisecond while PLN-based parsing might take seconds. The compromise: use LG as a heuristic to guide PLN-based WG refinement. This vision remains conceptually relevant to the Hyperon era as an ultimate convergence target. (mailing-list-backed: Link-grammar-word-grammar-and-CCG, 2014)

Symbolic Heads and Grammar Induction

Symbolic Transformer Heads (Proposed)

The whitepaper (§7.2) describes augmenting Transformer layers with a discrete template memory, bridging neural attention with symbolic structure:

Training: Text is parsed into AtomSpace graphs; frequent subgraphs are mined via WILLIAM to create key-value template pairs. Two losses train each layer to align with these templates:

Contrastive alignment loss — trains each layer to project hidden states into a key space matching template keys:

\[\mathcal{L}_{\text{align}} = -\log \frac{\exp(\mathrm{sim}(q_i, k^+) / \tau)}{\sum_j \exp(\mathrm{sim}(q_i, k_j) / \tau)}\]

Variables: \(q_i\) = query projection of hidden state \(h_i\), \(k^+\) = positive (matching) template key, \(k_j\) = all candidate keys, \(\tau\) = temperature, \(\mathrm{sim}\) = similarity function
Meaning: Encourages hidden states to cluster near their corresponding symbolic template keys in the shared embedding space

Reconstruction loss — ensures retrieved template values integrate meaningfully:

\[\mathcal{L}_{\text{recon}} = \| h_i - \hat{h}_i \|^2 \quad\text{where}\quad \hat{h}_i = \sum_m \alpha_m v_m\]

Variables: \(h_i\) = original hidden state, \(\hat{h}_i\) = reconstructed state from top-\(m\) template values \(v_m\) weighted by attention \(\alpha_m\)
Meaning: Ensures symbolic template retrieval faithfully reconstructs the neural representation, preventing the symbolic pathway from becoming decorative

At runtime, each token projects its hidden state to a query, retrieves top-\(m\) templates by similarity, computes attention over them, and injects the weighted combination back into the Transformer's residual stream.

Transformer-Guided Grammar Induction (Published Research)

Goertzel, Suarez-Madrigal & Yu (2020) demonstrated using BERT as a sentence probability oracle to guide symbolic link-grammar induction:

\[P(S) = \sqrt{P_f(S) \cdot P_b(S)}\]

Variables: \(P_f(S) = \prod_{i=0}^{N} P(w_i \mid w_0, \ldots, w_{i-1})\) = forward sentence probability, \(P_b(S) = \prod_{i=N}^{0} P(w_i \mid w_{i+1}, \ldots, w_N)\) = backward sentence probability
Meaning: Combined bidirectional sentence probability from BERT masked predictions, used to validate proposed grammar rules by comparing \(P(S)\) for sentences generated from proposed rules vs. mutated rules
Source: Goertzel, Suarez-Madrigal & Yu (2020), arXiv:2005.12533

Pipeline: (1) compute word-sentence probability matrix via BERT; (2) word-sense disambiguation via clustering; (3) cluster word-senses into categories; (4) incremental grammar rule induction validated by sentence probability comparison.

Status and Resources

System Interfaces

PLN: Semantic parses feed directly into PLN as ground-level atoms for uncertain reasoning. NL-to-MeTTa quality determines PLN's ability to reason about natural language content.
Pattern Mining: Mined subgraph patterns from parsed text serve as template keys for Symbolic Heads. A planned feedback loop: pattern miner finds candidates, PLN evaluates them, and the ones that pass guide the next round of mining. (mailing-list-backed: A-very-small-LG-relex-bug, 2014)
WILLIAM: Compression-driven mining identifies the most reusable symbolic templates from parsed corpora.
MeTTa-Motto: LLM integration library providing the neural NL understanding that complements symbolic parsing.

Implementation Anchors

link-grammar — C parser, v5.13.0, multi-language, actively maintained. LGPL.
lg-atomese — C++ bridge to AtomSpace, v2.0, production use.
learn — Guile Scheme structure learning, active V2 development.
matrix — Guile Scheme sparse vector/matrix library for AtomSpace.
generate — C++ network generation via sheaf theory.
nl2pln_demo — LLM-assisted NL-to-PLN/MeTTa conversion demo.
metta-nl-corpus — Dagster pipeline for NL-to-MeTTa annotation with validation.
bio-semantic-parser — Full-stack NL-to-MeTTa pipeline for biological data.
Legacy: relex(Java, unmaintained since ~2016)

Current Status

Operational: Link Grammar parser (v5.13.0); lg-atomese bridge (v2.0); learn/matrix structure learning; nl2pln_demo; bio-semantic-parser
Active development: metta-nl-corpus (SNLI-based annotation pipeline); learn Version Two
Proposed: SENF canonical forms; Symbolic Transformer Heads; dependent-type NL semantics; full integration of grammar induction with MORK-native pattern mining

Historical Design Rationale (mailing-list-backed, opencog-ml 2014–2022)

Link Grammar in C (not Python/Scheme): LG v1.0 dates to 1991 (Temperley & Sleator, Carnegie Mellon). C retained because the core data structure requires direct CPU cache-line tuning for deeply recursive parsing. Current parser ~100× faster than original; Python would be 10-20× slower. (Linas Vepstas, Oct 2021)
5-example grammar learning threshold: Unsupervised grammar induction discovers correct grammatical form with only 5+ word observations. "From first principles, 5 is minimum for beating random chance on MST parses." (Linas Vepstas, Apr 2019)
Structure preservation property: The LG→disjuncts pipeline recovers the input grammar at high F1 when fed LG-English parses — establishing that it's "structure-preserving." (Linas Vepstas, Jun 2019)
Convergence hypothesis ("Linas Claim"): Non-lexical input converges to the same lexical output as MI-weighted input given sufficient sampling. Requires 10-100× larger training sets for visible convergence. (Linas Vepstas, Jun 2019)
Morpho-syntax unity: LG formalism can learn both morphology and syntax "in one gulp." Demonstrated for Tagalog, Hebrew, Amharic. More powerful than FSTs for non-concatenative languages (Semitic). (GSoC 2014, Linas Vepstas)
Anaphora as downstream reasoning: Hobbs algorithm operates on AtomSpace output (post-RelEx), not in the parser. Separation enables integration with PLN for selection restriction filtering. (Hujie Wang, May 2014)
Word Grammar algebraic formalization: Ben Goertzel's algebraic formalization of WG maps directly to SHIQ description logic. Richard Hudson (WG creator) engaged directly, publishing a rethink of WG word order rules in the Journal of Linguistics. (Algebraic-view-of-word-grammar, 2014)
Instance/lemma representation tension: R2L must use word instances ("Mike@111", "eats@222") to distinguish different events, but SuReal needs lemmas with tense/number features for morphological output. This tension is what SENF normalization is designed to resolve. (sureal-and-normalization, 2016)

Open Problems / Research Directions

Defining SENF formally — what rewrite rules normalize NL variations to canonical MeTTa form?
Scaling Symbolic Heads from proposed design to demonstrated system with mined AtomSpace templates
Validating NL-to-MeTTa conversion quality at scale — the metta-nl-corpus pipeline addresses this but needs larger gold-standard datasets
Bridging Link Grammar's typed links to MeTTa's type system — enabling the legacy parser to feed directly into Hyperon reasoning
Dependent-type representations for NL semantics — formalizing the Curry-Howard grounding approach
Unified parsing-reasoning convergence — can PLN-based Word Grammar parsing become practical with MORK-level performance?

Primary Sources

Goertzel, B., Suarez-Madrigal, A., Yu, S. (2020). Guiding Symbolic Natural Language Grammar Induction via Transformer-Based Sequence Probabilities. arXiv:2005.12533.
Sleator, D. and Temperley, D. (1993). Parsing English with a Link Grammar. CMU-CS-91-196.
Goertzel, B. (2025). Hyperon for AGI⇒ASI Whitepaper, §7.2: Symbolic Heads.