Home / Hyperon AI Algorithms / Semantic Parsing / Semantic Parsing Full
Responsible: Leung Man Hin, Roman Treutlein (Hyperon-era NLP); Linas Vepstas (Link Grammar, learn, matrix); Ben Goertzel (architecture, grammar induction research)
Papers: Goertzel, Suarez-Madrigal & Yu (2020), Guiding Symbolic Natural Language Grammar Induction via Transformer-Based Sequence Probabilities (arXiv:2005.12533); Goertzel et al. (2010), An NLP Architecture for Embodied AGI (AGI-10); Goertzel (2025), Hyperon Whitepaper §7.2; Sleator & Temperley (1993), Parsing English with a Link Grammar
Status: The legacy NLP pipeline (Link Grammar + lg-atomese + learn) is operational and mature. Hyperon-era approaches — LLM-assisted NL-to-MeTTa conversion, SENF canonical forms, and Symbolic Transformer Heads — are at varying stages from operational demo to active development to proposed.
This card provides technical depth beyond the concise Semantic Parsing index card. Semantic parsing in the Hyperon ecosystem spans two generations: a mature OpenCog-era pipeline (Link Grammar → RelEx → AtomSpace) and emerging Hyperon-era approaches that combine LLM capabilities with formal symbolic structures. The goal in both cases is the same: convert natural language into grounded, logically manipulable representations that PLN can reason over.
Related cards: PLN Full (reasoning over parsed representations), AtomSpace Full (storage substrate), WILLIAM Full (compression-driven template mining), MeTTa-Motto (LLM integration)
The OpenCog NLP pipeline, still operational, processes natural language through multiple stages:
A context-free, dependency-style parser that produces typed link graphs between words. Instead of phrase-structure trees, Link Grammar identifies typed binary connections (links) between word pairs, constrained by a dictionary of connector rules. Independently maintained (v5.13.0) by original OpenCog contributors outside the Hyperon project. Supports English, Thai, Russian, Arabic, Persian, German. Multi-threaded, UTF-8, cloud-ready.
LG is formally equivalent to pregroup grammar, which is modeled using compact bilinear categories (Lambek 2008). This positions LG within the same categorical framework used in quantum NLP (DisCoCat). LG connectors map 1-1 to CCG categories — e.g., LG's O- & S+ corresponds to CCG's (S\NP)/NP — though the mapping is "pointless" in isolation because it produces equally messy CCG. The key motivation for the LG-CCG equivalence was that CCG's transparent syntax-semantics interface could enable auto-generation of RelEx2Logic rules, potentially replacing hundreds of hand-coded mappings. Coordination remains LG's "Achilles heel" — CCG handles it naturally via (X\X)/X. (mailing-list-backed: Link-grammar-word-grammar-and-CCG, 2014)
LG enforces planarity — no links may cross. This is a known fundamental limitation that creates a gap between surface syntax (SSynt) and deep syntax (DSynt). For sentences like "the dog was black," the parser must choose between a root-to-verb link and a more semantically intuitive adjective-complement analysis because both require crossing links. This SSynt-to-DSynt gap was the core architectural reason RelEx existed as a post-processing layer: "graph-write rules" converting surface parses to semantic representations. Richard Hudson (creator of Word Grammar) confirmed that any learning mechanism will inevitably discover crossed links, reinforcing that planarity is theoretically imperfect. (mailing-list-backed: Link-crossing-and-copulas-LG-5.5.1, 2014)
C++ bridge exposing Link Grammar dictionaries and parse results as AtomSpace Atoms. Enables dynamic dictionary maintenance via learning algorithms — interleaved learning and parsing rather than batch. Independently maintained as part of the classical OpenCog stack.
Java dependency extractor (v1.6.3) that took Link Grammar output and produced shallow semantic relations plus OpenCog Atomese. Included anaphora resolution and FrameNet mapping. Legacy, unmaintained since ~2016.
RelEx2Logic was deliberately designed to perform minimal semantic interpretation, leaving deeper semantics to PLN. This was a lesson from the predecessor RelEx2Frame, which attempted deeper semantics using FrameNet but "got unwieldy and buggy and was pretty much non-usable." The deliberate shallowness was architectural: extract argument structure only, let PLN handle reasoning. By late 2017, R2L output was acknowledged as inadequate for PLN reasoning — "the biggest issue is that the PLN solver doesn't work very well on the output that R2L is generating" — which motivated the eventual pivot to LLM-based NL-to-MeTTa conversion. (mailing-list-backed: General-discussion-of-linguistic-representations-in-OpenCog, On-converting-natural-language-text-to-atomese, 2014–2017)
To progressively absorb RelEx's complexity into LG itself, Linas introduced h/d (head/dependent) direction markers on LG links, making LG more Word-Grammar-like. This disambiguated cases where dependency direction is unclear (e.g., VJ links to conjunctions) and "allows a large portion of the complexity of RelEx to be ripped out" by moving headedness information into the parser. (mailing-list-backed: A-catena-proposal, 2014)
Neuro-symbolic structure learning in Guile Scheme. Originally focused on NL but broadened to learn any structural patterns via frequentist counting over hypergraph representations. Produces symbolic vector representations (comparable to Word2Vec/GloVe but as explicit, queryable graphs). Key dependency: the matrix library — sparse vector/matrix operations for AtomSpace computing correlation, mutual information, and cosine similarity over million-dimensional sparse vectors.
The learn pipeline's theoretical backbone is MST (Maximum Spanning Tree) parsing via pointwise mutual information, achieving ~85% accuracy on unsupervised dependency recovery (Deniz Yuret, 1998 PhD thesis). The grammar learning target has the formal structure of a sheaf — the same mathematical structure that professional linguists hand-author when building parse-rule lexicons. A key discovery: removing infrequent disjuncts as "noise" actually discarded most of the grammatical signal — the "noise" was the signal. Asymmetric MI (I(X,Y)/H(X)) was proposed as superior to symmetric MI, with a generalization from MST to maximum spanning DAG (MSDAG). (mailing-list-backed: Unsupervised-Lang-linasvepstas-works, Revised-report-on-connector-sets, Language-learning-status, 2014–2017)
C++ network generation library using sheaf theory and connector semantics. Generates parse trees, deduction chains, and plans from weighted constraints. Linas advocated Meaning-Text Theory (Melcuk) as the strongest framework for generation, noting it "shows you exactly how to convert general ideas into explicit, grammatical sentences" and underpins several commercial linguistics products. (mailing-list-backed: General-discussion-of-linguistic-representations-in-OpenCog, 2014)
The current strategy replaces hand-built pipeline stages with LLM-assisted conversion, validated against formal reasoning:
Proposed canonical representation that collapses idiomatic NL variations into a single graph structure by combining LLM semantic intuition with formal rewrite rules. A direct architectural successor to an earlier (2017) proposal to use Lojban as a "logical normal form" intermediate layer ("RelEx2Lojban"). Ben Goertzel argued Lojban maps directly to predicate logic (hence PLN-friendly Atomese), covers all everyday semantics, and could enable automatic mapping rule generation via parallel English/Lojban corpora. The approach was not adopted — Linas was "virulently anti-lojban" — but the core insight (needing a canonical semantic normal form between NL surface forms and logical representations) directly prefigures SENF. (mailing-list-backed: Replacing-Relex2Logic-with-Relex2Lojban, 2017)
Recent architectural direction: using dependent type theory for knowledge representation, where quantifier dependencies are localized and directly grounded in observations via the Curry-Howard correspondence. This aligns with MeTTa's native gradual type system.
Ben Goertzel proposed implementing Word Grammar parsing directly in PLN, where learning language would simply mean "learning PLN rules," and the pattern miner would replace a separate grammar learner. Each word would be a ConceptNode, syntactic relationships InheritanceLinks to CategoryNodes, and parsing would mean finding the most likely network of links satisfying WG constraints. The practical concern: LG is sub-millisecond while PLN-based parsing might take seconds. The compromise: use LG as a heuristic to guide PLN-based WG refinement. This vision remains conceptually relevant to the Hyperon era as an ultimate convergence target. (mailing-list-backed: Link-grammar-word-grammar-and-CCG, 2014)
The whitepaper (§7.2) describes augmenting Transformer layers with a discrete template memory, bridging neural attention with symbolic structure:
Training: Text is parsed into AtomSpace graphs; frequent subgraphs are mined via WILLIAM to create key-value template pairs. Two losses train each layer to align with these templates:
Contrastive alignment loss — trains each layer to project hidden states into a key space matching template keys:
\[\mathcal{L}_{\text{align}} = -\log \frac{\exp(\mathrm{sim}(q_i, k^+) / \tau)}{\sum_j \exp(\mathrm{sim}(q_i, k_j) / \tau)}\]Reconstruction loss — ensures retrieved template values integrate meaningfully:
\[\mathcal{L}_{\text{recon}} = \| h_i - \hat{h}_i \|^2 \quad\text{where}\quad \hat{h}_i = \sum_m \alpha_m v_m\]At runtime, each token projects its hidden state to a query, retrieves top-\(m\) templates by similarity, computes attention over them, and injects the weighted combination back into the Transformer's residual stream.
Goertzel, Suarez-Madrigal & Yu (2020) demonstrated using BERT as a sentence probability oracle to guide symbolic link-grammar induction:
\[P(S) = \sqrt{P_f(S) \cdot P_b(S)}\]Pipeline: (1) compute word-sentence probability matrix via BERT; (2) word-sense disambiguation via clustering; (3) cluster word-senses into categories; (4) incremental grammar rule induction validated by sentence probability comparison.