Legacy Pipeline
The OpenCog NLP pipeline, still operational, processes natural language through multiple stages:
Link Grammar
A context-free, dependency-style parser that produces typed link graphs between words. Instead of phrase-structure trees, Link Grammar identifies typed binary connections (links) between word pairs, constrained by a dictionary of connector rules. Independently maintained (v5.13.0) by original OpenCog contributors outside the Hyperon project. Supports English, Thai, Russian, Arabic, Persian, German. Multi-threaded, UTF-8, cloud-ready.
LG is formally equivalent to pregroup grammar, which is modeled using compact bilinear categories (Lambek 2008). This positions LG within the same categorical framework used in quantum NLP (DisCoCat). LG connectors map 1-1 to CCG categories — e.g., LG's O- & S+ corresponds to CCG's (S\NP)/NP — though the mapping is "pointless" in isolation because it produces equally messy CCG. The key motivation for the LG-CCG equivalence was that CCG's transparent syntax-semantics interface could enable auto-generation of RelEx2Logic rules, potentially replacing hundreds of hand-coded mappings. Coordination remains LG's "Achilles heel" — CCG handles it naturally via (X\X)/X. (mailing-list-backed: Link-grammar-word-grammar-and-CCG, 2014)
The No-Links-Crossing Constraint
LG enforces planarity — no links may cross. This is a known fundamental limitation that creates a gap between surface syntax (SSynt) and deep syntax (DSynt). For sentences like "the dog was black," the parser must choose between a root-to-verb link and a more semantically intuitive adjective-complement analysis because both require crossing links. This SSynt-to-DSynt gap was the core architectural reason RelEx existed as a post-processing layer: "graph-write rules" converting surface parses to semantic representations. Richard Hudson (creator of Word Grammar) confirmed that any learning mechanism will inevitably discover crossed links, reinforcing that planarity is theoretically imperfect. (mailing-list-backed: Link-crossing-and-copulas-LG-5.5.1, 2014)
lg-atomese
C++ bridge exposing Link Grammar dictionaries and parse results as AtomSpace Atoms. Enables dynamic dictionary maintenance via learning algorithms — interleaved learning and parsing rather than batch. Independently maintained as part of the classical OpenCog stack.
RelEx
Java dependency extractor (v1.6.3) that took Link Grammar output and produced shallow semantic relations plus OpenCog Atomese. Included anaphora resolution and FrameNet mapping. Legacy, unmaintained since ~2016.
RelEx2Logic was deliberately designed to perform minimal semantic interpretation, leaving deeper semantics to PLN. This was a lesson from the predecessor RelEx2Frame, which attempted deeper semantics using FrameNet but "got unwieldy and buggy and was pretty much non-usable." The deliberate shallowness was architectural: extract argument structure only, let PLN handle reasoning. By late 2017, R2L output was acknowledged as inadequate for PLN reasoning — "the biggest issue is that the PLN solver doesn't work very well on the output that R2L is generating" — which motivated the eventual pivot to LLM-based NL-to-MeTTa conversion. (mailing-list-backed: General-discussion-of-linguistic-representations-in-OpenCog, On-converting-natural-language-text-to-atomese, 2014–2017)
The Catena Mechanism
To progressively absorb RelEx's complexity into LG itself, Linas introduced h/d (head/dependent) direction markers on LG links, making LG more Word-Grammar-like. This disambiguated cases where dependency direction is unclear (e.g., VJ links to conjunctions) and "allows a large portion of the complexity of RelEx to be ripped out" by moving headedness information into the parser. (mailing-list-backed: A-catena-proposal, 2014)
learn
Neuro-symbolic structure learning in Guile Scheme. Originally focused on NL but broadened to learn any structural patterns via frequentist counting over hypergraph representations. Produces symbolic vector representations (comparable to Word2Vec/GloVe but as explicit, queryable graphs). Key dependency: the matrix library — sparse vector/matrix operations for AtomSpace computing correlation, mutual information, and cosine similarity over million-dimensional sparse vectors.
The learn pipeline's theoretical backbone is MST (Maximum Spanning Tree) parsing via pointwise mutual information, achieving ~85% accuracy on unsupervised dependency recovery (Deniz Yuret, 1998 PhD thesis). The grammar learning target has the formal structure of a sheaf — the same mathematical structure that professional linguists hand-author when building parse-rule lexicons. A key discovery: removing infrequent disjuncts as "noise" actually discarded most of the grammatical signal — the "noise" was the signal. Asymmetric MI (I(X,Y)/H(X)) was proposed as superior to symmetric MI, with a generalization from MST to maximum spanning DAG (MSDAG). (mailing-list-backed: Unsupervised-Lang-linasvepstas-works, Revised-report-on-connector-sets, Language-learning-status, 2014–2017)
generate
C++ network generation library using sheaf theory and connector semantics. Generates parse trees, deduction chains, and plans from weighted constraints. Linas advocated Meaning-Text Theory (Melcuk) as the strongest framework for generation, noting it "shows you exactly how to convert general ideas into explicit, grammatical sentences" and underpins several commercial linguistics products. (mailing-list-backed: General-discussion-of-linguistic-representations-in-OpenCog, 2014)