SONDE: A Depth-Predictive Architecture

Motivation

Two of the most productive paradigms in modern AI are built on self-supervised prediction along a single axis of structure in natural data.

Large language models exploit sequential structure. Given a prefix of tokens, the objective is to predict the next token. The training signal requires no external labels — the text provides both input and target. Scaling this objective over large corpora produces models that exhibit a broad range of capabilities, including in-context learning, retrieval, and chain-of-thought problem solving (Brown et al., 2020; Wei et al., 2022).

The Joint Embedding Predictive Architecture (JEPA) exploits spatial and temporal structure. Given visible regions of an image or video, the objective is to predict the abstract representation of masked regions — prediction in representation space rather than pixel space, which forces the encoder to retain only predictable structure and discard irrelevant variation. V-JEPA acquires representations encoding physical regularities from video without explicit supervision (Bardes et al., 2024).

In both cases, each prediction step operates within a single level of abstraction: a token predicts the next token; a patch predicts a neighboring patch. Neither paradigm's prediction objective explicitly models the relationship between abstraction levels — the generative hierarchy by which deeper structures produce surface observations.

SONDE is an architecture whose prediction objective operates on this axis: prediction across depth levels.

DEPTH (SONDE) | | what produced this? | what does this produce? | SEQUENCE (LLM) -----+------ SPACE/TIME (JEPA) what comes next? what's missing here?

Depth Structure in Natural Data

Many domains contain data that is organized across levels of abstraction. In well-structured software, a function body implements a function's contract, which serves a module's interface, which instantiates the system's architecture. In scientific exposition, specific data supports an experimental result, which tests a hypothesis, which is derived from a theoretical framework. These are not universal patterns — poorly structured code, for instance, may exhibit no clean hierarchy — but where they exist, the hierarchical relationship is a structural property of the data itself, not an imposed annotation.

This relationship between abstraction levels is distinct from the axes exploited by existing self-supervised paradigms. Sequential prediction (LLMs) operates along the time axis within a single level. Spatial prediction (JEPA) operates across positions within a single level. Depth structure is a relationship between levels: each surface is produced by a deeper layer, and each deeper layer is itself a surface relative to the layer beneath it.

In certain domains, depth structure can be recovered from data without manual labeling. In code, abstract syntax trees and scope nesting provide explicit, programmatically extractable decomposition into depth levels. The degree to which this generalizes to other domains (scientific papers, legal documents, mathematical proofs) varies: section structure in papers provides an approximate depth hierarchy, but the boundaries are not always clean. Code is the primary training domain for SONDE precisely because its depth decomposition is unambiguous and mechanically extractable.

Architecture

SONDE is trained from scratch without pretrained weights. The architecture consists of five components.

ByteEncoder. A multi-scale convolutional network operating on raw byte sequences. Parallel convolution kernels of sizes 3, 5, 7, and 9 capture patterns at different scales. Sinusoidal positional encoding preserves sequence order. Learned attention pooling aggregates the convolutional features into a fixed-dimensional vector per depth level. The encoder runs once per sample; this is the computationally dominant step.

CrossLevelRefiner. A transformer operating on the set of 4 level vectors (one per depth level), performing 5 iterative passes of multi-head cross-attention with gated residual updates. Each level attends to all other levels on every pass. Weights are shared across all passes (recurrent application). This is computationally inexpensive: the input is 4 vectors of dimension 256, not a long sequence.

Projection Heads. Four independent 2-layer MLPs (one per depth level, 512 hidden dimensions), projecting refined representations onto the unit hypersphere for cosine-similarity-based comparison. This normalization follows standard practice in contrastive representation learning (Chen et al., 2020).

Cross-Depth Predictor. A 3-layer transformer that receives the projected representations of unmasked levels as input and produces a predicted representation of the masked level. Learned depth-level embeddings encode which levels are visible and which is the prediction target.

EMA Target Encoders. Exponential moving average copies of the encoder and projection heads, updated with momentum coefficient 0.996. Target representations are computed through these parameters, which do not receive gradient updates. This mechanism prevents representational collapse — the degenerate solution in which all representations converge to a constant — and is shared with BYOL (Grill et al., 2020) and JEPA.

Training Procedure

Given a depth-structured tuple (function body, function signature, class context, module context), one level is selected uniformly at random and masked. The remaining levels are encoded, iteratively refined, and projected. The predictor estimates the masked level's representation. The loss function is:

L = 0.5 × L_cosine + 0.5 × L_InfoNCE

where L_cosine = 1 − cos(ŷ, y) measures the distance between predicted and target representations, and L_InfoNCE is temperature-scaled cross-entropy (τ = 0.07) computed over the batch, pushing representations of different functions apart while pulling representations of the same function together (Oord et al., 2018). Gradients propagate through the encoder, refiner, projection heads, and predictor. Target encoders update via EMA only.

Masking Strategy

The choice of which level to mask determines the inference direction:

Masking deep levels, predicting from shallow — inductive inference (inferring generative process from observations)
Masking shallow levels, predicting from deep — deductive inference (predicting realizations from specifications)
Masking intermediate levels — abductive inference (inferring mediating structure from endpoints)

During training, the masked level is selected uniformly at random, training all three directions simultaneously.

Data

The training domain is code. Depth levels are extracted programmatically via AST parsing: function body (level 0), function signature (level 1), enclosing class or scope (level 2), and module-level context (level 3). Levels are non-overlapping by construction. The dataset consists of 6,000 depth-structured tuples curated from Lean 4 theorem prover libraries and open-source code repositories, split 80/20 by repository to prevent data leakage between training and evaluation.

Experimental Results

Seven iterations were conducted, each isolating a specific architectural or data variable. All results below are reported on held-out test sets from repositories absent from training data.

v1–v4: Establishing Learnability

v1 (12K samples, no regularization) produced a train-set coherence gap of 0.462. On held-out repositories, the gap was not significantly above zero, indicating complete overfitting.

v2 introduced regularization (dropout 0.2, weight decay, early stopping). 80K samples, 1.4M parameters, 128-dimensional embeddings, cosine loss only. Evaluated on 14 unseen repositories: coherence gap 0.117, retrieval@1 1.5%, anomaly AUC 0.707. The coherence gap — defined as the difference in mean cosine similarity between same-function cross-level pairs and different-function cross-level pairs — was the first metric to exceed what a randomly initialized encoder produces (gap ≈ 0) on held-out data.

v3 held training data constant and modified only the architecture: 256-dimensional embeddings, attention pooling replacing mean pooling, InfoNCE contrastive loss added to cosine loss, sinusoidal positional encoding. 9.5M parameters. Coherence gap 0.327 (2.8× v2), retrieval@1 51.0% (34× v2), anomaly AUC 0.896. Because the training data was identical to v2, the improvement is attributable to architectural changes.

v4 added Wikipedia text alongside code (115K training samples total, 29K test). Coherence gap 0.591 (1.8× v3). Note: this comparison is confounded — v4 changed both the domain composition and total data volume relative to v3. However, the within-domain code coherence gap in v4 exceeded v3's converged result, suggesting that the additional domain contributed structure rather than noise. A controlled comparison holding total data volume constant was not conducted.

Version	Gap	Retrieval @1	Anomaly AUC	Controlled Variable
v1	0.462	—	—	Baseline (overfit)
v2	0.117	1.5%	0.707	Regularization
v3	0.327	51.0%	0.896	Architecture (data held constant)
v4	0.591	—	—	Multi-domain data

v5–v6: Byte Encoding and Iterative Refinement

v5 simultaneously changed two variables: the encoder (CNN byte encoder replacing text encoder, eliminating all pretrained components) and the training data (6,000 dense curated tuples with strictly non-overlapping depth levels, replacing 80K samples with less controlled level separation). Coherence gap: 0.424; retrieval@1: 31.5%; anomaly AUC: 0.900. Because two variables changed, the result does not cleanly attribute improvement to either the encoder or the data. However, achieving a gap of 0.424 with 75× fewer samples than v3 (gap 0.327) is consistent with the hypothesis that clean depth-level separation in training data is at least as important as data volume.

v6 introduced the CrossLevelRefiner: iterative cross-attention over compact level vectors with shared weights across passes. Same data as v5.

Metric	v5	v6
Coherence gap	0.424	0.947
Same-function cosine similarity	0.452	0.966
Retrieval @1	31.5%	96.8%
Retrieval @5	53.0%	100%
Anomaly AUC	0.900	0.9996

Training set: 4,807 functions. Test set: 1,193 functions from unseen repositories. 10.1M trainable parameters. Training time: 68 minutes on a single consumer GPU.

At v6, depth-level representations from the same function achieve mean cosine similarity 0.966; representations from different functions achieve 0.038. Retrieval@1: given a function body, the model retrieves the correct function signature from among all 1,193 test candidates with 96.8% accuracy (100% within top 5). Anomaly detection: mismatched depth-level pairs (e.g., a function body paired with an unrelated class definition) are distinguished from matched pairs with AUC 0.9996.

v7: Generation and Cross-Domain Transfer

v6 established that the learned representations encode depth structure. v7 tested whether this structure generalizes beyond the training domain.

The v6 encoder and refiner are frozen. A dense decoder is trained on top: 64 learned latent tokens are concatenated with the 3 visible-level representations (total sequence length: 67), refined through 7 shared-weight transformer passes, then expanded via MLP to 512 byte positions with local convolutional refinement. Attention cost per pass: 67² = 4,489 operations, versus 512² = 262,144 for naive sequence-length attention (58× reduction).

The evaluation task is depth ordering: given two samples from the same depth-structured tuple, predict which originates from the deeper level. This is a binary classification task that requires the model to have learned a consistent notion of "deeper" versus "shallower."

Evaluation Setting	Accuracy
Random baseline	50.0%
Code (training distribution)	92.0%
Code (held-out test repositories)	91.6%
Newton's manuscripts (zero-shot transfer)	75.6%

The model was trained exclusively on source code. The zero-shot evaluation domain — Newton's manuscripts — consists of 17th-century natural philosophy, theology, and alchemical writings. While both domains use English characters (and code contains English-language identifiers), the domains share no syntactic structure, formatting conventions, or subject matter. The model achieves 75.6% accuracy (25.6pp above chance; p < 0.001, binomial test against H₀: accuracy = 0.5). A limitation of this evaluation: the binomial test establishes that performance exceeds chance, but does not rule out the possibility that the model exploits surface features (e.g., text length, lexical complexity) that correlate with depth without reflecting genuine depth structure. Controlled experiments with length-matched and complexity-matched pairs would strengthen the cross-domain claim. The in-domain train-test gap is minimal (92.0% → 91.6%), indicating robust generalization within the training domain.

Isaac Newton

The Regulae Philosophandi, added to the second edition of the Principia (1713), articulate a methodological commitment to parsimony in causal explanation. Rule I: "We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances." Rule II: "Therefore to the same natural effects we must, as far as possible, assign the same causes." Whether these rules describe Newton's actual investigative practice or represent a post-hoc rationalization is debated in Newton scholarship (Westfall, 1980; Cohen, 1999). What is not debated is the structure of the results: Newton's published work repeatedly reduced large surfaces of apparently distinct phenomena to compact generating structures at depth.

Newton's own work exemplified this. Planetary orbits, projectile trajectories, tidal patterns, and the precession of equinoxes — phenomena that had been catalogued and modeled independently for centuries — were reduced to a single law: F = Gm₁m₂/r². The visible spectrum, previously treated as a property of surfaces and pigments, was reduced to differential refrangibility of corpuscular light. In each case, a large surface of apparently distinct observations collapsed into a compact generative structure at depth.

The unpublished manuscripts extend this pattern beyond the published works. Newton's alchemical notebooks (estimated at over a million words by the Newton Project, though exact word counts depend on what is included) investigate whether chemical transmutation provides evidence of a unifying active principle in matter. His theological chronologies attempt to recover what he considered the original, uncorrupted doctrine beneath centuries of textual alteration. In both cases, the operative structure is the same as in the Principia: a surface (observed chemistry, received scripture) is treated as the product of a deeper process, and the investigation works downward.

This iterated descent through levels of abstraction is the cognitive operation that SONDE formalizes as a computational objective. Newton's manuscripts, which record fifty years of this process in written form, now serve as an out-of-distribution evaluation set. A model trained exclusively on source code achieves 75.6% accuracy on depth ordering within these manuscripts (p < 0.001 by binomial test) — a statistically significant result in a domain entirely absent from the training distribution.

Core Thesis

The experimental results raise a question the current data cannot resolve but which the architecture was designed to investigate: whether generative depth structure is domain-specific or domain-general.

Under the domain-specific view, depth structure in code (function → module → architecture) and depth structure in natural language (claim → argument → framework) are independent regularities sharing only a superficial hierarchical form. A model trained on one domain should not transfer to another, and multi-domain training should dilute within-domain performance by forcing the model to average over incompatible structural patterns.

Under the domain-general view, there exist structural invariants in the relationship between abstraction levels — regularities in how a surface relates to what produced it — that hold across domains. If such invariants exist, they constitute something more fundamental than domain-specific patterns: not a set of facts common to all fields, but a set of organizational principles governing how knowledge at any level of abstraction generates knowledge at adjacent levels. The question is whether such principles are real or whether the appearance of cross-domain structure is an artifact of superficial similarities.

Two observations bear on this question, neither individually conclusive. First, multi-domain training (v4) improved within-domain code coherence beyond what code-only training achieved. This result is confounded by the simultaneous increase in total data volume, but the domain-specific view predicts that adding unrelated data should at minimum dilute domain-specific representations, which was not observed. Second, code-trained representations (v7) transferred to 17th-century natural philosophy at 75.6% accuracy — well above chance, though the possibility that the model exploits surface correlates of depth (rather than depth structure itself) has not been ruled out.

These observations are consistent with the existence of shared depth structure. They do not establish it. Two domains cannot demonstrate universality. The nature and formal properties of any such shared structure remain uncharacterized. Whether depth invariants, if they exist, are fully recoverable from finite observational data is an open theoretical question bearing on fundamental limits of unsupervised representation learning (Locatello et al., 2019; Morioka & Hyvärinen, 2024). Rigorous investigation requires evaluation across a substantially broader set of domains (mathematics, legal reasoning, biological systems, musical structure) and formal analysis distinguishing genuinely shared structural properties from superficial hierarchical similarity.

Theoretical Context

Learning generative depth structure from observational data confronts established impossibility results. The Causal Hierarchy Theorem (Bareinboim et al., 2022) proves that observational distributions generically do not determine interventional or counterfactual quantities. Markov equivalence entails that multiple distinct causal graphs can produce identical conditional independence structures, rendering the generating graph unidentifiable from observational data alone. Locatello et al. (2019) demonstrated that unsupervised disentanglement of independent latent factors is impossible without inductive biases that constrain the model class or the data distribution.

Recent theoretical work has identified conditions under which these impossibilities can be circumvented. Morioka and Hyvärinen (ICML 2024) proved identifiability of causal representations from purely observational data under a grouping structure assumption. Richens and Everitt (ICLR 2024) showed that decision-making agents satisfying regret bounds must learn approximate causal models of their environment. The VAR architecture (Tian et al., NeurIPS 2024 Best Paper) demonstrated that next-scale prediction — predicting across spatial resolution levels — outperforms next-token prediction for visual generation, providing empirical evidence that cross-level prediction is a productive self-supervised signal.

Code constitutes a favorable domain for depth prediction because it satisfies several conditions that the identifiability literature suggests are enabling: explicit hierarchical structure recoverable via AST parsing, and training across independent codebases constituting multi-environment data — a condition that provably enables causal identifiability under mild assumptions (Peters et al., 2016). Revision histories could in principle provide temporal and quasi-interventional signals, though the current implementation of SONDE uses only static code snapshots and does not exploit version history.

To our knowledge, no prior architecture implements cross-depth representation prediction as a self-supervised training objective. Related work includes H-JEPA (LeCun, 2022), which proposes hierarchical prediction across temporal scales but has not been implemented; PrediRep (Ororbia & Friston, 2024), which performs cross-level prediction in a predictive coding framework but does not scale beyond 5–7 layers; VAR, which predicts across spatial resolutions rather than abstraction depth; and DreamCoder (Ellis et al., 2021), which learns hierarchical program libraries but operates in constrained symbolic domains.

Specifications

Parameter	Value
Architecture	CNN byte encoder + cross-level refiner + cross-depth predictor
Trainable parameters	10.1M (20M including EMA targets)
Embedding dimension	256
Hidden dimension	512
Attention heads	8
Depth levels	4
Refiner passes	5 (weight-shared)
Predictor layers	3
Training data	6,000 depth-structured tuples (Lean 4 + open-source code)
Train/test split	80/20 by repository
Training time	68 minutes on a single consumer GPU
Loss	0.5 × L_cosine + 0.5 × L_InfoNCE (τ = 0.07)
EMA momentum	0.996
Dropout	0.15
Optimizer	AdamW (lr = 5 × 10⁻⁵)