Benchmark Results
We ran 132 benchmark tasks across 4 models and 40+ experimental conditions on a SysML v2 corpus. Two patterns emerged consistently enough to stake claims on. Both are testable on a second corpus — and we intend to test them.
Representation > Retrieval
Lead evidence: O4 (d=1.01, N=10), O12 (d=0.75), O8 (d=0.64)
Ablation: graph tool penalty from one tool (sysml_inspect), not graph tools inherently
Null results that constrain the claim: O5, O6, O9, O14
Representation matters more than retrieval
For AI systems working with structured knowledge, the form in which information is presented to the model matters more than the mechanism used to find it.
Every retrieval intervention we tested produced null results:
- Vector search added nothing (O5, exact tie 0.880 vs 0.880)
- Graph traversal added nothing at 2–3 hops (O9) or 4–5 hops (O14)
- Planning tools added nothing on hard tasks (O6, +0.035)
Every representation and guidance intervention produced large effects:
- Rendering model elements into structured views nearly doubled accuracy on explanation tasks (O4, d=1.01, N=10) — and at 4× lower cost
- One sentence of tool selection guidance eliminated a 13-point penalty on discovery tasks (O12, d=0.75)
- CLI tool-based search dominated bulk context injection on discovery (O8, d=0.64)
Schema ablation identified the specific mechanism behind the apparent graph tool
penalty: one tool (sysml_inspect) created a selection trap on discovery
tasks. Removing it recovered performance to 0.925 — above the search-only baseline.
The penalty was from tool design, not from graph tools inherently.
Aggregate Benchmarks Lie
Lead evidence: O1 (per-task d: −0.400 to +0.800)
Supporting: O10 (bimodal scaling), O8 (task-type interaction), O4 (range +1.000 to −0.200)
Aggregate benchmarks hide task-level structure
Our aggregate tool comparison showed no significant difference (O1, p=0.391). A naive reading: tools don't matter. But per-task analysis revealed effect sizes ranging from −0.400 to +0.800. Enormous, opposite-signed effects that cancel in the mean.
- CLI outperforms RAG on discovery by 29 points but RAG is roughly equivalent on reasoning (O8). An aggregate would hide both signals.
- Scaling collapse is bimodal: 5 of 20 tasks score 1.000 on the larger corpus while 11 of 20 score below 0.333 (O10). The mean of 0.423 describes no actual task.
- O4's render advantage ranges from +1.000 (task E4, perfect cliff at N=10) to −0.200 (task E5). The mean of +0.335 understates the wins and hides the reversal.
The methodological contribution: per-task analysis with paired effect sizes is necessary to surface real patterns in tool-augmented LLM evaluation.
14 observations from the benchmark. Six are presented in detail below; expand each for the full analysis. All p-values are uncorrected — when corrected for 14 simultaneous tests, none remain significant across the full set.
O12 — Context engineering outperforms tool restriction d=0.75 · p=0.009 · power 0.80
Guided graph score vs 0.750 unguided. Matches the 2-tool baseline at 0.880.
p=0.009 (paired t, uncorrected)
d=0.75, N=16 tasks
Power: 0.80
The naive response to "too many tools hurt performance" is to restrict the tool set. The better response is a sentence in the system prompt. When agents are instructed to start with search and read_file, escalating to graph tools only when search is insufficient, the 13-point discovery penalty from over-tooling disappears entirely. Performance with 6 tools matches and marginally exceeds the 2-tool baseline (0.887 vs 0.880).
The affected tasks (D11, D12, D16, D6) are those where unguided agents select structurally complex tools for attribute-lookup tasks that search handles trivially. The agent doesn't need graph traversal to find a part's mass. It needs search. But without guidance, it reaches for the most powerful tool available, and the overhead of using it (more tokens, more turns, more opportunities to go off track) costs accuracy.
This is the only adequately powered observation in the study (power=0.80). It is also the lowest nominal p-value (0.009). If we had to pick one finding to bet on replicating, this would be it.
O4 — Pre-rendered views outperform agent-assembled context d=1.01 · p=0.025 · 4× cheaper
| Config | Score | Cost |
|---|---|---|
| Pre-rendered | 0.893 | $6.23 |
| Agent-assembled | 0.558 | $24.76 |
E4: perfect 1.0→0.0 cliff across all 10 runs.
p=0.025 (t-test), p=0.031 (Wilcoxon)
d=1.01, N=10 runs × 8 tasks
On explanation tasks (N=10 runs per task), pre-rendered model views scored 0.893 vs 0.558 for letting the agent assemble its own context. A 34-point gap. The effect strengthened at N=10 (d=1.01, up from d=0.83 at N=5). Task E4 shows a perfect 1.0→0.0 cliff: rendering enables it entirely, graph assembly cannot solve it across 10 attempts.
The advantage is explanation-specific. On discovery tasks, pre-rendering scored 0.719, worse than search (0.880). Pre-rendering the wrong view adds noise, not signal. This matters: it means pre-rendering is not a universal improvement. It is a task-dependent one, and the task type determines whether it helps or hurts.
The cost difference is striking: $6.23 total for 10 render runs (3.8 avg turns) vs $24.76 for 10 graph runs (8.7 avg turns) — 4× cheaper. The pre-rendered view does the work at index time that the agent would otherwise do at query time, and it does it once instead of per-query.
This has the largest effect size in the study (d=1.01) with near-adequate power (0.69 at N=10, needs 10 tasks for 80%). The effect strengthened with more data, which is what you want to see from a real signal.
E1–E8 explanation tasks · render vs graph · N=10 · E5 is the reversal task
O8 — Retrieval strategy interacts with task type d=0.64 · p=0.021
| Task type | CLI | RAG |
|---|---|---|
| Discovery | 0.855 | 0.566 |
| Reasoning | 0.323 | 0.459 |
Discovery: p=0.021 (paired t), d=0.64
Reasoning: p=0.403 (not significant)
CLI tool-based search dominated structured lookup (+29 points over RAG, p=0.021, d=0.64, N=16 tasks). RAG edged ahead on cross-file reasoning (+14 points, p=0.403, not significant), likely because it injects all relevant context at once, avoiding the problem where the agent runs out of turns before it can chain together enough tool calls to answer multi-step questions.
The CLI advantage on discovery is driven by 5 tasks where RAG scores 0.000: tasks requiring iterative tool-mediated retrieval that single-shot context injection cannot perform. The model needs to search, read a result, search again based on what it found, and repeat. RAG gives it everything at once, which is the right coverage but the wrong format for these tasks.
Neither retrieval architecture is universally better. This suggests that the right approach is not to pick one, but to route queries to the right strategy based on task type.
Discovery: CLI +29 pts over RAG (p=0.021) · Reasoning: no significant difference
O1 — Tool-task interaction is heterogeneous aggregate p=0.391 · per-task d up to 0.80
| Task | Search | Graph |
|---|---|---|
| D11 | 1.000 | 0.200 |
| D6 | 1.000 | 0.400 |
| D13 | 0.600 | 1.000 |
| D10 | 0.700 | 1.000 |
Aggregate: p=0.391 (not significant). Per-task: up to 0.80 difference.
Graph tools appeared to hurt discovery tasks, help layer tasks, and are
near-neutral on reasoning. The aggregate difference is not statistically
significant (paired t-test p=0.391, N=16) because the effect is
task-dependent. Schema ablation on discovery tasks (N=5) later revealed that
the discovery penalty was caused by one tool (sysml_inspect)
creating a selection trap — not by graph tools inherently. With that tool
removed, the remaining graph configuration scored 0.925 vs 0.872 for
search-only on discovery tasks.
The pattern holds across all four models tested, making it one of the most robust qualitative observations in the benchmark despite the null aggregate test. This is the lead evidence for Thesis 2: the aggregate null is not "tools don't matter." It is "tools matter enormously, but in opposite directions on different tasks, and the average hides everything interesting."
D1–D16 discovery tasks · search vs graph · Sonnet · opacity ∝ magnitude · aggregate p=0.391
O10 — Corpus scale is the dominant difficulty factor 0.880 → 0.423 at 5× scale
| Corpus | Score |
|---|---|
| 19 files | 0.880 |
| 95 files, search | 0.423 |
| 95 files, graph | 0.389 |
| 95 files, +vectors | 0.409 |
Failure modes: 55% budget exhaustion, 27% reasoning errors, 0% search failure.
Performance roughly halves from 19 to 95 files (0.880 to 0.423). At scale, additional tools did not help — the bottleneck is reasoning depth and turn budget, not retrieval quality. 11 of 20 scaling tasks fall below 0.333. The distribution is bimodal: easy tasks remain easy, hard tasks become impossible.
The failure mode is revealing: 55% of the time the agent ran out of turns before finishing. 27% were reasoning errors. 0% were search failures. The agent can find the information. It just can't process enough of it within the turn budget to reach the right answer. This suggests the path forward is better orchestration rather than better search.
This is the observation that keeps us honest. Our other results come from a 19-file corpus. Real engineering repositories are hundreds or thousands of files. The scaling problem is unsolved by any method we tested.
20 scaling tasks (95-file corpus) · graph score · bimodal: 6× at 1.000, 11× at 0.000 · dashed line = mean
Remaining observations summarized. Full details in the benchmark repository.
Other observations
| ID | Summary | Classification |
|---|---|---|
| O2 | Model quality gap: Sonnet consistently outperformed OpenAI models | Descriptive |
| O3 | o3-mini is the only model where graph tools help on reasoning (+0.056) | Exploratory (power=0.08) |
| O5 | Vector search: exact tie with keyword search on small corpus (0.880 vs 0.880) | Null |
| O6 | Planning tools (sysml_stat, sysml_plan): +0.035 on hard tasks, not significant | Null |
| O7 | RFLP layer tasks: cli_full showed slight advantage (~0.25 effect) | Exploratory |
| O9 | Graph tools at 2–3 hops: no benefit (d=0.16, power=0.07) | Null |
| O11 | Turn budget is a partial bottleneck but not the whole story | Descriptive |
| O13 | Few-shot examples hurt mini models (GPT-4o-mini, o3-mini) | Exploratory |
| O14 | Graph tools at 4–5 hops: not significant (d=0.44, power=0.19) | Null (underpowered) |
The observations above are our interpretation of the data. Below is the data itself — 4,700+ scored benchmark runs across 5 models and 12 tool configurations. Filter, compare, and draw your own conclusions.
Score heatmap — tasks × tool sets
Data from sysml-bench-results · 4,726 scored runs · gitlab.com/nomograph/sysml-bench-results
Corpus
Eve Online Mining Frigate SysML v2 model. 19 files, 798 elements, 1,515 relationships.
Models
Claude Sonnet 3.5, GPT-4o, GPT-4o-mini, o3-mini.
Replication
N=3 exploratory sweeps. N=5–10 for key comparisons (O4 at N=10, O1/O12 at N=5). T=0.3 for all CLI runs.
Caveat
All results exploratory. None survive multiple comparison correction across 14 observations.
sysml-bench is an exploratory benchmark evaluating how tool-augmented LLMs perform on structured engineering tasks in SysML v2. 132 tasks across 8 categories (discovery, reasoning, explanation, layer, boundary, vector-sensitive, structural trace, corpus scaling), tested with 4 models and 40+ experimental conditions.
The study generated 14 observations. Three achieved nominal statistical significance. None survive correction for running 14 tests simultaneously. It is a well-characterized exploratory study that identifies patterns and estimates effect sizes. The contribution is the benchmark methodology, the identification of task-tool interaction as a key variable, and the effect size estimates that make confirmatory follow-up designable.
Scoring
Per-field structured scoring. Each task defines expected fields with typed scorers: Bool (exact match), Float (numeric within tolerance), Str (exact string match), StrContains (case-insensitive substring), ListStr (F1 score with 0.8 threshold, binarized). Task score = mean of field scores. Condition score = mean of task scores across N runs.
Tool sets
| Tool Set | Tools | Schema Tokens | Description |
|---|---|---|---|
| cli_search | 2 | 268 | search + read_file |
| cli_graph | 6 | 1,116 | search + trace + check + query + inspect + read_file |
| cli_render | 7 | 1,303 | cli_graph + sysml_render |
| cli_full | 9 | 1,485 | cli_render + sysml_stat + sysml_plan |
| +guided | varies | +~50 | System prompt with tool selection hint |
| +vectors | varies | +0 | Adds fastembed HNSW vector index |
Known confounds (partially resolved)
cli_search (2 tools, 268 tokens) vs cli_graph (6 tools, 1,116 tokens) confounds
tool count, schema overhead, and selection complexity. Schema ablation on discovery
tasks (N=5) resolved the primary confound: removing sysml_inspect from
cli_graph recovered performance to 0.925 (above cli_search at 0.872), demonstrating
that the penalty was from selection confusion caused by one tool, not from schema
overhead or tool count. The remaining confound is that the ablation configuration
(5 tools) still differs from cli_search (2 tools) in tool count.
Ground truth
Created by the primary author from SysML v2 model inspection. Two corrections applied during experimentation: D16 (35.0→37.0), R5 (3→2). Structural trace scoring schema corrected in v2 (ST2/ST7 scores changed 0.542→0.865). No independent verification. Single-author ground truth is a limitation.
Holm-Bonferroni
Step-down correction at α=0.05 across 14 observations. Controls family-wise error rate.
Effect size conventions
Cohen's d: 0.2 = small, 0.5 = medium, 0.8 = large. Calibrated for behavioral science; benchmark score differences may have different practical significance.
Multiple comparison correction
14 observations tested at α=0.05 yields a family-wise error rate of about 51%. When corrected (Holm-Bonferroni step-down), no observation remains significant:
| Rank | Observation | Raw p | Holm threshold | Survives? |
|---|---|---|---|---|
| 1 | O12 (render) | 0.009 | 0.0036 | No |
| 2 | O8 (discovery) | 0.021 | 0.0038 | No |
| 3 | O4 (N=10) | 0.025 | 0.0042 | No |
If O4 and O12 are designated as the only two hypotheses under test (Holm-Bonferroni at m=2), both survive: O12 adjusted p=0.018, O4 adjusted p=0.025. This designation was made after seeing the data. It becomes pre-registered only if declared before collecting new data on a second corpus.
Power analysis
Only one observation (O12) has enough statistical power to reliably detect its effect. The power analysis tells us exactly how large a follow-up study needs to be. This is itself a contribution: it makes the confirmatory work designable.
| Observation | Effect size (d) | Current tasks | Power | Tasks for 80% | Tasks for 80% (α=0.025) |
|---|---|---|---|---|---|
| O12 (guided render) | 0.75 | 16 | 0.80 | 17 | 21 |
| O8 (CLI vs RAG) | 0.64 | 16 | 0.70 | 20 | 25 |
| O4 (render vs assembly) | 1.01 | 10×8 | 0.69 | 10 | 12 |
| O1 (heterogeneity) | 0.22 | 16 | 0.13 | 163 | 210 |
| O14 (graph 4-5 hops) | 0.44 | 8 | 0.19 | 42 | 53 |
Six identified threats. The single-corpus limitation is the most fundamental.
T1: Single corpus. All primary observations derive from one 19-file SysML v2 model. The corpus is small enough that exhaustive search may substitute for structured traversal, potentially explaining why graph tools show no advantage. All claims are scoped to "on our benchmark corpus."
T2: Confounded tool sets (partially resolved). cli_search vs cli_graph differs in tool count, schema overhead, and selection complexity. Schema ablation resolved the primary confound: the penalty was from selection confusion caused by one tool, not schema overhead or tool count.
T3: Multiple comparisons. 14 observations at α=0.05 yields ~51% family-wise error rate. No observation survives full correction.
T4: Underpowered tests. Only O12 achieves 80% power. O4 (0.69 at N=10) and O8 (0.70) are near-adequate. Most other observations would need 30–300+ tasks to detect their effects reliably.
T5: Scoring methodology (binarization validated). ListStr binarization at 0.8 creates cliff effects, but continuous scoring confirmed the three key findings are robust. StrContains scoring for explanation tasks may be too lenient.
T6: Ground truth. Created and verified by a single author. Two corrections applied mid-experiment. No inter-rater reliability assessment.
Pick any two tool configurations and see how they compare task by task. Each line connects the same task scored under both configurations — the direction and length tell you which tool set won and by how much.
Data from sysml-bench-results · 4,726 scored runs
Three arXiv preprints in preparation from this data. Paper A argues that representation matters more than retrieval, with O4 and O12 as the designated primary hypothesis pair and a pre-registered confirmatory design on a second corpus. Paper B argues that aggregate benchmarks hide task-level structure, with O1, O10, and O8 as lead evidence and a methodological contribution on per-task analysis with paired effect sizes. Paper C presents sysml-bench itself as a community benchmark artifact for MBSE AI evaluation, analogous to SWE-bench for software engineering. All three frame the current study as exploratory. We will link them here when they are posted. A practitioner-focused version of Paper A is being submitted to GVSETS 2026.
Confirmatory study
Pre-register O4 and O12 as primary hypotheses before collecting data on a second SysML v2 corpus. Design task sets for 80% power: 20+ explanation tasks, 20+ discovery tasks. This converts the post-hoc designation into genuine pre-registration.
Completed since initial publication: N=10 replication of O4
(effect strengthened to d=1.01), continuous scoring validation (binarization
does not distort key findings), and schema ablation (identified
sysml_inspect as the cause of the graph tool penalty via
selection confusion). Remaining: second corpus for confirmatory replication.
Community artifact
The benchmark harness, task definitions, ground truth, and scoring code are available as a community artifact for future comparison. The goal is for sysml-bench to serve the MBSE AI community the way SWE-bench serves software engineering: a shared, reproducible evaluation surface.