Benchmark Results

sysml-bench · 132 tasks · 4 models · 40+ conditions · N=3–10

What we found

We ran 132 benchmark tasks across 4 models and 40+ experimental conditions on a SysML v2 corpus. Two patterns emerged consistently enough to stake claims on. Both are testable on a second corpus — and we intend to test them.

Thesis 1

Representation > Retrieval

Lead evidence: O4 (d=1.01, N=10), O12 (d=0.75), O8 (d=0.64)

Ablation: graph tool penalty from one tool (sysml_inspect), not graph tools inherently

Null results that constrain the claim: O5, O6, O9, O14

Representation matters more than retrieval

For AI systems working with structured knowledge, the form in which information is presented to the model matters more than the mechanism used to find it.

Every retrieval intervention we tested produced null results:

Vector search added nothing (O5, exact tie 0.880 vs 0.880)
Graph traversal added nothing at 2–3 hops (O9) or 4–5 hops (O14)
Planning tools added nothing on hard tasks (O6, +0.035)

Every representation and guidance intervention produced large effects:

Rendering model elements into structured views nearly doubled accuracy on explanation tasks (O4, d=1.01, N=10) — and at 4× lower cost
One sentence of tool selection guidance eliminated a 13-point penalty on discovery tasks (O12, d=0.75)
CLI tool-based search dominated bulk context injection on discovery (O8, d=0.64)

Schema ablation identified the specific mechanism behind the apparent graph tool penalty: one tool (sysml_inspect) created a selection trap on discovery tasks. Removing it recovered performance to 0.925 — above the search-only baseline. The penalty was from tool design, not from graph tools inherently.

Thesis 2

Aggregate Benchmarks Lie

Lead evidence: O1 (per-task d: −0.400 to +0.800)

Supporting: O10 (bimodal scaling), O8 (task-type interaction), O4 (range +1.000 to −0.200)

Aggregate benchmarks hide task-level structure

Our aggregate tool comparison showed no significant difference (O1, p=0.391). A naive reading: tools don't matter. But per-task analysis revealed effect sizes ranging from −0.400 to +0.800. Enormous, opposite-signed effects that cancel in the mean.

CLI outperforms RAG on discovery by 29 points but RAG is roughly equivalent on reasoning (O8). An aggregate would hide both signals.
Scaling collapse is bimodal: 5 of 20 tasks score 1.000 on the larger corpus while 11 of 20 score below 0.333 (O10). The mean of 0.423 describes no actual task.
O4's render advantage ranges from +1.000 (task E4, perfect cliff at N=10) to −0.200 (task E5). The mean of +0.335 understates the wins and hides the reversal.

The methodological contribution: per-task analysis with paired effect sizes is necessary to surface real patterns in tool-augmented LLM evaluation.

Observations

14 observations from the benchmark. Six are presented in detail below; expand each for the full analysis. All p-values are uncorrected — when corrected for 14 simultaneous tests, none remain significant across the full set.

O12 — Context engineering outperforms tool restriction d=0.75 · p=0.009 · power 0.80

0.887

Guided graph score vs 0.750 unguided. Matches the 2-tool baseline at 0.880.

p=0.009 (paired t, uncorrected)

d=0.75, N=16 tasks

Power: 0.80

The naive response to "too many tools hurt performance" is to restrict the tool set. The better response is a sentence in the system prompt. When agents are instructed to start with search and read_file, escalating to graph tools only when search is insufficient, the 13-point discovery penalty from over-tooling disappears entirely. Performance with 6 tools matches and marginally exceeds the 2-tool baseline (0.887 vs 0.880).

The affected tasks (D11, D12, D16, D6) are those where unguided agents select structurally complex tools for attribute-lookup tasks that search handles trivially. The agent doesn't need graph traversal to find a part's mass. It needs search. But without guidance, it reaches for the most powerful tool available, and the overhead of using it (more tokens, more turns, more opportunities to go off track) costs accuracy.

This is the only adequately powered observation in the study (power=0.80). It is also the lowest nominal p-value (0.009). If we had to pick one finding to bet on replicating, this would be it.

O4 — Pre-rendered views outperform agent-assembled context d=1.01 · p=0.025 · 4× cheaper

Config	Score	Cost
Pre-rendered	0.893	$6.23
Agent-assembled	0.558	$24.76

E4: perfect 1.0→0.0 cliff across all 10 runs.

p=0.025 (t-test), p=0.031 (Wilcoxon)

d=1.01, N=10 runs × 8 tasks

On explanation tasks (N=10 runs per task), pre-rendered model views scored 0.893 vs 0.558 for letting the agent assemble its own context. A 34-point gap. The effect strengthened at N=10 (d=1.01, up from d=0.83 at N=5). Task E4 shows a perfect 1.0→0.0 cliff: rendering enables it entirely, graph assembly cannot solve it across 10 attempts.

The advantage is explanation-specific. On discovery tasks, pre-rendering scored 0.719, worse than search (0.880). Pre-rendering the wrong view adds noise, not signal. This matters: it means pre-rendering is not a universal improvement. It is a task-dependent one, and the task type determines whether it helps or hurts.

The cost difference is striking: $6.23 total for 10 render runs (3.8 avg turns) vs $24.76 for 10 graph runs (8.7 avg turns) — 4× cheaper. The pre-rendered view does the work at index time that the agent would otherwise do at query time, and it does it once instead of per-query.

This has the largest effect size in the study (d=1.01) with near-adequate power (0.69 at N=10, needs 10 tasks for 80%). The effect strengthened with more data, which is what you want to see from a real signal.

E1–E8 explanation tasks · render vs graph · N=10 · E5 is the reversal task

O8 — Retrieval strategy interacts with task type d=0.64 · p=0.021

Task type	CLI	RAG
Discovery	0.855	0.566
Reasoning	0.323	0.459

Discovery: p=0.021 (paired t), d=0.64

Reasoning: p=0.403 (not significant)

CLI tool-based search dominated structured lookup (+29 points over RAG, p=0.021, d=0.64, N=16 tasks). RAG edged ahead on cross-file reasoning (+14 points, p=0.403, not significant), likely because it injects all relevant context at once, avoiding the problem where the agent runs out of turns before it can chain together enough tool calls to answer multi-step questions.

The CLI advantage on discovery is driven by 5 tasks where RAG scores 0.000: tasks requiring iterative tool-mediated retrieval that single-shot context injection cannot perform. The model needs to search, read a result, search again based on what it found, and repeat. RAG gives it everything at once, which is the right coverage but the wrong format for these tasks.

Neither retrieval architecture is universally better. This suggests that the right approach is not to pick one, but to route queries to the right strategy based on task type.

Discovery: CLI +29 pts over RAG (p=0.021) · Reasoning: no significant difference

O1 — Tool-task interaction is heterogeneous aggregate p=0.391 · per-task d up to 0.80

Task	Search	Graph
D11	1.000	0.200
D6	1.000	0.400
D13	0.600	1.000
D10	0.700	1.000

Aggregate: p=0.391 (not significant). Per-task: up to 0.80 difference.

Graph tools appeared to hurt discovery tasks, help layer tasks, and are near-neutral on reasoning. The aggregate difference is not statistically significant (paired t-test p=0.391, N=16) because the effect is task-dependent. Schema ablation on discovery tasks (N=5) later revealed that the discovery penalty was caused by one tool (sysml_inspect) creating a selection trap — not by graph tools inherently. With that tool removed, the remaining graph configuration scored 0.925 vs 0.872 for search-only on discovery tasks.

The pattern holds across all four models tested, making it one of the most robust qualitative observations in the benchmark despite the null aggregate test. This is the lead evidence for Thesis 2: the aggregate null is not "tools don't matter." It is "tools matter enormously, but in opposite directions on different tasks, and the average hides everything interesting."

D1–D16 discovery tasks · search vs graph · Sonnet · opacity ∝ magnitude · aggregate p=0.391

O10 — Corpus scale is the dominant difficulty factor 0.880 → 0.423 at 5× scale

Corpus	Score
19 files	0.880
95 files, search	0.423
95 files, graph	0.389
95 files, +vectors	0.409

Failure modes: 55% budget exhaustion, 27% reasoning errors, 0% search failure.

Performance roughly halves from 19 to 95 files (0.880 to 0.423). At scale, additional tools did not help — the bottleneck is reasoning depth and turn budget, not retrieval quality. 11 of 20 scaling tasks fall below 0.333. The distribution is bimodal: easy tasks remain easy, hard tasks become impossible.

The failure mode is revealing: 55% of the time the agent ran out of turns before finishing. 27% were reasoning errors. 0% were search failures. The agent can find the information. It just can't process enough of it within the turn budget to reach the right answer. This suggests the path forward is better orchestration rather than better search.

This is the observation that keeps us honest. Our other results come from a 19-file corpus. Real engineering repositories are hundreds or thousands of files. The scaling problem is unsolved by any method we tested.

20 scaling tasks (95-file corpus) · graph score · bimodal: 6× at 1.000, 11× at 0.000 · dashed line = mean

Remaining observations summarized. Full details in the benchmark repository.

Other observations

ID	Summary	Classification
O2	Model quality gap: Sonnet consistently outperformed OpenAI models	Descriptive
O3	o3-mini is the only model where graph tools help on reasoning (+0.056)	Exploratory (power=0.08)
O5	Vector search: exact tie with keyword search on small corpus (0.880 vs 0.880)	Null
O6	Planning tools (sysml_stat, sysml_plan): +0.035 on hard tasks, not significant	Null
O7	RFLP layer tasks: cli_full showed slight advantage (~0.25 effect)	Exploratory
O9	Graph tools at 2–3 hops: no benefit (d=0.16, power=0.07)	Null
O11	Turn budget is a partial bottleneck but not the whole story	Descriptive
O13	Few-shot examples hurt mini models (GPT-4o-mini, o3-mini)	Exploratory
O14	Graph tools at 4–5 hops: not significant (d=0.44, power=0.19)	Null (underpowered)

Explore the Data

The observations above are our interpretation of the data. Below is the data itself — 4,700+ scored benchmark runs across 5 models and 12 tool configurations. Filter, compare, and draw your own conclusions.

Model Category Primary tool sets only

Score heatmap — tasks × tool sets

Data from sysml-bench-results · 4,726 scored runs · gitlab.com/nomograph/sysml-bench-results

Study Design

Corpus

Eve Online Mining Frigate SysML v2 model. 19 files, 798 elements, 1,515 relationships.

Models

Claude Sonnet 3.5, GPT-4o, GPT-4o-mini, o3-mini.

Replication

N=3 exploratory sweeps. N=5–10 for key comparisons (O4 at N=10, O1/O12 at N=5). T=0.3 for all CLI runs.

Caveat

All results exploratory. None survive multiple comparison correction across 14 observations.

sysml-bench is an exploratory benchmark evaluating how tool-augmented LLMs perform on structured engineering tasks in SysML v2. 132 tasks across 8 categories (discovery, reasoning, explanation, layer, boundary, vector-sensitive, structural trace, corpus scaling), tested with 4 models and 40+ experimental conditions.

The study generated 14 observations. Three achieved nominal statistical significance. None survive correction for running 14 tests simultaneously. It is a well-characterized exploratory study that identifies patterns and estimates effect sizes. The contribution is the benchmark methodology, the identification of task-tool interaction as a key variable, and the effect size estimates that make confirmatory follow-up designable.

Scoring

Per-field structured scoring. Each task defines expected fields with typed scorers: Bool (exact match), Float (numeric within tolerance), Str (exact string match), StrContains (case-insensitive substring), ListStr (F1 score with 0.8 threshold, binarized). Task score = mean of field scores. Condition score = mean of task scores across N runs.

Tool sets

Tool Set	Tools	Schema Tokens	Description
cli_search	2	268	search + read_file
cli_graph	6	1,116	search + trace + check + query + inspect + read_file
cli_render	7	1,303	cli_graph + sysml_render
cli_full	9	1,485	cli_render + sysml_stat + sysml_plan
+guided	varies	+~50	System prompt with tool selection hint
+vectors	varies	+0	Adds fastembed HNSW vector index

Known confounds (partially resolved)

cli_search (2 tools, 268 tokens) vs cli_graph (6 tools, 1,116 tokens) confounds tool count, schema overhead, and selection complexity. Schema ablation on discovery tasks (N=5) resolved the primary confound: removing sysml_inspect from cli_graph recovered performance to 0.925 (above cli_search at 0.872), demonstrating that the penalty was from selection confusion caused by one tool, not from schema overhead or tool count. The remaining confound is that the ablation configuration (5 tools) still differs from cli_search (2 tools) in tool count.

Ground truth

Created by the primary author from SysML v2 model inspection. Two corrections applied during experimentation: D16 (35.0→37.0), R5 (3→2). Structural trace scoring schema corrected in v2 (ST2/ST7 scores changed 0.542→0.865). No independent verification. Single-author ground truth is a limitation.

Statistical Context

Holm-Bonferroni

Step-down correction at α=0.05 across 14 observations. Controls family-wise error rate.

Effect size conventions

Cohen's d: 0.2 = small, 0.5 = medium, 0.8 = large. Calibrated for behavioral science; benchmark score differences may have different practical significance.

Multiple comparison correction

14 observations tested at α=0.05 yields a family-wise error rate of about 51%. When corrected (Holm-Bonferroni step-down), no observation remains significant:

Rank	Observation	Raw p	Holm threshold	Survives?
1	O12 (render)	0.009	0.0036	No
2	O8 (discovery)	0.021	0.0038	No
3	O4 (N=10)	0.025	0.0042	No

If O4 and O12 are designated as the only two hypotheses under test (Holm-Bonferroni at m=2), both survive: O12 adjusted p=0.018, O4 adjusted p=0.025. This designation was made after seeing the data. It becomes pre-registered only if declared before collecting new data on a second corpus.

Power analysis

Only one observation (O12) has enough statistical power to reliably detect its effect. The power analysis tells us exactly how large a follow-up study needs to be. This is itself a contribution: it makes the confirmatory work designable.

Observation	Effect size (d)	Current tasks	Power	Tasks for 80%	Tasks for 80% (α=0.025)
O12 (guided render)	0.75	16	0.80	17	21
O8 (CLI vs RAG)	0.64	16	0.70	20	25
O4 (render vs assembly)	1.01	10×8	0.69	10	12
O1 (heterogeneity)	0.22	16	0.13	163	210
O14 (graph 4-5 hops)	0.44	8	0.19	42	53

Threats to Validity

Six identified threats. The single-corpus limitation is the most fundamental.

T1: Single corpus. All primary observations derive from one 19-file SysML v2 model. The corpus is small enough that exhaustive search may substitute for structured traversal, potentially explaining why graph tools show no advantage. All claims are scoped to "on our benchmark corpus."

T2: Confounded tool sets (partially resolved). cli_search vs cli_graph differs in tool count, schema overhead, and selection complexity. Schema ablation resolved the primary confound: the penalty was from selection confusion caused by one tool, not schema overhead or tool count.

T3: Multiple comparisons. 14 observations at α=0.05 yields ~51% family-wise error rate. No observation survives full correction.

T4: Underpowered tests. Only O12 achieves 80% power. O4 (0.69 at N=10) and O8 (0.70) are near-adequate. Most other observations would need 30–300+ tasks to detect their effects reliably.

T5: Scoring methodology (binarization validated). ListStr binarization at 0.8 creates cliff effects, but continuous scoring confirmed the three key findings are robust. StrContains scoring for explanation tasks may be too lenient.

T6: Ground truth. Created and verified by a single author. Two corrections applied mid-experiment. No inter-rater reliability assessment.

Compare Tool Sets

Pick any two tool configurations and see how they compare task by task. Each line connects the same task scored under both configurations — the direction and length tell you which tool set won and by how much.

Tool set A vs Tool set B Task category

Data from sysml-bench-results · 4,726 scored runs

What's Next

Publications in preparation

Three arXiv preprints in preparation from this data. Paper A argues that representation matters more than retrieval, with O4 and O12 as the designated primary hypothesis pair and a pre-registered confirmatory design on a second corpus. Paper B argues that aggregate benchmarks hide task-level structure, with O1, O10, and O8 as lead evidence and a methodological contribution on per-task analysis with paired effect sizes. Paper C presents sysml-bench itself as a community benchmark artifact for MBSE AI evaluation, analogous to SWE-bench for software engineering. All three frame the current study as exploratory. We will link them here when they are posted. A practitioner-focused version of Paper A is being submitted to GVSETS 2026.

Confirmatory study

Pre-register O4 and O12 as primary hypotheses before collecting data on a second SysML v2 corpus. Design task sets for 80% power: 20+ explanation tasks, 20+ discovery tasks. This converts the post-hoc designation into genuine pre-registration.

Completed since initial publication: N=10 replication of O4 (effect strengthened to d=1.01), continuous scoring validation (binarization does not distort key findings), and schema ablation (identified sysml_inspect as the cause of the graph tool penalty via selection confusion). Remaining: second corpus for confirmatory replication.

Community artifact

The benchmark harness, task definitions, ground truth, and scoring code are available as a community artifact for future comparison. The goal is for sysml-bench to serve the MBSE AI community the way SWE-bench serves software engineering: a shared, reproducible evaluation surface.