Results

What we've measured so far. All data is open and all experiments are designed for independent replication.

sysml-bench
14 observations

AI on SysML v2 Model Comprehension

tasks
132
models
4
conditions
40+
replications
3-10
Key finding

Representation dominates retrieval.

Pre-rendered views scored 0.893 vs 0.558 for agent-assembled context (d=1.01). Tool guidance eliminated a 13-point penalty. Vector search and graph traversal produced null results.

gkg-bench
simulation

GitLab Knowledge Graph for SDLC Queries

fixtures
62
models
3
conditions
5
replications
10-20
Key finding (simulation)

GKG improves baseline accuracy by 77% (+21pp).

Sonnet 4, n=20, 31 test fixtures. Multi-hop queries show the largest effect. Worked examples in tool descriptions are essential: 0% accuracy without them. Runs against DuckDB simulation, not production GKG.

lever canary
pilot

Feedback Signal Properties for LLM Code Repair

tasks
8
models
4
treatments
4
replications
1
Pilot finding

Precision matters more than brevity.

Naming what failed and what was expected keeps accuracy up. Brevity saves 47% tokens for the same accuracy. Current-gen models ceiling on these tasks. Needs more runs.