We release: 67,000+ trajectories from 3,800 resolved issues in 1,800+ Python repos. About 3x more successful trajectories and 1.5x more repos than our previous dataset. Trajectories are long: on average 64 turns, up to 100 turns and 131k context length.
> RFT on this data, SWE-bench Verified: Qwen3-30B-Instruct: 25.7% โ 50.3% Pass@1. Qwen3-235B-Instruct: 46.2% โ 61.7% Pass@1. Also strong gains on SWE-rebench September.
> We also did massive evals. We run OpenHands with 100 and 500 turns. We compare models under both limits. We run on SWE-bench Verified and several months of SWE-rebench.
!!! We also check tests written by the models. We measure how often tests are correct. We check how often the final patch passes its own tests. This gives a pool of tests for verifiers and auto graders.
geofractal getting started guide available, bulk ablation for fusion, simple towers, oscillator capacity, and substructure systemic associative capacity. Many formulas were tested, 92 tests for collectives, oscillation bulk experiments, and more. All of them either coalesce into the correct behavior or the failures are directly visible, which means the system is robust enough to declare some tools functionally valid but not scalable yet.
This is likely one of it's final growing phases before full production capacity is ramped up. The architecture is not for the novice, it's meant for experts to either get ideas, borrow code, utilize library capacity, or simply tell AI what to do. MOST files in current production have good descriptions for AI integration.
The wide router compiler organizes similar towers into stacked staged combinations before compiling with torch.compile. This is experimental, but has shown increased speed with multiple structures of wide models and will serve it's purpose in the future.
From AI demos to production systems: what breaks when agents become autonomous?
A recurring lesson from production AI deployments is that most failures are system failures, not model failures.
As organizations move beyond pilots, challenges increasingly shift toward:
โข Agent identity and permissioning โข Trust boundaries between agents and human operators โข Governance and auditability for autonomous actions โข Security treated as a first-class architectural constraint
This recent Fortune article highlights how enterprises are navigating that transition, including work with AWSโs AI Innovation Lab.
Open question for the community: What architectural patterns or tooling are proving effective for managing identity, permissions, and safety in autonomous or semi-autonomous agent systems in production?