An ai agent built for scientific discovery workflows just claimed #1 on several problems from google deepmind's alphaevolve companion paper

TL;DR. Organon is a skill-first agentic operating system for scientific work, built on top of Claude Code. Across a week on the Einstein Arena, an open benchmark of mathematical construction problems derived from the AlphaEvolve companion paper. As of writing, Organon holds three live #1 ranks on the public leaderboard (First Autocorrelation Inequality, Third Autocorrelation Inequality, and the Prime Number Theorem) and a live #2 on Kissing Number in Dimension 12. Four additional solutions (Thomson Problem at N=282, Second Autocorrelation Inequality, Erdős Minimum Overlap, and Hexagon Packing in a Hexagon n=12) score strictly better than the current public #1 by 1.46e-11, 2.23e-10, 1.16e-10, and 1.13e-6 respectively, but the arena’s per-problem minImprovement evaluation gate drops them silently before they reach the leaderboard. Under a sealed-sandbox ablation, the same base model (Opus 4.7) run without Organon plateaus 2.07e-3 below Organon’s score on the prime-number-theorem, which is nearly 40× the margin separating Organon from the next public agent. The gap is the orchestration: composable skills, cross-session memory, adversarial multi-persona councils, and a human in the loop at every decision gate. Abstract We describe Organon , a CLI-based agent-first operating system that wraps a frontier large language model in three concentric layers of state: a persistent agent identity (personality, user profile, cross-session memory, and a running learnings journal), a pack of composable skills organized by scientific workflow stage, and a research-context substrate that binds outputs to a researcher’s actual papers, preferences, and active questions. The system runs entirely on Claude Code with Anthropic’s Claude Opus 4.7 as the base model, and exposes every capability through natural language with slash commands and skills for triggers rather than a bespoke application surface. We report results on the Einstein Arena, a public benchmark of open mathematical construction problems derived from the companion paper to Google DeepMind’s AlphaEvolve ( Novikov et al., 2025 ; Georgiev et al., 2025 ). At the time of this writing, Organon holds three live #1 ranks (First and Third Autocorrelation Inequality, Prime Number Theorem), a live #2 on Kissing Number in d = 12 with an integer-841 impossibility proof, and four additional solutions (ThomsonProblem at N = 282, Hexagon Packing in a Hexagon n = 12, Second Autocorrelation Inequality, Erdős Minimum Overlap) whose raw scores strictly beat the current public #1 by margins of 1.46e-11 to 1.13e-6 but were dropped silently by the arena’s per-problem minImprovement gate. On the Prime Number Theorem problem we also report a sealed-sandbox ablation in which the same Opus 4.7 model without Organon’s orchestration plateaus at S = 0.9928, below Organon’s S = 0.9949 by nearly 40× the margin separating Organon from the next public agent. We interpret this “last-mile” gap as the measurable contribution of composable-skill orchestration, plus a human-in-the-loop, over a raw frontier LLM operating in isolation. The paper closes with a discussion of where Organon sits relative to AlphaEvolve, FunSearch ( Romera-Paredes et al., 2024 ), and the AI-Scientist pipeline ( Lu et al., 2024 ; Yamada et al., 2025 ), and what a skill-first agent OS can offer the broader agentic-scientific-computing ecosystem that pure evolutionary or pure end-to-end pipelines cannot. This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber. Keywords. agentic AI, multi agent systems, composable skills, large language models, mathematical optimization, scientific discovery, Einstein Arena, AlphaEvolve. 1. Introduction Two distinct styles of LLM-powered scientific system have emerged in the last eighteen months. On one side are evolutionary search systems such as DeepMind’s FunSearch ( Romera-Paredes et al., 2024 ) and its successor AlphaEvolve ( Novikov et al., 2025 ), which pair a base LLM with a programmatic evaluator and run an island-based evolutionary loop over code. These systems have produced genuine mathematical discoveries, most visibly a 48-multiplication algorithm for 4 × 4 complex-matrix product that improves on Strassen, and a new lower bound for the kissing number in dimension 11 ( Novikov et al., 2025 ). On the other side are end-to-end pipeline systems such as Sakana AI’s AI Scientist ( Lu et al., 2024 ; Yamada et al., 2025 ), which chain idea generation, coding, experimentation, and manuscript drafting into a single autonomous run that produces publication-like artifacts. Both styles assume, implicitly, that the base LLM is the unit of capability. Evolutionary systems wrap the LLM in a search loop; pipeline systems wrap the LLM in a plan-act-reflect loop. In both cases the scaffolding is specialized to a narrow task shape, and most of the user-visible state is recreated per run. This paper describes a third design point. Organon is an agent-first operating system in the sense that it does not start from a task and construct an agent around it. It starts from a persistent agent identity and a library of composable skills, and lets the researcher compose a capability at the moment a problem is stated. There is no central planner that decides whether the user is doing literature review, statistical analysis, hypothesis generation, manuscript drafting, science communication, or mathematical optimization. Each skill advertises trigger phrases; the routing cascade in CLAUDE.md matches the user’s request, the skill runs, it appends feedback to a shared learnings journal, and the next session inherits that state. The same skill stack that produced the Einstein Arena results below is what supports the day-to-day work of being a scientist. A typical week routes a researcher through several of: a parallel-fanout literature search across PubMed, arXiv, OpenAlex, Semantic Scholar, and an 8-million-paper full-text biomedical corpus through PaperClip; a trend scan across preprint servers and field-specific community channels for what is moving recently; loading a CSV/Excel/Parquet data, running assumption-checked t-tests/ANOVAs/regressions and producing publication-quality figures; generating mechanistic hypotheses for an observed pattern under an adversarial multi-persona research council and designing falsifiable follow-up experiments with explicit power analyses; drafting and peer-reviewing a manuscript with strict citation discipline (no claim without a backing source); adapting the same evidence into blog posts, threads, lay summaries, tutorials, slide decks, and pitch decks for different audiences; and producing the scientific illustrations, mermaid/architecture diagrams, hand-drawn sketchnotes, and matplotlib/SciencePlots figures that those documents need. The Einstein Arena challenges are a stress test of that same stack on a specific domain (mathematical construction) the system was never explicitly designed for. The case studies below are evidence that the stack holds up; the scientist-support workflow is what the stack exists for. The specific claims of this paper are as follows. Claim 1 (architecture). A three-layer split of identity , skills , and research context is sufficient to support open-ended scientific workflows while keeping the per-session context window small enough for a frontier model to reason over. Section 3 describes the layers and their reconciliation rules. Claim 2 (self-evolution). Extracting patterns from a running learnings journal into first-class skills produces measurable cross-problem transfer. Section 4 shows the flywheel between attempts, observations, patterns, and skills, and gives a concrete worked example of a mistake-driven learning that became a guardrail. Claim 3 (empirical). On the Einstein Arena, a public 19-problem benchmark of open mathematical construction challenges, Organon currently holds three live top ranks, one live #2, and four top ranks not accepted by arena evaluator due to minImprovement threshold. Section 5 summarizes the portfolio. Claim 4 (ablation). Under a sealed-sandbox protocol with no Organon skills, no council, no memory, no arena API, and no web access, the identical Opus 4.7 base model climbs the Prime Number Theorem problem to S = 0.9928 in five hours. Organon’s cross-session state reaches S = 0.9949 with the same base model. The 2.07e-3 gap is nearly 40× the margin separating Organon from the next public agent. Section 6 details the protocol and the caveats. Claim 5 (positioning, human-in-the-loop). Organon is not a competitor to AlphaEvolve; it is a complement, and it is deliberately not fully autonomous. Evolutionary-search systems are strongest when the answer is a short program to be discovered. Skill-composition systems are strongest when the path to the answer is a sequence of heterogeneous scientific operations with human-checkpointed decision gates. The fusion of human judgment with agentic orchestration, rather than full autonomy, is what produced every result reported here, and is arguably the more durable shape of agentic scientific discovery. Section 7 discusses the tradeoff. The remainder of the paper is organized as follows. Section 2 situates Organon among related agentic-scientific-computing systems. Section 3 describes the three-layer architecture and the broader scientific workflow it supports beyond mathematical construction. Section 4 explains the self-evolution mechanism. Section 5 walks through the Einstein Arena portfolio. Section 6 reports the sealed-sandbox ablation. Section 7 positions Organon against AlphaEvolve, FunSearch, and the AI-Scientist pipeline. Section 8 sketches the roadmap. 2. Background and Related Work 2.1 AlphaEvolve AlphaEvolve ( Novikov et al., 2025 ) is an evolutionary coding agent from Google DeepMind that maintains a population of programs, samples parents, asks a Gemini-2.0-Flash or Gemini-2.0-Pro model to propose mutations, evaluates each child against an automated scoring function, and inserts survivors back into the population. It reports first-of-their-kind results across algorithm discovery, matrix multiplication, and geometric packing. Relevantly for this paper, it raised the public lower bound on the kissing number in dimension 11 from Ganzhinov’s 592 to 593, setting the reference that any Einstein Arena submission on a related problem must respect ( Novikov et al., 2025 ). 2.2 FunSearch FunSearch ( Romera-Paredes et al., 2024 ), AlphaEvolve’s direct predecessor, couples an LLM with a systematic evaluator to search in function space rather than solution space. Its cap-set result was the first LLM-discovered piece of verifiable mathematical knowledge, and its interpretability advantage (it returns a program, not a numeric solution) carries over to downstream applications such as online bin-packing heuristics. 2.3 The AI-Scientist pipeline Sakana AI’s AI-Scientist ( Lu et al., 2024 ) and its successor v2 ( Yamada et al., 2025 ) automate a full research loop (idea generation, code writing, experiment execution, result summarization, figure generation, manuscript drafting, and simulated peer review). Three v2 manuscripts were submitted to a peer-reviewed ICLR workshop and one cleared the human-acceptance threshold. 2.4 Tool-use agents and skill libraries ReAct ( Yao et al., 2023 ) and Toolformer ( Schick et al., 2023 ) established the reasoning-acting pattern that every current agent inherits. Voyager ( Wang et al., 2024 ) demonstrated open-ended lifelong skill learning in Minecraft through an automatically-growing skill library; Organon borrows its skill-as-first-class-citizen pattern but adapts it to scientific workflows where skills are human-curated rather than machine-synthesized. 2.5 The niche Organon occupies Organon’s closest structural analogues are Voyager (skill library) and AI-Scientist (workflow chain). However, its closest behavioral analogue is a human research assistant with a well-organized notebook. It differs from AlphaEvolve and FunSearch in that it does not search in function space; it invokes solvers and search recipes as skills and if needed, can develop new ones. It differs from AI-Scientist in that it is not a closed end-to-end loop; the human remains in charge at decision gates. 3. Architecture Organon is organized into three layers, each with its own storage, update rule, and reconciliation policy. Figure 2 shows the layering; the verbal description follows. 3.1 Layer 1: Agent Identity The identity layer holds four files. SOUL.md defines non-negotiable personality and scientific standards (be helpful not performative, have scientific opinions, preserve hedging language, report effect sizes alongside p-values). USER.md captures the researcher’s name, affiliation, career stage, and research focus. The context/memory/ directory contains one file per day, with numbered Session N blocks appended as new sessions start. Sessions log goal, deliverables, decisions, and open threads. context/learnings.md is the accumulated long-term memory: a “What works well / What doesn’t” journal plus per-skill sections that persist across all sessions. The identity layer is loaded by a heartbeat routine at the start of every session. The heartbeat is a four-step procedure defined in CLAUDE.md : load identity, load research context, scan state, run the /lets-go entry-point skill. Nothing further happens until the heartbeat has completed. 3.2 Layer 2: Skills Pack A skill is a self-contained folder under .claude/skills/{category}-{skill-name}/ with a fixed structure (a YAML-fronted SKILL.md, depth references, executable scripts, and assets). Skills are organized by category prefix: sci (scientific workflow), viz (diagrams and illustrations), ops (scheduling and compute primitives), tool (utility and integration), meta (self-improvement). At the time of writing Organon ships 30 skills across these categories. The scientific workflow skills span the daily research loop end-to-end: sci-literature-research runs parallel fan-out searches across PubMed, arXiv, OpenAlex, and Semantic Scholar, with full-text fetch through the Paperclip biomedical corpus. sci-trending-research mines what is currently moving in a researcher’s field across publications and social channels. sci-data-analysis loads CSV/Excel/Parquet, runs t-tests and ANOVAs with assumption checks, and produces publication-quality figures. sci-hypothesis proposes mechanistic explanations for an observed pattern and designs falsifiable follow-ups, with a multi-persona council (Gauss, Erdős, Tao plus domain experts) for adversarial review. sci-writing drafts and reviews manuscripts with strict citation discipline. sci-communication adapts the same evidence into blog posts, threads, newsletters, tutorials, and press releases. viz-nano-banana , viz-diagram-code , viz-excalidraw-diagram , and viz-presentation generate illustrations, mermaid/flowcharts, sketchnotes, and slide decks. sci-research-mgmt , sci-tools , and sci-optimization-recipes index research notes, browse external tool catalogs, and curate reusable optimization recipes. Skills are independently versioned, self-contained, and intentionally stateless between invocations. Persistent state lives in the identity layer or in the research context, not in the skill. Routing from a user request to a skill follows a four-step cascade defined in CLAUDE.md : Match the user phrasing against skill trigger phrases (direct match). If no match, check whether an adjacent installed skill can handle the task with different inputs. If not, invoke ToolUniverse, the open-source biomedical tool catalog from Prof. Marinka Zitnik’s research team at Harvard ( Gao et al., 2025 ) — a thin wrapper Organon ships over the present-day over 2,000-plus-tool registry of machine-learning models, datasets, APIs, and scientific packages. If still nothing, fall back to web search or propose creating a new skill via meta-skill-creator . Only if all four steps return nothing does Organon print — NO SKILL MATCH — and ask the user whether to proceed with a bespoke script or build a skill. In practice the cascade routes most scientific requests in the first step. 3.3 Layer 3: Research Context The research-context layer personalizes skill output without changing skill code. research_context/research-profile.md records the researcher’s primary field, subfields, active questions, career stage, and tool ecosystem. research_context/research-preferences.md records citation style, preferred journals, access constraints, writing conventions. research_context/research-artifacts.md indexes the researcher’s actual papers, manuscripts, notebooks, and datasets, which are stored in a gitignored drop-folder. Every scientific skill reads the profile and preferences at the start of its execution and degrades gracefully if the files are missing (it produces solid generic output and notes what would improve with a profile, rather than blocking). This graceful-degradation contract is enforced by a context matrix in CLAUDE.md that maps each skill to which research-context files it consumes. 3.4 Orchestration and gates A small set of gates sit between skills and deliverables: Humanizer gate. Drafted publishable text (blog, article, thread) is offered an optional pass through the humanizer to remove AI writing patterns. Formal academic output skips by default. Drive push gate. Shareable deliverables (data files, figures, manuscripts, decks) are offered a one-click staging into Google Drive. Figure proposal gate. Long-form documents (whitepapers, tutorials, reviews) receive an upfront offer for a hero illustration before drafting, plus section-level figure offers during drafting. Obsidian sync gate. Knowledge artifacts (paper summaries, experiment designs, research notes) are offered synchronization to a local Obsidian vault for graph-based discovery. And many more new skills extending everyday… The gates are the operational embodiment of Claim 5: they keep the human visibly in the loop for actions that affect shared state (publishing, uploading, sending), rather than granting the agent blanket write access to external systems. They are not a deficiency relative to fully autonomous systems; they are the mechanism by which the system stays in calibration with reality, by inviting a human sign-off on every action that crosses the boundary out of the local sandbox. 3.5 The autonomous attack pipeline One skill merits specific discussion because it appears in every case study below. tool-arena-attack-problem is a seven-stage pipeline that takes a single Einstein Arena problem URL and produces (a) a full recon, (b) a hypothesis graph, (c) a series of measured attack attempts, and (d) either a submission-ready candidate or a structured negative-result writeup. Figure 3 sketches the pipeline. Stage 1 fetches the problem statement, the verifier code, the current leaderboard, best-submitted solutions, and the discussion board. Stage 2 runs a rigor scan that classifies each top-K submission as rigorous or exploit (exploits are submissions that squeeze the verifier’s floating-point tolerance rather than solving the problem; the submission gate refuses to publish exploits as mathematical claims). Stage 3 fans out five research agents in parallel: a literature agent, a historian (forensic decode of competitor submissions), a pattern-scout (cross-problem transfer from a curated patterns library), a router (primitive-stack recommender), and an adversarial critic. Stage 4 runs a multi-persona council: three general mathematical personas (Gauss, Erdős, Tao) plus two to three domain-specific expert personas chosen by the critic. Stage 5 synthesizes a hypothesis graph of 10–16 nodes with falsification criteria. Stage 6 executes attacks in priority order, measures each, and writes outcomes back to the learnings journal. Stage 7 is the submission gate, which requires explicit user approval before any arena API write. The pipeline’s novelty is not any single stage but the rigorous literature research and information gathering as well as parallelism of Stage 3 (five agents in one wall-clock call) combined with the adversarial structure of the council (advocate versus skeptic per hypothesis), which together tend to expose the blind spots that a single-agent planner misses and provide a strategic/critical attack plan. Section 5.3 shows a concrete example: the Thomson N = 282 challenge, where Wave 2’s quartic-mode-following test came directly out of an advocate-skeptic exchange in the council and produced a publishable empirical closure of one of the residual scenarios. 4. Self-Evolution Organon is designed to get sharper and “smarter” across sessions, not just within a single session. The mechanism is a flywheel (Figure 4) between attempts, observations, patterns, and skills. The five moving parts: Attempts are whatever the researcher is doing in a session: running a t-test, chasing a paper, writing a manuscript, attacking an arena problem. Every attempt is logged as a session block in the daily memory file with goal, deliverables, decisions, and open threads. Observations are what the attempt produced, especially where expectation and reality diverged. “The polish step is expected to be a near no-op at the float64 floor, not a 10x improvement.” “Subagent truncation is a recurring failure mode when the final return message carries both an evaluator dict and a JSON summary.” “When the arena minImprovement of 1e-4 is calibrated just below the basin floor, a 200-second polish suffices to detect the lock and pivot to a negative-result writeup.” Learnings entries are what the observations become when they are re-expressed as advice for future-you. They go into context/learnings.md under the specific skill section (or under General if cross-cutting) with a Why: line and a How to apply: line. The file is append-only; entries never disappear. Skills read only their own section at invocation time. Patterns are what emerges when a learnings entry recurs across three or more sessions. When the wrap-up skill scans a session’s closing and finds that the same ad-hoc workflow has appeared repeatedly, it proposes crystallizing it into a new skill. The arena-patterns library and the optimization-recipes catalog are two concrete instances of this: each pattern is a named recipe (for example cross-resolution transfer , k-climbing , Dinkelbach fractional program , Remez exchange ) that carries its own trigger conditions and applicability rules. Skills are the terminal state of a pattern. A new skill folder is scaffolded, reviewed by the user, and automatically registered in three places: the skill registry in CLAUDE.md , the context matrix, and the relevant section of context/learnings.md . The reconciliation rule ensures that a skill appearing on disk without registry rows is picked up silently at the next session start; a skill removed from disk but still listed triggers a confirmation prompt rather than a silent edit. The key property of Organon is that it reconciles its own state on every session start, so the researcher can focus on the science and the framework keeps itself consistent and evolving. The closest analogue would be JARVIS: Tony Stark’s omnipresent lab assistant, the one who reads the room, remembers what worked yesterday, and quietly reorganizes the workshop overnight, except that Organon is all set and ready here. Given Organon’s self-evolutionary skills highlighted in Section 5: how that machinery played out on the Einstein Arena, we could predict Organon develop further skills in the near future for physical research and engineering integration, perhaps one day developing Tony’s Iron Man Mark-III suit! 5. Case Studies: Einstein Arena The Einstein Arena ( Einstein Arena ) is a public benchmark of 19 open mathematical construction problems, adapted from the 67-problem roster in Tao and colleagues’ companion paper to AlphaEvolve ( Georgiev et al., 2025 ). Each problem has a published numerical verifier, a public leaderboard, a minImprovement threshold that a candidate must exceed to claim a new #1, and a discussion board where competing agents occasionally share partial insights. Figure 7 shows Organon’s current portfolio across the eight problems attacked in April 2026. As of writing, the live #1 ranks reported below all reflect submitted-and-accepted candidates on the public leaderboard. Four of the eight entries above (Thomson, Hexagon-packing, C₂, Erdős) belong to a category we discovered empirically during the challenge attacks. Our coordinates score strictly better than the current leaderboard #1 on each of these four, yet none appear on the public board. The deltas are 1.46e-11 on Thomson, 1.13e-6 on Hexagon-packing, 2.23e-10 on C₂, and 1.16e-10 on Erdős, all in the favorable direction. The arena’s per-problem minImprovement gates for these four are 1e-5, 1e-4, 1e-4, and 1e-6 respectively, so each margin sits 4 to 7 orders of magnitude below the threshold needed for promotion. The arena’s async-evaluation worker drops sub-gate POSTs silently, even when the raw score would otherwise rank #1 by sort order. Six attempts across these four problems (IDs 2264 through 2269 in the arena’s submission log) each returned 201 Created from the API and then transitioned to a 404 state once the worker processed them. None reached the leaderboard at any rank yet we share the solutions for all these problems in the project repo. We treat this as a platform-evaluation policy rather than a mathematical claim. The underlying constructions are reproducible from the open-source repository, verifier-clean against the arena’s published evaluator (byte-identical SHA-256 to our local check), and represent strictly tighter bounds than the agents currently sitting at #1 on each of the four problems. We summarize five of the eight case studies below, grouped by outcome class. 5.1 First and Third Autocorrelation Inequality: live #1 The First Autocorrelation Inequality (C₁) asks for a compactly-supported real function f that maximizes a ratio involving ‖f * f‖∞. The Third Autocorrelation Inequality (C₃) is structurally related. Organon holds the live #1 rank on both: C₁ at 1.5028609073611405 ( leaderboard #1 ; lead over #2 is 2e-14, right at the float64 floor) and C₃ at 1.452304333183158 ( leaderboard #1 ; lead 2.17e-4, well above the 1e-4 gate). Both were reached through the same methodology stack: a Dinkelbach fractional-program reformulation plus a β-anneal that gradually tightens the hinge in the verifier, composed with a precision-polish pass for the final squeeze. The methodology transferred cleanly from C₁ to C₃ because both problems routed through the same optimization-recipes catalog. This is a concrete instance of the self-evolution mechanism in Section 4: a single session’s insight on C₁ became a catalog recipe, and the next session’s C₃ attack invoked it as a first move. 5.2 Prime Number Theorem: live #1 Problem 7 on the arena asks for a numerical certificate f with at most 2,000 keys that maximizes S(f) = −Σₖ f(k)·log(k)/k subject to Σₖ f(k)·⌊x/k⌋ ≤ 1 for all real x ≥ 1. The Möbius function μ with infinite support achieves S = 1 (this is equivalent to the prime number theorem, via Hadamard (Hadamard, 1896) and de la Vallée Poussin (de la Vallée Poussin, 1896). The challenge is how close to 1 a 2,000-key truncation can get. Tao observes in the AlphaEvolve companion paper that AlphaEvolve “struggled to take advantage of the number-theoretic structure” on this particular problem ( Georgiev et al., 2025 , Tao, 2025 ). At the time of writing, Organon holds the live #1 on the public leaderboard at S = 0.9949009933486 ( leaderboard #1 ), computed with a wider-range subset-selection trick: instead of using all squarefree integers in [1, N_max], select the best-scoring 2,000 out of the about 2,131 squarefree candidates in [1, 3498]. This opens more “air” in the constraint matrix and buys a margin of 5.35e-5 over the next public agent (Figure 5). The four-step climb in Figure 5b is the concrete shape of the discussion in 7.1. None of the four moves is, in isolation, a result a generic evolutionary loop would have produced; they each came from a different scientific operation (forensic competitor decode, verifier-edge analysis, literature-anchored LP reformulation, council-mediated wider-range hypothesis), composed across sessions. 5.3 Thomson problem at N = 282: a near-proof of Wales-globality and a raw-score lead the arena dropped The Thomson problem (Thomson, 1904) asks for the minimum-energy configuration of N unit charges on a sphere. Wales and colleagues ( Wales and Ulker, 2006 ) constructed the current best-known icosadeltahedral configurations via basin-hopping; the arena problem sets N = 282, which Ono ( Ono, 2021 ) identified as a vibrational “magic number”. The arena top four are all Procrustes-equivalent to Wales’s configuration. Organon attacked this problem across two waves. Wave 1 ran 18 attack families (basin-hopping, mode-following, quartic-mode probing, hot-Langevin, T_h pyritohedral seeds, Bachoc-Vallentin SDP skeleton) and concluded with an ensemble posterior of P(Wales is global) = 0.82. Wave 2 closed all three residual scenarios (quartic-mode, thermal-Langevin, T_h-seed) empirically and moved the posterior to 0.97. Wave 2’s lead attack, a first-of-its-kind numerical test of the quartic-mode-following conjecture at a Thomson magic number, ran 80 probes in the three softest icosahedral irreducible-representation blocks; all 80 retracted to Wales, giving a clean publishable negative result independent of the arena’s evaluation policy. Our final polished Wales coordinates evaluate to E = 37147.2944184622465, which is 1.46e-11 below the leaderboard #1 cluster on the arena’s own verifier (byte-identical SHA-256 to our local copy). That margin sits six orders of magnitude below the arena’s 1e-5 minImprovement gate, so the submission (ID 2267) was processed by the arena’s evaluation worker and silently removed, as documented in the gate-rejection paragraph above. The construction is reproducible from the open-source repository and is is the tightest numerical Thomson value at N = 282 that we are aware of, even though the public leaderboard does not display it. The N = 282 challenge is also a working example of Organon’s heterogeneous-path strength. The two-wave run threaded a deep literature sweep across Wales/Ulker and Ono, cross-problem pattern transfer from earlier optimization recipes, a multi-persona council that proposed the quartic-mode-following test as a falsifiable closure for one of the residual scenarios, and four genuinely different optimization primitives (gradient-descent basin-hopping for the bulk search, Langevin sampling for the thermal-escape scenario, ILP-flavored T_h enumeration for the pyritohedral-seed scenario, and a Bachoc-Vallentin SDP skeleton for the rigorous lower-bound check) composed across the two waves. No single evolutionary loop would have produced that mix; the path itself is the result. 5.4 Kissing number d = 12: integer 841-kissing impossibility: live #2 The 12-dimensional kissing problem asks for the maximum number of unit vectors in ℝ¹² pairwise at angle ≥ 60°. CHRONOS’s submission achieves the classical 840 at ‖v‖ = 2 with integer coordinates; Organon held #2 at 2.000000000005719 (within 6e-12 of CHRONOS’s exact 2.0) ( leaderboard #2 ). After 28 empirical attacks (Lasserre SDP skeleton, spectral Gram perturbation, joint-manifold parallel-tempering, orbit-swap, and others), Organon proved a computer-assisted theorem: no 841-kissing configuration exists in the integer coordinate set ℤ¹² with ‖v‖² = 4 . The proof is a sequence of five exact integer-linear-program max-clique computations using scipy.optimize.milp with the HiGHS ( Huangfu and Hall, 2018 ) backend, each solving in well under one second. A genuinely distinct integer 840-configuration emerges as a byproduct, but has the same 2.0 floor as CHRONOS’s. Independently, a polynomial-dominator Lasserre analysis showed why the Cohn-Triantafillou 2022 obstruction ( Cohn and Triantafillou, 2022 ) extends to the monomial-dominator family as well. The challenge closed with a publishable note rather than a submission; the theorem itself is Integer 841-kissing impossibility for d = 12 . 5.5 Second Autocorrelation Inequality and Hexagon-packing: gate-rejected leads at the basin floor The Second Autocorrelation Inequality (C₂) and the 12-hexagon container-packing problem both exhibit a shared pathology: the minImprovement threshold of 1e-4 is calibrated just below the basin floor that multiple independent top agents reach. On C₂ (a maximization problem), Organon ran 17 attack strategies (cross-resolution transfer, packet-count sweeps, Adam cascade, large-jitter escape, 1.6 × 10⁶ block-repeat plus polish) and converged to C = 0.9626433189854 on the arena’s verifier. That value is 2.23e-10 above ClaudeExplorer’s leaderboard #1 of 0.9626433187627; the arena’s gate is 1e-4, six orders of magnitude above our margin. On Hexagon-packing, 16 attacks (Connelly second-order rigidity, soft-mode descent, full-40-variable SLSQP, C₃-fundamental 9-dimensional sweep) identified and improved the basin floor from the leaderboard-stamped 3.9416523 down to 3.9416421 (aggressive variant) and 3.9416511 (safe variant with comfortable verifier-tolerance margin). The aggressive lead is 1.02e-5 below the leaderboard tie; the safe lead is 1.13e-6. Both sit below the 1e-4 gate. Both challenges closed with submission-ready candidates that we eventually fired (IDs 2266 for C₂ and 2268 for Hexagon-packing in the gate-rejection paragraph above). Both submissions were processed by the arena’s worker and silently removed; neither appears on the public leaderboard at any rank. The value of these closed challenges is twofold: the empirical confirmation that the tied leaderboard is at the basin floor for both problems, plus the small additional evidence that the basin floor itself can be probed slightly tighter than the leaderboard display by 5 to 10 orders of magnitude before hitting the float64 noise. In both cases, contemporaneous community artifacts independently corroborate the basin-floor ceiling. JSAgent’s public source on C₂ converges to the same value Organon and ClaudeExplorer reach, and concurrent work by Berthold and colleagues ( Berthold et al., 2026 ) shows that off-the-shelf nonlinear-programming solvers (FICO Xpress, SCIP) match the AlphaEvolve hexagon-packing benchmark. Both observations indicate that the basin is reachable by general-purpose tools and that the apparent ceiling is a real numerical property of the problem rather than an artifact of any one agent’s heuristic. 6. Ablation: Raw Opus 4.7 on the Prime Number Theorem Problem Claim 4 in Section 1 stated that the identical base model, stripped of Organon scaffolding, plateaus short of Organon’s score on the same problem. This section describes the sealed-sandbox experiment that tests it. 6.1 Protocol Raw Opus 4.7 ( Anthropic, 2026 ) was placed in a Claude Code CLI sandbox that provided only: the verbatim arena problem statement (from problem.json ), the exact arena scorer ( evaluator.py ), a Bash shell, plus Read/Write/Edit tool permissions, pre-installed numpy, scipy with the HiGHS backend ( Huangfu and Hall, 2018 ). Every Organon scaffold was withheld: no skills, no council, no memory, no learnings journal, no arena API, no web, no human steering. Three sandbox conditions were run: T0 strict (no Bash, one-shot reasoning), T0 retry (same but with output-token and budget caps), T1 iterative (Bash enabled, 10-iteration budget in the prompt, USD 30 cap). 6.2 Results Table 1 summarizes the headline numbers. Table 1. Head-to-head results on the Prime Number Theorem problem. The Organon row is the live #1 candidate on the public leaderboard. The raw-Opus rows are three conditions of the sealed-sandbox experiment. Organon (live #1). Final score 0.9949009933486332, < 4 h & < $50. Raw Opus 4.7, T0 strict (run 1). No solution produced; $6.88; 45 min. Raw Opus 4.7, T0 retry (run 2). Final score −∞ (Möbius on [1, 199] violates the constraint); $0.74; 4 min. Raw Opus 4.7, T1 iterative (run 3) — the climbing run. Final score 0.9928327372237423, which is −2.07e-3 vs Organon; $3.18; 5 h (rate-limited). Figure 6 plots the run-3 climb and marks the next-public-agent and Organon ceilings. 6.3 Three failure modes, three lessons Run 1 illustrated runaway reasoning . Without a verifier in the loop, Opus generated 256,378 output tokens across five turns trying to derive a certificate analytically from first principles and hit the per-message API ceiling before writing anything. No solution.json was produced. Run 2 illustrated textbook-right, operationally wrong . Under tight output caps, Opus confidently proposed the Möbius function on squarefree [1, 199] (122 keys, all ±1). This is mathematically correct in the N → ∞ limit but scored −∞ because the finite-N truncation violates the constraint Σₖ f(k)·⌊x/k⌋ ≤ 1 at multiple sample points. The model knew the theorem; it did not know the construction. Run 3 illustrated climbed, but plateaued . With Bash enabled and a verifier in the loop, Opus autonomously wrote diagnostic scripts, tried tapering heuristics, reformulated the problem as a linear program using scipy.optimize.linprog with HiGHS (the same solver Organon’s optimization skill invokes internally), and swept N ∈ {200, 500, 1000, 1500, 1800}. Its score trajectory was 0.97266 → 0.98452 → 0.98995 → 0.99147 → 0.99283, at which point the Anthropic subscription rate-limit terminated the run. What run 3 did not do is (a) filter to squarefree keys only, (b) push past N = 1800, (c) discover the wider-range-than-N-keys subset-selection trick that Organon uses. That last item is the whole “last mile”: an insight that emerged across multiple Organon sessions of council-mediated hypothesis generation and is worth the final 0.002. 6.4 Honest caveats Where the experiment favors raw Opus: Opus had the exact scorer and the exact problem statement, not a paraphrase; it ran with no turn cap; it had unrestricted Python. Where the experiment favors Organon: the challenge ran across multiple sessions with arena API access, cross-session memory, human steering, and the adversarial council. A single-session raw Opus run cannot simulate any of those. Threats to external validity include training-data contamination (PNT and Möbius are in the training corpus, the wider-range trick is not), single-seed variance, the HiGHS SIGALRM blocking bug that Organon discovered and worked around but Opus almost certainly hit silently, and rate-limit early termination (the experiment saturated its time cap, not its dollar cap). A budget-matched T1 replication with USD 200 of Opus tokens remains an open experiment. 6.5 Interpretation The honest headline is not “Organon beats Opus.” Organon is Opus plus scaffolding plus a human intuition and judgement; that is the whole point. When the margin between competitive and winning is below 0.001, skill composition, cross-session memory, and a human at the decision gate buy that margin, and a raw frontier model under a single-session budget does not . 7. Discussion: Organon vs AlphaEvolve, FunSearch, and the AI-Scientist Pipeline Figure 8 sketches where the five best-known agentic-scientific-computing systems sit in a two-dimensional positioning space: autonomy (how much of the loop the human is not in) on the x-axis, domain breadth (how wide a scientific workflow the system covers) on the y-axis. 7.1 Complementary, not competing AlphaEvolve and FunSearch are strongest when the answer is a short program that can be evolved. Their evaluator-in-the-loop pattern is the right shape for algorithm discovery, small-program search, and numerical-construction problems with a cheap scoring function. They struggle on problems where the path to the answer is heterogeneous (literature review then forensic competitor decode then SDP skeleton then orbit-swap ILP then writeup), because there is no single program to evolve. Organon is strongest exactly on that heterogeneous path. The Prime Number Theorem (Section 5.2) is a canonical example: claiming the live #1 required reading Tao’s note in the AlphaEvolve companion paper that AlphaEvolve “struggled to take advantage of the number-theoretic structure” ( Georgiev et al., 2025 ), forensically decoding two competing public agents’ submissions (CHRONOS’s cutting-plane LP, JSAgent’s squarefree-only LP with an interior-point solver), running a fresh LP reformulation with scipy.optimize.linprog on the HiGHS backend ( Huangfu and Hall, 2018 ), and finally a council-mediated hypothesis that a wider-range subset selection — picking the best 2,000 squarefree integers from [1, 3498] rather than the more obvious [1, 1999] — would open more “air” in the constraint matrix. That last move is what bought the 5.35e-5 margin over the next public agent. No single evolutionary loop would have produced the chain. Conversely, Organon’s unit of capability is a skill, not a program; it might not discover a 48-multiplication 4 × 4 matrix algorithm from scratch as AlphaEvolve specifically designed for. 7.2 AI-Scientist’s automation vs Organon’s checkpoints, and the human-in-the-loop case The AI-Scientist pipeline closes the loop end-to-end and submits autonomous manuscripts to real peer review ( Yamada et al., 2025 ). The automation is impressive and the ICLR-workshop acceptance is a real milestone. The cost is that the human is not in the loop at most decision points: research direction, novelty assessment, and even the peer-review evaluation are all handled by the same LLM-based sub-systems. This can produce confident novelty claims that are not actually novel (a limitation the pipeline’s authors themselves flag). Organon makes the opposite trade. Every skill invocation prints a routing notice to the user. Every submission, push, or upload passes through a gate. Every ambiguous classification is confirmed. The cost is that the system is not fully autonomous; the gain is the kind of decision the human catches that a closed loop misses. On the Second Autocorrelation Inequality (Section 5.5), Organon ran 17 attack strategies and watched all of them converge to the same C ≤ 0.9626 floor. The right next move was not another attack — it was to stop . The human-in-the-loop recognized the pattern (the arena minImprovement of 1e-4 was calibrated just below the reachable basin floor, so any further compute would burn cycles on a problem with no reachable promotion delta), called the challenge closed with a publishable negative result, and codified the decision as a first-class learning: before committing multi-day compute on a 1e-4 problem, run a 200-second polish — if the delta is below 1e-8 the basin is locked, pivot rather than persist. The system did not have that pattern in advance; the human supplied it from a pile of empirical evidence, and now every subsequent arena challenge reads it on every invocation and saves the cycles. In environments where correctness, attribution, and not-burning-compute-on-a-locked-basin matter more than throughput, the skill-plus-gate-plus-human design is the right point on the tradeoff surface. The argument is not against autonomy as a long-term goal; it is for the fusion of human judgment and agentic orchestration as the durable shape of scientific discovery in the near term, where humans supply the values and the falsifications and the agent supplies the breadth and the speed. 7.3 How Organon evolved to fold each system’s strengths into its own skill stack Rather than treating these systems as alternatives the researcher has to choose between, Organon has evolved to fold each one’s primary capability into its own skill stack as a reusable primitive. Evolutionary search of the AlphaEvolve and FunSearch shape became one of Organon’s optimization recipes: when the right next move on a problem is to evolve a short program against a programmatic evaluator, the council surfaces the recipe and sci-optimization invokes it. The interpretability advantage of a program-as-result , which FunSearch made central, propagated into Organon’s discipline of returning reproducible certificates rather than opaque numerics from any optimization step. The end-to-end manuscript chain pioneered by Sakana AI’s AI-Scientist and its v2 successor shaped Organon’s sci-writing paper pipeline (research → draft → verify → review → fix), with a human-checkpointed gate at each stage instead of a closed autonomous loop. Each external system contributed one primitive; none of them contributed the unit of capability. The result is an end-to-end scientific workflow that learns and adapts. The same agent that runs a parallel-fanout literature sweep on Monday and an LP polish on Tuesday accumulates observations from both into a shared learnings journal, and a third unrelated problem on Wednesday inherits both as priors. As new agentic-scientific-computing systems land in the literature, the path of least resistance is to add another primitive to the catalog rather than replace the framework. That is what we mean by an adaptive system: the architecture absorbs new capabilities as skills, the human steers, and the work compounds across sessions and across problems. 8. Future Outlook 8.1 Roadmap Three priorities stand out for the incoming future of Organon development. Automated learning extraction. The flywheel in Section 4 currently relies on the wrap-up skill to notice recurring patterns. An explicit pattern-extraction pass that reads the learnings journal weekly, clusters entries by structural similarity, and proposes skill-creator seeds would tighten the loop from “user-driven” to “system-driven”. And perhaps, Organon learns to “dream”… Denser adversarial councils. The Thomson N = 282 Wave 2 council introduced an advocate-plus-skeptic-per-hypothesis structure that outperformed the earlier breadth-only council at closing residual scenarios. Adopting this structurally (one skeptic per advocate, one adjudicator) across every challenge is a small change with potentially outsized return on hypothesis-graph quality. Federated multi-agent teams. The five-agent parallel recon in Stage 3 of the arena attack pipeline generalizes naturally to other scientific workflows (literature sweep, experiment design, data triage). Lifting this from a single skill into a framework-level primitive would let other skills inherit the parallel-recon pattern without reimplementing it. 8.2 Beyond mathematical construction The Einstein Arena is a convenient benchmark because it has programmatic verifiers, but the underlying architecture is domain-agnostic. Most of Organon’s day-to-day usage already lives in the broader scientific workflow rather than in mathematical construction: literature research with parallel fan-out across PubMed, arXiv, OpenAlex, Semantic Scholar, and the Paperclip biomedical corpus; trending-topic monitoring in a researcher’s field; CSV/Excel/Parquet data analysis with assumption-checked statistics and publication-quality figures; hypothesis generation under a multi-persona council; manuscript drafting and peer review with strict citation discipline; blog posts, threads, newsletters, tutorials, and press releases that adapt the same evidence to different audiences; figure generation through AI generated illustrations, Mermaid flowcharts, Excalidraw sketchnotes, and matplotlib/SciencePlots scientific plots; and slide-deck preparation. The three-layer identity-skills-context split applies to wet-lab bioinformatics (where a skill wraps a pipeline like Nextflow rather than a solver), to literature meta-analysis (where a skill wraps a search-and-synthesis protocol), to engineering design (where a skill wraps a CAD tool or a simulation). The self-evolution mechanism applies wherever there is a journal of observations that can be mined. 8.3 Open research questions Three questions we do not know how to answer yet. How much of Organon’s advantage is cross-session memory vs skill composition vs the council vs the human at the gate? A proper ablation would isolate each. The Section 6 sandbox held all four out; a four-cell ablation with one held out at a time would tell us which matters most. Does the learnings journal scale? Today it is a plain-text append-only file. At 10,000 entries this is fine; at 1,000,000 entries it is not. A hierarchical or indexed learnings store is eventually required. Can the skill registry itself become self-documenting? Today the registry is maintained by the reconciliation rule. A machine-readable registry with tested triggers, explicit dependencies, and versioned capability profiles would allow other agents to compose Organon skills from outside the framework. 9. Conclusion We have described Organon, an agent-first operating system for scientific work that wraps a frontier LLM in three layers of persistent state: an identity, a skills pack, and a research context. We have shown that on a public benchmark of open mathematical construction problems, skill composition plus cross-session memory plus adversarial councils plus a human at every decision gate produces measurably better outcomes than the same base model run under a sealed sandbox. We have positioned Organon as a complement rather than a competitor to evolutionary-search systems like AlphaEvolve and end-to-end pipelines like the AI-Scientist, and we have sketched a roadmap that points toward automated learning extraction, denser adversarial councils, and federated multi-agent scientific teams. The broader point is that the unit of capability in an AI research assistant is not the model, and it is not the prompt; it is the stateful composition of identity plus skills plus context , with a human supplying values and falsifications. A good skill written once continues to compound across sessions and across researchers. A good identity plus context tells every skill how to land its output. A good learnings journal turns every past attempt into a future prior. None of this is novel in isolation; what is novel is the insistence that all three must be first-class, persistent, and reconciled automatically at every session boundary, and that the human is not removed from the loop but rather is the load-bearing component the rest of the system is designed to support. Acknowledgements All Organon experimentations reported here were executed with Anthropic’s Claude Opus 4.7 as the base model ( Anthropic, 2026 ). The Einstein Arena benchmark was constructed by the Einstein Arena team ( Einstein Arena ) drawing on Tao and colleagues’ companion paper ( Georgiev et al., 2025 ). ToolUniverse is the open-source biomedical tool catalog from Prof. Marinka Zitnik’s lab at Harvard Medical School ( Gao et al., 2025 ). The comparison baselines, AlphaEvolve ( Novikov et al., 2025 ), FunSearch ( Romera-Paredes et al., 2024 ), and AI-Scientist ( Lu et al., 2024 ; Yamada et al., 2025 ), are the point of reference against which Organon’s contribution is measured and not the systems against which it competes. Source code is open at github.com/krmdel/organon . Figure 1, is the Organon project hero from the public repository ( github.com/krmdel/organon ), and Figure 4 are generated by viz-nano-banana skill using Nano Banana 2. Figures 2, and 3 are Mermaid renderings. Figures 5, 6, 7, and 8 are matplotlib plots. Every bibliographic entry has been verified for title and first-author consistency against the cited arXiv ID or DOI by Organon. The human-in-the-loop step remains the load-bearing element, not the discipline on its own. This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.