Jump to section
Summary
Workday’s main AI bottleneck is not model quality. It is that Hopper still needs to become a production-grade semantic and evaluation layer for a metadata-heavy platform where important behavior is only partially machine-readable.[1][2][3] This proposal recommends an internal acceleration program to make Hopper measurable, governable, and integrated with tools: build a real evaluation harness, encode XO and metadata semantics into structured artifacts, connect Hopper cleanly to ConteXtO-style actions, and run the whole system with regression gates and human review where needed.[4][5][6][7]
The value proposition is straightforward: if Hopper becomes agent-operable, Workday can reduce developer iteration count, lower SME dependence, and compound AI productivity gains across workflows. If it remains mostly document retrieval plus informal heuristics, results will stay brittle.[8][9][3]
This proposal argues that Hopper is the highest-leverage place to invest if Workday wants AI to materially improve development workflows in XO and metadata-heavy systems. Public Workday architecture points to a platform where a meaningful part of application behavior lives in metadata, frameworks, and DSL semantics rather than plain source code.[1][2] That means LLMs will not be reliable unless Hopper can provide machine-usable semantics and not just loosely retrieved documents.[10][3][11]
The recommended engagement has five parts:
- Build an evaluation harness around real internal workflows so Hopper is measured on faithfulness, correctness, task completion, and regression resistance.
- Build a structured semantic layer for XO and related domains, including invariants, constraints, examples, counterexamples, and provenance.
- Build a high-throughput ingestion and review pipeline so institutional knowledge can be encoded quickly without losing quality control.
- Integrate Hopper with ConteXtO-style tools using retrieval-based tool selection, clear answer-vs-tool policies, and closed-loop traceability.
- Roll the system out under enterprise controls: least privilege, auditability, human approval for risky actions, and explicit ownership.
These workstreams reflect current best practices in RAG evaluation, agent evaluation, tool-surface design, and interoperable agent infrastructure.[12][4][5][6][7][13]
The commercial logic is that Isara’s advantage should come from speed with correctness defenses, not from claims about prompts or models. The first milestone should therefore function as a measurable bake-off: equal access, equal guardrails, and clear success metrics such as validated semantic artifacts per week, golden-set improvement, and end-to-end iteration reduction.[8][9][6][14][15]
Why Hopper is the leverage point
Two failure modes recur in real agent deployments:
- Missing stable action surfaces. API-first systems are materially easier for agents to use reliably than GUI-first systems, because APIs reduce step count, ambiguity, and state fragility.[16] This is consistent with broader engineering guidance to minimize brittle interface-level automation where lower-level interfaces exist.[17]
- Missing machine-checkable semantics. Domain-specific languages and layered frameworks accumulate implicit constraints, ordering effects, and hidden state that are difficult for both novice engineers and agents to infer correctly unless those rules are made explicit.[10]
For Workday’s platform, this means Hopper is not an accessory. It is the system that turns brittle prompt-and-retry behavior into grounded engineering behavior. Retrieval-augmented generation is useful because it combines generation with external knowledge, but enterprise experience shows that outcomes are driven as much by content design, evaluation, and monitoring as by model choice.[3][11]
There is a second-order scaling problem as well. As tool inventories and DSL domains grow, prompt stuffing becomes unreliable; retrieval-based selection and context compression become necessary if the agent layer is going to remain accurate and manageable.[12][4] Workday’s published direction around agent interoperability and MCP-style tool access makes this especially relevant.[5]
Target end state
The target Hopper end state is a semantics-and-decision system rather than a document chatbot. It should answer questions, propose changes, recommend actions, and justify each result with traceable sources, rule references, and confidence boundaries.
That requires three layers working together.
Curated semantic layer
Hopper should maintain a canonical, versioned representation of XO semantics and best practices as structured artifacts rather than free-form prose. These artifacts should cover:
- Constructs and meanings.
- Constraints and boundary conditions.
- Precedence and override behavior.
- Domain-framework invariants.
- Common failure modes and remediation patterns.
- Recommended and discouraged implementation patterns.
Each artifact should include provenance, scope, validity conditions, and tests in the form of examples, counterexamples, and expected outputs. This turns institutional knowledge into reviewable and regressible work products.[3]
Retrieval and reasoning layer
Hopper should use modular retrieval-augmented generation to ground explanations and recommendations in internal sources.[11][18] The retrieval system should be designed for control rather than maximum recall:
- Retrieve domain modules first, then retrieve within-module detail only when needed.
- Use hybrid retrieval where appropriate: keyword, vector, and optional graph-linked retrieval.
- Attach compact agent-facing summaries to each module so the model sees a minimal working set instead of an unbounded prompt.
Where multi-hop reasoning across related concepts is common, graph augmentation may be valuable because it preserves relationships that pure similarity search can miss.[19]
Evaluation and verification layer
Hopper quality must be measured continuously at both the component level and the workflow level. Formal RAG evaluation work provides useful starting dimensions such as context relevance, answer relevance, and faithfulness.[6] For agentic workflows, evaluation should favor verifiable end states wherever possible, complemented by rubric-based grading when binary checks are not available.[7]
In practice, verification should include:
- Golden-set scoring on real Workday developer workflows.
- Regression gating on every material Hopper change.
- Differential-style checks where two implementations or environments can be compared.[20]
- Reflection or critique passes as a secondary filter, not a substitute for evaluation.[21]
Scope of work
This engagement is structured as an internal accelerator program with five workstreams.
1. Evaluation harness
A continuously running system that answers: did Hopper reduce iterations and increase correctness for real XO and metadata work?
This workstream is the control system for the whole program. It should force every claim about Hopper performance to resolve into measured outcomes on representative internal tasks rather than anecdote or demo quality. It should also make regressions visible quickly enough that the team can tighten retrieval, semantics, and tool behavior before bad patterns spread.
Deliverables:
- A golden task suite spanning feature build, modification, debugging, troubleshooting, and “what would change if…” analysis.
- Scoring dashboards for retrieval quality, faithfulness, answer relevance, task completion, iteration count, tool-call success rate, and human review time.
- Automated regression gates for prompts, retrieval logic, schemas, semantic artifacts, and tool-routing policies.
2. Semantic knowledge architecture
A typed, versioned semantic layer for XO and adjacent metadata domains.
This layer is what turns scattered institutional knowledge into durable system behavior. The goal is not to collect more prose, but to encode the rules, invariants, and edge conditions that agents actually need in a form that can be reviewed, versioned, and tested. If this layer is weak, the rest of the Hopper stack will remain brittle no matter how strong the models or tooling become.
Deliverables:
- An internal schema for semantic cards with invariants, constraints, examples, counterexamples, and provenance links.
- A known-edge-semantics registry for ambiguous or failure-prone areas.
- Governance rules for domain ownership, approval, deprecation, and change history.
3. Knowledge ingestion and maintenance pipeline
A production pipeline for converting existing institutional knowledge into structured Hopper artifacts and keeping them current.
This workstream determines whether semantic coverage can grow fast enough to matter. The pipeline should make it cheap to convert existing materials into structured artifacts while preserving review quality and provenance. It should also create a repeatable path for keeping Hopper aligned with changing frameworks, tools, and internal practices over time.
Deliverables:
- Ingestion from documents, design notes, runbooks, bug writeups, and internal Q&A into both retrieval chunks and semantic cards.
- Parallel candidate generation with verification stages to accelerate throughput while filtering low-confidence outputs.
- SME review workflows optimized around approval or rejection of structured entries, not essay review.
4. Integration with the tool layer
Hopper must interoperate cleanly with ConteXtO-style tools and related internal action surfaces.
This is where Hopper stops being a retrieval layer and becomes an operational system. The integration needs to make tool use legible, bounded, and verifiable so the model knows when to answer, when to act, and how to justify that choice. Done well, this reduces prompt bloat, lowers tool misuse, and creates a clean trace from user intent to observed result.
Deliverables:
- A consistent interface for deciding when Hopper should answer directly and when it should invoke a tool.
- Retrieval-based tool selection rather than broad prompt stuffing of tool specs.[4]
- Closed-loop traceability from question to retrieval citations to tool calls to results to post-checks to final recommendation.
- Tool-definition hygiene standards covering naming, parameter contracts, invalid-state prevention, and edge-case examples.[13]
5. Governance, security, and operational rollout
The engagement should operate within enterprise controls from the start.
This workstream is what makes the program deployable inside a real enterprise environment rather than a lab setting. The intent is to build auditability, least privilege, and approval boundaries into the operating model from day one instead of bolting them on after the fact. That matters both for risk control and for making the eventual automation envelope explicit and defensible.
Deliverables:
- Least-privilege patterns for data retrieval and action execution.
- Full provenance on answers and recommendations.
- Audit logs for retrieval paths, tool calls, and high-risk decisions.
- Human approval gates for action mode until evidence supports broader automation.
Phased delivery plan
The work should be managed by exit criteria rather than fixed calendar promises.
Phase A: Alignment and baselining
Exit criterion: measurable starting point.
- Define the golden set of representative workflows and questions.
- Establish baseline metrics and dashboards.
- Define autonomy boundaries, review gates, and success criteria.
Phase B: Semantic corpus and governance
Exit criterion: first production-usable semantic corpus.
- Build the semantic schema.
- Stand up ingestion and review workflows.
- Populate initial high-value domains with approved artifacts.
Phase C: Retrieval and answer hardening
Exit criterion: measurable lift on the golden set.
- Tune retrieval precision and recall against real internal questions.
- Add source-required answer formatting and uncertainty labeling.
- Introduce automated grading and regression gates.
Phase D: Action-layer integration
Exit criterion: end-to-end reduction in workflow iterations.
- Connect Hopper decisions to tool execution paths.
- Add post-condition checks and trace logging.
- Restrict action mode to high-confidence scenarios with strong checks.
Phase E: Continuous improvement
Exit criterion: predictable weekly gains.
- Feed failures and new edge cases back into the semantic backlog.
- Run nightly or weekly semantic regression suites.
- Expand domain coverage without sacrificing measured faithfulness.
Operating model
This engagement is designed to strengthen, not replace, the existing Hopper team.
Workday responsibilities
- Product ownership for workflow priorities and adoption.
- DSL and metadata SMEs.
- Access to representative tasks, artifacts, and runtime validation paths.
- Designated reviewers for domain approvals and policy decisions.
Isara responsibilities
- Evaluation infrastructure and measurement discipline.
- Knowledge engineering and ingestion pipelines.
- Retrieval and tool-selection architecture.
- Multi-agent orchestration for high-throughput artifact generation.
- Integration engineering and governance implementation.
Success metrics
Recommended top-line metrics:
- Reduction in iteration count for difficult internal coding agent workflows.
- Developer time saved by workflow category, measured against internal baselines.
- First-try correctness on canonical semantics questions.
- Faithfulness and citation quality on Hopper responses.
- Tool-call success rate and post-condition pass rate.
- Weekly throughput of validated semantic artifacts and regression coverage.
Why Isara can outperform internal-only execution
Our differentiator is not better prompting. It is a combination of proprietary research, proven multi-agent systems expertise, and experience deploying bespoke AI systems inside high-value, regulated environments.[22]
Our core technology is frontier multi-agent coordination: orchestrating dozens to hundreds of specialized agents to decompose hard problems, gather and cross-reference information, and synthesize structured, citation-backed outputs that exceed what a single model can do alone.[22] That matters here because Hopper is not a toy summarization problem. It is a systems problem involving semantics, evaluation, tool routing, and high-stakes edge cases across multiple internal domains. The relevant claim is not simply that we can run more agents. It is that we have proprietary research and deployment experience in making complex agent systems actually work in production.[14][15][22]
The second part of our argument is deployment maturity. We do our best work inside client environments, with auditable outputs, full usage visibility, and compliance-aware deployment patterns rather than a generic SaaS model.[22] For Workday, that is a materially stronger fit than a vendor whose primary mode is an externally hosted copiloting layer. If Hopper is going to touch sensitive internal workflows, tools, and domain knowledge, our ability to deploy into Workday-controlled environments with clear auditability is part of the technical value, not just a commercial convenience.[5][22]
The third part of our argument is system design fit. We build systems around a client’s workflows, data, and operational constraints rather than adapting a general-purpose product to approximate them.[22] That is exactly the shape of the Hopper problem. Workday does not need a generic enterprise chatbot. It needs a system tailored to XO semantics, metadata-heavy development, internal tools, and internal governance. Our value comes from shaping the system to Workday’s actual bottlenecks so the deployment compounds from real usage and feedback over time.[3][22]
Our strongest claim should still be treated as falsifiable, not rhetorical. The first milestone should function as a constrained bake-off with equal access, equal guardrails, and equal review bandwidth. Decision criteria should be explicit: validated semantic artifacts per week, regression improvement on the golden set, end-to-end decrease in iterations for target workflows, and evidence that the resulting system can operate within Workday’s control, audit, and security requirements. If we do not show a meaningful lead on those measures, the program should not expand.[6][7][22]
Commercial structure
Recommended engagement options:
- Fixed-scope build: defined deliverables for evaluation harness, semantic pipeline, tool integration, and initial domain coverage, with change control for added domains.
- Time-and-materials acceleration: embedded team model with weekly measurable outputs against agreed metrics.
- Outcome-based structure: base fee plus success fee tied to improvements in agreed metrics such as iteration count reduction, task completion improvement, or faithfulness gains on the golden set.
Assumptions
- Workday can provide access to enough non-sensitive representative tasks and artifacts to define a realistic harness.
- There is a practical way to validate recommendations against real behavior or a faithful simulation.
- Knowledge sources used for Hopper can be access-controlled, versioned, and governed.
- Internal stakeholders are available to review and approve semantic artifacts at a sustainable cadence.
Key risks and mitigations
- Risk: document-search plateau. Mitigation: structured semantics, executable examples, and regression-gated evaluation rather than relying on RAG over prose alone.[3][11]
- Risk: tool-surface explosion. Mitigation: retrieval-based tool selection, strict tool-definition hygiene, and context compression.[12][4][13]
- Risk: apparent speedup hides rework. Mitigation: optimize for end-to-end completion and verified post-conditions, not proxy metrics alone.[9][7]
- Risk: unsafe automation. Mitigation: least privilege, action gating, auditability, and human approval until measured evidence supports expansion.[5]
References
Why Workday Is Different by Design, and Why It Matters | Workday US ↩︎ ↩︎
Exploring Workday’s Architecture | Workday Technology Blog | Medium ↩︎ ↩︎
Optimizing and Evaluating Enterprise Retrieval-Augmented Generation (RAG): A Content Design Perspective ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Building Enterprise Intelligence: A Guide to AI Agent Protocols for Multi-Agent Systems | Workday US ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
The Impact of AI on Developer Productivity: Evidence from GitHub Copilot ↩︎ ↩︎
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks ↩︎ ↩︎ ↩︎ ↩︎
When too many tools become too much context - WRITER ↩︎ ↩︎ ↩︎
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation ↩︎ ↩︎
Self-Consistency Improves Chain of Thought Reasoning in Language Models ↩︎ ↩︎
Retrieval-Augmented Generation for Large Language Models: A Survey ↩︎
Improving Retrieval Augmented Generation accuracy with GraphRAG | AWS Machine Learning Blog ↩︎
Reflexion: Language Agents with Verbal Reinforcement Learning ↩︎
Isara Laboratories — AI Research for Financial Institutions ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎