How to sleep soundly when lives depend on you
Testing at Orchestra Health
“If we’d had Orchestra last year when we did this case, he’d probably still be alive.”
This was a comment made by a customer after we completed their first live demo.
By live demo, I mean we took in real data, with real patients and real information available then, and showed them what might have gone differently if they’d used Orchestra. As a sales tactic, we’ve found it to be effective.
This time, however, was slightly different, as we were conducting the demo as a postmortem, literally, for the first time. The patient in this scenario had died intraoperatively after experiencing cardiac arrest. The question was: why?
Preventing scenarios like this is a core reason Orchestra Health exists. As a patient readiness platform, we are responsible for understanding what is happening, who it’s happening with, how it will happen, and — most importantly — where things may go wrong, when we’re working optimally.
In this case, the patient had suffered a cardiac incident under anesthesia five years prior. The incident had been reported, documented, and shared … only to be subsequently buried by five years of patient data, clinic notes, lab results, imaging studies, referrals, and medication changes. The signal was there, it just wasn’t surfaced.
How Do You Test an Agentic, non-Deterministic System When Life Hangs in the Balance?
Orchestra allows health systems to leverage agentic systems to understand what needs to happen next in a fraction of the time it takes with conventional means. For some of our customers, it is integral to their entire operation.
This creates a problem: these systems are non-deterministic. The same input does not always produce the same output. Model behavior changes with updates from our sub-processors. Context windows, token sampling, and inference variability all introduce unpredictability.
We need to iterate on these systems while preventing regressions pre-deployment and identifying failures post-deployment. The post-deployment criterion is particularly important. We are responsible for the quality of our outputs to a standard that our sub-processors—OpenAI, Anthropic, and others—are not. Their liability ends at the API. Ours does not.
To address this, we have divided our testing into categories, each with specific criteria to determine appropriate use cases.
Imperative Testing (Structured Output)
When: Pre-deployment
This is the most straightforward and most familiar method. It resembles traditional test suites.
The approach: structure prompts to deliver structured outputs. Maintain a vetted library of inputs with known expected outputs. Run the suite on each deployment candidate and compare results.
How it works in practice:
We maintain test cases derived from real clinical scenarios. Each case includes:
Input data (patient records, procedure types, clinical context)
Expected output structure (risk flags, missing items, recommendations)
Validation criteria (which fields must match, which can vary, and which indicate failure)
For a prompt designed to identify medication contraindications, the test case might include a patient on Ozempic scheduled for a procedure requiring discontinuation. The expected output is a flag indicating that the Ozempic must be held. If the system fails to produce that flag, the test fails.
Benefits:
Fast execution
Low cost
Deterministic pass/fail criteria
Integrates with existing CI/CD pipelines
Limitations:
Brittle. Non-deterministic outputs may fail tests despite being clinically valid.
Limited coverage. Test cases cannot anticipate every real-world scenario.
False confidence. Passing tests does not guarantee correctness on novel inputs.
We use imperative testing as a baseline. It catches blatant regressions and ensures structural consistency. It does not catch subtle degradations in clinical reasoning.
Hash Checks
When: Post-deployment
We use a testing method we call Hash Checks when validation is cheap, but determination is complex. In these scenarios, validating a correct solution is inexpensive and can be done alongside solutions produced by an LLM or other ML model.
How it works in practice:
We have multiple small systems that we use to detect show-stopper conditions that are often buried between multiple sources, but these conditions can be confirmed by the patient with a simple yes/no question. There are literally hundreds of these conditions we check in any given case, which makes asking all questions of each patient impractical and, counterintuitively, less accurate.
After detecting these conditions, we can confirm via a notification sent to the patient or their past provider. Using this two-step system, we reduce the cognitive overhead on the patient, remove steps from the process, and increase the coverage of conditions we’re searching for.
One drawback to this approach is that delivery is not guaranteed, which means you will need to design a failure state into your system. Delivery cannot be guaranteed in all scenarios because your imperative checks can disagree with the AI output (i.e. the LLM’s answer fails the static imperative checks and therefore is not surfaced to the user), creating a conflict.
Benefits:
Highly accurate, complex results delivered to users
Broader range of checks
Limitations:
Solutions must be confirmable through deterministic means
Delivery is not guaranteed
Human-in-the-Loop
When: Post-deployment
For sensitive scenarios, we engage a human to validate output before delivering the final result to the customer.
The approach depends on the risk level:
Sampled review: A percentage of outputs are routed to human reviewers. This provides ongoing monitoring of system performance without requiring review of every case.
Full review: All outputs for designated high-risk scenarios are routed to human reviewers before delivery. The system does not release the output until a reviewer approves it.
How it works in practice:
When Orchestra generates a readiness assessment for a high-acuity case—such as cardiac surgery, transplant, or complex oncologic resection—the output is retained. A clinician from our reviewer network receives the case, reviews the source data against the system’s output, and either approves, modifies, or rejects the assessment.
Approved assessments are delivered to the customer. Modified assessments are delivered with corrections. Rejected assessments are flagged for engineering review.
The review serves two purposes:
Validating sensitive outputs. The customer receives an assessment that has been verified by a qualified clinician, not just generated by a model.
Flagging faulty outputs. Rejected and modified assessments feed back into our monitoring systems. Patterns of modification indicate systematic issues. Rejections indicate potential model failures.
Benefits:
Highest confidence for critical outputs
Generates labeled data for model improvement
Provides clinical validation our sub-processors cannot offer
Limitations:
Slow. Human review adds latency.
Implementation is expensive
We can deliver this human-in-the-loop validation at scale because of the clinical network we have built over the past year. This network — physicians, nurses, pharmacists, and other clinicians—provides the capacity to review outputs across specialties and time zones. It also provides the domain expertise to catch errors that a general-purpose reviewer would miss.
For our customers, this means they receive clinically validated patient readiness assessments without requiring their own staff to perform the validation. For us, it means we have a real-time signal on system performance from the people most qualified to evaluate it.
Outcome Tracking
When: Post-deployment (delayed)
The ultimate test of a patient readiness system is patient outcomes. Did the flagged risks materialize? Did the missed risks cause harm? Did the assessments improve decision-making?
The approach: track downstream outcomes and correlate them with system outputs.
How it works in practice:
For patients whose assessments flagged specific risks, we track whether those risks manifested. A patient flagged for cardiac risk who experiences a cardiac event during surgery validates the flag. A patient flagged for bleeding risk who proceeds without incident may indicate an overly sensitive threshold—or may indicate that the flag prompted appropriate precautions.
For patients whose assessments did not flag risks, we track adverse events. An unflagged patient who experiences a preventable complication indicates a miss.
Over time, these correlations inform threshold tuning, prompt refinement, and overall system calibration.
Benefits:
Ground truth evaluation
Directly measures clinical value
Informs continuous improvement
Limitations:
Long feedback loops. Outcomes may not be known for days, weeks, or months.
Confounded by interventions. A flagged risk that does not materialize may mean the flag was wrong, or may mean the flag prompted actions that prevented the event.
Requires data access. Not all customers share outcome data.
We use outcome tracking as the long-term signal. It does not catch issues in real time, but it tells us whether the system is doing what it is supposed to do: improving patient safety.
The Testing Stack
No single method is sufficient. Each addresses a different failure mode, operates on a different timescale, and provides a different type of signal.
Imperative Testing
Timing: Pre-deployment
Catches: Structural errors, obvious regressions
Misses: Subtle reasoning failures
Hash Checks
Timing: Post-deployment
Catches: Output drift, unexpected changes
Misses: Whether changes are good or bad
Humans-in-the-loop
Timing: Post-deployment
Catches: Clinical errors, edge cases
Misses: Scale limitations
Outcome Tracking
Timing: Delayed
Catches: Real-world effectiveness
Misses: Immediate issues
We run all of them. They are not redundant. They are complementary.
Testing in practice
The patient who died had a documented cardiac incident from five years prior. The documentation existed. It was in the record. A human reviewer with unlimited time could have found it.
The problem was not missing data. The problem was buried data. Five years of subsequent records—hundreds of notes, thousands of data points—had pushed the relevant signal below the threshold of human attention.
Our system surfaced it. It did so because we designed it to look for exactly this kind of risk, and because we tested it against cases where these risks were present and missed.
That customer now uses Orchestra. Their coordinators no longer rely on their own ability to read through five years of records and hope they catch the relevant detail. The system flags it. A human confirms it. The surgical team knows before the patient enters the OR.


