Skip to main content
Jonathan Andrei
Back to all posts
Jun. 202611 min read

FlakeWarden: 90.7% Accuracy and a 0% Safety False-Positive Rate on Flaky-Test Triage, on UiPath Maestro

Flaky tests are the most corrosive failure mode in CI: a red build might be a real regression or just noise, and engineers eventually start ignoring red builds until a real bug ships. FlakeWarden answers the only question that matters (real defect, flaky, or environment) with a deterministic flake-scorer for the clear cases and a grounded UiPath Agent Builder classifier for the ambiguous ones, orchestrated through Maestro with a human approving every change. 90.7% accuracy on a 150-case corpus, with 0% safety false-positive rate enforced by mechanism.

AgentHackUiPathUiPath MaestroUiPath Agent BuilderAgentic AITest AutomationFlaky TestsPython
I created this post and the FlakeWarden project for the UiPath AgentHack 2026 (Track 3, UiPath Test Cloud). #AgentHack

Flaky tests are the most corrosive failure mode in CI. When a red build might be a real regression or just noise, an engineer either burns 15 to 45 minutes triaging every failure or, worse, starts ignoring red builds until a genuine regression ships. Google's continuous-testing study reported that about 16% of their tests had some flakiness and that about 84% of pass-to-fail transitions came from flaky tests. As an illustrative model: at a 5% flake rate a 2,000-test suite produces around 100 spurious failures per full run, which at 15 to 45 minutes of triage each is tens of engineer-hours per cycle.

FlakeWarden looks at a failing test's execution history and the surrounding evidence and answers the only question that matters: is this a real defect, a flaky test, or an environment problem? It then routes each failure to the right action, with a human in charge of every change.

FlakeWarden architecture: UiPath Test Cloud feeds execution history into a Maestro process containing a deterministic flake-scorer, a grounded Agent Builder classifier, and a governed action router. Confident verdicts skip the LLM; ambiguous ones go through the classifier. Every flaky fix or quarantine is gated by Action Center before the Orchestrator applies it.
Deterministic where it must be exact, generative where the context is messy. The boundary is the architecture.

The deterministic-vs-generative boundary

  • Deterministic flake-scorer: auditable statistics over run history. Handles the clear cases and never guesses. About a third of failures (52 of 150 in the eval) resolve here with no model call.
  • Grounded Agent Builder classifier: RAG over stack traces, DOM diffs, commit messages, and runner logs. Reasons over only the ambiguous failures and labels each one real_defect, flaky, or environment with evidence-cited reasoning.
  • UiPath Maestro orchestrates the two plus a Repair Agent. Every fix, quarantine, or baseline change passes through a mandatory Action Center human-review gate, never an autonomous mutation.

The headline number: 0% safety false-positive rate

Measured on a 150-failure labeled corpus: 90.7% overall accuracy and a 0% safety-direction false-positive rate, meaning no real regression is ever auto-hidden as flaky or environment. The 0% rate is enforced by mechanism, not luck: the scorer only auto-resolves a defect on a positive selector fingerprint; a flaky-looking history with any regression hint is double-checked by the classifier; and the classifier tie-breaks toward real_defect on split evidence. eval/negative_control.py runs as a CI gate that fails the build if a planted defect is ever auto-healed.

The remaining 12% is noise-direction error (a flaky or environment failure escalated as a defect). That wastes a little triage time but hides nothing, which is the right tail to land on for a CI guardrail.

Agent Builder Triage Classifier verdict screen showing label real_defect on Checkout_E2E_007 with 0.95 confidence, the structured output schema fields, and grounded reasoning citing the failing assertion and the recent commit message
real_defect verdict on Checkout_E2E_007 (0.95 confidence). Notice the empty proposed_fix: the agent refuses to suggest a fix when the failure is a real bug, even though it could.
Same agent producing a flaky verdict on Cart_UI_012 with a proposed selector/synchronization fix
flaky verdict on Cart_UI_012 with a proposed_fix. Only flaky verdicts populate the fix; the action still requires Action Center approval.
Agent Builder verdict screen showing the environment label on Payments_API_003 with a confidence of 0.97 and grounded reasoning pointing to an external API timeout
environment verdict on Payments_API_003 (0.97). Same agent, same schema, three different correct labels across the verified happy paths.

Built end-to-end with Claude Code driving the UiPath uip CLI

The entire solution (the deterministic scorer, the classifier interface, the eval harness, the Maestro orchestration, packaging, and deployment) was scaffolded and iterated using Claude Code driving the UiPath uip CLI (UiPath for Coding Agents). The session ran uip login, uip skills, uip tools install, uip agent deploy, and uip maestro bpmn registry/init/validate against an EU UiPath Labs tenant, with the orchestrator process artifact (Solution.1.agent.Agent) deployed and the BPMN passing uip maestro bpmn validate.

Terminal showing Claude Code driving the uip CLI: package, publish, deploy commands running against the UiPath Automation Cloud tenant
Coding-agent loop with the uip CLI. Every command in the loop has a deterministic exit code, which is what makes an agent-driven build trustworthy.
UiPath Orchestrator processes page showing FlakeWarden solution package published v1.0.0 and deployed as an Orchestrator process named Solution.1.agent.Agent
Triage Classifier agent v1.0.0 deployed as an Orchestrator process. The agent itself is live; the deployment-plumbing step left is wiring the BPMN's Orchestrator.StartAgentJob call to a serverless robot.

Maestro orchestration (authored entirely through the CLI)

Maestro process: Start → agent call (Orchestrator.StartAgentJob on the deployed Triage Classifier) → verdict extraction → exclusive gateway on the label → three routed branches: flaky → human-gated heal, real_defect → escalate, environment → re-run. No graphical canvas: every node, edge, and registry entry was authored as the uip CLI's BPMN/JSON output and validated locally before deploy.

Terminal output of uip maestro bpmn validate against the FlakeWarden BPMN, returning 0 errors and confirming the registry is consistent
uip maestro bpmn validate passes against the CLI-authored process. The validator catches missing registry entries and dangling gateway branches; a green output means the orchestration is at least structurally trustworthy.
Maestro process run trace showing the path taken by an ambiguous failure: scorer escalation → agent classifier returning a flaky verdict → routing into the human-gated heal branch with the proposed fix attached
An ambiguous failure walks the path: scorer → classifier → flaky verdict with a proposed_fix → routed into the human-gated heal branch. The Action Center task is what stands between the agent and any change to the test suite.

What was hard

An early version of the evaluation was circular: the corpus encoded the exact signals the model read back. I caught it in an adversarial self-review and de-rigged the corpus (noisy signals, decoupled vocabulary, genuine conflict cases), re-measuring at an honest 90.7% instead of a tautological number. The lesson: evaluation-driven development is only honest if the eval set is decoupled from the model under test.

Making the 0% safety rate real, not designed: it is now enforced by a flaky-band regression guard plus a tie-break toward real_defect, so a real bug can never be silently healed. eval/negative_control.py runs as a CI gate; if a planted defect ever gets routed away from the defect branch, the build fails.

UiPath platform learning curve: binding prompt arguments uses the @ picker, not {{ }} typed text, and the in-product agent evaluator could not route its model in the EU tenant (HTTP 417 data-residency). The agent itself runs fine; the evaluator model routing is a real gap I am submitting as product feedback.

Honest limitations

The 150-case corpus is synthetic-but-adversarial: a solo builder cannot ship a real enterprise's CI history. Production requires connecting the live Test Manager results API in place of the seeded corpus, then running a shadow-mode prospective study to report real-world accuracy against a gold-standard labeled set. The architecture, governance gates, and eval methodology are production-shaped; the data is the gap. The headline 0% safety rate also stands or falls with the mechanism that enforces it; on a different evidence distribution that mechanism needs re-validation, not blind reuse.

Deterministic where it must be exact, generative where the context is messy, and a human in charge of every change. The headline number isn't the 90.7%, it's the 0% safety false-positive rate, because that is the number that decides whether a CI agent is trustworthy enough to live in front of every red build.
Related project

FlakeWarden: Agentic Flaky-Test Triage With a 0% Safety False-Positive Rate, on UiPath Maestro

View the project