Building Agents That Actually Ship

An autonomous pipeline that takes a Linear bug ticket and turns it into a tested, reviewed draft PR — with zero human intervention. Label a ticket, get a PR.

What is Bug Boomerang?

A bug gets filed. A developer reads it. They investigate the codebase. They figure out the root cause. They write a plan. They code a fix. They review their own work. They set up a test environment. They test the fix. They create a PR. They post an update on the ticket.

That's two hours minimum for a trivial bug. Half a day for anything non-trivial. And most of that time isn't thinking — it's scaffolding. Cloning branches, setting up environments, running audits, writing PR descriptions. Mechanical work that eats developer hours but requires just enough context to resist automation.

Bug Boomerang takes a Linear bug ticket and turns it into a tested, reviewed draft PR — with zero human intervention. Not "assists with." Not "suggests." It does the whole thing.

Label a ticket, get a PR

A bug label gets applied to a Linear issue. Bug Boomerang picks it up, investigates the codebase, identifies the root cause, writes a fix plan, codes the solution, reviews its own work in a self-correcting loop, provisions a production-like sandbox, validates the fix in a real browser, and creates a draft PR. The developer's first touch is reviewing a PR that already works. 15–30 minutes. A few dollars in compute.

The Pipeline

The Self-Correcting Code Loop

This is where it gets interesting. A coding agent writes the fix based on the plan. Then a separate review agent examines the patch with structured output: is the fix correct? Are there issues? Each finding has a priority, a title, and a detailed explanation.

If the reviewer says it's correct, the pipeline moves on. If it finds problems, the findings get fed back to the coder as context for the next iteration. The coder doesn't see its own previous attempt in isolation — it sees what the reviewer thought was wrong and addresses it specifically.

Between rounds, automated audit scripts check for rule violations and schema consistency. If the code breaks project conventions, that feedback enters the loop too.

Coder writes fix (from plan + prior review feedback)
  ↓
Audit scripts run (rules + schema checks)
  ↓
Review agent examines patch (structured JSON output)
  ↓
"patch is correct" → PASS → continue to sandbox
  ↓
Findings with priorities → FAIL → feed back to coder
  ↓
Repeat (up to 5 rounds)
  ↓
5 rounds exhausted → escalate to human with full context

This isn't "generate code and hope." It's a feedback loop where two independent AI agents converge on a correct solution through structured critique.

The Sandbox: A Production-Like World in 60 Seconds

For bugs that touch the UI, Bug Boomerang provisions an ephemeral test environment from scratch:

Database isolation via Neon branching

The sandbox gets a private copy of the production database using Neon's copy-on-write branching. Not a mock. Not a seed script. A real database with real data, isolated so nothing the test does can touch production. The branch gets deleted when the test is done.

Full application stack via Vercel Sandbox

The fix branch gets deployed into a sandboxed Vercel environment — API server, web frontend, reverse proxy for same-origin auth, CDN schemas, the works. Environment variables are injected, dependencies installed, services started.

Auth health verification

Before any test runs, the sandbox validates itself: Can a user sign in? Does the session persist? Do the schemas load? Does the API respond? Only when every health check passes does testing begin.

Browser testing with a real browser

A Browser Use agent opens the sandbox in a real browser. It logs in, navigates to the affected area, and tries to reproduce the original bug. Not a unit test. A real browser clicking real buttons on a real application backed by a real database.

The output is structured: PASS or FAIL, with a failure type (code issue or environment issue), a summary, and specific findings.

FAIL → loop back

If the fix failed because of a code issue, the entire pipeline restarts: discovery runs again with the browser agent's feedback as additional context. New plan, new code, new review, new test. Up to 3 full restarts.

If the environment itself is broken — escalation. Bug Boomerang doesn't waste cycles debugging infrastructure.

This entire environment — database branch, application stack, auth validation — spins up automatically, exists for exactly as long as the test needs it, and tears itself down afterward. No developer provisioned anything. No DevOps ticket was filed.

Escalation: Knowing When to Stop

Bug Boomerang doesn't pretend it can fix everything. It has explicit stop conditions:

Condition	What happens
Code review exhausted (5 rounds)	Stops, tags configured users on Linear, bumps priority to High
Browser agent crashed	Stops, escalates with full context of what it tried
Environment failure	Stops, distinguishes infra problems from code problems
Frontend validation failed 3 times	Stops, provides all browser feedback to the human

When it escalates, the human gets everything: what was discovered, what was planned, what was coded, what the reviewer said, what the browser saw. Full context, not just "it didn't work."

And if a human posts a comment on the Linear issue while a run is active, Bug Boomerang treats it as new input. It supersedes the current run, starts fresh, and carries the human's feedback forward as additional context. The human doesn't need to learn a new interface — they just comment on the ticket like they normally would.

The Toolchain Is Pluggable

Every stage runs through a configurable toolchain. The agents, models, and providers are specified in config, not code:

toolchain:
  discovery:
    model: "claude-sonnet-4-5-20250929"
    provider: "anthropic"
  coder:
    model: "claude-opus-4-6"
    provider: "anthropic"
  review:
    model: "o3"
    provider: "openai"
    outputMode: "json"

Want to A/B test GPT-5.3-Codex against Claude Opus? Change a config value. Every tool also supports environment variable overrides: XENA_CODER_BIN, XENA_CODER_ARGS_JSON, XENA_CODER_TIMEOUT_MS. The pipeline doesn't care which model writes the code — it cares that the review loop produces correct output.

State Survives Everything

Every run is persisted to disk as a deterministic state machine. Every stage transition is recorded. Every agent output is saved as an artifact:

.xena/
  issues/{issueId}.json           # Run state machine
  artifacts/{issueId}/{runId}/    # Discovery, plan, review outputs
  sandbox-meta/{sandboxId}.json   # Environment metadata
  events.json                     # Dedup window (7-day)

If the server crashes mid-run, the state is there when it comes back. This isn't just crash recovery — it's auditability. You can trace exactly what happened at every stage of every run.

What Makes This Different

End-to-end, not assisted

From ticket to tested PR. Not a code suggestion tool. Not a co-pilot. The whole pipeline, autonomously.

Self-correcting code loop

Two independent AI agents (coder + reviewer) iterate through structured critique. Up to 5 rounds. Audit scripts between each.

Real browser, real database

Neon DB branches + Vercel Sandbox + Browser Use. Production-like testing with production-like data, isolated and ephemeral.

Honest about its limits

Explicit stop conditions. Full-context escalation. No silent failures. No half-finished branches quietly rotting.

Tech Stack

Component	Technology
Runtime	Node.js (CommonJS)
HTTP	Express
Trigger	Linear webhooks (signature-verified)
Code Agents	Configurable — Claude, GPT, any provider
Review	Structured JSON output with per-finding priorities
Database Isolation	Neon copy-on-write branching
Sandbox	Vercel Sandbox SDK
Browser Testing	Browser Use cloud API
Git	`gh` CLI for PR operations
Notifications	Slack via GitHub webhook loop
State	Deterministic file-based state machine
Config	YAML + environment variable overrides

How It Connects

Bug Boomerang shares DNA with the rest of the runtime family:

Xena — the autonomous event operator. Bug Boomerang's webhook handling and state persistence patterns come from Xena's architecture.
Noktua — the desktop agent. Same configurable toolchain philosophy — the framework is permanent, the components are variables.
The coder/reviewer agents use the same pluggable model client pattern. When a better coding model ships, you change a string in config.

References

Neon — Serverless Postgres with copy-on-write branching: neon.tech
Vercel Sandbox — Ephemeral compute environments: Vercel docs
Browser Use — Cloud browser automation: browser-use.com
Linear — Issue tracking with webhook events: linear.app

Bug Boomerang

End-to-end, not assisted

Self-correcting code loop

Real browser, real database

Honest about its limits

On this page