Building Agents That Actually Ship

An autonomous event operator that turns webhooks into intelligent actions — context-aware, memory-driven, and built to close the loop.

What is Xena?

Xena is an autonomous event operator. It sits behind a single webhook endpoint and turns inbound events — emails, Linear tickets, GitHub notifications, Slack messages, scheduled reminders — into intelligent, context-aware actions.

Most agent frameworks focus on chat. Xena focuses on work. When an event arrives, Xena doesn't just respond — it builds context from memory, decides what to do, executes tools, and follows up if the work isn't done yet.

The core idea

One webhook. Every event gets the same pipeline: verify → deduplicate → extract → build context → reason → act → follow up. No special-casing per source. No brittle integrations. One path, every time.

Why does this exist?

The agent ecosystem is full of demos that reply to a prompt. Real operational work doesn't look like that. Real work looks like:

An email lands asking you to create a Linear ticket, research something, and reply with the outcome
A Linear issue gets labelled and needs triaging, context from previous conversations, and a status update sent to the requester
A reminder fires 30 minutes later to check if that delegated research task finished — and if not, reschedule and let the requester know

Xena handles all of this through a two-agent architecture: a cheap context agent curates what matters, then a frontier reasoning agent decides and acts.

The Two-Agent Pipeline

How It Works

Event arrives

Any source — email via AgentMail, Linear webhook, GitHub event, Slack message, or a fired reminder — hits POST /webhook. Xena verifies the signature (Svix, HMAC-SHA256, or shared secret), deduplicates via content hash (300s TTL), and detects the source.

Context Agent curates

A cheap model (via DSPy's Reasoning Language Model pattern) searches memory, pulls active reminders, extracts entities, and builds a scored ContextPackage. This is where Xena figures out what matters — not just what happened, but what it already knows about this thread, this person, this project.

The context agent runs with bounded iterations (max_iterations=10, max_llm_calls=20) and always falls back to deterministic assembly if the LLM path fails.

Xena reasons and acts

The frontier model receives the context package, gets injected with the best-matching skill (from skills/*.md), and has exactly one tool shape: execute(path, payload). It picks a path from the tool registry — create a Linear issue, reply to an email, set a reminder, delegate to Manus — and executes.

Tool rounds are bounded (default 1, elevated to 3+ for async follow-up loops). Validation errors don't burn rounds. Output truncation triggers graceful fallback summaries.

Delivery assurance kicks in

If Xena produced a response but didn't execute a tool for an external channel, the dead-letter rail catches it and delivers anyway. If the entire invocation failed, the failure notification rail tries the requester's channel, then raw payload fallback, then creates a Linear issue as last resort.

No silent failures. Ever.

The loop closes (or continues)

For non-instant work, Xena follows a courtesy loop contract: quick acknowledgment → expected timing → completion/delay/failure update. Async work gets a reminder that re-enters as a webhook event, carrying full lineage via context_ref. The cycle repeats until the work is done.

DSPy and the Recursive LM

This is where the computer science gets interesting.

DSPy is a framework from Stanford that treats LLM prompts as optimisable programs rather than hand-written strings (Khattab et al., 2023). Instead of crafting a prompt by hand and hoping it works, you define a signature — a structured input/output contract — and let the framework compile, execute, and eventually optimise the prompt based on real-world outcomes.

class BuildContext(dspy.Signature):
    """Given a webhook event, build a curated context package."""
    event_payload: str = dspy.InputField()
    event_type: str = dspy.InputField()
    event_source: str = dspy.InputField()
    context_package: str = dspy.OutputField()

DSPy handles prompt generation, output parsing, and critically — the optimisation loop. You feed it examples of good and bad outputs, and it automatically tunes the prompts to produce better results. This isn't fine-tuning the model. It's optimising the instructions the model receives. The model stays the same; the prompts evolve.

The RLM Pattern (Recursive Language Model)

DSPy's RLM module makes the LLM call iterative and tool-augmented. A standard LLM call is fire-and-forget. RLM makes it a reasoning loop:

Receive the event

The RLM gets the event payload and the output contract (the BuildContext signature).

Access tools

It gets memory search and reminder lookup as callable tools.

Reason and retrieve

The model decides what information it needs, calls tools to retrieve it, examines the results, and decides whether it has enough context or needs another pass.

Iterate (up to 10 rounds)

It refines queries, pulls in related threads, discovers linked reminders — autonomously curating until it's satisfied.

Output structured briefing

The final ContextPackage includes entities, memory, reminders, urgency scoring, quality metrics, and follow-up flags.

This is fundamentally different from a "retrieve then generate" RAG pipeline. In typical RAG, you run one search query, stuff results into a prompt, and hope the model can work with whatever came back. With RLM, the model decides its own retrieval strategy. It can search for a person's name, realise it also needs their recent conversation history, run a second search, discover a related reminder, and weave all of that into a coherent briefing.

The "recursive" part means the model reflects on its own intermediate outputs and course-corrects. If the first search returns noise, it refines the query. If it finds a reference to a conversation thread, it pulls that in. It's an LLM running inside a reasoning loop with access to real data — not generating from static knowledge, but actively exploring and curating.

Two Models, Not One

The RLM uses two models simultaneously to keep costs in the pennies range:

Role	Model	Job
Primary	Cheap, fast (e.g. Fireworks-hosted)	Reasoning, retrieval strategy, final assembly
Sub-model	Even cheaper	Tool execution, intermediate processing

This means sophisticated multi-step retrieval that would be impossible with a single static prompt — at a fraction of the cost of throwing a frontier model at the entire problem.

The Feedback Loop

Every event generates an outcome trace: did Xena use the right skill? Was the context package helpful or full of noise? These traces feed back into DSPy's optimisation pipeline. Over time, the context agent's prompts automatically improve — retrieval gets more precise, noise gets filtered more aggressively, briefings get tighter.

The system learns from its own operational history — not by retraining a model, but by evolving the instructions it gives to the model. This is the difference between a system that works on day one and slowly degrades, versus one that gets measurably better every week.

Skills: Behavioural Programs, Not API Wrappers

Skills aren't "call this endpoint." They're encoded judgment.

A communication skill doesn't say "use the email API." It says: when you see an inbound email, lead with the answer, match the sender's tone, and if you need to delegate research, tell them you're on it before you start.

A follow-up skill doesn't say "set a reminder." It says: when a delegated task comes back, check its status. If it's done, close the loop with the requester. If it's still running, tell them it's taking longer than expected and check again later.

Skills are:

Written as markdown files in skills/*.md with frontmatter metadata
Selected dynamically via rubric scoring against event context
Injected into Xena's system prompt before reasoning begins
The mechanism by which behaviour adapts to situation — no hardcoded branching

This is the separation between what you can do (tools) and how you should behave (skills). Tools are capabilities. Skills are judgment.

The Memory System

Xena's memory is COALA-aligned — inspired by the Cognitive Architectures for Language Agents paper — with six distinct stores:

Store	Purpose	Retrieval
Episodic	Raw event stream (`interactions.jsonl`)	BM25 + recency weighting
Semantic	Extracted summaries and entities	BM25 + semantic scoring
Working	Bounded recent context	Recency window
Graph	Entity relationships and provenance	Depth-bounded traversal
Conversations	Thread reconstruction	Thread metadata lookup
Reminders	Scheduled future actions	SQLite with due-time queries

All stores are append-only JSONL (except reminders in SQLite). BM25 indexes are built with bm25s and rebuilt on debounce. Recency weighting uses a configurable half-life (default 72 hours, weight 0.35).

A Night Agent runs on schedule to prune stale data, deduplicate, rebuild indexes, and optionally summarize threads into semantic memory.

The Tool System

Xena exposes a single action shape to the LLM: execute(path, payload).

No function-calling sprawl. No 47-tool menus. One shape, routed by path:

execute("linear.issue.create", { title, description, team_id })
execute("linear.comment.create", { issue_id, body })

GraphQL adapter with scoped workspace resolution.

execute("communication.email.reply", { thread_id, body })
execute("communication.email.send", { to, subject, body })

AgentMail adapter handling threads, labels, and message lifecycle.

execute("research.web.search", { query })
execute("research.web.extract", { url })

Template-driven HTTP adapter for delegated research tasks.

execute("reminder.pipeline.set", { due_at, context_ref, payload })
execute("reminder.pipeline.list", {})
execute("reminder.pipeline.cancel", { reminder_id })

Internal scheduler that re-enters via webhook — this is how recursion works.

Tools are defined in a YAML registry (tools.yaml, version 2). Adding a new integration means writing an adapter and registering paths in a YAML file. The model never knows the difference. The architecture never changes.

The router handles unknown paths, missing fields, missing adapters, and adapter exceptions — all returned as structured observations, never thrown as exceptions. Every tool call is explicit in the catalog and recorded as a durable interaction event.

# tools.yaml (version 2) — excerpt
version: 2
adapters:
  linear:
    type: linear
    credentials: [XENA_AI_LINEAR_API_KEY, LINEAR_API_KEY]
  agentmail:
    type: agentmail
    credentials: [AGENT_MAIL_API_KEY]
specs:
  - path: linear.issue.create
    adapter: linear
    required: [title, team_id]
  - path: communication.email.reply
    adapter: agentmail
    required: [thread_id, body]

Paths follow a clean taxonomy: communication.email.reply, project.issue.create, reminder.pipeline.set. The model sees one tool shape. The registry resolves the rest.

What Makes This Different

One pipeline, every event

Not a chatbot. Not a per-integration hack. Every event source hits the same verify → context → reason → act path. Adding a new source means adding a signature check, not rebuilding the pipeline.

Context before reasoning

The cheap context agent does the heavy lifting of what matters so the expensive frontier model only sees curated, scored, token-budgeted context. DSPy's RLM pattern applied to operational work, not just RAG.

Recursive via reminders

Async work doesn't need a separate orchestration layer. Reminders are first-class webhook events carrying full lineage. Set → fire → re-enter pipeline → check status → close or reschedule. Simple recursion through the same infra.

Delivery guarantees, not hopes

Dead-letter rails catch orphaned responses. Failure notifications escalate through channel → raw fallback → Linear issue. Infrastructure is deterministic; only judgment is probabilistic.

Tech Stack

Component	Technology
Runtime	Python 3.12, FastAPI, async/await
Context Agent	DSPy with RLM (Reasoning Language Model) pattern
Reasoning Agent	Frontier LLM (GPT-4o/Claude) via provider-abstracted `ModelClient`
Memory Retrieval	bm25s (Okapi BM25)
Reminders	SQLite-backed scheduler with webhook re-entry
Tool Registry	YAML-defined, adapter-pattern execution
Observability	Append-only JSONL traces + feedback records

The Design Philosophy

The framework is permanent. Everything else is a variable.

When a better model drops, you change an environment variable. When a new tool is needed, you register it. When the world changes, the skills evolve. Nothing structural changes. Ever.

The AI industry is obsessed with making models smarter. But the bottleneck isn't intelligence — it's architecture. The smartest model in the world is useless if it forgets everything between calls, can't follow up on its own work, and leaves people hanging when something breaks.

Xena inverts the priority: memory before intelligence (identity comes from what you remember, not how smart you are), behaviour before tools (judgment about how to act matters more than what you can access), context before routing (one pipeline that handles everything, because the context tells you what to do), and infrastructure before reasoning (deterministic guarantees for things that must never fail).

References

DSPy — Programming (not prompting) language models: dspy.ai / Khattab et al., 2023
COALA — Cognitive Architectures for Language Agents: Sumers et al., 2023
Okapi BM25 — Probabilistic relevance retrieval: Wikipedia
RLM Pattern — Reasoning Language Model in DSPy: DSPy docs

Xena

One pipeline, every event

Context before reasoning

Recursive via reminders

Delivery guarantees, not hopes

On this page