Xena
An autonomous event operator that turns webhooks into intelligent actions — context-aware, memory-driven, and built to close the loop.
What is Xena?
Xena is an autonomous event operator. It sits behind a single webhook endpoint and turns inbound events — emails, Linear tickets, GitHub notifications, Slack messages, scheduled reminders — into intelligent, context-aware actions.
Most agent frameworks focus on chat. Xena focuses on work. When an event arrives, Xena doesn't just respond — it builds context from memory, decides what to do, executes tools, and follows up if the work isn't done yet.
The core idea
One webhook. Every event gets the same pipeline: verify → deduplicate → extract → build context → reason → act → follow up. No special-casing per source. No brittle integrations. One path, every time.
Why does this exist?
The agent ecosystem is full of demos that reply to a prompt. Real operational work doesn't look like that. Real work looks like:
- An email lands asking you to create a Linear ticket, research something, and reply with the outcome
- A Linear issue gets labelled and needs triaging, context from previous conversations, and a status update sent to the requester
- A reminder fires 30 minutes later to check if that delegated research task finished — and if not, reschedule and let the requester know
Xena handles all of this through a two-agent architecture: a cheap context agent curates what matters, then a frontier reasoning agent decides and acts.
The Two-Agent Pipeline
How It Works
Event arrives
Any source — email via AgentMail, Linear webhook, GitHub event, Slack message, or a fired reminder — hits POST /webhook. Xena verifies the signature (Svix, HMAC-SHA256, or shared secret), deduplicates via content hash (300s TTL), and detects the source.
Context Agent curates
A cheap model (via DSPy's Reasoning Language Model pattern) searches memory, pulls active reminders, extracts entities, and builds a scored ContextPackage. This is where Xena figures out what matters — not just what happened, but what it already knows about this thread, this person, this project.
The context agent runs with bounded iterations (max_iterations=10, max_llm_calls=20) and always falls back to deterministic assembly if the LLM path fails.
Xena reasons and acts
The frontier model receives the context package, gets injected with the best-matching skill (from skills/*.md), and has exactly one tool shape: execute(path, payload). It picks a path from the tool registry — create a Linear issue, reply to an email, set a reminder, delegate to Manus — and executes.
Tool rounds are bounded (default 1, elevated to 3+ for async follow-up loops). Validation errors don't burn rounds. Output truncation triggers graceful fallback summaries.
Delivery assurance kicks in
If Xena produced a response but didn't execute a tool for an external channel, the dead-letter rail catches it and delivers anyway. If the entire invocation failed, the failure notification rail tries the requester's channel, then raw payload fallback, then creates a Linear issue as last resort.
No silent failures. Ever.
The loop closes (or continues)
For non-instant work, Xena follows a courtesy loop contract: quick acknowledgment → expected timing → completion/delay/failure update. Async work gets a reminder that re-enters as a webhook event, carrying full lineage via context_ref. The cycle repeats until the work is done.
DSPy and the Recursive LM
This is where the computer science gets interesting.
DSPy is a framework from Stanford that treats LLM prompts as optimisable programs rather than hand-written strings (Khattab et al., 2023). Instead of crafting a prompt by hand and hoping it works, you define a signature — a structured input/output contract — and let the framework compile, execute, and eventually optimise the prompt based on real-world outcomes.
class BuildContext(dspy.Signature):
"""Given a webhook event, build a curated context package."""
event_payload: str = dspy.InputField()
event_type: str = dspy.InputField()
event_source: str = dspy.InputField()
context_package: str = dspy.OutputField()DSPy handles prompt generation, output parsing, and critically — the optimisation loop. You feed it examples of good and bad outputs, and it automatically tunes the prompts to produce better results. This isn't fine-tuning the model. It's optimising the instructions the model receives. The model stays the same; the prompts evolve.
The RLM Pattern (Recursive Language Model)
DSPy's RLM module makes the LLM call iterative and tool-augmented. A standard LLM call is fire-and-forget. RLM makes it a reasoning loop:
Receive the event
The RLM gets the event payload and the output contract (the BuildContext signature).
Access tools
It gets memory search and reminder lookup as callable tools.
Reason and retrieve
The model decides what information it needs, calls tools to retrieve it, examines the results, and decides whether it has enough context or needs another pass.
Iterate (up to 10 rounds)
It refines queries, pulls in related threads, discovers linked reminders — autonomously curating until it's satisfied.
Output structured briefing
The final ContextPackage includes entities, memory, reminders, urgency scoring, quality metrics, and follow-up flags.
This is fundamentally different from a "retrieve then generate" RAG pipeline. In typical RAG, you run one search query, stuff results into a prompt, and hope the model can work with whatever came back. With RLM, the model decides its own retrieval strategy. It can search for a person's name, realise it also needs their recent conversation history, run a second search, discover a related reminder, and weave all of that into a coherent briefing.
The "recursive" part means the model reflects on its own intermediate outputs and course-corrects. If the first search returns noise, it refines the query. If it finds a reference to a conversation thread, it pulls that in. It's an LLM running inside a reasoning loop with access to real data — not generating from static knowledge, but actively exploring and curating.
Two Models, Not One
The RLM uses two models simultaneously to keep costs in the pennies range:
| Role | Model | Job |
|---|---|---|
| Primary | Cheap, fast (e.g. Fireworks-hosted) | Reasoning, retrieval strategy, final assembly |
| Sub-model | Even cheaper | Tool execution, intermediate processing |
This means sophisticated multi-step retrieval that would be impossible with a single static prompt — at a fraction of the cost of throwing a frontier model at the entire problem.
The Feedback Loop
Every event generates an outcome trace: did Xena use the right skill? Was the context package helpful or full of noise? These traces feed back into DSPy's optimisation pipeline. Over time, the context agent's prompts automatically improve — retrieval gets more precise, noise gets filtered more aggressively, briefings get tighter.
The system learns from its own operational history — not by retraining a model, but by evolving the instructions it gives to the model. This is the difference between a system that works on day one and slowly degrades, versus one that gets measurably better every week.
Skills: Behavioural Programs, Not API Wrappers
Skills aren't "call this endpoint." They're encoded judgment.
A communication skill doesn't say "use the email API." It says: when you see an inbound email, lead with the answer, match the sender's tone, and if you need to delegate research, tell them you're on it before you start.
A follow-up skill doesn't say "set a reminder." It says: when a delegated task comes back, check its status. If it's done, close the loop with the requester. If it's still running, tell them it's taking longer than expected and check again later.
Skills are:
- Written as markdown files in
skills/*.mdwith frontmatter metadata - Selected dynamically via rubric scoring against event context
- Injected into Xena's system prompt before reasoning begins
- The mechanism by which behaviour adapts to situation — no hardcoded branching
This is the separation between what you can do (tools) and how you should behave (skills). Tools are capabilities. Skills are judgment.
The Memory System
Xena's memory is COALA-aligned — inspired by the Cognitive Architectures for Language Agents paper — with six distinct stores:
| Store | Purpose | Retrieval |
|---|---|---|
| Episodic | Raw event stream (interactions.jsonl) | BM25 + recency weighting |
| Semantic | Extracted summaries and entities | BM25 + semantic scoring |
| Working | Bounded recent context | Recency window |
| Graph | Entity relationships and provenance | Depth-bounded traversal |
| Conversations | Thread reconstruction | Thread metadata lookup |
| Reminders | Scheduled future actions | SQLite with due-time queries |
All stores are append-only JSONL (except reminders in SQLite). BM25 indexes are built with bm25s and rebuilt on debounce. Recency weighting uses a configurable half-life (default 72 hours, weight 0.35).
A Night Agent runs on schedule to prune stale data, deduplicate, rebuild indexes, and optionally summarize threads into semantic memory.
The Tool System
Xena exposes a single action shape to the LLM: execute(path, payload).
No function-calling sprawl. No 47-tool menus. One shape, routed by path:
execute("linear.issue.create", { title, description, team_id })
execute("linear.comment.create", { issue_id, body })GraphQL adapter with scoped workspace resolution.
execute("communication.email.reply", { thread_id, body })
execute("communication.email.send", { to, subject, body })AgentMail adapter handling threads, labels, and message lifecycle.
execute("research.web.search", { query })
execute("research.web.extract", { url })Template-driven HTTP adapter for delegated research tasks.
execute("reminder.pipeline.set", { due_at, context_ref, payload })
execute("reminder.pipeline.list", {})
execute("reminder.pipeline.cancel", { reminder_id })Internal scheduler that re-enters via webhook — this is how recursion works.
Tools are defined in a YAML registry (tools.yaml, version 2). Adding a new integration means writing an adapter and registering paths in a YAML file. The model never knows the difference. The architecture never changes.
The router handles unknown paths, missing fields, missing adapters, and adapter exceptions — all returned as structured observations, never thrown as exceptions. Every tool call is explicit in the catalog and recorded as a durable interaction event.
# tools.yaml (version 2) — excerpt
version: 2
adapters:
linear:
type: linear
credentials: [XENA_AI_LINEAR_API_KEY, LINEAR_API_KEY]
agentmail:
type: agentmail
credentials: [AGENT_MAIL_API_KEY]
specs:
- path: linear.issue.create
adapter: linear
required: [title, team_id]
- path: communication.email.reply
adapter: agentmail
required: [thread_id, body]Paths follow a clean taxonomy: communication.email.reply, project.issue.create, reminder.pipeline.set. The model sees one tool shape. The registry resolves the rest.
What Makes This Different
One pipeline, every event
Not a chatbot. Not a per-integration hack. Every event source hits the same verify → context → reason → act path. Adding a new source means adding a signature check, not rebuilding the pipeline.
Context before reasoning
The cheap context agent does the heavy lifting of what matters so the expensive frontier model only sees curated, scored, token-budgeted context. DSPy's RLM pattern applied to operational work, not just RAG.
Recursive via reminders
Async work doesn't need a separate orchestration layer. Reminders are first-class webhook events carrying full lineage. Set → fire → re-enter pipeline → check status → close or reschedule. Simple recursion through the same infra.
Delivery guarantees, not hopes
Dead-letter rails catch orphaned responses. Failure notifications escalate through channel → raw fallback → Linear issue. Infrastructure is deterministic; only judgment is probabilistic.
Tech Stack
| Component | Technology |
|---|---|
| Runtime | Python 3.12, FastAPI, async/await |
| Context Agent | DSPy with RLM (Reasoning Language Model) pattern |
| Reasoning Agent | Frontier LLM (GPT-4o/Claude) via provider-abstracted ModelClient |
| Memory Retrieval | bm25s (Okapi BM25) |
| Reminders | SQLite-backed scheduler with webhook re-entry |
| Tool Registry | YAML-defined, adapter-pattern execution |
| Observability | Append-only JSONL traces + feedback records |
The Design Philosophy
The framework is permanent. Everything else is a variable.
When a better model drops, you change an environment variable. When a new tool is needed, you register it. When the world changes, the skills evolve. Nothing structural changes. Ever.
The AI industry is obsessed with making models smarter. But the bottleneck isn't intelligence — it's architecture. The smartest model in the world is useless if it forgets everything between calls, can't follow up on its own work, and leaves people hanging when something breaks.
Xena inverts the priority: memory before intelligence (identity comes from what you remember, not how smart you are), behaviour before tools (judgment about how to act matters more than what you can access), context before routing (one pipeline that handles everything, because the context tells you what to do), and infrastructure before reasoning (deterministic guarantees for things that must never fail).
References
- DSPy — Programming (not prompting) language models: dspy.ai / Khattab et al., 2023
- COALA — Cognitive Architectures for Language Agents: Sumers et al., 2023
- Okapi BM25 — Probabilistic relevance retrieval: Wikipedia
- RLM Pattern — Reasoning Language Model in DSPy: DSPy docs