Building Agents That Actually Ship

A graph-based test orchestration system that decomposes every testable action into atomic particles, arranges them in a dependency graph, and runs tests in parallel against ephemeral database snapshots.

What is PTF?

You write a test. It needs a logged-in coach, a client, a nutrition plan, a day, a meal, and a food item — just to test that editing a food item works. So you write setup code. 40 lines of fixtures. Then another test needs most of the same setup but slightly different. So you copy-paste and tweak. Then the schema changes and 30 tests break because they all had their own version of the same fixture chain.

Now multiply that by every feature, every role, every edge case. The test suite becomes a maintenance project of its own. Tests run sequentially because they share state. Fixtures drift. Phantom failures appear. Nobody trusts the suite, so nobody runs it.

PTF takes a completely different approach. Every testable action in the system is decomposed into an atomic particle — a single, versioned, composable unit. These particles are arranged in a directed acyclic graph (DAG) that encodes their dependencies. A test is just a path through the graph. Execution is parallelised against ephemeral database snapshots cloned in milliseconds.

The test suite writes itself from the graph

Define particles once. The graph encodes every valid combination. Any path through the graph is a runnable test — even paths nobody explicitly wrote. PR changes map to affected particles, which map to required test paths. The system knows exactly what to test, provisions isolated databases in under 100ms, and runs everything in parallel. No fixtures. No shared state. No sequential bottlenecks.

Architecture

The Core Idea: Particles + Graph + Snapshots

What's a Particle?

A particle is the smallest testable unit in the system. Not a test — a building block for tests. Each particle has a type, dependencies, a handler, and optional performance thresholds.

Traditional:  "Test that a coach can add food to a meal"
                        ↓
Particle:     login.coach.nutrition.plan.day.meal.food.add
                        ↓
Encoded:      L1.P2.D3.E4.C5.G6.I7.A1

Eight particle types cover the full taxonomy:

Type	Purpose	Example
Gate	Authentication state	`login` — validates session exists
Permission	Role-based access	`coach` — RBAC check
Domain	Feature area boundary	`nutrition` — feature namespace
Entity	Core data object	`plan` — the nutrition plan model
Container	Grouping construct	`day` — groups meals within a plan
Group	Sub-grouping	`meal` — groups food items
Item	Leaf-level data	`food` — individual food entry
Action	CRUD operation	`add`, `edit`, `delete`

Each particle is content-addressable — a SHA256 hash of its definition. Change the schema, handler, or assertions and the hash changes automatically. Drift detection is built in.

The Dependency Graph

Particles form a directed acyclic graph. A test path is only valid if every upstream dependency is satisfied. The graph catches impossible tests before they run:

L.S1.P2.S1.D3.I7.A1   ← INVALID

food.add requires meal, which requires day, which requires plan.
Missing intermediaries = automatic rejection before execution.

This is the DAG doing what fixtures never could — encoding the actual structure of the system. The graph is the system specification.

State Encoding

Every particle in a path carries a state marker:

Encoding	Meaning
`S1`	Condition met (e.g., logged in, has role)
`S0`	Condition NOT met (negative test)
`V1`	Create
`V2`	Read
`V3`	Update
`V4`	Delete

Positive test: L.S1.P2.S1.D3.E4.I7.A1 — expect 201. Negative test: L.S1.P2.S0.D3.E4.I7.A1 — expect 403 (no coach permission).

Same path notation, same runner. The encoding tells the system what to expect. Negative tests don't need separate infrastructure — they're just paths with S0 markers.

The Function Registry

Each particle maps to real system functions at two levels:

registry = {
  'L': {
    query: () => auth.getCurrentSession()
  },
  'I7.A1': {
    query: () => foodItem.create(payload)
  }
}

registry = {
  'L': {
    http: () => GET /api/auth/session
  },
  'I7.A1': {
    http: () => POST /api/plans/{id}/days/{id}/meals/{id}/foods
  }
}

Same path definition. Same dependency validation. Same state encoding. The only difference is whether the particle calls a function directly or hits an HTTP endpoint. Unit tests and integration tests share the same graph — no duplication.

Parallel Execution via Snapshots

This is where PTF goes from clever to fast.

The Problem with Sequential Tests

Traditional flow (sequential, shared state):
  create client → create plan → assign plan → edit food
       5s      +      3s      +      2s      +     2s    = 12s

Every test builds on the last. State leaks between runs. One failure cascades. Parallelism is impossible because everything shares the same database.

The Snapshot Solution

PTF pre-seeds databases at graph checkpoints — natural save points in user journeys:

Snapshot	State	What It Covers
`S0_base`	Fresh tenant, seeded config	Auth, onboarding, initial setup
`S1_coach_active`	Coach account, profile complete	Dashboard, settings, preferences
`S2_client_exists`	Coach + client relationship	Client CRUD, relationship management
`S3_plan_assigned`	Client has active plan	Plan operations, day/meal structure
`S4_populated`	Rich data: meals, foods, history	Reporting, analytics, bulk operations
`S5_multi_client`	5+ clients, varied states	Pagination, filtering, edge cases

Each snapshot is a PostgreSQL template database. Cloning one takes under 100ms — copy-on-write, no data duplication:

-- Clone for a test run (< 100ms)
CREATE DATABASE test_run_abc123
  TEMPLATE s3_plan_assigned_template;

-- Destroy after test
DROP DATABASE test_run_abc123;

For any test path, the orchestrator selects the nearest upstream snapshot, clones it, runs the test against the clone, and destroys it. Six tests that previously ran in 12 seconds sequentially now run in 2 seconds in parallel — each against its own isolated database. No shared state. No fixture drift. No phantom failures.

PR triggers test selection

A PR is raised. PTF identifies changed files, maps them to affected domains, queries the graph for required particles, and constructs the minimum set of test paths needed. No redundant tests — only what the change could have broken.

Snapshot selection and cloning

For each test path, the orchestrator finds the nearest upstream snapshot checkpoint. It clones the template database in under 100ms using PostgreSQL's copy-on-write. Each test gets its own isolated, deterministic database.

Parallel execution across instances

Tests fan out across cloud instances, each running multiple database clones. A single instance can handle 20 concurrent databases. Five instances = 100 tests running simultaneously. The orchestration layer distributes work and collects results.

Results, metrics, and cleanup

Every test reports pass/fail with structured output. Performance metrics (P50, P95, P99) are recorded per particle for regression trending. All ephemeral databases are destroyed. The system is clean for the next run.

Composability: The Power Feature

Because particles are atomic and the graph defines relationships, any valid path is a runnable test — even if nobody explicitly wrote it.

QA wants to test: "Coach edits a food item in a plan assigned to a client who has dietary restrictions."

If the particles exist in the graph:

login → coach → nutrition → plan → client_assignment → dietary_flags → food → edit

PTF constructs the path, validates it against the DAG, selects the nearest snapshot, and runs — no new test code required. The graph covers the combinatorial space that manual test writing never could.

Automatic Test Discovery

The graph doesn't just validate paths — it reveals untested ones. PTF can traverse all valid paths and identify coverage gaps: "There are 47 valid paths through the nutrition domain. 38 have explicit tests. 9 are untested." Those 9 aren't hypothetical — they're real user journeys that the system supports but nobody thought to test.

Negative Testing

Negative tests use the same infrastructure with inverted expectations:

Category	Encoding	Expected
Auth failure	`L.S0.*`	401
Permission denied	`.P{n}.S0.`	403
Missing dependency	Skip intermediate particle	400/404
Invalid data	Append `.INVALID` modifier	400
Rate limited	Append `.RATELIMIT` modifier	429

A positive test proves the system works. A negative test proves the system correctly rejects invalid operations. Same path notation, same runner, same parallel execution. The S0 marker on a particle tells the system to expect failure at that point.

Performance as a First-Class Concern

Every particle carries optional performance thresholds:

perf: {
  p50_http: 100,   // 50th percentile
  p95_http: 200,   // 95th percentile — the threshold
  p99_http: 500,   // 99th percentile
  timeout: 2000    // hard cutoff
}

After each test run, metrics are stored per particle. PTF tracks trends over time:

P95 exceeds threshold for 3 consecutive runs — Warning
P95 exceeds 150% of baseline — Alert
Timeout hit — Failure + investigation flag

Performance regression correlates with specific PRs automatically. "P95 on food.add jumped 47% since PR #1847" — the data is there because every particle run is metered.

Particle Versioning and Drift Detection

Each particle definition generates a content-addressable hash:

function versionParticle(def: ParticleDefinition): string {
  return sha256(JSON.stringify(def)).slice(0, 8);
}

On every PR, PTF:

Parses changed files for schema/handler modifications
Regenerates hashes for affected particles
Compares against the registry
Flags test paths containing changed particles

⚠️  Particle version changed: food.add (a3f8c2d1 → b7e2f4a9)

Affected test paths (12):
  - L1.P2.D3.E4.C5.G6.I7.A1  (food.add basic)
  - L1.P2.D3.E4.C5.G6.I7.A1.V2  (food.add + read)
  ...

Action required: Review and approve particle migration

No silent breakage. If a particle changes, every test that uses it is flagged. The system knows the difference between "the test still works" and "the test works but the particle underneath changed."

What Makes This Different

Graph-native test orchestration

Tests are paths through a DAG, not scripts in a folder. The graph validates, discovers, and optimises test execution automatically.

True parallel isolation

PostgreSQL template cloning gives every test its own database in under 100ms. No shared state, no fixture drift, no sequential bottlenecks.

Composable by construction

Define particles once, combine infinitely. Any valid path through the graph is a runnable test — including paths nobody explicitly wrote.

Performance regression built in

Every particle is metered. Every run is tracked. Regressions correlate to PRs automatically. Performance isn't a separate concern — it's part of every test.

Tech Stack

Component	Technology
Language	TypeScript
Test Runner	Vitest
HTTP Client	undici / supertest
Database	PostgreSQL
Ephemeral DBs	PostgreSQL template cloning
Graph Layout	Dagre (automatic hierarchical)
Visualisation	ReactFlow
State Management	Zustand
Frontend	TanStack Start + Tailwind
Data Fetching	TanStack Query

How It Connects

PTF shares philosophy with the rest of the toolchain:

Bug Boomerang — autonomous bug fix pipeline. Bug Boomerang creates PRs; PTF validates them. Every bug fix adds a regression test particle.
AI QA — browser-based QA testing. AI QA validates UI behaviour; PTF validates data integrity. Different layers, complementary coverage.
The same ephemeral database pattern (snapshot → clone → test → destroy) appears in Bug Boomerang's sandbox environment. PTF takes that pattern and makes it the foundation of the entire test suite.

References

PostgreSQL Template Databases — Near-instant copy-on-write cloning: PostgreSQL docs
ReactFlow — Graph visualisation for React: reactflow.dev
Dagre — Directed graph layout engine: GitHub
Vitest — Fast TypeScript test runner: vitest.dev

Particle Testing Framework