Performance — RuleForge

Methodology

Rule: rule-bag-policy@7 — 8 nodes (input → 2 string filters → AND → product → set-property mutator → lookup-replace mutator → calc → output). Real-world shape, exercises every evaluator family.
Request: fixtures/scenarios/s-bag-3pc-markup15.json — 3-piece bag, 15% markup, route LHR→DXB.
Hardware: a single laptop, NVMe SSD, .NET 9.
Warmup: 100 iterations before timing begins.
Modes: warm (default) reuses cached sources; cold rebuilds them per request to surface DF round-trip cost.

Headline numbers

Mode	Source	p50	p95	p99	req/s
Warm, 1 worker	local file	0.07 ms	0.09 ms	0.14 ms	13,333
Warm, 1 worker	local dfdb	0.07 ms	0.08 ms	0.12 ms	13,889
Warm, 1 worker	prod DF (Render)	0.06 ms	0.08 ms	0.10 ms	14,706
Warm, 16 workers	local dfdb	0.13 ms	0.23 ms	1.45 ms	73,529
Cold (fresh source per req)	local dfdb	2.51 ms	3.68 ms	6.21 ms	375
Cold (fresh source per req)	prod DF (Render)	1474 ms	1664 ms	1838 ms	1

What the numbers mean

Warm steady-state is source-agnostic

The first three rows are statistically identical because the engine's per-(ruleId, version) and per-referenceSet caches mean DocumentForge is hit once, ever, per pinned snapshot. Whether DF is a local file, a loopback HTTP node, or a Render-hosted instance is irrelevant once caches are warm. The 70µs cost is dominated by JSONPath traversal and filter evaluation — pure CPU work.

Cold-path matters only at boot or cache miss

The last two rows show what happens when the engine has to refetch from DF every request. Local dfdb on loopback completes in 2.5ms — three SQL queries plus a reference-set fetch. Cross-region DF goes 600× slower because of TLS handshake + the Render instance cold-starting. In production this happens at boot, on publish (cache invalidation), and on pod restart — never on every request.

Concurrency scales linearly

16 workers reach 73K req/s — 5× the single-worker number, with p99 still under 1.5ms. The runner is lock-free on the hot path: rule snapshots and reference sets live in ConcurrentDictionary caches; the DAG walker holds zero shared mutable state.

What RuleForge does to keep the hot path fast

Source caching. DocumentForgeRuleSource caches every (ruleId, version) tuple indefinitely (versions are immutable). Environment bindings cache for 30 seconds.
JSONPath subset, hand-rolled. No regex, no AST allocation. Five tokens, walked iteratively.
Tight evaluator types. Each filter evaluator is a static method with no allocations on the happy path beyond the resolved-values list.
Single-pass DAG walk. Topological-ish queue with a "wait for deps" gate. No re-evaluation. Cycle check happens once, at validate time.
Shared JsonSerializerOptions. One global instance with camelCase + null-omit + case-insensitive reads. STJ caches metadata against it; no per-call cost.

Co-location matters more than embedding

The performance argument for going further — embedding DocumentForge as an in-process library — would shave the 2.5ms cold path to ~50µs. That's a 50× win on a code path that almost never fires. Not worth the deploy-story coupling. Run dfdb as a sidecar on the same host (Docker compose, Render private service, k8s pod) and you get 99% of the benefit with 5% of the integration cost.

Reproducing

git clone https://github.com/tailwind-retailing/ruleforge.git
cd ruleforge
dotnet build

# Warm bench, local file source
dotnet run --project src/RuleForge.Cli -- bench \
  --endpoint /v1/ancillary/bag-policy \
  --request  '@fixtures/scenarios/s-bag-3pc-markup15.json' \
  --warmup 100 --iterations 2000

# Concurrent bench against local dfdb
dotnet run --project src/RuleForge.Cli -- bench \
  --endpoint /v1/ancillary/bag-policy \
  --request  '@fixtures/scenarios/s-bag-3pc-markup15.json' \
  --df --df-base http://localhost:5000 \
  --warmup 100 --iterations 10000 --concurrency 16

Where you'd profile next

If your rule graphs balloon past 50 nodes or your reference sets exceed ~1k rows, the next bottlenecks (in order):

Reference-set linear scan inside lookup-replace — currently O(rows). Index by matchOn columns to make it O(log n).
Calc-node expression parsing — NCalc parses fresh per call. Cache compiled LogicalExpression by expression string.
Trace allocation in debug mode — per-node TraceEntry + ctx snapshots add allocations. Production mode (no --debug / no ?debug=true) skips this entirely.

None of these matter at the current scale; flagged here so you know where to dig if you find yourself off the cliff.