Methodology
- Rule:
rule-bag-policy@7— 8 nodes (input → 2 string filters → AND → product → set-property mutator → lookup-replace mutator → calc → output). Real-world shape, exercises every evaluator family. - Request:
fixtures/scenarios/s-bag-3pc-markup15.json— 3-piece bag, 15% markup, route LHR→DXB. - Hardware: a single laptop, NVMe SSD, .NET 9.
- Warmup: 100 iterations before timing begins.
- Modes: warm (default) reuses cached sources; cold rebuilds them per request to surface DF round-trip cost.
Headline numbers
| Mode | Source | p50 | p95 | p99 | req/s |
|---|---|---|---|---|---|
| Warm, 1 worker | local file | 0.07 ms | 0.09 ms | 0.14 ms | 13,333 |
| Warm, 1 worker | local dfdb | 0.07 ms | 0.08 ms | 0.12 ms | 13,889 |
| Warm, 1 worker | prod DF (Render) | 0.06 ms | 0.08 ms | 0.10 ms | 14,706 |
| Warm, 16 workers | local dfdb | 0.13 ms | 0.23 ms | 1.45 ms | 73,529 |
| Cold (fresh source per req) | local dfdb | 2.51 ms | 3.68 ms | 6.21 ms | 375 |
| Cold (fresh source per req) | prod DF (Render) | 1474 ms | 1664 ms | 1838 ms | 1 |
What the numbers mean
Warm steady-state is source-agnostic
The first three rows are statistically identical because the engine's per-(ruleId, version) and per-referenceSet caches mean DocumentForge is hit once, ever, per pinned snapshot. Whether DF is a local file, a loopback HTTP node, or a Render-hosted instance is irrelevant once caches are warm. The 70µs cost is dominated by JSONPath traversal and filter evaluation — pure CPU work.
Cold-path matters only at boot or cache miss
The last two rows show what happens when the engine has to refetch from DF every request. Local dfdb on loopback completes in 2.5ms — three SQL queries plus a reference-set fetch. Cross-region DF goes 600× slower because of TLS handshake + the Render instance cold-starting. In production this happens at boot, on publish (cache invalidation), and on pod restart — never on every request.
Concurrency scales linearly
16 workers reach 73K req/s — 5× the single-worker number, with p99 still under 1.5ms. The runner is lock-free on the hot path: rule snapshots and reference sets live in ConcurrentDictionary caches; the DAG walker holds zero shared mutable state.
What RuleForge does to keep the hot path fast
- Source caching.
DocumentForgeRuleSourcecaches every(ruleId, version)tuple indefinitely (versions are immutable). Environment bindings cache for 30 seconds. - JSONPath subset, hand-rolled. No regex, no AST allocation. Five tokens, walked iteratively.
- Tight evaluator types. Each filter evaluator is a static method with no allocations on the happy path beyond the resolved-values list.
- Single-pass DAG walk. Topological-ish queue with a "wait for deps" gate. No re-evaluation. Cycle check happens once, at validate time.
- Shared
JsonSerializerOptions. One global instance with camelCase + null-omit + case-insensitive reads. STJ caches metadata against it; no per-call cost.
Co-location matters more than embedding
The performance argument for going further — embedding DocumentForge as an in-process library — would shave the 2.5ms cold path to ~50µs. That's a 50× win on a code path that almost never fires. Not worth the deploy-story coupling. Run dfdb as a sidecar on the same host (Docker compose, Render private service, k8s pod) and you get 99% of the benefit with 5% of the integration cost.
Reproducing
git clone https://github.com/tailwind-retailing/ruleforge.git cd ruleforge dotnet build # Warm bench, local file source dotnet run --project src/RuleForge.Cli -- bench \ --endpoint /v1/ancillary/bag-policy \ --request '@fixtures/scenarios/s-bag-3pc-markup15.json' \ --warmup 100 --iterations 2000 # Concurrent bench against local dfdb dotnet run --project src/RuleForge.Cli -- bench \ --endpoint /v1/ancillary/bag-policy \ --request '@fixtures/scenarios/s-bag-3pc-markup15.json' \ --df --df-base http://localhost:5000 \ --warmup 100 --iterations 10000 --concurrency 16
Where you'd profile next
If your rule graphs balloon past 50 nodes or your reference sets exceed ~1k rows, the next bottlenecks (in order):
- Reference-set linear scan inside lookup-replace — currently O(rows). Index by
matchOncolumns to make it O(log n). - Calc-node expression parsing — NCalc parses fresh per call. Cache compiled
LogicalExpressionby expression string. - Trace allocation in debug mode — per-node TraceEntry + ctx snapshots add allocations. Production mode (no
--debug/ no?debug=true) skips this entirely.
None of these matter at the current scale; flagged here so you know where to dig if you find yourself off the cliff.