Elite Intelligence Desk · Backtest Build Roadmap

🆕 2026-05-15 LATE NIGHT · Council Mode 3 Redesign — 4-Tool Evidence Ladder Literature research + Council Mode 3 → backtest replaced with multi-tool evidence pipeline

Trigger: literature research showed engine scored 1/9 met, 2/9 partial, 6/9 not met on industry preconditions for parity backtest. We have the Lean wall-clock-coupling property without Lean's BacktestClock framework. Lopez de Prado's warning: "backtesting before parameters are forward-test-validated = discovery on data the strategy already saw = highest overfitting risk."

Council Mode 3 verdict (all 5 advisors converged): don't redesign the backtest — replace it with a 4-tool evidence ladder.

#	Tool	Speed	Accuracy	Status
1	Forward-test (continuous)	Calendar-bound	HIGHEST	PRIMARY edge discovery
2	CPCV + DSR/PBO on shadow log	Instant on existing data	HIGH	NEW — Sat 2026-05-17
3	Directional WS-replay (Option C)	Fast (1-2d per A/B)	MEDIUM (relative only)	EXISTING — niche A/B tool
4	Quantower historical tick-replay	UNKNOWN — depends on probe	MEDIUM-HIGH (genuine OOS)	GATED on variance probe

Immediate next step (Sat 2026-05-17 ~3-4h): variance characterization probe at 2× / 5× / 10× replay speeds × 3 iterations on a 1-hour window from research_data/live/20260513_engine_input.jsonl. Decision gate: CV ≤ 0.3 at 5× → ship Tool 4 (6 months OOS data in ~36 days). CV > 0.5 at 5× → Tool 4 dies cleanly, stick with Tools 1+2+3.

Cancelled (formally): parity backtest, jsdom replay extension, Quantower-substrate replay extension, multi-day stitching at compressed speeds.

Pivot Trigger for the backtest project: month 6 (2026-11-13) — if forward-test + CPCV show no edge in any regime, backtest project is formally cancelled. Tools 3+4 only ever activate when forward-test surfaces a specific A/B question on a strategy with demonstrated edge.

✅ PHASE E · Backtest Reframed (Option C) — historical context 5 council sessions · 4 measurement tracks · framework redefined 2026-05-15 (superseded by late-night 4-tool ladder above)

Full arc: memory/project_session_handoff_2026-05-14_late_night.md · Module inventory: project_wall_clock_modules_inventory_2026-05-14.md · Production survey: project_engine_architecture_survey_2026-05-14.md

⭐ Mission What this backtest is — and is not

Backtest is a directional probe for the engine — relative comparisons, edge-case stress tests, sanity checks. NOT a strict-determinism instrument. Aggregate metrics use bootstrap 95% CIs to absorb the engine's bounded variance.

✓ Use Backtest For

Relative A/B comparisons (config X vs Y)
Edge-case stress tests on engine logic
Engine variance characterization across replays
Sanity checks on logic changes
Aggregate metrics with bootstrap CIs

✗ Do NOT Use For

Strict per-fire parity claims
Replacing live forward-test for calibration
Phase 4 sample-acceleration (10wk → 1-2wk promise)
Absolute outcome predictions
Anything before engine confidence is settled (M3 rule)

⏱ Timeline · Honest Cut 2026-05-14 · post real-Chrome pivot 3 wk

Stage	Effort	Cumul	Calendar
B0	shipped	—	05-13	✅
~~B0.5 jsdom (3 sess)~~	~~3 sess~~	—	parked	🔴
B0.6 real-Chrome WS (G1+G2)	1 sess	1 sess	05-14 overnight	✅
B0.6 G3/G4 close (substrate WS debug)	0.5–1 sess	~2 sess	05-14 → 05-15	🟡
B1 (replay harden + headless option)	2–3 d	~5 d	05-16 → 05-19	—
B2	1 d	~6 d	05-20	—
B3	1 d	~7 d	05-21	—
B4	1–2 d	~9 d	05-22 → 05-24	—
B5	2–3 d	~12 d	05-25 → 05-27	—
B6	1 d	~13 d	05-28	—
B7	1 d	~14 d	05-29	—
B8	4–5 d	~19 d	05-31 → 06-05	—

🟢 Upside

~2026-05-30 · ~2.5 wk · substrate WS fix is small

⚪ Centerline

2026-06-05 · ~3 wk · 2 d ahead of pre-pivot estimate

🔴 Pessimistic

~2026-06-15 · ~4.5 wk · additional state-warming issues

Phase A · 4 Gates

Stage 0 · live capture ✅ Engine determinism ✅ Real-Chrome WS replay ✅ Gate 1 · first-fire ✅ 13 fires emitted 05-14 Gate 2 · schema ✅ by construction Gate 3 · value parity 🟡 partial (20:45 setupIds match · EV gap) Gate 4 · ≥95% match 🟡 2/14 (substrate-warmup next) Parked: jsdom replay path (engine OOM at 30-50K events, 3 sessions sunk). Pivoted to real Chrome + tiny Node WS server. Same engine code, same fields, no DOM emulation hacks. 40× speed cliff established.

Substrate Capture

Shipped · 05-13 04:30 IL · Quantower → JSONL · IB sessions verified

0.6

Real-Chrome WS Replay ✅ Gate 1+2

05-14 overnight: Node WS server → isolated Chrome → 61,618 events / 17.4h replay in 26 min wall-clock. 13 fires. 40× stable / 50× cliff. Substrate-warmup debug next.

1·2

Replay Harden + Reconcile

Next · 3–4 d total · deterministic + parity locked into regression suite

3·4

Resolver + Cost Model

2–3 d · slippage + commission + NO_FILL honored · sensitivity curve

5·6

Report CIs + Walk-Forward

3–4 d · bootstrap 95% CI · train/test/holdout · hard refuse on tuned-window eval

7·8

A/B + Candidate Setups

5–6 d · ORB + VWAP-2σ + Gap-Fill · Q1 evidence file

🎉 2026-05-14 BREAKTHROUGH — Real Chrome WS Replay jsdom path dead · Real Chrome + tiny Node WS server works · Gate 1+2 PASSED in one session

✅ The new primitive works Node WS server (backtest_harness/tools/replay_ws_server.js) streams recorded jsonl → isolated Chrome running index-v2-replay.html. 61,618 events in 26 min wall-clock, zero disconnects, 13 fires emitted. First fires from a backtest replay in this entire effort.

🟡 Speed cliff = 40× Validated 2× / 10× / 40× stable. 50× = deterministic WS-disconnect mid-stream. 40× = 17h day in 26 min — fast enough for real iteration. Headless-Chrome multiplier (3–5×) reserved for B1 if needed (would give ~120× effective).

🟡 Cold-start gap remains Backtest fires ib-brk-L + ib-ext-L at 20:45 UTC matching live's late cluster (Gate 3 partial). But EV values diverge (backtest negative ARMED vs live positive SIGNAL) due to cold bar buffer mis-classifying day as "compression." Fix: Quantower substrate with 9h warmup. WS-disconnect storm on first attempt — needs Chrome Network-tab inspection.

🚦 Binding Constraints 7

C1Parity gate before any setup-decision use. B2 reconciliation must pass.
C2Cost & fill model non-optional. Zero-cost inflates PF 0.2–0.4 per setup.
C3Bootstrap 95% CI on every metric. "PF 1.4" lies · "PF 1.4 ± 0.5" honest.
C4Walk-forward split train/test/holdout. No metric on tuned window.
C5Deterministic replay — same input → byte-identical output.
C6Never mix backtest output with live JSONLs. Separate dir, separate prefix.
C7Live engine untouched. Shims live in harness layer only.

🪲 B0.5 Risks · Post-Pass-2 After 2026-05-13 evening 5

R1✅ CONFIRMED MATERIAL — dominant blocker. Two layers: (a) yestClose/yestHigh/yestLow/settlement = 0 → engine reads massive implied gap → wrong directional bias (backtest 100% LONG vs live 98% SHORT); (b) recentBarDeltasOfficial empty → plan validator can't compute stop/target → 0/111 valid plans. Pass 3 (B0.5.7d + e) targets both.
R2✅ MEASURED + STABLE. 78 scripts load with 7 shim categories. No new surface across 3,604 and 1,190 frame replays. B1 estimate 2–3 d holds.
R3⚠ DEMOTED — not the issue. Density fix (50→10 quote sampling) lifted parity 50.5→63.1% on same data. So density was the constraint, not tick-fidelity. Stream-replay rebuild NOT warranted.
R4✅ CONFIRMED — Cold-start divergence. Backtest fires LONG (recent-momentum bounce context) vs live SHORT (multi-day trend context). Warmup-window idea was wrong-shaped fix — what we need is prior-day data seeding via R1 enrichment.
R5🆕 NEW · Pass-2 finding — Intent vs plan parity divergence. Engine emits decisionState=SHADOW early in the pipeline. Live bridge only persists fires that passed planValidate(). Apples-to-apples = valid-plan fires only. Comparing raw intents to bridge-filtered fires overstates parity.

🔗 Cross-Phase Deps Hidden cost surfaces 2

B5↔P3.9Regime breakdown needs a regime classifier. That's Phase 3.9 work (not yet built). Two paths: (a) skip regime columns in v1, add when P3.9 lands · (b) pull P3.9 forward — +3–5 d. Decision at B5 start.
B1↔B0.5B1 effort is conditional on B0.5 findings. Estimate 2–3 days assumes typical shim surface. If B0.5 surfaces many unexpected browser deps → expand to 4–5 d.
B5,B8↔Q1Q1 council target date depends on B5 + B8 landing on time · regime breakdown availability shapes evidence quality.

🎯 North-Star Alignment vs master ROADMAP

Serves

T1 E[R] > 0 at fire
T2 mean realized R (accel)
T6 no-fire discipline

Does NOT serve

T4 regime stability (can't re-roll history)
T5 time stability (needs live forward)

📐 Detailed Phases B0 shipped · B0.5 active · B1–B8 conditional 9

B0Substrate Capture — Quantower Backtest → NQEliteBacktestStrategy → JSONL · 50-quote event-paced · Symbol.LastDateTime · IB ranges verified May 12 (Asia 105pt · LDN 66pt · NY 181pt).
B0.5Parity Harness · Council pivot to 3-stage architecture — After 6 patches couldn't lift parity past 3.6%, council pivoted to research-backed (NautilusTrader/Lean/Lopez de Prado) sequenced approach. Pre-stage ✅: engine determinism proven (byte-identical fire output, 2 runs same substrate). Stage 0 ✅: Python bridge patched to capture every engine-bound WS message to <date>_engine_input.jsonl · bridge restarted · capture growing. Stage 1 ✅ built: replay_recorded.js replays captured events through harness, diffs vs live · smoke test 78/78 scripts, 0 errors. Awaiting meaningful capture window with live fires. Verdict gate ≥95% on ev_calibration_log per Memory 5. Stages 2 (snapshot-to-event expander) + 3 (warmup phase) conditional on Stage 1 verdict.
B1Engine Replay Hardening — productionize harness · stub every browser dep · determinism test (3× byte-diff) · warmup period · multi-day stitched replay. 2–3 d conditional on R2.
B2Reconciliation Gate — extend parity to 3 reference days · all must pass tolerance · lock into regression suite. 1 d.
B3Resolver Adapter — extend resolve_shadow_outcomes.py for backtest substrate · same buckets (TARGET/STOP/DRIFT/NO_FILL) · output to research_data/backtest/resolved/. 1 d.

B4Cost & Fill Model — slippage default 1 tick · NQ commission · limit fill: touch ≠ fill, trade-through ≥1 tick = fill · stop +1 tick adverse · NO_FILL honored · sensitivity curve 0/1/2/3 ticks. 1–2 d.
B5Report + CIs — extend weekly_report.py · bootstrap 95% CI per metric · <30 trades → INSUFFICIENT_SAMPLE · regime cols (P3.9 dep). 2–3 d.
B6Walk-Forward Harness — 60/20/20 split · tune only on train · hard refuse "tuned metric on tuned window" · rolling N-fold variant. 1 d.
B7Multi-Setup A/B Harness — N config variants in parallel · side-by-side variant × setup × CI · stat-significance test. 1 d.
B8Candidate Setups — implement ORB · VWAP-Reversion-2σ · Gap-Fill (shadow-only) · run all 8 setups against 30-d substrate · Q1 evidence file. 4–5 d — only legitimately-week-sized phase.

🧠 Architectural Lessons — B0.5 First Pass each would have wasted days · captured 2026-05-13 7

L1Script-tag injection is the only valid loader. vm.runInContext and eval() both isolate const/let per call → silent broken engine state with no error. Only document.createElement('script') + appendChild in runScripts: 'dangerously' preserves cross-script global lexical scope the way browsers do.
L2Real jsdom elements for fake DOM, not Proxies. Proxy-based fakes pass property reads but fail jsdom's internal instanceof Element checks when MutationObserver / innerHTML accessors fire. document.createElement('div') with assigned ID works.
L3Force-prime dualConnState.data.connected + runtimeOps.startupReady post-load. Both are top-level const/var in engine-pipeline.js — lexically scoped, NOT on window. Until set, every signal blocks with "Startup lock." Inject a probe script that mutates them inside the engine's scope.
L4Per-frame heartbeat bump is mandatory. Engine's dualConnState.data.lastMessageAt must be kept fresh per frame or the staleness gate fires after a couple of seconds.
L5FakeWebSocket must auto-open + auto-fire one system message on next microtask. Defer via Promise.resolve().then() so the caller has a chance to assign onopen/onmessage handlers first.
L6LightweightCharts CDN dep needs a chainable-no-op Proxy stub. Otherwise engine-chart.js crashes during init and breaks downstream loads.
L7Backtest output format differs from live shadow JSONL. Live = {rxAt, payload:{...}}. Backtest = raw live_decision_log.v1 entries (inner payload only). Parity-diff tool must normalize.
L8"Use replay-time" bugs have multiple write paths. Found DateTime.UtcNow in 3 sites that fed engine-bound data (JSON timestamp field, per-1m bar accumulator clock, tick-tempo timestamp) — fixing only the obvious one would have left 2 silent corruptions. Always grep ALL DateTime.UtcNow references in any data-producing path.
L9Substrate density is an independent correctness dimension. Even with right timestamps, sparse sampling causes engine to fire "intents" without enough data for plan validation. Density and correctness must BOTH be right. Sampling went 50 → 10 quote events for ~5× density.
L10Raw intent parity ≠ valid-plan parity. Engine emits decisionState=SHADOW early in the pipeline. Live bridge only persists fires that passed planValidate() (entry/stop/target > 0). Always filter to valid plans before computing parity. Raw 63% vs valid-plan 0% on the same data.
L11🚨 The localStorage KEY matters as much as the field filter. nqelite.live_decision_log.v1 is the engine's diagnostic audit log (BLOCKED/CONFLICT included). ev_calibration_log is the validated-fire log that the live bridge persists to disk as _calibration_entry.jsonl. Always use ev_calibration_log for parity. Spent hours comparing audit-log to fire-log before catching this — 63% phantom vs 3.6% true. Before celebrating any parity number, verify which log is being read.

🪧 Honesty Governor on Estimates added 2026-05-13 after Frank-grade catch

First cut had 5.5-wk estimates inflated by uniform "couple of days each" buffers — lazy padding that looked plausible per row but doubled the aggregate. 2026-05-13 evening update: B0.5 first pass came in at 1 session against the 1–2 sess estimate — on schedule. Centerline 2026-06-07 holds. If a stage routinely takes longer than its honest estimate, diagnose why rather than retroactively re-justify the padding.

Mirror of BACKTEST_ROADMAP.md · every change to that file propagates here (Memory 1 rule, 2026-05-13) · Elite Intelligence Desk