⭐ Mission
What this backtest is — and is not
Backtest is a directional probe for the engine — relative comparisons, edge-case stress tests, sanity checks. NOT a strict-determinism instrument. Aggregate metrics use bootstrap 95% CIs to absorb the engine's bounded variance.
✓ Use Backtest For
- Relative A/B comparisons (config X vs Y)
- Edge-case stress tests on engine logic
- Engine variance characterization across replays
- Sanity checks on logic changes
- Aggregate metrics with bootstrap CIs
✗ Do NOT Use For
- Strict per-fire parity claims
- Replacing live forward-test for calibration
- Phase 4 sample-acceleration (10wk → 1-2wk promise)
- Absolute outcome predictions
- Anything before engine confidence is settled (M3 rule)
⏱ Timeline · Honest Cut
2026-05-14 · post real-Chrome pivot
3 wk
| Stage | Effort | Cumul | Calendar | |
|---|---|---|---|---|
| B0 | shipped | — | 05-13 | ✅ |
| parked | 🔴 | |||
| B0.6 real-Chrome WS (G1+G2) | 1 sess | 1 sess | 05-14 overnight | ✅ |
| B0.6 G3/G4 close (substrate WS debug) | 0.5–1 sess | ~2 sess | 05-14 → 05-15 | 🟡 |
| B1 (replay harden + headless option) | 2–3 d | ~5 d | 05-16 → 05-19 | — |
| B2 | 1 d | ~6 d | 05-20 | — |
| B3 | 1 d | ~7 d | 05-21 | — |
| B4 | 1–2 d | ~9 d | 05-22 → 05-24 | — |
| B5 | 2–3 d | ~12 d | 05-25 → 05-27 | — |
| B6 | 1 d | ~13 d | 05-28 | — |
| B7 | 1 d | ~14 d | 05-29 | — |
| B8 | 4–5 d | ~19 d | 05-31 → 06-05 | — |
🟢 Upside
~2026-05-30 · ~2.5 wk · substrate WS fix is small⚪ Centerline
2026-06-05 · ~3 wk · 2 d ahead of pre-pivot estimate🔴 Pessimistic
~2026-06-15 · ~4.5 wk · additional state-warming issues
Phase A · 4 Gates
Stage 0 · live capture ✅
Engine determinism ✅
Real-Chrome WS replay ✅
Gate 1 · first-fire ✅ 13 fires emitted 05-14
Gate 2 · schema ✅ by construction
Gate 3 · value parity 🟡 partial (20:45 setupIds match · EV gap)
Gate 4 · ≥95% match 🟡 2/14 (substrate-warmup next)
Parked: jsdom replay path (engine OOM at 30-50K events, 3 sessions sunk). Pivoted to real Chrome + tiny Node WS server. Same engine code, same fields, no DOM emulation hacks. 40× speed cliff established.
🎉 2026-05-14 BREAKTHROUGH — Real Chrome WS Replay
jsdom path dead · Real Chrome + tiny Node WS server works · Gate 1+2 PASSED in one session
✅ The new primitive works
Node WS server (
backtest_harness/tools/replay_ws_server.js) streams recorded jsonl → isolated Chrome running index-v2-replay.html. 61,618 events in 26 min wall-clock, zero disconnects, 13 fires emitted. First fires from a backtest replay in this entire effort.
🟡 Speed cliff = 40×
Validated 2× / 10× / 40× stable. 50× = deterministic WS-disconnect mid-stream. 40× = 17h day in 26 min — fast enough for real iteration. Headless-Chrome multiplier (3–5×) reserved for B1 if needed (would give ~120× effective).
🟡 Cold-start gap remains
Backtest fires
ib-brk-L + ib-ext-L at 20:45 UTC matching live's late cluster (Gate 3 partial). But EV values diverge (backtest negative ARMED vs live positive SIGNAL) due to cold bar buffer mis-classifying day as "compression." Fix: Quantower substrate with 9h warmup. WS-disconnect storm on first attempt — needs Chrome Network-tab inspection.
🚦 Binding Constraints
7
- C1Parity gate before any setup-decision use. B2 reconciliation must pass.
- C2Cost & fill model non-optional. Zero-cost inflates PF 0.2–0.4 per setup.
- C3Bootstrap 95% CI on every metric. "PF 1.4" lies · "PF 1.4 ± 0.5" honest.
- C4Walk-forward split train/test/holdout. No metric on tuned window.
- C5Deterministic replay — same input → byte-identical output.
- C6Never mix backtest output with live JSONLs. Separate dir, separate prefix.
- C7Live engine untouched. Shims live in harness layer only.
🪲 B0.5 Risks · Post-Pass-2
After 2026-05-13 evening
5
- R1✅ CONFIRMED MATERIAL — dominant blocker. Two layers: (a)
yestClose/yestHigh/yestLow/settlement= 0 → engine reads massive implied gap → wrong directional bias (backtest 100% LONG vs live 98% SHORT); (b)recentBarDeltasOfficialempty → plan validator can't compute stop/target → 0/111 valid plans. Pass 3 (B0.5.7d + e) targets both. - R2✅ MEASURED + STABLE. 78 scripts load with 7 shim categories. No new surface across 3,604 and 1,190 frame replays. B1 estimate 2–3 d holds.
- R3⚠ DEMOTED — not the issue. Density fix (50→10 quote sampling) lifted parity 50.5→63.1% on same data. So density was the constraint, not tick-fidelity. Stream-replay rebuild NOT warranted.
- R4✅ CONFIRMED — Cold-start divergence. Backtest fires LONG (recent-momentum bounce context) vs live SHORT (multi-day trend context). Warmup-window idea was wrong-shaped fix — what we need is prior-day data seeding via R1 enrichment.
- R5🆕 NEW · Pass-2 finding — Intent vs plan parity divergence. Engine emits
decisionState=SHADOWearly in the pipeline. Live bridge only persists fires that passedplanValidate(). Apples-to-apples = valid-plan fires only. Comparing raw intents to bridge-filtered fires overstates parity.
🔗 Cross-Phase Deps
Hidden cost surfaces
2
- B5↔P3.9Regime breakdown needs a regime classifier. That's Phase 3.9 work (not yet built). Two paths: (a) skip regime columns in v1, add when P3.9 lands · (b) pull P3.9 forward — +3–5 d. Decision at B5 start.
- B1↔B0.5B1 effort is conditional on B0.5 findings. Estimate 2–3 days assumes typical shim surface. If B0.5 surfaces many unexpected browser deps → expand to 4–5 d.
- B5,B8↔Q1Q1 council target date depends on B5 + B8 landing on time · regime breakdown availability shapes evidence quality.
🎯 North-Star Alignment
vs master ROADMAP
Serves
- T1 E[R] > 0 at fire
- T2 mean realized R (accel)
- T6 no-fire discipline
Does NOT serve
- T4 regime stability (can't re-roll history)
- T5 time stability (needs live forward)
📐 Detailed Phases
B0 shipped · B0.5 active · B1–B8 conditional
9
- B0Substrate Capture — Quantower Backtest →
NQEliteBacktestStrategy→ JSONL · 50-quote event-paced ·Symbol.LastDateTime· IB ranges verified May 12 (Asia 105pt · LDN 66pt · NY 181pt). - B0.5Parity Harness · Council pivot to 3-stage architecture — After 6 patches couldn't lift parity past 3.6%, council pivoted to research-backed (NautilusTrader/Lean/Lopez de Prado) sequenced approach. Pre-stage ✅: engine determinism proven (byte-identical fire output, 2 runs same substrate). Stage 0 ✅: Python bridge patched to capture every engine-bound WS message to
<date>_engine_input.jsonl· bridge restarted · capture growing. Stage 1 ✅ built:replay_recorded.jsreplays captured events through harness, diffs vs live · smoke test 78/78 scripts, 0 errors. Awaiting meaningful capture window with live fires. Verdict gate ≥95% onev_calibration_logper Memory 5. Stages 2 (snapshot-to-event expander) + 3 (warmup phase) conditional on Stage 1 verdict. - B1Engine Replay Hardening — productionize harness · stub every browser dep · determinism test (3× byte-diff) · warmup period · multi-day stitched replay. 2–3 d conditional on R2.
- B2Reconciliation Gate — extend parity to 3 reference days · all must pass tolerance · lock into regression suite. 1 d.
- B3Resolver Adapter — extend
resolve_shadow_outcomes.pyfor backtest substrate · same buckets (TARGET/STOP/DRIFT/NO_FILL) · output toresearch_data/backtest/resolved/. 1 d.
- B4Cost & Fill Model — slippage default 1 tick · NQ commission · limit fill: touch ≠ fill, trade-through ≥1 tick = fill · stop +1 tick adverse · NO_FILL honored · sensitivity curve 0/1/2/3 ticks. 1–2 d.
- B5Report + CIs — extend
weekly_report.py· bootstrap 95% CI per metric · <30 trades →INSUFFICIENT_SAMPLE· regime cols (P3.9 dep). 2–3 d. - B6Walk-Forward Harness — 60/20/20 split · tune only on train · hard refuse "tuned metric on tuned window" · rolling N-fold variant. 1 d.
- B7Multi-Setup A/B Harness — N config variants in parallel · side-by-side variant × setup × CI · stat-significance test. 1 d.
- B8Candidate Setups — implement ORB · VWAP-Reversion-2σ · Gap-Fill (shadow-only) · run all 8 setups against 30-d substrate · Q1 evidence file. 4–5 d — only legitimately-week-sized phase.
🧠 Architectural Lessons — B0.5 First Pass
each would have wasted days · captured 2026-05-13
7
- L1Script-tag injection is the only valid loader.
vm.runInContextandeval()both isolateconst/letper call → silent broken engine state with no error. Onlydocument.createElement('script') + appendChildinrunScripts: 'dangerously'preserves cross-script global lexical scope the way browsers do. - L2Real jsdom elements for fake DOM, not Proxies. Proxy-based fakes pass property reads but fail jsdom's internal
instanceof Elementchecks when MutationObserver / innerHTML accessors fire.document.createElement('div')with assigned ID works. - L3Force-prime
dualConnState.data.connected+runtimeOps.startupReadypost-load. Both are top-levelconst/varin engine-pipeline.js — lexically scoped, NOT on window. Until set, every signal blocks with "Startup lock." Inject a probe script that mutates them inside the engine's scope. - L4Per-frame heartbeat bump is mandatory. Engine's
dualConnState.data.lastMessageAtmust be kept fresh per frame or the staleness gate fires after a couple of seconds. - L5FakeWebSocket must auto-open + auto-fire one system message on next microtask. Defer via
Promise.resolve().then()so the caller has a chance to assignonopen/onmessagehandlers first. - L6LightweightCharts CDN dep needs a chainable-no-op Proxy stub. Otherwise engine-chart.js crashes during init and breaks downstream loads.
- L7Backtest output format differs from live shadow JSONL. Live =
{rxAt, payload:{...}}. Backtest = rawlive_decision_log.v1entries (inner payload only). Parity-diff tool must normalize. - L8"Use replay-time" bugs have multiple write paths. Found
DateTime.UtcNowin 3 sites that fed engine-bound data (JSON timestamp field, per-1m bar accumulator clock, tick-tempo timestamp) — fixing only the obvious one would have left 2 silent corruptions. Always grep ALLDateTime.UtcNowreferences in any data-producing path. - L9Substrate density is an independent correctness dimension. Even with right timestamps, sparse sampling causes engine to fire "intents" without enough data for plan validation. Density and correctness must BOTH be right. Sampling went 50 → 10 quote events for ~5× density.
- L10Raw intent parity ≠ valid-plan parity. Engine emits
decisionState=SHADOWearly in the pipeline. Live bridge only persists fires that passedplanValidate()(entry/stop/target > 0). Always filter to valid plans before computing parity. Raw 63% vs valid-plan 0% on the same data. - L11🚨 The localStorage KEY matters as much as the field filter.
nqelite.live_decision_log.v1is the engine's diagnostic audit log (BLOCKED/CONFLICT included).ev_calibration_logis the validated-fire log that the live bridge persists to disk as_calibration_entry.jsonl. Always use ev_calibration_log for parity. Spent hours comparing audit-log to fire-log before catching this — 63% phantom vs 3.6% true. Before celebrating any parity number, verify which log is being read.
🪧 Honesty Governor on Estimates
added 2026-05-13 after Frank-grade catch
First cut had 5.5-wk estimates inflated by uniform "couple of days each" buffers — lazy padding that looked plausible per row but doubled the aggregate. 2026-05-13 evening update: B0.5 first pass came in at 1 session against the 1–2 sess estimate — on schedule. Centerline 2026-06-07 holds. If a stage routinely takes longer than its honest estimate, diagnose why rather than retroactively re-justify the padding.