Backtest System · Build RoadmapMeasurement instrument · index-v2-replay.html + backtest_harness/ · SUBORDINATE to engine roadmap · backtest exists to serve engine confidence
STATUS 🆕 REDESIGNED 2026-05-15 LATE NIGHT · 4-tool ladder LITERATURE 1/9 met · 2/9 partial · 6/9 NOT MET for parity TOOL 1 forward-test continuous TOOL 2 CPCV+DSR/PBO Sat 05-17 TOOL 3 directional WS-replay (niche) TOOL 4 Quantower tick-replay — GATED on variance probe
2026-05-15 late night (council redesign)
🆕 2026-05-15 LATE NIGHT · Council Mode 3 Redesign — 4-Tool Evidence Ladder Literature research + Council Mode 3 → backtest replaced with multi-tool evidence pipeline

Trigger: literature research showed engine scored 1/9 met, 2/9 partial, 6/9 not met on industry preconditions for parity backtest. We have the Lean wall-clock-coupling property without Lean's BacktestClock framework. Lopez de Prado's warning: "backtesting before parameters are forward-test-validated = discovery on data the strategy already saw = highest overfitting risk."

Council Mode 3 verdict (all 5 advisors converged): don't redesign the backtest — replace it with a 4-tool evidence ladder.

# Tool Speed Accuracy Status
1Forward-test (continuous)Calendar-boundHIGHESTPRIMARY edge discovery
2CPCV + DSR/PBO on shadow logInstant on existing dataHIGHNEW — Sat 2026-05-17
3Directional WS-replay (Option C)Fast (1-2d per A/B)MEDIUM (relative only)EXISTING — niche A/B tool
4Quantower historical tick-replayUNKNOWN — depends on probeMEDIUM-HIGH (genuine OOS)GATED on variance probe

Immediate next step (Sat 2026-05-17 ~3-4h): variance characterization probe at 2× / 5× / 10× replay speeds × 3 iterations on a 1-hour window from research_data/live/20260513_engine_input.jsonl. Decision gate: CV ≤ 0.3 at 5× → ship Tool 4 (6 months OOS data in ~36 days). CV > 0.5 at 5× → Tool 4 dies cleanly, stick with Tools 1+2+3.

Cancelled (formally): parity backtest, jsdom replay extension, Quantower-substrate replay extension, multi-day stitching at compressed speeds.

Pivot Trigger for the backtest project: month 6 (2026-11-13) — if forward-test + CPCV show no edge in any regime, backtest project is formally cancelled. Tools 3+4 only ever activate when forward-test surfaces a specific A/B question on a strategy with demonstrated edge.

PHASE E · Backtest Reframed (Option C) — historical context 5 council sessions · 4 measurement tracks · framework redefined 2026-05-15 (superseded by late-night 4-tool ladder above)
OPT C REFRAME

Backtest = directional probe, NOT a parity instrument

Phase A redefined: "≥95% per-fire byte-identical""aggregate parity within bounded variance + CIs". Tooling preserved as-is for relative comparisons. Phase 4 sample acceleration NOT delivered — live forward-test remains the calibration path. M5 #9 fully preserved — zero engine code touched.

PHASE A REDEFINED
0 engine edits
0 backtest edits
1 new analysis tool
Use Backtest For
  • Relative A/B comparisons — config X vs Y, threshold X vs Y
  • Edge-case stress tests — does engine handle gap-fill, etc.
  • Engine variance characterization — repeated replays, CV measurement
  • Sanity checks on engine logic changes
Do NOT Use For
  • Strict per-fire parity claims (unreachable on this engine)
  • Replacing live forward-test for calibration
  • Phase 4 sample-accumulation acceleration (the 10wk → 1-2wk promise)
  • Absolute outcome predictions
i Variance Numbers
  • Total fire CV=0.29, valid-plan CV=0.20
  • Strong setups CV=0.20 — deterministic core
  • Marginal setups CV=0.94 — flips in/out
  • Per-fire score/EV: nearly identical when fires happen
Ships Next
  • Bootstrap CIs in weekly_report.py ~2-3h
  • Build resolve_shadow_outcomes.py — engine-output analysis
  • Run on May 7-13 → realized-R CIs per setup
  • Engage Phase 3 tuning with CI-honest table
Full arc: memory/project_session_handoff_2026-05-14_late_night.md · Module inventory: project_wall_clock_modules_inventory_2026-05-14.md · Production survey: project_engine_architecture_survey_2026-05-14.md
Mission What this backtest is — and is not
Backtest is a directional probe for the engine — relative comparisons, edge-case stress tests, sanity checks. NOT a strict-determinism instrument. Aggregate metrics use bootstrap 95% CIs to absorb the engine's bounded variance.
✓ Use Backtest For
  • Relative A/B comparisons (config X vs Y)
  • Edge-case stress tests on engine logic
  • Engine variance characterization across replays
  • Sanity checks on logic changes
  • Aggregate metrics with bootstrap CIs
✗ Do NOT Use For
  • Strict per-fire parity claims
  • Replacing live forward-test for calibration
  • Phase 4 sample-acceleration (10wk → 1-2wk promise)
  • Absolute outcome predictions
  • Anything before engine confidence is settled (M3 rule)
Timeline · Honest Cut 2026-05-14 · post real-Chrome pivot 3 wk
StageEffortCumulCalendar
B0shipped05-13
B0.5 jsdom (3 sess)3 sessparked🔴
B0.6 real-Chrome WS (G1+G2)1 sess1 sess05-14 overnight
B0.6 G3/G4 close (substrate WS debug)0.5–1 sess~2 sess05-14 → 05-15🟡
B1 (replay harden + headless option)2–3 d~5 d05-16 → 05-19
B21 d~6 d05-20
B31 d~7 d05-21
B41–2 d~9 d05-22 → 05-24
B52–3 d~12 d05-25 → 05-27
B61 d~13 d05-28
B71 d~14 d05-29
B84–5 d~19 d05-31 → 06-05
🟢 Upside
~2026-05-30 · ~2.5 wk · substrate WS fix is small
⚪ Centerline
2026-06-05 · ~3 wk · 2 d ahead of pre-pivot estimate
🔴 Pessimistic
~2026-06-15 · ~4.5 wk · additional state-warming issues
Phase A · 4 Gates
Stage 0 · live capture ✅  Engine determinism ✅  Real-Chrome WS replay ✅  Gate 1 · first-fire ✅ 13 fires emitted 05-14  Gate 2 · schema ✅ by construction  Gate 3 · value parity 🟡 partial (20:45 setupIds match · EV gap)  Gate 4 · ≥95% match 🟡 2/14 (substrate-warmup next)   Parked: jsdom replay path (engine OOM at 30-50K events, 3 sessions sunk). Pivoted to real Chrome + tiny Node WS server. Same engine code, same fields, no DOM emulation hacks. 40× speed cliff established.
0
Substrate Capture
Shipped · 05-13 04:30 IL · Quantower → JSONL · IB sessions verified
0.6
Real-Chrome WS Replay ✅ Gate 1+2
05-14 overnight: Node WS server → isolated Chrome → 61,618 events / 17.4h replay in 26 min wall-clock. 13 fires. 40× stable / 50× cliff. Substrate-warmup debug next.
3·4
Resolver + Cost Model
2–3 d · slippage + commission + NO_FILL honored · sensitivity curve
5·6
Report CIs + Walk-Forward
3–4 d · bootstrap 95% CI · train/test/holdout · hard refuse on tuned-window eval
7·8
A/B + Candidate Setups
5–6 d · ORB + VWAP-2σ + Gap-Fill · Q1 evidence file
🎉 2026-05-14 BREAKTHROUGH — Real Chrome WS Replay jsdom path dead · Real Chrome + tiny Node WS server works · Gate 1+2 PASSED in one session
✅ The new primitive works Node WS server (backtest_harness/tools/replay_ws_server.js) streams recorded jsonl → isolated Chrome running index-v2-replay.html. 61,618 events in 26 min wall-clock, zero disconnects, 13 fires emitted. First fires from a backtest replay in this entire effort.
🟡 Speed cliff = 40× Validated 2× / 10× / 40× stable. 50× = deterministic WS-disconnect mid-stream. 40× = 17h day in 26 min — fast enough for real iteration. Headless-Chrome multiplier (3–5×) reserved for B1 if needed (would give ~120× effective).
🟡 Cold-start gap remains Backtest fires ib-brk-L + ib-ext-L at 20:45 UTC matching live's late cluster (Gate 3 partial). But EV values diverge (backtest negative ARMED vs live positive SIGNAL) due to cold bar buffer mis-classifying day as "compression." Fix: Quantower substrate with 9h warmup. WS-disconnect storm on first attempt — needs Chrome Network-tab inspection.
🚦 Binding Constraints 7
  • C1Parity gate before any setup-decision use. B2 reconciliation must pass.
  • C2Cost & fill model non-optional. Zero-cost inflates PF 0.2–0.4 per setup.
  • C3Bootstrap 95% CI on every metric. "PF 1.4" lies · "PF 1.4 ± 0.5" honest.
  • C4Walk-forward split train/test/holdout. No metric on tuned window.
  • C5Deterministic replay — same input → byte-identical output.
  • C6Never mix backtest output with live JSONLs. Separate dir, separate prefix.
  • C7Live engine untouched. Shims live in harness layer only.
🪲 B0.5 Risks · Post-Pass-2 After 2026-05-13 evening 5
  • R1CONFIRMED MATERIAL — dominant blocker. Two layers: (a) yestClose/yestHigh/yestLow/settlement = 0 → engine reads massive implied gap → wrong directional bias (backtest 100% LONG vs live 98% SHORT); (b) recentBarDeltasOfficial empty → plan validator can't compute stop/target → 0/111 valid plans. Pass 3 (B0.5.7d + e) targets both.
  • R2MEASURED + STABLE. 78 scripts load with 7 shim categories. No new surface across 3,604 and 1,190 frame replays. B1 estimate 2–3 d holds.
  • R3DEMOTED — not the issue. Density fix (50→10 quote sampling) lifted parity 50.5→63.1% on same data. So density was the constraint, not tick-fidelity. Stream-replay rebuild NOT warranted.
  • R4CONFIRMED — Cold-start divergence. Backtest fires LONG (recent-momentum bounce context) vs live SHORT (multi-day trend context). Warmup-window idea was wrong-shaped fix — what we need is prior-day data seeding via R1 enrichment.
  • R5🆕 NEW · Pass-2 finding — Intent vs plan parity divergence. Engine emits decisionState=SHADOW early in the pipeline. Live bridge only persists fires that passed planValidate(). Apples-to-apples = valid-plan fires only. Comparing raw intents to bridge-filtered fires overstates parity.
🔗 Cross-Phase Deps Hidden cost surfaces 2
  • B5↔P3.9Regime breakdown needs a regime classifier. That's Phase 3.9 work (not yet built). Two paths: (a) skip regime columns in v1, add when P3.9 lands · (b) pull P3.9 forward — +3–5 d. Decision at B5 start.
  • B1↔B0.5B1 effort is conditional on B0.5 findings. Estimate 2–3 days assumes typical shim surface. If B0.5 surfaces many unexpected browser deps → expand to 4–5 d.
  • B5,B8↔Q1Q1 council target date depends on B5 + B8 landing on time · regime breakdown availability shapes evidence quality.
🎯 North-Star Alignment vs master ROADMAP
Serves
  • T1 E[R] > 0 at fire
  • T2 mean realized R (accel)
  • T6 no-fire discipline
Does NOT serve
  • T4 regime stability (can't re-roll history)
  • T5 time stability (needs live forward)
📐 Detailed Phases B0 shipped · B0.5 active · B1–B8 conditional 9
  • B0Substrate Capture — Quantower Backtest → NQEliteBacktestStrategy → JSONL · 50-quote event-paced · Symbol.LastDateTime · IB ranges verified May 12 (Asia 105pt · LDN 66pt · NY 181pt).
  • B0.5Parity Harness · Council pivot to 3-stage architecture — After 6 patches couldn't lift parity past 3.6%, council pivoted to research-backed (NautilusTrader/Lean/Lopez de Prado) sequenced approach. Pre-stage ✅: engine determinism proven (byte-identical fire output, 2 runs same substrate). Stage 0 ✅: Python bridge patched to capture every engine-bound WS message to <date>_engine_input.jsonl · bridge restarted · capture growing. Stage 1 ✅ built: replay_recorded.js replays captured events through harness, diffs vs live · smoke test 78/78 scripts, 0 errors. Awaiting meaningful capture window with live fires. Verdict gate ≥95% on ev_calibration_log per Memory 5. Stages 2 (snapshot-to-event expander) + 3 (warmup phase) conditional on Stage 1 verdict.
  • B1Engine Replay Hardening — productionize harness · stub every browser dep · determinism test (3× byte-diff) · warmup period · multi-day stitched replay. 2–3 d conditional on R2.
  • B2Reconciliation Gate — extend parity to 3 reference days · all must pass tolerance · lock into regression suite. 1 d.
  • B3Resolver Adapter — extend resolve_shadow_outcomes.py for backtest substrate · same buckets (TARGET/STOP/DRIFT/NO_FILL) · output to research_data/backtest/resolved/. 1 d.
  • B4Cost & Fill Model — slippage default 1 tick · NQ commission · limit fill: touch ≠ fill, trade-through ≥1 tick = fill · stop +1 tick adverse · NO_FILL honored · sensitivity curve 0/1/2/3 ticks. 1–2 d.
  • B5Report + CIs — extend weekly_report.py · bootstrap 95% CI per metric · <30 trades → INSUFFICIENT_SAMPLE · regime cols (P3.9 dep). 2–3 d.
  • B6Walk-Forward Harness — 60/20/20 split · tune only on train · hard refuse "tuned metric on tuned window" · rolling N-fold variant. 1 d.
  • B7Multi-Setup A/B Harness — N config variants in parallel · side-by-side variant × setup × CI · stat-significance test. 1 d.
  • B8Candidate Setups — implement ORB · VWAP-Reversion-2σ · Gap-Fill (shadow-only) · run all 8 setups against 30-d substrate · Q1 evidence file. 4–5 d — only legitimately-week-sized phase.
🧠 Architectural Lessons — B0.5 First Pass each would have wasted days · captured 2026-05-13 7
🪧 Honesty Governor on Estimates added 2026-05-13 after Frank-grade catch
First cut had 5.5-wk estimates inflated by uniform "couple of days each" buffers — lazy padding that looked plausible per row but doubled the aggregate. 2026-05-13 evening update: B0.5 first pass came in at 1 session against the 1–2 sess estimate — on schedule. Centerline 2026-06-07 holds. If a stage routinely takes longer than its honest estimate, diagnose why rather than retroactively re-justify the padding.