{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reproduce all figures and tables\n",
    "\n",
    "**Paper**: *Beyond Accuracy and Alignment: A Diagnostic Protocol for Evaluating Feedback Alignment* (NeurIPS 2026 E&D track)\n",
    "\n",
    "This notebook walks through reproducing every figure and table in the paper from the saved JSON / log files in `results/`. Every cell pulls from a saved file (no training is invoked), so the entire notebook runs in seconds and serves as the auditable single-source-of-truth pointer for each cited number.\n",
    "\n",
    "**What this notebook reproduces**:\n",
    "- Table 1 (5-method audit accuracies)\n",
    "- Table 2 (mode validation)\n",
    "- Table 3 (protocol definition — static, no data)\n",
    "- Table 5 (depth sweep, Appendix H)\n",
    "- Table 6 (no-residual ablation, Appendix H)\n",
    "- Table 9 (SB+CB penalty rescue, Appendix J)\n",
    "- Figure 1 (audit hero) — references the saved figure file\n",
    "- Figure 2 (cross-method dissociation) — re-renders from saved JSON\n",
    "- Figure 3 (temporal cross-arch) — references saved\n",
    "- Figure 4 (penalty rescue) — re-renders from saved JSON\n",
    "- Figure 5 (cross-arch verdict matrix) — re-renders from hand-encoded data\n",
    "- §4 ¶4 cross-method functional triangulation (nudging + training-loss decrease)\n",
    "- §5 ¶3 BP+penalty 2x2 control\n",
    "- Appendix L drift values\n",
    "- Appendix M layer-0 dominance per-seed table\n",
    "\n",
    "**What this notebook does NOT do**:\n",
    "- Re-train any model (use the experiment scripts in `experiments/` for that)\n",
    "- Re-measure cosines on saved checkpoints (use `experiments/measure_direction_quality_existing_ckpt.py`)\n",
    "\n",
    "All values that the paper cites are derived in cells below by loading the corresponding `results/*.json` file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import numpy as np\n",
    "from pathlib import Path\n",
    "import os\n",
    "\n",
    "REPO_ROOT = Path('/home/yurenh2/fa')\n",
    "os.chdir(REPO_ROOT)\n",
    "\n",
    "def both_stds(vals):\n",
    "    \"\"\"Return (mean, ddof=0 std, ddof=1 std) for a list of measurements.\n",
    "    \n",
    "    The paper uses ddof=1 (sample std with Bessel correction).\n",
    "    \"\"\"\n",
    "    return np.mean(vals), np.std(vals, ddof=0), np.std(vals, ddof=1)\n",
    "\n",
    "def load_json(rel):\n",
    "    return json.load(open(REPO_ROOT / rel))\n",
    "\n",
    "print('repo root:', REPO_ROOT)\n",
    "print('saved auditable files:')\n",
    "for f in sorted((REPO_ROOT / 'results').glob('*.json')):\n",
    "    print(f' ', f.name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Table 1 — 5-method audit on the 4-block d=256 pre-LayerNorm ResMLP\n",
    "\n",
    "**Source**: `results/protocol_audit/audit_table_s42_s123_s456.json` (3 seeds × 5 methods)\n",
    "\n",
    "Each row shows test accuracy ± sample std (ddof=1) and headline Γ at the converged 100-epoch checkpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d = load_json('results/protocol_audit/audit_table_s42_s123_s456.json')\n",
    "print(f'{\"method\":<14} {\"acc (±ddof=1)\":<18} {\"per-seed accs\"}')\n",
    "print('-' * 70)\n",
    "for m, label in [('bp', 'BP'), ('ep', 'EP'), ('dfa', 'DFA'),\n",
    "                 ('state_bridge', 'State Bridge'), ('credit_bridge', 'Credit Bridge')]:\n",
    "    accs = [d['reports'][f'{m}_s{s}']['headline_acc'] for s in [42, 123, 456]]\n",
    "    mean, _, ddof1 = both_stds(accs)\n",
    "    print(f'{label:<14} {mean:.3f} ± {ddof1:.3f}     {[f\"{a:.4f}\" for a in accs]}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Frozen-blocks baseline\n",
    "\n",
    "**Source**: `results/resmlp_frozen_blocks_s{42,123,456}.log`\n",
    "\n",
    "DFA-shallow accuracy (the architecture-matched baseline used as the comparison for diagnostic (d))."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "shallow = []\n",
    "for s in [42, 123, 456]:\n",
    "    log = open(REPO_ROOT / f'results/resmlp_frozen_blocks_s{s}.log').read()\n",
    "    m = re.search(r'FINAL DFA-shallow: (\\d+\\.\\d+)', log)\n",
    "    if m: shallow.append(float(m.group(1)))\n",
    "mean, _, ddof1 = both_stds(shallow)\n",
    "print(f'DFA-shallow (frozen baseline): {mean:.3f} ± {ddof1:.3f}')\n",
    "print(f'  per-seed: {shallow}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## §5 — matched 30-epoch BP/DFA controls (with and without penalty)\n",
    "\n",
    "**Sources**:\n",
    "- BP no-pen: `results/bp_no_penalty_30ep/bp_pen_lam0.0_s{42,123,456}.json`\n",
    "- BP+pen: `results/bp_with_penalty/bp_pen_lam0.01_s{42,123,456}.json`\n",
    "- DFA no-pen: `results/dfa_no_penalty_30ep/results_cifar10.json`\n",
    "- DFA+pen: `results/dfa_pen_short/dfa_pen_lam0.01_s{42,123,456}.json`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# BP no-pen\n",
    "bp_nopen = [load_json(f'results/bp_no_penalty_30ep/bp_pen_lam0.0_s{s}.json')['final_acc'] for s in [42, 123, 456]]\n",
    "mean, _, ddof1 = both_stds(bp_nopen)\n",
    "print(f'BP no-pen 30ep:  {mean:.3f} ± {ddof1:.3f}')\n",
    "\n",
    "# BP+pen\n",
    "bp_pen = [load_json(f'results/bp_with_penalty/bp_pen_lam0.01_s{s}.json')['final_acc'] for s in [42, 123, 456]]\n",
    "mean, _, ddof1 = both_stds(bp_pen)\n",
    "print(f'BP+pen 30ep:    {mean:.3f} ± {ddof1:.3f}')\n",
    "\n",
    "# DFA no-pen\n",
    "d = load_json('results/dfa_no_penalty_30ep/results_cifar10.json')\n",
    "dfa_nopen = [d[str(s)]['dfa']['log']['test_acc'][-1] for s in [42, 123, 456]]\n",
    "mean, _, ddof1 = both_stds(dfa_nopen)\n",
    "print(f'DFA no-pen 30ep: {mean:.3f} ± {ddof1:.3f}')\n",
    "\n",
    "# DFA+pen\n",
    "dfa_pen = [load_json(f'results/dfa_pen_short/dfa_pen_lam0.01_s{s}.json')['final_test_acc'] for s in [42, 123, 456]]\n",
    "mean, _, ddof1 = both_stds(dfa_pen)\n",
    "print(f'DFA+pen 30ep:   {mean:.3f} ± {ddof1:.3f}')\n",
    "\n",
    "# Penalty cost / margin math\n",
    "frozen = 0.349\n",
    "print()\n",
    "print('§5 ¶3 derived quantities:')\n",
    "print(f'  BP penalty cost:        {(np.mean(bp_nopen) - np.mean(bp_pen))*100:.1f} pp')\n",
    "print(f'  DFA penalty rescue:     {(np.mean(dfa_pen) - np.mean(dfa_nopen))*100:.1f} pp')\n",
    "print(f'  BP+pen margin vs frozen:    {(np.mean(bp_pen) - frozen)*100:.1f} pp')\n",
    "print(f'  DFA+pen margin vs frozen:   {(np.mean(dfa_pen) - frozen)*100:.1f} pp')\n",
    "print(f'  BP-to-DFA gap (under penalty): {(np.mean(bp_pen) - np.mean(dfa_pen))*100:.1f} pp')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## §4 ¶4 — SB+pen, CB+pen, DFA+pen accuracies, cosines, ρ"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "files_by_method = {\n",
    "    'state_bridge': [\n",
    "        ('round38_sbcb_penalty_30ep', '42'),\n",
    "        ('round38_sb_penalty_30ep_s123', '123'),\n",
    "        ('round38_sb_penalty_30ep_s456', '456'),\n",
    "    ],\n",
    "    'credit_bridge': [\n",
    "        ('round38_sbcb_penalty_30ep', '42'),\n",
    "        ('round38_cb_penalty_30ep_s123', '123'),\n",
    "        ('round38_cb_penalty_30ep_s456', '456'),\n",
    "    ],\n",
    "    'dfa': [\n",
    "        ('round41_dfa_penalty_30ep', '42'),\n",
    "        ('round41_dfa_penalty_30ep_s123', '123'),\n",
    "        ('round41_dfa_penalty_30ep_s456', '456'),\n",
    "    ],\n",
    "}\n",
    "\n",
    "labels = {'state_bridge': 'SB+pen', 'credit_bridge': 'CB+pen', 'dfa': 'DFA+pen'}\n",
    "for m, files in files_by_method.items():\n",
    "    accs, cos_deep, rho_deep = [], [], []\n",
    "    for tag, sk in files:\n",
    "        d = load_json(f'results/{tag}/results_cifar10.json')\n",
    "        accs.append(d[sk][m]['log']['test_acc'][-1])\n",
    "        diag = d[sk][m]['diagnostics']\n",
    "        cos_deep.append(np.mean(diag['bp_cosine'][1:]))\n",
    "        rho_deep.append(np.mean(diag['perturbation_rho'][1:]))\n",
    "    a_m, _, a_s = both_stds(accs)\n",
    "    c_m, _, c_s = both_stds(cos_deep)\n",
    "    r_m, _, r_s = both_stds(rho_deep)\n",
    "    print(f'{labels[m]:<8} acc {a_m:.3f}±{a_s:.3f}  cos {c_m:+.3f}±{c_s:.3f}  rho {r_m:+.3f}±{r_s:.3f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## §4 ¶4 — Nudging test 3-seed (the strongest functional metric)\n",
    "\n",
    "**Source**: `results/nudging_test_3seed_summary.json`\n",
    "\n",
    "Single-step loss change for a step of size η=0.01 along the per-layer credit direction at the converged checkpoint, averaged over the deep blocks (l1+)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n = load_json('results/nudging_test_3seed_summary.json')\n",
    "for m, label in [('state_bridge', 'SB+pen'), ('credit_bridge', 'CB+pen'), ('dfa', 'DFA+pen')]:\n",
    "    vals = [v['deep_mean'] for v in n['methods'][m]['per_seed'].values()]\n",
    "    mean, _, ddof1 = both_stds(vals)\n",
    "    print(f'{label:<8}: {mean:.2e} ± {ddof1:.2e}  (per seed: {[f\"{v:.2e}\" for v in vals]})')\n",
    "\n",
    "sb = n['methods']['state_bridge']['three_seed_deep_mean']\n",
    "cb = n['methods']['credit_bridge']['three_seed_deep_mean']\n",
    "dfa = n['methods']['dfa']['three_seed_deep_mean']\n",
    "print()\n",
    "print(f'SB / CB ratio:  {sb / cb:.2f}')\n",
    "print(f'SB / DFA ratio: {sb / dfa:.2f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## §4 ¶4 — Training loss decrease 3-seed\n",
    "\n",
    "**Source**: `results/training_loss_decrease_3seed.json`\n",
    "\n",
    "Loss[ep1] − Loss[ep30] for each method, averaged over 3 seeds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "t = load_json('results/training_loss_decrease_3seed.json')\n",
    "for m, label in [('state_bridge', 'SB+pen'), ('credit_bridge', 'CB+pen'), ('dfa', 'DFA+pen')]:\n",
    "    vals = [v['decrease'] for v in t['per_method'][m]['per_seed'].values()]\n",
    "    mean, _, ddof1 = both_stds(vals)\n",
    "    print(f'{label:<8}: {mean:.4f} ± {ddof1:.4f}  (per seed: {[f\"{v:.4f}\" for v in vals]})')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Appendix M — vanilla DFA early-epoch per-layer cosines (layer-0 dominance)\n",
    "\n",
    "**Source**: `results/vanilla_dfa_early_ckpts/per_layer_cos_3seed.json`\n",
    "\n",
    "Per-seed × per-epoch × per-layer cosine measurements showing that the headline Γ on vanilla DFA is driven entirely by layer 0, with all deep layers (1-4) at noise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d = load_json('results/vanilla_dfa_early_ckpts/per_layer_cos_3seed.json')\n",
    "print(f'{\"key\":<12} {\"l0\":>8} {\"l1\":>8} {\"l2\":>8} {\"l3\":>8} {\"l4\":>8}  {\"||g_2||\"}')\n",
    "for k, v in d.items():\n",
    "    cos = v['per_layer_cos']\n",
    "    g2 = v['per_layer_g_norm_median'][2]\n",
    "    print(f'{k:<12} ' + ' '.join(f'{c:+8.3f}' for c in cos) + f'  {g2:.2e}')\n",
    "\n",
    "# Aggregate stats\n",
    "ep1 = [np.mean(d[f's{s}_ep1']['per_layer_cos'][1:]) for s in [42, 123, 456]]\n",
    "mean, _, ddof1 = both_stds(ep1)\n",
    "print(f'\\nep 1 deep mean (3-seed): {mean:.4f} ± {ddof1:.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## §6 ¶1 — protocol calibration gaps (for the 4-diagnostic protocol)\n",
    "\n",
    "**Source**: `results/protocol_audit/audit_table_s42_s123_s456.json`\n",
    "\n",
    "The 24,338× and 63× gaps between healthy (BP/EP) and degenerate (DFA/SB/CB) reference quantities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d = load_json('results/protocol_audit/audit_table_s42_s123_s456.json')\n",
    "\n",
    "# Per-seed g_L (deepest BP gradient norm)\n",
    "healthy_g, degen_g = [], []\n",
    "for m in ['bp', 'ep']:\n",
    "    for s in [42, 123, 456]:\n",
    "        g = d['reports'][f'{m}_s{s}']['bp_grad_norms'][-1]\n",
    "        healthy_g.append(g)\n",
    "for m in ['dfa', 'state_bridge', 'credit_bridge']:\n",
    "    for s in [42, 123, 456]:\n",
    "        g = d['reports'][f'{m}_s{s}']['bp_grad_norms'][-1]\n",
    "        degen_g.append(g)\n",
    "\n",
    "print(f'min healthy ||g_L|| = {min(healthy_g):.2e}')\n",
    "print(f'max degenerate ||g_L|| = {max(degen_g):.2e}')\n",
    "print(f'gap factor = {min(healthy_g) / max(degen_g):.0f}×')\n",
    "print()\n",
    "\n",
    "# Per-seed max-per-block growth\n",
    "healthy_growth, degen_growth = [], []\n",
    "for m in ['bp', 'ep']:\n",
    "    for s in [42, 123, 456]:\n",
    "        res = d['reports'][f'{m}_s{s}']['residual_norms']\n",
    "        ratios = [res[i+1]/res[i] for i in range(len(res)-1)]\n",
    "        healthy_growth.append(max(ratios))\n",
    "for m in ['dfa', 'state_bridge', 'credit_bridge']:\n",
    "    for s in [42, 123, 456]:\n",
    "        res = d['reports'][f'{m}_s{s}']['residual_norms']\n",
    "        ratios = [res[i+1]/res[i] for i in range(len(res)-1)]\n",
    "        degen_growth.append(max(ratios))\n",
    "\n",
    "print(f'max healthy per-block growth = {max(healthy_growth):.2f}')\n",
    "print(f'min degenerate per-block growth = {min(degen_growth):.2f}')\n",
    "print(f'gap factor = {min(degen_growth) / max(healthy_growth):.1f}×')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## §6 ¶2 — fresh-B null calibration (penalty creates real signal)\n",
    "\n",
    "**Source**: `results/null_calibration_penalized_dfa.json`\n",
    "\n",
    "20 fresh random-B draws on the penalized DFA s42 checkpoint, vs the training-Bs deep cosine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n = load_json('results/null_calibration_penalized_dfa.json')\n",
    "print(f'training-Bs deep cos (s42): {n[\"training_Bs_deep_cos\"]:+.4f}')\n",
    "print(f'fresh-Bs deep cos (n=20):    {n[\"fresh_Bs_deep_mean_of_per_draw_means\"]:+.4f} ± {n[\"fresh_Bs_deep_std_of_per_draw_means_ddof0\"]:.4f}')\n",
    "print()\n",
    "print('per-layer std across 20 fresh-B draws:')\n",
    "for i, s in enumerate(n['fresh_Bs_per_layer_std_ddof0']):\n",
    "    print(f'  l{i}: {s:.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## §3 ¶3 — no-terminal-LN ResMLP same-backbone control\n",
    "\n",
    "**Source**: `results/snapshot_no_outln_v1/snapshot_noLN_s{42,123,456}.json`\n",
    "\n",
    "Removing terminal LN from the same backbone preserves Mode 1(a) but eliminates Mode 1(b)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hL_vals, gL_vals, accs = [], [], []\n",
    "for s in [42, 123, 456]:\n",
    "    d = load_json(f'results/snapshot_no_outln_v1/snapshot_noLN_s{s}.json')\n",
    "    final = d['dfa_log'][-1]\n",
    "    hL_vals.append(final['hidden_norms'][-1])\n",
    "    gL_vals.append(final['bp_grad_per_sample_l2_med'][-1])\n",
    "    accs.append(final['acc_eval'])\n",
    "\n",
    "print(f'no-outln DFA 100ep, 3 seeds:')\n",
    "print(f'  ||h_L|| 3-seed mean: {np.mean(hL_vals):.2e}  (per seed: {[f\"{v:.2e}\" for v in hL_vals]})')\n",
    "print(f'  ||g_L|| 3-seed mean: {np.mean(gL_vals):.2e}  (per seed: {[f\"{v:.2e}\" for v in gL_vals]})')\n",
    "mean, _, ddof1 = both_stds(accs)\n",
    "print(f'  test acc:  {mean:.3f} ± {ddof1:.3f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Figure 2 — re-render the cross-method dissociation visualization\n",
    "\n",
    "**Renderer**: `paper/figures/render_fig_cos_acc_dissociation.py`\n",
    "\n",
    "Re-running the renderer regenerates `paper/figures/fig_cos_acc_dissociation.pdf`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import subprocess\n",
    "result = subprocess.run(['python3', 'paper/figures/render_fig_cos_acc_dissociation.py'],\n",
    "                       capture_output=True, text=True)\n",
    "print(result.stdout)\n",
    "print(result.stderr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Figure 4 — re-render penalty rescue panels\n",
    "\n",
    "**Renderer**: `paper/figures/render_fig4_penalty_rescue.py`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = subprocess.run(['python3', 'paper/figures/render_fig4_penalty_rescue.py'],\n",
    "                       capture_output=True, text=True)\n",
    "print(result.stdout)\n",
    "print(result.stderr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Figure 5 — re-render cross-architecture verdict matrix\n",
    "\n",
    "**Renderer**: `paper/figures/render_fig5_cross_arch.py`\n",
    "\n",
    "The verdict matrix is hand-encoded based on the per-row data sources (see the script's docstring for which JSON each row references)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = subprocess.run(['python3', 'paper/figures/render_fig5_cross_arch.py'],\n",
    "                       capture_output=True, text=True)\n",
    "print(result.stdout)\n",
    "print(result.stderr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compile the paper PDF\n",
    "\n",
    "Final step: re-run tectonic on `paper/main.tex` to produce a fresh PDF that incorporates any updated figures."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = subprocess.run(['tectonic', 'paper/main.tex'],\n",
    "                       capture_output=True, text=True, cwd=str(REPO_ROOT))\n",
    "# print last 500 chars of stderr (tectonic warnings/errors)\n",
    "print(result.stderr[-500:] if result.stderr else 'no stderr')\n",
    "print()\n",
    "import subprocess as sp\n",
    "info = sp.run(['pdfinfo', 'paper/main.pdf'], capture_output=True, text=True, cwd=str(REPO_ROOT))\n",
    "for line in info.stdout.split('\\n'):\n",
    "    if 'Pages' in line: print(line)\n",
    "print(f'\\nPDF: {REPO_ROOT}/paper/main.pdf')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "All paper figures and tables can be reproduced from the following saved files:\n",
    "\n",
    "| Source | Used by |\n",
    "|---|---|\n",
    "| `results/protocol_audit/audit_table_s42_s123_s456.json` | Table 1, Figure 1, §6 ¶1 |\n",
    "| `results/protocol_audit/audit_d512_3seed.json` | Appendix H d=512 |\n",
    "| `results/protocol_audit/audit_cnn_3seed.json` | §3 ¶3 / §5 ¶3 CNN values, Figure 5 |\n",
    "| `results/protocol_audit/temporal_evolution_s{42,123,456}.json` | §3 ¶3 ep-4 g_L, Figure 5 row 4 |\n",
    "| `results/snapshot_no_outln_v1/snapshot_noLN_s{42,123,456}.json` | §3 ¶3 no-outln control |\n",
    "| `results/snapshot_evolution_v2/snapshot_evolution_s{42,123,456}.json` | §3 ¶1 endpoint values |\n",
    "| `results/dfa_pen_short/dfa_pen_lam0.01_s{42,123,456}.json` | DFA+pen 30ep |\n",
    "| `results/dfa_pen_short/dfa_pen_lam0.0001_s{42,123,456}.json` | §5 ¶2 λ=1e-4 |\n",
    "| `results/round38_sbcb_penalty_30ep/results_cifar10.json` (s42) | SB+pen, CB+pen s42 |\n",
    "| `results/round38_{sb,cb}_penalty_30ep_s{123,456}/results_cifar10.json` | SB+pen, CB+pen s123/s456 |\n",
    "| `results/round41_dfa_penalty_30ep{,_s{123,456}}/results_cifar10.json` | DFA+pen 30ep diagnostics |\n",
    "| `results/bp_no_penalty_30ep/bp_pen_lam0.0_s{42,123,456}.json` | §5 ¶3 BP no-pen matched |\n",
    "| `results/bp_with_penalty/bp_pen_lam0.01_s{42,123,456}.json` | §5 ¶3 BP+pen multi-seed |\n",
    "| `results/dfa_no_penalty_30ep/results_cifar10.json` | §5 ¶3 DFA no-pen matched |\n",
    "| `results/resmlp_frozen_blocks_s{42,123,456}.log` | Frozen baseline 0.349 |\n",
    "| `results/h2_no_residual_full_s{42,123,456}/snapshot_evolution_s{42,123,456}.json` | Appendix H no-residual ablation |\n",
    "| `results/optionA_random_targets_s42/snapshot_evolution_s42.json` | Appendix I random-target DFA |\n",
    "| `results/optionSBCB_smoke/results_cifar10.json` | Appendix I random-target SB/CB 3ep |\n",
    "| `results/optionSBCB_random_targets_s42/results_cifar10.json` | Appendix I random-target SB/CB 100ep |\n",
    "| `results/optionEP_smoke/ep_random_s42.pt` | EP random-target 5ep |\n",
    "| `results/optionEP_random_targets_full/ep_random_s42.pt` | EP random-target 100ep |\n",
    "| `results/ep_random_h_L_summary.json` | EP random-target h_L 3-seed |\n",
    "| `results/null_calibration_penalized_dfa.json` | §6 ¶2 fresh-B null |\n",
    "| `results/nudging_test_3seed_summary.json` | §4 ¶4 nudging test 3-seed |\n",
    "| `results/training_loss_decrease_3seed.json` | §4 ¶4 training-loss trajectory 3-seed |\n",
    "| `results/matched_30ep_control_summary.json` | §5 ¶3 matched 30-ep summary |\n",
    "| `results/bp_with_penalty_3seed_summary.json` | §5 ¶3 BP+pen 3-seed |\n",
    "| `results/vanilla_dfa_early_ckpts/per_layer_cos_3seed.json` | Appendix M layer-0 dominance |\n",
    "| `results/threshold_sensitivity_output.txt` | Appendix E threshold sweep |\n",
    "\n",
    "**Statistical convention**: as of v2.38, all 3-seed standard deviations in the paper use ddof=1 (sample std with Bessel correction). The `both_stds()` helper at the top of this notebook returns both ddof=0 and ddof=1 for any list of values; the paper-cited value is always the ddof=1 column.\n",
    "\n",
    "**To re-run the experiments themselves** (for re-training or re-measuring), see the corresponding scripts in `experiments/` and `protocol/examples/`. The training scripts each take a `--seed` argument; the standard 3-seed set is {42, 123, 456}."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}