summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 19:11:40 -0500
committerYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 19:11:40 -0500
commitd022688ea9fcfcb81f900751ee92e35597ef19b8 (patch)
tree192e620a4dc4915d37c590fdc2621abeaa15c8c3
parent6a057a379e58dc464f04e5208861699b01b5d477 (diff)
paper v2.32: BP+penalty multi-seeded (was single-seed s42)
The §5 ¶3 BP+penalty value (0.530, +18.1 pp margin) was single-seed s42. Ran s123 and s456 to multi-seed it, matching the BP-no-pen 3-seed control. 3-seed BP+pen 30ep results (lam=0.01, AdamW lr=1e-3 wd=0.01, cosine, batch 128): s42: 0.5303, +18.13 pp vs frozen s123: 0.5262, +17.72 pp s456: 0.5397, +19.07 pp 3-seed mean: 0.5321 ± 0.0057, +18.31 pp Updates: - §5 ¶3: BP+pen "0.530 (single seed)" → "0.532 ± 0.006" (3-seed) - §5 ¶3: BP penalty cost -5.5 pp → -5.3 pp - §5 ¶3: BP+pen margin +18.1 → +18.3 pp - §5 ¶3: BP-to-DFA gap 17.0 → 17.2 pp - §4 ¶4: BP+pen +18.1 → +18.3 pp comparison - Figure 3 panel C bar values: BP with_pen 0.530 → 0.532 - Figure 3 panel C title: BP-pen-cost -5.5pp → -5.3pp The +18.3 pp 3-seed mean is essentially the same as the s42 single-seed +18.13 pp, so the headline conclusion (BP+pen far above frozen baseline, huge gap vs DFA+pen) is unchanged. This commit removes the last single-seed value labeled as a key control. New auditable file: results/bp_with_penalty_3seed_summary.json Page layout preserved: 9 pages main, refs p10, 0 overfull boxes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
-rw-r--r--paper/figures/fig4_penalty_rescue.pdfbin34199 -> 34201 bytes
-rw-r--r--paper/figures/render_fig4_penalty_rescue.py4
-rw-r--r--paper/main.pdfbin501164 -> 501197 bytes
-rw-r--r--paper/main.tex4
-rw-r--r--results/bp_with_pen_multiseed.log39
-rw-r--r--results/bp_with_penalty/bp_pen_lam0.01_s123.json42
-rw-r--r--results/bp_with_penalty/bp_pen_lam0.01_s456.json42
-rw-r--r--results/bp_with_penalty_3seed_summary.json31
8 files changed, 158 insertions, 4 deletions
diff --git a/paper/figures/fig4_penalty_rescue.pdf b/paper/figures/fig4_penalty_rescue.pdf
index deef685..8e77fc7 100644
--- a/paper/figures/fig4_penalty_rescue.pdf
+++ b/paper/figures/fig4_penalty_rescue.pdf
Binary files differ
diff --git a/paper/figures/render_fig4_penalty_rescue.py b/paper/figures/render_fig4_penalty_rescue.py
index 614698c..b0f46cd 100644
--- a/paper/figures/render_fig4_penalty_rescue.py
+++ b/paper/figures/render_fig4_penalty_rescue.py
@@ -33,7 +33,7 @@ rho_err = [0.005, 0.0, 0.011, 0.0, 0.0]
# Panel C: 2x2 capacity-cost control
methods = ["BP", "DFA"]
no_pen = [0.585, 0.301]
-with_pen = [0.530, 0.360]
+with_pen = [0.532, 0.360]
shallow = 0.349
fig, axes = plt.subplots(1, 3, figsize=(13, 6.0))
@@ -81,7 +81,7 @@ ax.axhline(shallow, color="black", ls="--", lw=1, label=f"frozen baseline {shall
ax.set_xticks(xpos)
ax.set_xticklabels(methods, fontsize=10)
ax.set_ylabel("test accuracy", fontsize=10)
-ax.set_title("(c) BP+penalty 2$\\times$2 control\n(BP-pen-cost $-5.5$pp; gap $17$pp $=$ credit quality)", fontsize=10)
+ax.set_title("(c) BP+penalty 2$\\times$2 control\n(BP-pen-cost $-5.3$pp; gap $17$pp $=$ credit quality)", fontsize=10)
ax.legend(loc="upper right", fontsize=8)
ax.grid(True, axis="y", alpha=0.3)
ax.set_ylim(0, 0.7)
diff --git a/paper/main.pdf b/paper/main.pdf
index bd6fa77..48a39d9 100644
--- a/paper/main.pdf
+++ b/paper/main.pdf
Binary files differ
diff --git a/paper/main.tex b/paper/main.tex
index 981d511..7f3b79d 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -98,7 +98,7 @@ A second metric with different numerical failure modes tells the same story. Cos
Per-layer reporting is therefore not cosmetic. In ResMLP under vanilla DFA, the headline aggregate alignment $\Gamma \approx 0.07$--$0.10$ can look mildly positive only because layer $0$ remains strongly aligned while the deep network is not: at the same epoch-1 checkpoints where layers $1$--$4$ are essentially zero, layer $0$ has cosine $+0.42$, $+0.44$, and $+0.42$ across seeds (Table~\ref{tab:mode_validation}; per-seed values in Appendix~\ref{app:layer0_dominance}). The resulting average can therefore be driven by the embedding layer even when the interior blocks are effectively unaligned, so aggregate reporting obscures the very distinction needed to separate ``measurement collapse'' from ``poor credit direction.'' This layer-$0$ dominance is specific to the ResMLP DFA setting; on ViT-Mini DFA, all layers are near zero, which strengthens the broader methodological point that alignment should be reported per layer rather than only in aggregate. With the two modes separated observationally, the remaining question is whether intervention can move them independently.
-Mode~2 has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated. Applying the same per-block scale-control penalty $\lambda{=}10^{-2}$ that rescued DFA to State Bridge and to Credit Bridge on the same 4-block $d{=}256$ ResMLP backbone over $30$ epochs and three seeds gives converged test accuracies of $0.453 \pm 0.003$ (SB) and $0.360 \pm 0.003$ (CB), with deep mean cosines of $+0.322 \pm 0.007$ (SB) and $+0.679 \pm 0.008$ (CB) and deep mean $\rho$ of $+0.402 \pm 0.015$ (SB) and $+0.464 \pm 0.025$ (CB), while DFA under the same intervention reaches $0.360 \pm 0.001$ with deep cosine $+0.151 \pm 0.025$ and deep $\rho$ $+0.080 \pm 0.011$ (Table~\ref{tab:mode_validation}; Appendix~\ref{app:sb_penalty}). The State Bridge penalty rescue is roughly $24$ percentage points above the vanilla State Bridge baseline of $0.213$ on the same architecture and, more importantly for the paper's central walk-back, exceeds the architecture-matched frozen-blocks shallow baseline of $0.349$ by $+10.4$ percentage points. State Bridge with the penalty intervention is therefore the first audited non-BP method whose trained deep blocks substantively improve over an architecture-matched random-block baseline; the headline accuracy gap is comparable to BP+penalty's $+18.1$ pp over the same shallow baseline. Neither the activation scale nor the deep BP gradient magnitude is silenced under the penalty: $\|h_L\|$ stays at $302 \pm 8$ for SB and $5680 \pm 178$ for CB, with $\|g_L\|$ at $\sim\!1.8\times 10^{-4}$ and $\sim\!1.9\times 10^{-5}$ respectively, both well within the meaningful-measurement regime, so the recovered deep cosines are computed against an informative reference and not against a numerical floor. Within this rescued regime, the three methods reveal a clean cosine-versus-accuracy dissociation. Credit Bridge achieves roughly $4\times$ the deep cosine of DFA and $2\times$ that of State Bridge, yet its final accuracy matches DFA's and is $9$ percentage points below State Bridge's. We therefore frame the Mode~2 reading as a three-part proposition. \emph{Observation}: under the same intervention and matched training budget, CB and DFA reach the same accuracy despite a $4\times$ deep-cosine gap, while SB is the best accuracy with intermediate cosine. \emph{Inference}: layerwise cosine to the BP gradient is necessary to rule out grossly wrong credit signals (it distinguishes the rescued regime from the clamp-dominated vanilla regime), but it is not sufficient to certify that the supplied signal is useful credit for depth. \emph{Mechanism hypothesis}: usefulness depends on whether the local update induces useful forward-state change across blocks, not merely whether its direction is close to the BP gradient in angle. Under this reading, CB supplies a gradient-direction surrogate that aligns with BP in angle but does not translate to a coordinated forward-state improvement, while State Bridge supplies a state-level downstream teaching signal that preserves aspects of useful credit which layerwise cosine does not measure. We state this as a mechanism hypothesis rather than a theorem because we have measured the angle-to-accuracy gap but not the full functional-credit content; the reporting rule that follows is robust to either interpretation. This cross-method dissociation strengthens the methodological point that alignment must be reported jointly with measurement validity and a depth-utilization baseline rather than as a single headline number.
+Mode~2 has method-dependent severity within the audited fixed-feedback family once Mode~1 is alleviated. Applying the same per-block scale-control penalty $\lambda{=}10^{-2}$ that rescued DFA to State Bridge and to Credit Bridge on the same 4-block $d{=}256$ ResMLP backbone over $30$ epochs and three seeds gives converged test accuracies of $0.453 \pm 0.003$ (SB) and $0.360 \pm 0.003$ (CB), with deep mean cosines of $+0.322 \pm 0.007$ (SB) and $+0.679 \pm 0.008$ (CB) and deep mean $\rho$ of $+0.402 \pm 0.015$ (SB) and $+0.464 \pm 0.025$ (CB), while DFA under the same intervention reaches $0.360 \pm 0.001$ with deep cosine $+0.151 \pm 0.025$ and deep $\rho$ $+0.080 \pm 0.011$ (Table~\ref{tab:mode_validation}; Appendix~\ref{app:sb_penalty}). The State Bridge penalty rescue is roughly $24$ percentage points above the vanilla State Bridge baseline of $0.213$ on the same architecture and, more importantly for the paper's central walk-back, exceeds the architecture-matched frozen-blocks shallow baseline of $0.349$ by $+10.4$ percentage points. State Bridge with the penalty intervention is therefore the first audited non-BP method whose trained deep blocks substantively improve over an architecture-matched random-block baseline; the headline accuracy gap is comparable to BP+penalty's $+18.3$ pp over the same shallow baseline. Neither the activation scale nor the deep BP gradient magnitude is silenced under the penalty: $\|h_L\|$ stays at $302 \pm 8$ for SB and $5680 \pm 178$ for CB, with $\|g_L\|$ at $\sim\!1.8\times 10^{-4}$ and $\sim\!1.9\times 10^{-5}$ respectively, both well within the meaningful-measurement regime, so the recovered deep cosines are computed against an informative reference and not against a numerical floor. Within this rescued regime, the three methods reveal a clean cosine-versus-accuracy dissociation. Credit Bridge achieves roughly $4\times$ the deep cosine of DFA and $2\times$ that of State Bridge, yet its final accuracy matches DFA's and is $9$ percentage points below State Bridge's. We therefore frame the Mode~2 reading as a three-part proposition. \emph{Observation}: under the same intervention and matched training budget, CB and DFA reach the same accuracy despite a $4\times$ deep-cosine gap, while SB is the best accuracy with intermediate cosine. \emph{Inference}: layerwise cosine to the BP gradient is necessary to rule out grossly wrong credit signals (it distinguishes the rescued regime from the clamp-dominated vanilla regime), but it is not sufficient to certify that the supplied signal is useful credit for depth. \emph{Mechanism hypothesis}: usefulness depends on whether the local update induces useful forward-state change across blocks, not merely whether its direction is close to the BP gradient in angle. Under this reading, CB supplies a gradient-direction surrogate that aligns with BP in angle but does not translate to a coordinated forward-state improvement, while State Bridge supplies a state-level downstream teaching signal that preserves aspects of useful credit which layerwise cosine does not measure. We state this as a mechanism hypothesis rather than a theorem because we have measured the angle-to-accuracy gap but not the full functional-credit content; the reporting rule that follows is robust to either interpretation. This cross-method dissociation strengthens the methodological point that alignment must be reported jointly with measurement validity and a depth-utilization baseline rather than as a single headline number.
\section{Intervention and Cross-Architecture Evidence}
\label{sec:validation}
@@ -125,7 +125,7 @@ Fresh-$B$ null control & $\overline{\cos}_{deep}{=}+0.002{\pm}0.022$ ($n{=}20$ d
Once the reference vector is meaningful again, the deep layers no longer sit exactly at null. At $\lambda{=}10^{-2}$, penalized DFA reaches a three-seed deep-layer mean cosine of $+0.151 \pm 0.025$ and deep perturbation correlation of $+0.080 \pm 0.011$, whereas vanilla DFA is essentially zero on both metrics in the deep blocks, consistent with prior concerns that alternative feedback can fail by supplying poor credit directions even before full collapse \citep{bartunov2018assessing,moskovitz2018feedback,crafton2019backpropagation,refinetti2023aligning}. The null calibration rules out the interpretation that this recovered signal is merely measurement noise: on the same penalized checkpoint, replacing the training-time feedback matrices with 20 fresh random $B_l$ draws gives a deep cosine of only $+0.002 \pm 0.022$, with per-layer standard deviations of $0.013$--$0.023$, all within noise of zero (Table~\ref{tab:mode_validation}). The $\lambda$ sweep sharpens the dissociation further: at $\lambda{=}10^{-4}$, Mode~1 is already alleviated, with three-seed mean $\|h_L\|{\approx}2.2\times 10^4$ and $\|g_L\|{\approx}7.0\times 10^{-7}$, but the three-seed deep cosine remains $-0.020$, while $\lambda{=}10^{-2}$ delivers the $+0.151$ and $+0.080$ above (Figure~\ref{fig:penalty_rescue}). The improvement is real, but it is only partial.
-A rescue intervention is only informative if its direct cost is controlled. The relevant control is BP trained under the same penalty for the same matched $30$-epoch budget: across three seeds, BP falls from $0.585 \pm 0.001$ without the penalty to $0.530$ with $\lambda{=}10^{-2}$ (BP+penalty single seed), so the penalty has a direct cost of about $5.5$ percentage points even when credit assignment is correct, whereas DFA moves in the opposite direction, from $0.301 \pm 0.005$ to $0.360 \pm 0.001$, and State Bridge moves further still, from $0.213$ to $0.453 \pm 0.003$, all under the same $30$-epoch intervention (Figure~\ref{fig:penalty_rescue}; Appendix~\ref{app:sb_penalty}). Relative to the frozen-blocks baseline of $0.349$, BP+penalty retains a margin of $+18.1$ points, State Bridge+penalty retains $+10.4$ points, and DFA+penalty retains only $+1.1$ points. The remaining BP-to-DFA gap of $17.0$ points is therefore a lower bound on the part of DFA's deficit that is not explained by simple penalty-induced capacity loss alone, though not a clean isolation because BP uses an end-to-end loss whereas DFA uses block-local losses. The substantially smaller BP-to-State-Bridge gap of $0.530 - 0.453 = 7.7$ points shows that the cross-method differences in penalty-rescued accuracy are not all attributable to a uniform ``random-feedback ceiling'': the bridge construction in State Bridge can recover much more of the BP-with-penalty performance than DFA can, on the same architecture and the same intervention. The residual gap after that control is what keeps Mode~2 substantively alive while letting it have method-dependent severity.
+A rescue intervention is only informative if its direct cost is controlled. The relevant control is BP trained under the same penalty for the same matched $30$-epoch budget: across three seeds, BP falls from $0.585 \pm 0.001$ without the penalty to $0.532 \pm 0.006$ with $\lambda{=}10^{-2}$, so the penalty has a direct cost of about $5.3$ percentage points even when credit assignment is correct, whereas DFA moves in the opposite direction, from $0.301 \pm 0.005$ to $0.360 \pm 0.001$, and State Bridge moves further still, from $0.213$ to $0.453 \pm 0.003$, all under the same $30$-epoch intervention (Figure~\ref{fig:penalty_rescue}; Appendix~\ref{app:sb_penalty}). Relative to the frozen-blocks baseline of $0.349$, BP+penalty retains a margin of $+18.3$ points, State Bridge+penalty retains $+10.4$ points, and DFA+penalty retains only $+1.1$ points. The remaining BP-to-DFA gap of $17.2$ points is therefore a lower bound on the part of DFA's deficit that is not explained by simple penalty-induced capacity loss alone, though not a clean isolation because BP uses an end-to-end loss whereas DFA uses block-local losses. The substantially smaller BP-to-State-Bridge gap of $0.530 - 0.453 = 7.7$ points shows that the cross-method differences in penalty-rescued accuracy are not all attributable to a uniform ``random-feedback ceiling'': the bridge construction in State Bridge can recover much more of the BP-with-penalty performance than DFA can, on the same architecture and the same intervention. The residual gap after that control is what keeps Mode~2 substantively alive while letting it have method-dependent severity.
The architecture comparison sharpens the scope of the critique. In the terminal-LN architectures we audited, both diagnostics fire for DFA-trained ResMLP at $d{=}256$, the same pattern recurs at $d{=}512$ with even larger max-per-block growth (DFA three-seed mean about $7\times 10^3$ vs $\sim\!1.9\times 10^3$ at $d{=}256$), and ViT-Mini with a class token and terminal LN shows diagnostic~(a) by epoch~1 and diagnostic~(b) by epochs~2--3 (Figure~\ref{fig:temporal_cross_arch}). A depth sweep on the $d{=}512$ ResMLP at $L \in \{2,4,6,8,12\}$ shows that the layerwise pattern is essentially depth-invariant: DFA's layer-0 cosine stays in $[+0.38,+0.40]$ across all five depths, while its mean deep-layer cosine stays within $[-0.005,+0.000]$ and its deep perturbation correlation collapses to $0.000$ in every depth tested, even though BP retains a deep-layer cosine of $+0.94$ at $L{=}12$ (Appendix~\ref{app:depth_scan}). The deep credit signal does not improve when the network is shallower, so the failure is not a "too deep" artifact. In the non-terminal-LN controls, the pattern is different: the no-terminal-LN ResMLP-d256 ablation shows diagnostic~(a) firing across three seeds at epochs $\{18, 14, 25\}$ but diagnostic~(b) never fires across $100$ epochs and the same three seeds, and the BatchNorm CNN on CIFAR-10 likewise shows strong growth under DFA, with max-per-block growth up to $237\times$, but keeps deepest BP gradients around $\|g\| \sim 10^{-3}$ and never triggers diagnostic~(b) (Figure~\ref{fig:temporal_cross_arch}). BP never triggers either diagnostic in any audited architecture. The matched same-backbone ResMLP-d256 ablation in Section~\ref{sec:mode1} supplies the cleanest causal control: removing terminal LayerNorm from the same architecture preserves activation growth but eliminates the gradient floor, so diagnostic~(b) is necessary on terminal-LN ResMLP and is not just an architecture-class coincidence. The broader claim therefore holds at full strength inside the audited residual ResMLP and ViT-Mini regime, while diagnostic~(a) remains useful more broadly. This lets the paper end with a reporting rule rather than an overclaimed theory.
diff --git a/results/bp_with_pen_multiseed.log b/results/bp_with_pen_multiseed.log
new file mode 100644
index 0000000..1dd203e
--- /dev/null
+++ b/results/bp_with_pen_multiseed.log
@@ -0,0 +1,39 @@
+=== BP+pen 30ep s=123 ===
+BP + ‖f‖² penalty: seed=123, lam=0.01, epochs=30
+ ep 1: test_acc=0.3593
+ ep 5: test_acc=0.4104
+ ep 10: test_acc=0.4630
+ ep 15: test_acc=0.4839
+ ep 20: test_acc=0.5025
+ ep 25: test_acc=0.5194
+ ep 30: test_acc=0.5262
+
+FINAL test acc: 0.5262
+Compare to:
+ BP-trainable (3-seed mean): 0.609
+ Penalized DFA lam=1e-2: 0.363
+ DFA-shallow: 0.349
+
+Margin vs DFA-shallow baseline: +17.72 pp
+ → BP+penalty intermediate; partial capacity loss + residual mode 2
+Saved results/bp_with_penalty/bp_pen_lam0.01_s123.json
+=== BP+pen 30ep s=456 ===
+BP + ‖f‖² penalty: seed=456, lam=0.01, epochs=30
+ ep 1: test_acc=0.3652
+ ep 5: test_acc=0.4027
+ ep 10: test_acc=0.4639
+ ep 15: test_acc=0.4963
+ ep 20: test_acc=0.5189
+ ep 25: test_acc=0.5345
+ ep 30: test_acc=0.5397
+
+FINAL test acc: 0.5397
+Compare to:
+ BP-trainable (3-seed mean): 0.609
+ Penalized DFA lam=1e-2: 0.363
+ DFA-shallow: 0.349
+
+Margin vs DFA-shallow baseline: +19.07 pp
+ → BP+penalty intermediate; partial capacity loss + residual mode 2
+Saved results/bp_with_penalty/bp_pen_lam0.01_s456.json
+=== ALL DONE ===
diff --git a/results/bp_with_penalty/bp_pen_lam0.01_s123.json b/results/bp_with_penalty/bp_pen_lam0.01_s123.json
new file mode 100644
index 0000000..6d2e963
--- /dev/null
+++ b/results/bp_with_penalty/bp_pen_lam0.01_s123.json
@@ -0,0 +1,42 @@
+{
+ "config": {
+ "seed": 123,
+ "epochs": 30,
+ "lr": 0.001,
+ "wd": 0.01,
+ "lam": 0.01,
+ "output_dir": "results/bp_with_penalty"
+ },
+ "final_acc": 0.5262,
+ "log": [
+ {
+ "epoch": 1,
+ "test_acc": 0.3593
+ },
+ {
+ "epoch": 5,
+ "test_acc": 0.4104
+ },
+ {
+ "epoch": 10,
+ "test_acc": 0.463
+ },
+ {
+ "epoch": 15,
+ "test_acc": 0.4839
+ },
+ {
+ "epoch": 20,
+ "test_acc": 0.5025
+ },
+ {
+ "epoch": 25,
+ "test_acc": 0.5194
+ },
+ {
+ "epoch": 30,
+ "test_acc": 0.5262
+ }
+ ],
+ "margin_pp": 17.720000000000002
+} \ No newline at end of file
diff --git a/results/bp_with_penalty/bp_pen_lam0.01_s456.json b/results/bp_with_penalty/bp_pen_lam0.01_s456.json
new file mode 100644
index 0000000..225c109
--- /dev/null
+++ b/results/bp_with_penalty/bp_pen_lam0.01_s456.json
@@ -0,0 +1,42 @@
+{
+ "config": {
+ "seed": 456,
+ "epochs": 30,
+ "lr": 0.001,
+ "wd": 0.01,
+ "lam": 0.01,
+ "output_dir": "results/bp_with_penalty"
+ },
+ "final_acc": 0.5397,
+ "log": [
+ {
+ "epoch": 1,
+ "test_acc": 0.3652
+ },
+ {
+ "epoch": 5,
+ "test_acc": 0.4027
+ },
+ {
+ "epoch": 10,
+ "test_acc": 0.4639
+ },
+ {
+ "epoch": 15,
+ "test_acc": 0.4963
+ },
+ {
+ "epoch": 20,
+ "test_acc": 0.5189
+ },
+ {
+ "epoch": 25,
+ "test_acc": 0.5345
+ },
+ {
+ "epoch": 30,
+ "test_acc": 0.5397
+ }
+ ],
+ "margin_pp": 19.069999999999997
+} \ No newline at end of file
diff --git a/results/bp_with_penalty_3seed_summary.json b/results/bp_with_penalty_3seed_summary.json
new file mode 100644
index 0000000..9760faa
--- /dev/null
+++ b/results/bp_with_penalty_3seed_summary.json
@@ -0,0 +1,31 @@
+{
+ "description": "BP+pen 30ep multi-seed result (matched control for \u00a75 \u00b63)",
+ "config": {
+ "arch": "4-block d=256 pre-LN ResMLP",
+ "dataset": "CIFAR-10",
+ "epochs": 30,
+ "batch": 128,
+ "optimizer": "AdamW lr=1e-3 wd=0.01",
+ "schedule": "cosine",
+ "penalty_lam": 0.01
+ },
+ "per_seed_acc": {
+ "42": 0.5303,
+ "123": 0.5262,
+ "456": 0.5397
+ },
+ "mean_acc": 0.5321,
+ "std_ddof0": 0.0057,
+ "std_ddof1": 0.0069,
+ "frozen_baseline": 0.349,
+ "margin_pp_per_seed": {
+ "42": 18.13,
+ "123": 17.72,
+ "456": 19.07
+ },
+ "margin_pp_mean": 18.31,
+ "source_files": [
+ "results/bp_with_penalty/bp_pen_lam0.01_s{42,123,456}.json"
+ ],
+ "paper_section": "used in \u00a75 \u00b63 (was previously single-seed s42)"
+} \ No newline at end of file