faeval.git/protocol/examples, branch master

Sync experiment+protocol scripts with v2.32 corrected control values

2026-04-09T00:24:06+00:00

The pre-v2.31 unsourced values BP=0.609 and DFA=0.308 (which v2.31 fixed
to 0.585 and 0.301 via matched 30-ep controls) were also hardcoded as
"compare to" comments in 5 helper scripts:

  experiments/bp_with_penalty_control.py
  experiments/dfa_residual_penalty_test.py
  experiments/resmlp_frozen_blocks_baseline.py
  protocol/examples/threshold_d_sensitivity.py
  protocol/examples/plot_penalty_rescue.py

These are non-paper-input scripts (their output goes to stdout, not to
the paper), so the stale values didn't cause numerical errors in the
paper itself. But the original v2.31 BP+pen=0.609 unsourced number bug
came from exactly this kind of hardcoded "for-comparison" comment that
was never measured. Updating them now to remove the same trap from
future runs.

Each script now references the matched 30-ep 3-seed values from
results/bp_no_penalty_30ep, results/dfa_no_penalty_30ep, results/
dfa_pen_short, and results/bp_with_penalty.

protocol/EVIDENCE_SUMMARY.md and PAPER_OUTLINE.md still have stale
numbers — these are project scratch documents and not user-facing.
Deferred to a separate sweep if needed.

Co-Authored-By: Claude Opus 4.6 (1M context)

Add penalty lambda 3-seed summary script + checkpoint save in penalty test

2026-04-08T05:07:39+00:00

- New script: protocol/examples/penalty_lam_3seed_summary.py
  Loads existing penalty JSON files for lam=1e-3 and lam=1e-2 across
  seeds, computes 3-seed mean margin vs DFA-shallow baseline, and
  explicitly checks the (d) verdict at 2pp threshold per seed and
  in aggregate. Reports MIXED if seeds disagree.

  Current result: lam=1e-2 has 3 seeds (margin +1.38 ± 0.05 pp, all
  FIRE), lam=1e-3 has 1 seed (+2.31 pp, PASSES). Awaiting s123/s456
  for lam=1e-3.

- experiments/dfa_residual_penalty_test.py: now saves model checkpoint
  + Bs alongside JSON log so post-hoc protocol can be applied without
  re-running. Closes the pitfall #6.5 self-disclosure (auxiliary nets
  must be saved for post-hoc Gamma to be reconstructible).

Add penalty λ sweep figure: shows λ-dependence of (d) verdict

2026-04-08T05:03:49+00:00

3-panel figure: vanilla DFA + penalty at λ=1e-3 (green) + penalty at
λ=1e-2 (blue):
  (a) ‖h_L‖: vanilla 4e8, both penalties ~4e4 (similar)
  (b) ‖g_2‖: vanilla 5e-10, penalties 7e-7 to 1e-6 (above floor)
  (c) acc: vanilla 0.31, λ=1e-2 0.36, λ=1e-3 0.37; horizontal lines
       at DFA-shallow 0.349 and 2pp threshold 0.371

Visual: at λ=1e-3 the test acc curve crosses ABOVE the 2pp threshold
line; at λ=1e-2 it stays below. This is the (d) lambda-dependence
finding from the round 18 follow-up.

Add (d) frozen-baseline threshold sensitivity — IMPORTANT new finding

2026-04-08T04:59:35+00:00

Critical observation: at lambda=1e-3 (single seed), penalized DFA margin
above shallow baseline is +2.3 pp — which PASSES (d) at the 2 pp default
threshold. At lambda=1e-2 (3 seeds), the margin is +1.4 pp — FIRES (d)
at 2 pp.

So the (d) verdict on penalized DFA depends on BOTH the lambda choice
AND the threshold choice. This is a significantly weaker claim than
'two failure modes are separable via (d)'.

The honest framing per round 18 lesson: there is a real tradeoff between
penalty strength and depth utilization. Weaker penalty preserves more
depth contribution but also more scale pathology. Stronger penalty kills
depth contribution. The protocol surfaces this tradeoff but doesn't
establish the second failure mode by itself.

Compared to (a) 63x and (b) 24338x separation gaps, (d) is the LEAST
robust diagnostic and the most sensitive to threshold choice. Need to
flag this prominently in the paper.

Add CNN third-architecture audit: BN, no terminal LN

2026-04-08T04:42:06+00:00

5 methods × 3 seeds on the SmallCNN (3 conv + BN + 1 FC + head, no
terminal LN) using existing checkpoints in results/cnn_baseline/.

Key findings:

  BP CNN:           0.866 acc, max/block 1.3, trustworthy
  State Bridge CNN: 0.633 acc, max/block 2.4, trustworthy
  EP CNN:           0.512 acc, max/block 12, trustworthy
  DFA CNN:          0.566 acc, max/block 237, walked back via (a)
  Credit Bridge CNN: 0.325 acc, max/block 96, walked back via (a)

CRITICAL: diagnostic (b) ||g_L|| floor NEVER fires on CNN for any method.
The deepest BP grad is at ~1e-5 to 6e-1, all well above the 1e-7 floor.

This is the cleanest confirmation that terminal LayerNorm is the
structural cause of the catastrophic gradient collapse in (b). Without
out_ln, the BP grad does NOT collapse to the floor, even on DFA. The
scale pathology (a) still appears on DFA and CB, but the gradient
collapse pathology (b) is specific to terminal-LN architectures.

DFA CNN's accuracy (56.6%) is much higher than DFA ResMLP (30.8%) or
DFA ViT (23.7%) — partially because the scale pathology is less
catastrophic without the LN-driven gradient cancellation amplifying
it. This is the cross-architecture mechanism story made concrete.

Add minimal worked example: end-to-end protocol usage tutorial

2026-04-08T04:33:49+00:00

5-epoch DFA training on CIFAR-10 + apply protocol + interpret verdict.
Self-contained, runs on CPU in <2 minutes. Demonstrates the API a future
paper author would use:

  1. train your model (any FA-style method)
  2. build eval_batches from your test loader
  3. call diagnose(model, eval_batches, headline_acc, frozen_baseline_acc)
  4. read report.verdict; walk back if 'needs walk-back'

Not run during this session to avoid GPU contention with the in-flight
direction-quality and ViT/ResNet experiments.

Add §4 penalty rescue figure: visual two-failure-modes story

2026-04-08T04:32:44+00:00

3-panel side-by-side showing per-epoch trajectories of vanilla DFA vs
DFA + lambda*||f||^2 penalty:

  (a) ||h_L||:    vanilla 4e8 vs penalty 4e4 (4 OOM rescue)
  (b) ||g_L||:    vanilla 5e-10 vs penalty ~1e-6 (4 OOM rescue)
  (d) test acc:   vanilla 0.31 vs penalty 0.36 vs frozen baseline 0.349 vs BP 0.61

The visual story: (a) and (b) show the penalty pulling the diagnostics
back into the healthy regime, but (d) shows the rescue translates to
only +1 pp above the DFA-shallow baseline and 24 pp below BP-trainable.
The two failure modes (scale + direction) are visually separable: scale
is fixed, direction is not.

Together with figure_audit_5method.png and figure_cross_arch_temporal_s42.png,
this is the third paper-ready figure for §3-§4.

Add §2/§3 hero figure: 5-method audit horizontal bar chart

2026-04-08T04:29:13+00:00

4-panel layout (one per diagnostic), 5 methods sorted bottom-to-top by
ascending accuracy, color-coded healthy (BP/EP, blue) vs degenerate
(DFA/SB/CB, red), with threshold lines drawn:

  (a) max per-block growth (log scale, threshold 50x)
  (b) ||g_L|| (log scale, floor 1e-7)
  (c) cross-batch stability (linear, ceiling 0.30)
  (d) headline acc (linear, frozen baseline 0.349)

The visual layout makes it immediately obvious that:
  - (a) and (b) cleanly split healthy from degenerate (4-7 OOM gap)
  - (c) is bimodal and doesn't cleanly split — confirms it's a sub-mode
    discriminator, not a primary detector
  - (d) shows BP above the frozen baseline by ~25 pp while DFA/CB/SB
    are at or below it

Add d=512 ResMLP audit table (3 seeds): cross-width validation

2026-04-08T04:26:32+00:00

Same protocol applied to the 4-block d=512 ResMLP variant (vs the d=256
default). 4 methods × 3 seeds = 12 conditions:

  BP @ d=512:           trustworthy on all 3 seeds (acc 0.60-0.61)
  DFA @ d=512:          walked back on all 3 seeds via (a)+(b)
  State Bridge @ d=512: walked back on all 3 seeds via (a)+(b), with
                         drift sub-mode on s123 (stability 0.879)
  Credit Bridge @ d=512: walked back on all 3 seeds via (a)+(b)

Width effect: max-per-block growth is HIGHER at d=512 (6e3-7e4) than at
d=256 (~1e3). Larger width amplifies the explosion. The protocol
verdicts are robust to this — same binary outcome, more extreme
quantitative numbers.

This is the cross-width validation: the protocol's findings are not
d=256-specific. The §3 audit results generalize across the width
dimension.

Partial protocol audit on penalized DFA: (a)+(b) pass, (d) still fires

2026-04-08T04:17:45+00:00

3-seed analysis of DFA + lambda=1e-2 ||f||^2 penalty using only the data
already in the existing penalty JSON logs (no checkpoint or full layer
norms needed):

  (a) per-block growth: avg ~8x per block (geom mean), well below 50x
      threshold. PASS likely (with small caveat that max could differ
      from mean).
  (b) BP grad floor:    g_2 = 8-10e-7 across 3 seeds, 10x above the
      1e-7 floor. PASS exact.
  (d) frozen baseline:  margin = 1.35-1.45 pp (mean 1.38) < 2 pp
      required. FIRE on all 3 seeds.

Aggregate partial verdict: protocol catches the SECOND failure mode
(direction quality / passive blocks) on penalized DFA even though it
PASSES the scale-related diagnostics. This is the cleanest possible
evidence that the two failure modes are separable: the penalty fixes
the scale failure but not the direction failure. The protocol's (d)
diagnostic is the right test for the second failure mode and it still
fires after the penalty rescue.

This is the §4 'two failure modes' evidence that doesn't depend on the
direction-quality direct test (which is still running). The (d)
diagnostic alone shows the separation.