faeval.git/results/protocol_audit, branch master

Add penalty λ sweep figure: shows λ-dependence of (d) verdict

2026-04-08T05:03:49+00:00

3-panel figure: vanilla DFA + penalty at λ=1e-3 (green) + penalty at
λ=1e-2 (blue):
  (a) ‖h_L‖: vanilla 4e8, both penalties ~4e4 (similar)
  (b) ‖g_2‖: vanilla 5e-10, penalties 7e-7 to 1e-6 (above floor)
  (c) acc: vanilla 0.31, λ=1e-2 0.36, λ=1e-3 0.37; horizontal lines
       at DFA-shallow 0.349 and 2pp threshold 0.371

Visual: at λ=1e-3 the test acc curve crosses ABOVE the 2pp threshold
line; at λ=1e-2 it stays below. This is the (d) lambda-dependence
finding from the round 18 follow-up.

Add CNN third-architecture audit: BN, no terminal LN

2026-04-08T04:42:06+00:00

5 methods × 3 seeds on the SmallCNN (3 conv + BN + 1 FC + head, no
terminal LN) using existing checkpoints in results/cnn_baseline/.

Key findings:

  BP CNN:           0.866 acc, max/block 1.3, trustworthy
  State Bridge CNN: 0.633 acc, max/block 2.4, trustworthy
  EP CNN:           0.512 acc, max/block 12, trustworthy
  DFA CNN:          0.566 acc, max/block 237, walked back via (a)
  Credit Bridge CNN: 0.325 acc, max/block 96, walked back via (a)

CRITICAL: diagnostic (b) ||g_L|| floor NEVER fires on CNN for any method.
The deepest BP grad is at ~1e-5 to 6e-1, all well above the 1e-7 floor.

This is the cleanest confirmation that terminal LayerNorm is the
structural cause of the catastrophic gradient collapse in (b). Without
out_ln, the BP grad does NOT collapse to the floor, even on DFA. The
scale pathology (a) still appears on DFA and CB, but the gradient
collapse pathology (b) is specific to terminal-LN architectures.

DFA CNN's accuracy (56.6%) is much higher than DFA ResMLP (30.8%) or
DFA ViT (23.7%) — partially because the scale pathology is less
catastrophic without the LN-driven gradient cancellation amplifying
it. This is the cross-architecture mechanism story made concrete.

Add §4 penalty rescue figure: visual two-failure-modes story

2026-04-08T04:32:44+00:00

3-panel side-by-side showing per-epoch trajectories of vanilla DFA vs
DFA + lambda*||f||^2 penalty:

  (a) ||h_L||:    vanilla 4e8 vs penalty 4e4 (4 OOM rescue)
  (b) ||g_L||:    vanilla 5e-10 vs penalty ~1e-6 (4 OOM rescue)
  (d) test acc:   vanilla 0.31 vs penalty 0.36 vs frozen baseline 0.349 vs BP 0.61

The visual story: (a) and (b) show the penalty pulling the diagnostics
back into the healthy regime, but (d) shows the rescue translates to
only +1 pp above the DFA-shallow baseline and 24 pp below BP-trainable.
The two failure modes (scale + direction) are visually separable: scale
is fixed, direction is not.

Together with figure_audit_5method.png and figure_cross_arch_temporal_s42.png,
this is the third paper-ready figure for §3-§4.

Add §2/§3 hero figure: 5-method audit horizontal bar chart

2026-04-08T04:29:13+00:00

4-panel layout (one per diagnostic), 5 methods sorted bottom-to-top by
ascending accuracy, color-coded healthy (BP/EP, blue) vs degenerate
(DFA/SB/CB, red), with threshold lines drawn:

  (a) max per-block growth (log scale, threshold 50x)
  (b) ||g_L|| (log scale, floor 1e-7)
  (c) cross-batch stability (linear, ceiling 0.30)
  (d) headline acc (linear, frozen baseline 0.349)

The visual layout makes it immediately obvious that:
  - (a) and (b) cleanly split healthy from degenerate (4-7 OOM gap)
  - (c) is bimodal and doesn't cleanly split — confirms it's a sub-mode
    discriminator, not a primary detector
  - (d) shows BP above the frozen baseline by ~25 pp while DFA/CB/SB
    are at or below it

Add d=512 ResMLP audit table (3 seeds): cross-width validation

2026-04-08T04:26:32+00:00

Same protocol applied to the 4-block d=512 ResMLP variant (vs the d=256
default). 4 methods × 3 seeds = 12 conditions:

  BP @ d=512:           trustworthy on all 3 seeds (acc 0.60-0.61)
  DFA @ d=512:          walked back on all 3 seeds via (a)+(b)
  State Bridge @ d=512: walked back on all 3 seeds via (a)+(b), with
                         drift sub-mode on s123 (stability 0.879)
  Credit Bridge @ d=512: walked back on all 3 seeds via (a)+(b)

Width effect: max-per-block growth is HIGHER at d=512 (6e3-7e4) than at
d=256 (~1e3). Larger width amplifies the explosion. The protocol
verdicts are robust to this — same binary outcome, more extreme
quantitative numbers.

This is the cross-width validation: the protocol's findings are not
d=256-specific. The §3 audit results generalize across the width
dimension.

Add §3 cross-architecture temporal evolution figure

2026-04-08T04:14:02+00:00

3-column 3-row plot:
  rows: ||h_L||, ||g_L||, test accuracy
  cols: ResMLP (with LN) | ViT-Mini (cls + LN) | StudentNet (no LN)

BP and DFA trajectories overlaid. Floor threshold drawn on the ||g_L||
row. Visualizes the cross-architecture causal control: with-LN
architectures both show ||g_L|| collapse below 1e-7 (DFA hits the floor
within 5 epochs); without-LN architecture shows ||g_L|| stays in the
healthy regime even though ||h_L|| still grows (catastrophic vs mild).

Cross-architecture temporal validation: 3 archs x 3 seeds x 2 methods

2026-04-08T04:03:05+00:00

ResMLP (4-block d=256, with out_ln, CIFAR-10):
  s42:  DFA (a) ep 8,  (b) ep 4,  acc 0.308
  s123: DFA (a) ep 11, (b) ep 4,  acc 0.320
  s456: DFA (a) ep 8,  (b) ep 3,  acc 0.300

ViT-Mini (4-block d=128, cls token + terminal LN, CIFAR-10):
  s42:  DFA (a) ep 1,  (b) ep 3,  acc 0.256
  s123: DFA (a) ep 1,  (b) ep 2,  acc 0.202
  s456: DFA (a) ep 1,  (b) ep 3,  acc 0.253

StudentNet (4-block d=128, NO terminal LN, synthetic alpha=1.0):
  s42:  DFA (a) ep 18, (b) NEVER, acc 0.332
  s123: DFA (a) ep 14, (b) NEVER, acc 0.314
  s456: DFA (a) ep 25, (b) NEVER, acc 0.336

BP: never fires on any seed x any architecture (9/9 sanity passes).

Key cross-architecture finding: diagnostic (b) is specifically the LN-
driven failure mode. Without out_ln, the BP grad never crosses the 1e-7
floor, even though (a) still fires (the residual stream still grows, just
without the LN-cancellation pathology that drives the BP grad to the
floor). This is the causal architectural control: (b) specifically tests
'is terminal-LN gradient cancellation active?' and (a) tests 'is the
residual stream growing without bound?'. They are linked but separable.

This is the §3 cross-architecture validation evidence.

Protocol diagnostic (a): use max per-block growth, not max/min ratio

2026-04-08T04:00:54+00:00

Old metric: max(||h||) / max(||h_0||, eps). False-positives on ViT-style
architectures because the cls token at layer 0 (right after patch_embed)
has anomalously small magnitude (~0.3-1.5), inflating the ratio even on
healthy BP-trained ViTs.

New metric: max_l(||h_{l+1}|| / ||h_l||) — the largest single-block
residual amplification. Architecture-invariant.

Calibration:
  - BP-trained, late training: <5x per block
  - BP ViT, early epochs (cls token resolving): 13-25x max
  - DFA-trained ResMLP/ViT: 100-4000x per block
Threshold raised from 10 to 50 to sit cleanly between healthy-early-
training (max 25) and failure-regime (min 100).

Re-verifications:
  - smoke test (BP/DFA/EP): all 3 verdicts unchanged
  - random init (3 seeds): trustworthy on all 3
  - 5-method audit table single-seed: identical verdicts
  - decision-utility ablation: identical (still 0/5 by S1, 3/5 by S_full)
  - temporal evolution 3-seed: (b) now fires first at ep 3-4, (a) at ep
    8-11. Both well before training ends. The 'protocol fires ~92 epochs
    early' story still holds.
  - ViT temporal evolution: BP no longer false-fires; DFA fires (a) ep 1,
    (b) ep 3 — protocol works on the second architecture.

Temporal evolution 3-seed: protocol fires at DFA epoch 3-4 on all seeds

2026-04-08T03:51:27+00:00

  s42: (a)+(b) fire at epoch 4, DFA final acc 0.3076
  s123: (a)+(b) fire at epoch 4, DFA final acc 0.3203
  s456: (a)+(b) fire at epoch 3, DFA final acc 0.2998

BP never fires on any seed (final acc 0.61-0.63).

The 'protocol catches it 96 epochs early' finding is fully reproducible
across seeds.

Add temporal diagnostic evolution: protocol fires at epoch 4 of DFA

2026-04-08T03:49:53+00:00

Replays per-epoch logged data from results/snapshot_evolution_v2/ through
the protocol thresholds.

Result: diagnostics (a) ||h_l|| explosion AND (b) ||g_L|| at floor BOTH
first fire at epoch 4 of DFA training. At that point, DFA test acc is
0.308 — its final value at epoch 100 is also 0.308. The protocol could
have walked back the headline 96 epochs before training finished.

DFA's gamma hovers at 0.087-0.107 for all 100 epochs. A reviewer looking
at acc+gamma would conclude 'DFA is hovering at 31% acc with ~0.10
alignment, both reasonable'. Wrong on both counts.

BP never fires any diagnostic at any epoch. Stays bounded at ||h_L||~200,
||g_L||~3-5e-5, accuracy climbs to 0.61.

This is the temporal validation of decision utility: the protocol catches
the pathology AS IT HAPPENS, not just retrospectively.