| Age | Commit message (Collapse) | Author |
|
Single-document overview of every result the protocol package has
produced so far, with reproducibility commands and the file/memory entry
where each result is recorded. Organized by paper section (§1 protocol,
§2 audit, §3 decision utility, §4 temporal validation, §5 pitfalls).
Includes the headline tables (3-seed audit, cross-architecture, penalty
sweep) ready for the paper, and an explicit status field for each
ongoing experiment.
This is a reading guide for anyone (codex, future-me, the user) who
needs to know what evidence is ready and how to reproduce it.
|
|
3-column 3-row plot:
rows: ||h_L||, ||g_L||, test accuracy
cols: ResMLP (with LN) | ViT-Mini (cls + LN) | StudentNet (no LN)
BP and DFA trajectories overlaid. Floor threshold drawn on the ||g_L||
row. Visualizes the cross-architecture causal control: with-LN
architectures both show ||g_L|| collapse below 1e-7 (DFA hits the floor
within 5 epochs); without-LN architecture shows ||g_L|| stays in the
healthy regime even though ||h_L|| still grows (catastrophic vs mild).
|
|
For each diagnostic, sweeps threshold across orders of magnitude on the
3-seed audit data and reports the verdict at each value.
Key calibration findings (3 seeds):
Diagnostic (a) max per-block growth:
Healthy max (BP/EP): 11.0
Degenerate min (DFA/SB/CB): 694
Separation gap: 63x
Default threshold 50 sits comfortably in the middle.
Any threshold in [12, 693] gives the same verdicts.
Diagnostic (b) ||g_L|| at floor:
Healthy min (BP/EP): 1.02e-4
Degenerate max (DFA/SB/CB): 4.18e-9
Separation gap: 24,338x
Default threshold 1e-7 sits comfortably in the middle.
Any threshold in [4.2e-9, 1.0e-4] gives the same verdicts.
Diagnostic (c) cross-batch stability:
NOT a clean binary discriminator across seeds. BP s456=0.114
near threshold; DFA s42=0.047 (noise sub-mode) doesn't fire;
SB s456=0.035 (noise sub-mode) doesn't fire. (c) is for sub-mode
interpretation, not binary detection.
This is the calibration evidence answering the E&D reviewer question
'why these specific thresholds?'.
|
|
The existing snapshot_evolution_vit.py and vit_frozen_blocks_baseline.py
do not save model checkpoints — they only emit per-epoch JSON logs. This
makes it impossible to apply the diagnostic protocol to a trained ViT
post-hoc, since the protocol needs an actual model object.
This script trains a 4-block d=128 ViT-Mini with block-level DFA on
CIFAR-10 (same training rule as snapshot_evolution_vit.py) for 60 epochs
and saves:
- the final state_dict
- the random feedback Bs (so the protocol can also verify bug 4 on
this checkpoint)
- test_acc and config
Output: results/vit_dfa_checkpoints/dfa_vit_s{seed}.pt
|
|
All 3 verified on the real DFA s42 checkpoint:
Bug 4: training Bs gives Γ=+0.068, 10 fresh Bs draws give Γ=+0.0043±0.007.
The 'alignment' is the network adapting to specific Bs.
Bug 5: 4 valid aggregation strategies give Γ in [-0.028, +0.074]. The
spread is 0.10 (3.45x ratio) and **the sign flips** between
strategies. Pick the wrong aggregation and DFA is anti-aligned;
pick the right one and DFA looks aligned.
Bug 6: Γ_layer0 = +0.429 dominates the mean +0.068. Hidden layers 1-4 are
all near zero or slightly negative. Mean of hidden layers only is
-0.022 (negative!). The deep blocks the paper claims to be
'training' have Γ ≈ 0 or below.
Bugs 5 and 6 are causally linked: 'median over layers' strategies pick a
negative deep layer; 'mean over layers' is dominated by the positive l0.
The catalog under-reported bug 5 (it said 2.5x, actual is 3.45x with sign
flip).
|
|
Demonstrates the practical use case of the protocol — not as a post-hoc
audit but as an in-training abort condition. Walks through the existing
per-epoch trace and shows when the protocol would have triggered an early
stop on DFA training and what the saved compute would be.
Result: DFA on 4-block d=256 ResMLP fires diagnostic (b) at epoch 4 with
test acc 0.3076. The final acc at epoch 100 is *also* 0.3076 (identical).
Stopping at epoch 4 saves 96% of compute with zero headline acc loss.
|
|
ResMLP (4-block d=256, with out_ln, CIFAR-10):
s42: DFA (a) ep 8, (b) ep 4, acc 0.308
s123: DFA (a) ep 11, (b) ep 4, acc 0.320
s456: DFA (a) ep 8, (b) ep 3, acc 0.300
ViT-Mini (4-block d=128, cls token + terminal LN, CIFAR-10):
s42: DFA (a) ep 1, (b) ep 3, acc 0.256
s123: DFA (a) ep 1, (b) ep 2, acc 0.202
s456: DFA (a) ep 1, (b) ep 3, acc 0.253
StudentNet (4-block d=128, NO terminal LN, synthetic alpha=1.0):
s42: DFA (a) ep 18, (b) NEVER, acc 0.332
s123: DFA (a) ep 14, (b) NEVER, acc 0.314
s456: DFA (a) ep 25, (b) NEVER, acc 0.336
BP: never fires on any seed x any architecture (9/9 sanity passes).
Key cross-architecture finding: diagnostic (b) is specifically the LN-
driven failure mode. Without out_ln, the BP grad never crosses the 1e-7
floor, even though (a) still fires (the residual stream still grows, just
without the LN-cancellation pathology that drives the BP grad to the
floor). This is the causal architectural control: (b) specifically tests
'is terminal-LN gradient cancellation active?' and (a) tests 'is the
residual stream growing without bound?'. They are linked but separable.
This is the §3 cross-architecture validation evidence.
|
|
Old metric: max(||h||) / max(||h_0||, eps). False-positives on ViT-style
architectures because the cls token at layer 0 (right after patch_embed)
has anomalously small magnitude (~0.3-1.5), inflating the ratio even on
healthy BP-trained ViTs.
New metric: max_l(||h_{l+1}|| / ||h_l||) — the largest single-block
residual amplification. Architecture-invariant.
Calibration:
- BP-trained, late training: <5x per block
- BP ViT, early epochs (cls token resolving): 13-25x max
- DFA-trained ResMLP/ViT: 100-4000x per block
Threshold raised from 10 to 50 to sit cleanly between healthy-early-
training (max 25) and failure-regime (min 100).
Re-verifications:
- smoke test (BP/DFA/EP): all 3 verdicts unchanged
- random init (3 seeds): trustworthy on all 3
- 5-method audit table single-seed: identical verdicts
- decision-utility ablation: identical (still 0/5 by S1, 3/5 by S_full)
- temporal evolution 3-seed: (b) now fires first at ep 3-4, (a) at ep
8-11. Both well before training ends. The 'protocol fires ~92 epochs
early' story still holds.
- ViT temporal evolution: BP no longer false-fires; DFA fires (a) ep 1,
(b) ep 3 — protocol works on the second architecture.
|
|
Each bug from the catalog has a synthetic reproducer that runs in <1 sec
without GPU:
Bug 1: x.norm(-1) on a 2x2 tensor returns 1.143 (L_{-1} of whole tensor)
instead of [5, 10] (per-row L_2 along dim=-1).
Bug 2: F.cosine_similarity(a, b) with ||b||=5e-10 returns +0.000905
instead of the true +0.018101. The clamp (eps=1e-8) underestimates
the divisor 20x.
Bug 3: 5e-10 in fp16 -> 0 (underflows smallest subnormal ~6e-8).
Downstream F.cosine_similarity returns NaN. bf16 works because it
shares fp32's exponent range.
Bugs 4-6 (Bs reproducibility, aggregation, layer-0 dominance) require a
trained network and are demonstrated inside audit_table and
ablation_decision_utility.
|
|
s42: (a)+(b) fire at epoch 4, DFA final acc 0.3076
s123: (a)+(b) fire at epoch 4, DFA final acc 0.3203
s456: (a)+(b) fire at epoch 3, DFA final acc 0.2998
BP never fires on any seed (final acc 0.61-0.63).
The 'protocol catches it 96 epochs early' finding is fully reproducible
across seeds.
|
|
Replays per-epoch logged data from results/snapshot_evolution_v2/ through
the protocol thresholds.
Result: diagnostics (a) ||h_l|| explosion AND (b) ||g_L|| at floor BOTH
first fire at epoch 4 of DFA training. At that point, DFA test acc is
0.308 — its final value at epoch 100 is also 0.308. The protocol could
have walked back the headline 96 epochs before training finished.
DFA's gamma hovers at 0.087-0.107 for all 100 epochs. A reviewer looking
at acc+gamma would conclude 'DFA is hovering at 31% acc with ~0.10
alignment, both reasonable'. Wrong on both counts.
BP never fires any diagnostic at any epoch. Stays bounded at ||h_L||~200,
||g_L||~3-5e-5, accuracy climbs to 0.61.
This is the temporal validation of decision utility: the protocol catches
the pathology AS IT HAPPENS, not just retrospectively.
|
|
3-seed random init ResMLP gives chance accuracy (~10%) but the protocol
verdict is 'trustworthy' on all 3 seeds:
- residual norms ~8.7 across all layers (no growth, bounded)
- BP gradient norms ~8e-3 (healthy, well above 1e-7 floor)
- cross-batch stability 0.08-0.18 (in the BP/EP range)
This is the answer to the likely reviewer question: 'is your protocol just
flagging anything that doesn't perform well?' Answer: no. Random init is
at chance and the protocol passes it. The walked-back trained methods are
walked back because of the *measurements*, not because of the accuracy.
Notable: random init g-norms (8e-3) are actually HIGHER than BP-trained
ones (4e-4) — BP training reduces the gradient magnitude as loss decreases.
So the protocol distinguishes 3 distinct regimes: (1) untrained healthy,
(2) trained-and-still-healthy (BP/EP), (3) trained-into-pathology (DFA/SB/CB).
|
|
3 seeds × 5 methods × 4 diagnostics = 60 measurements. Key reproducibility
findings:
- BP: trustworthy on all 3 seeds (acc 0.61-0.62, h_L ~200, g_L ~3-4e-4)
- EP: trustworthy on all 3 seeds (acc 0.29-0.36, h_L 3-8e3, g_L ~1e-4)
- DFA, SB, CB: walked back on all 3 seeds × all 3 of (a)/(b)/(d)
Diagnostic (c) is bimodal across seeds — confirms the prior memory finding:
- DFA s42=0.047 (noise), s123=0.436 (drift), s456=-0.005 (noise)
- SB s42=0.992 (drift), s123=0.561 (drift), s456=0.035 (noise)
- CB s42=0.352 (drift), s123=0.250 (~edge), s456=0.518 (drift)
(c) catches different methods on different seeds. (a)/(b)/(d) catch all 3
failing methods on all 3 seeds — robust binary detection.
|
|
Builds on the 5-method audit JSON. For each method, evaluates 7 reporting
strategies (S0=acc only, S1=+Γ field standard, S2-S5=+single diagnostic,
S_full=full protocol), and emits the verdict each strategy would have
reached.
Result: 3 of 5 methods (DFA/SB/CB) are walked back by S_full but NOT by S1.
Each of (a)scale, (b)floor, (d)frozen is independently sufficient for
binary detection of those 3 failures. Diagnostic (c)stability adds
sub-mode discrimination (drift vs noise) but not new positive detections.
This is the §3 protocol decision-utility evidence.
|
|
Trains both vanilla DFA (lam=0) and penalized DFA (lam=1e-2) from the same
seed, then directly measures the per-layer cosine between DFA's local
credit signal e_T @ B_l^T and the BP gradient at hidden layers. Uses the
training Bs (not fresh ones, per the Bs-specificity finding from earlier).
The penalized run is the key measurement: in that condition the BP grad is
~10^-7 (well above the eps=1e-8 floor), so a near-zero cosine here would
be the direct evidence of the second failure mode (direction-quality
ceiling) that codex round 13 hypothesized.
Pre-registered prediction: penalized cos(DFA, BP) ~ 0.01-0.05 -> direction
quality is the second, separable failure mode. Saves the penalized
checkpoint so the diagnostic protocol can be re-applied to it (where (a)
and (b) should pass, (d) should still fail).
|
|
5-method audit table on 4-block d=256 ResMLP CIFAR-10 seed 42:
- BP: trustworthy (acc 0.615, h_L=2e2, g_L=4e-4, stab 0.099)
- DFA: walked back via (a)+(b)+(d) — h_L=4e8, g_L=4e-9, undercuts frozen
- State Bridge: walked back via all 4 diagnostics — stability 0.992 is the
cleanest possible drift-dominated case
- Credit Bridge: walked back via all 4 — stability 0.352, also drift mode
- EP: trustworthy (acc 0.359, h_L=3e3, g_L=2e-4, stab -0.036) — paper's
internal control case
This is the §2 audit evidence for the main-track paper. Confirms that
standard headline acc + Γ silently fails on 3 of 5 methods on this
architecture, while the 4-diagnostic protocol catches all three.
|
|
Codex round 15 #1 priority for the E&D-track paper:
- protocol/protocol.py: 4 diagnostics (residual norms, BP grad norms,
cross-batch direction stability, and a frozen-baseline comparator)
- protocol/report.py: DiagnosticReport with per-diagnostic verdicts and
pretty-printer
- protocol/smoke_test.py: validates BP/DFA/EP checkpoints produce the
expected verdicts (BP/EP trustworthy; DFA walked back via residual
explosion + BP grad at floor)
- protocol/README.md: usage, audit cases, threshold rationale
- protocol/CHECKLIST.md: 6 evaluation pipeline pitfalls (norm(-1),
cosine_similarity eps clamp, fp16 underflow, Bs reproducibility,
aggregation, layer-0 dominance)
- protocol/REPORTING_TEMPLATE.md: per-method fillable form for FA papers
|
|
methods)
BP/DFA/SB/CB: added seeds 2048,3000,4000,5000,6000 (L=4 only, all 3 alphas)
Total: 1290 rows (was 990)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
CNN CIFAR-10 (5 seeds, fixed Gamma):
BP: acc=86.8%, Gamma=0.970, rho=0.603
DFA: acc=56.7%, Gamma=0.896, rho=0.061
EP: acc=50.6%, Gamma=0.484, rho=0.450
SB: acc=63.3%, Gamma=1.000 (BP self-cos, feedback nets not saved), rho=0.601
CB: acc=31.8%, Gamma=1.000 (BP self-cos), rho=0.226
DFA Gamma=0.896 is notably high — CNN DFA credit aligns well with BP gradients.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
layers
Old code detached hidden states between layers, making layers 0-2 disconnected
from the loss (gradient = None → 0). Fixed by keeping the forward graph connected.
BP CNN Gamma per-layer now: [0.985, 0.990, 0.987, 0.967] (was [0, 0, 0, 0.967])
But gradient norms are ~1e-17 (genuine numerical precision issue with CNN architecture).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
EP synthetic: 30 JSONs + 30 checkpoints (10 seeds × 3α)
EP CIFAR persample: 6 seeds × 4 layers × 256 samples = 6144 rows added
Synth cross-state: 150 EP rows added (990 total)
cifar_persample_all.csv: 30720 rows (was 24576, +6144 EP)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Updated synth_cross_state_distance.csv with 150 EP rows (990 total).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
EP CIFAR d_BP: L0=2.0×, L4=26.7× (much closer to BP than DFA=162×/2.5M×)
EP synthetic: no checkpoints saved (ep_synthetic.py didn't save .pt)
CNN summary: 20 rows confirmed correct
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
EP synthetic: 15 rows (3α × 5 seeds)
Synth cross-state: 840 rows (3α × 2L × 4 methods × 5 seeds × (L+1) layers)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
EP CIFAR d=256: s(1e-6)=100%, mean_norm=1.41e-04
EP produces networks where ALL samples have non-zero BP gradients,
unlike DFA (0.4%), SB (21%), CB (3%). EP is closer to BP (98.7%).
Updated clean_sparsity_summary.csv (980 rows, now includes EP).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Non-BP methods produce radically different representations:
DFA L0: 162×, L4: 2.5M× relative to BP hidden norms
SB L0: 3.2×, L4: 1.1M×
CB L0: 59×, L4: 1.4M×
(BP vs itself = 0)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
EP Synthetic (fixed): Gamma=+0.13~0.20, rho=+0.25
EP CIFAR d=256: Gamma=+0.007, rho=+0.051
EP CIFAR d=512: Gamma=+0.000, rho=-0.002
EP CNN: Gamma=+0.248, rho=+0.492
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
EP d=256 (5 seeds): acc=31.9%, Gamma=+0.007 (was -0.13), rho=+0.051 (was -0.037)
Sign correction: -(h_nudge - h_free)/β aligns EP credit with BP gradient direction.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
EP nudge moves h toward lower loss (opposite to BP grad which points toward loss increase).
Without negation, Gamma is negative and rho is -0.25.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
EP synthetic: acc high (92-96%) but Gamma negative (-0.13 to -0.20), rho=-0.25
EP credit direction may be inverted or diagnostics have issue.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
CNN CIFAR-10 (5 seeds):
BP: 86.8%±0.3%, Gamma=0.238, rho=0.250
DFA: 56.7%±2.0%, Gamma=0.216, rho=0.017
SB: 63.3%±0.5%, Gamma=0.045, rho=0.298
CB: 31.8%±6.2%, Gamma=0.013, rho=0.033
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
State bridge: per-layer StateBridgeNet predicting h3 from flattened h_l
Credit bridge: per-layer ValueNet with terminal + bridge consistency + DFA warmup
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
ReLU MLP (L=4 d=256):
BP: acc=61.1%, Gamma=1.000, rho=0.998
DFA: acc=30.7%, Gamma=0.104, rho=-0.001
SB: acc=15.5%, Gamma=0.300, rho=0.159
CB: acc=28.7%, Gamma=0.298, rho=0.007
Note: SB/CB Gamma uses BP gradient as proxy (feedback nets not checkpointed).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
DFA now uses regenerated DFA Bs for credit; SB/CB use BP as proxy (feedback nets not saved).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
EP (L=4 d=256): acc≈30%, Gamma≈0, rho≈0 — EP credit signal weak on feedforward MLP
GELU ablation (ReLU variant): 4 methods × 5 seeds complete
CNN BP+DFA: 5 seeds each, BP + DFA on SmallCNN
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Note: existing ResidualMLP already uses GELU. This adds ResidualMLPReLU variant.
Ablation compares ReLU vs GELU for BP/DFA/SB/CB.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
BP: s(1e-6)=92.7%, norm=2.70e-04, r_inf=0.159, PR=0.300
DFA: s(1e-6)=0.1%, norm=5.31e-08
SB: s(1e-6)=20.3%, norm=2.33e-06
CB: s(1e-6)=1.2%, norm=9.88e-08
Same pattern as d=256, confirming width-independence of the sparsity gap.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
bp s=456: acc=0.5999, rho=0.9881, nse=0.4764
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
BP: 60.6%±0.3%, rho=0.989
DFA: 30.8%±0.5%, rho=0.003
State Bridge: 21.2%±3.7%, rho=0.119
Credit Bridge: 30.1%±0.5%, rho=0.002
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|