| Age | Commit message (Collapse) | Author |
|
|
|
not saved
Discovered in our own cnn_baseline.py: when the random feedback Bs (for
DFA) or bridge predictor (for SB/CB) are not persisted alongside the
model checkpoint, post-hoc Gamma computation cannot reconstruct the
local credit signal. Instead of erroring, the script falls back to
cos(BP_grad, BP_grad) = 1.0 and records that as Gamma. Reader who
doesn't notice the small 'Gamma_note' field interprets 1.0 as perfect
alignment.
Recommendation: always save aux nets alongside checkpoints; if they're
missing, report Gamma as N/A, not 1.0.
|
|
|
|
5 methods × 3 seeds on the SmallCNN (3 conv + BN + 1 FC + head, no
terminal LN) using existing checkpoints in results/cnn_baseline/.
Key findings:
BP CNN: 0.866 acc, max/block 1.3, trustworthy
State Bridge CNN: 0.633 acc, max/block 2.4, trustworthy
EP CNN: 0.512 acc, max/block 12, trustworthy
DFA CNN: 0.566 acc, max/block 237, walked back via (a)
Credit Bridge CNN: 0.325 acc, max/block 96, walked back via (a)
CRITICAL: diagnostic (b) ||g_L|| floor NEVER fires on CNN for any method.
The deepest BP grad is at ~1e-5 to 6e-1, all well above the 1e-7 floor.
This is the cleanest confirmation that terminal LayerNorm is the
structural cause of the catastrophic gradient collapse in (b). Without
out_ln, the BP grad does NOT collapse to the floor, even on DFA. The
scale pathology (a) still appears on DFA and CB, but the gradient
collapse pathology (b) is specific to terminal-LN architectures.
DFA CNN's accuracy (56.6%) is much higher than DFA ResMLP (30.8%) or
DFA ViT (23.7%) — partially because the scale pathology is less
catastrophic without the LN-driven gradient cancellation amplifying
it. This is the cross-architecture mechanism story made concrete.
|
|
5-epoch DFA training on CIFAR-10 + apply protocol + interpret verdict.
Self-contained, runs on CPU in <2 minutes. Demonstrates the API a future
paper author would use:
1. train your model (any FA-style method)
2. build eval_batches from your test loader
3. call diagnose(model, eval_batches, headline_acc, frozen_baseline_acc)
4. read report.verdict; walk back if 'needs walk-back'
Not run during this session to avoid GPU contention with the in-flight
direction-quality and ViT/ResNet experiments.
|
|
3-panel side-by-side showing per-epoch trajectories of vanilla DFA vs
DFA + lambda*||f||^2 penalty:
(a) ||h_L||: vanilla 4e8 vs penalty 4e4 (4 OOM rescue)
(b) ||g_L||: vanilla 5e-10 vs penalty ~1e-6 (4 OOM rescue)
(d) test acc: vanilla 0.31 vs penalty 0.36 vs frozen baseline 0.349 vs BP 0.61
The visual story: (a) and (b) show the penalty pulling the diagnostics
back into the healthy regime, but (d) shows the rescue translates to
only +1 pp above the DFA-shallow baseline and 24 pp below BP-trainable.
The two failure modes (scale + direction) are visually separable: scale
is fixed, direction is not.
Together with figure_audit_5method.png and figure_cross_arch_temporal_s42.png,
this is the third paper-ready figure for §3-§4.
|
|
separability, figures section
|
|
4-panel layout (one per diagnostic), 5 methods sorted bottom-to-top by
ascending accuracy, color-coded healthy (BP/EP, blue) vs degenerate
(DFA/SB/CB, red), with threshold lines drawn:
(a) max per-block growth (log scale, threshold 50x)
(b) ||g_L|| (log scale, floor 1e-7)
(c) cross-batch stability (linear, ceiling 0.30)
(d) headline acc (linear, frozen baseline 0.349)
The visual layout makes it immediately obvious that:
- (a) and (b) cleanly split healthy from degenerate (4-7 OOM gap)
- (c) is bimodal and doesn't cleanly split — confirms it's a sub-mode
discriminator, not a primary detector
- (d) shows BP above the frozen baseline by ~25 pp while DFA/CB/SB
are at or below it
|
|
Same protocol applied to the 4-block d=512 ResMLP variant (vs the d=256
default). 4 methods × 3 seeds = 12 conditions:
BP @ d=512: trustworthy on all 3 seeds (acc 0.60-0.61)
DFA @ d=512: walked back on all 3 seeds via (a)+(b)
State Bridge @ d=512: walked back on all 3 seeds via (a)+(b), with
drift sub-mode on s123 (stability 0.879)
Credit Bridge @ d=512: walked back on all 3 seeds via (a)+(b)
Width effect: max-per-block growth is HIGHER at d=512 (6e3-7e4) than at
d=256 (~1e3). Larger width amplifies the explosion. The protocol
verdicts are robust to this — same binary outcome, more extreme
quantitative numbers.
This is the cross-width validation: the protocol's findings are not
d=256-specific. The §3 audit results generalize across the width
dimension.
|
|
3-seed result on the existing dfa_s{42,123,456}.pt checkpoints from
results/confirmatory/checkpoints_A2/, computing per-layer cosine of
DFA's local credit signal e_T@B_l^T vs the true BP gradient at h_l.
Key findings:
per-layer cos (3-seed mean):
l0: +0.42 (high — embedding alignment)
l1: +0.006 (essentially zero)
l2: -0.015 (essentially zero)
l3: -0.004 (essentially zero)
l4: -0.004 (essentially zero)
layer-mean across all 5: +0.07-0.10
The deep blocks (l1-l4) have essentially zero alignment with BP grad in
the vanilla scale-failure regime. Layer 0 dominates the headline.
The script reconstructs the training-time random Bs by replaying the RNG
sequence (torch.manual_seed + ResidualMLP construction + randn draws),
since the existing checkpoints don't save Bs. For the still-running
direction-quality experiment which DOES save Bs, the script auto-detects
the dict format and uses the saved Bs directly.
|
|
3-seed analysis of DFA + lambda=1e-2 ||f||^2 penalty using only the data
already in the existing penalty JSON logs (no checkpoint or full layer
norms needed):
(a) per-block growth: avg ~8x per block (geom mean), well below 50x
threshold. PASS likely (with small caveat that max could differ
from mean).
(b) BP grad floor: g_2 = 8-10e-7 across 3 seeds, 10x above the
1e-7 floor. PASS exact.
(d) frozen baseline: margin = 1.35-1.45 pp (mean 1.38) < 2 pp
required. FIRE on all 3 seeds.
Aggregate partial verdict: protocol catches the SECOND failure mode
(direction quality / passive blocks) on penalized DFA even though it
PASSES the scale-related diagnostics. This is the cleanest possible
evidence that the two failure modes are separable: the penalty fixes
the scale failure but not the direction failure. The protocol's (d)
diagnostic is the right test for the second failure mode and it still
fires after the penalty rescue.
This is the §4 'two failure modes' evidence that doesn't depend on the
direction-quality direct test (which is still running). The (d)
diagnostic alone shows the separation.
|
|
Single-document overview of every result the protocol package has
produced so far, with reproducibility commands and the file/memory entry
where each result is recorded. Organized by paper section (§1 protocol,
§2 audit, §3 decision utility, §4 temporal validation, §5 pitfalls).
Includes the headline tables (3-seed audit, cross-architecture, penalty
sweep) ready for the paper, and an explicit status field for each
ongoing experiment.
This is a reading guide for anyone (codex, future-me, the user) who
needs to know what evidence is ready and how to reproduce it.
|
|
3-column 3-row plot:
rows: ||h_L||, ||g_L||, test accuracy
cols: ResMLP (with LN) | ViT-Mini (cls + LN) | StudentNet (no LN)
BP and DFA trajectories overlaid. Floor threshold drawn on the ||g_L||
row. Visualizes the cross-architecture causal control: with-LN
architectures both show ||g_L|| collapse below 1e-7 (DFA hits the floor
within 5 epochs); without-LN architecture shows ||g_L|| stays in the
healthy regime even though ||h_L|| still grows (catastrophic vs mild).
|
|
For each diagnostic, sweeps threshold across orders of magnitude on the
3-seed audit data and reports the verdict at each value.
Key calibration findings (3 seeds):
Diagnostic (a) max per-block growth:
Healthy max (BP/EP): 11.0
Degenerate min (DFA/SB/CB): 694
Separation gap: 63x
Default threshold 50 sits comfortably in the middle.
Any threshold in [12, 693] gives the same verdicts.
Diagnostic (b) ||g_L|| at floor:
Healthy min (BP/EP): 1.02e-4
Degenerate max (DFA/SB/CB): 4.18e-9
Separation gap: 24,338x
Default threshold 1e-7 sits comfortably in the middle.
Any threshold in [4.2e-9, 1.0e-4] gives the same verdicts.
Diagnostic (c) cross-batch stability:
NOT a clean binary discriminator across seeds. BP s456=0.114
near threshold; DFA s42=0.047 (noise sub-mode) doesn't fire;
SB s456=0.035 (noise sub-mode) doesn't fire. (c) is for sub-mode
interpretation, not binary detection.
This is the calibration evidence answering the E&D reviewer question
'why these specific thresholds?'.
|
|
The existing snapshot_evolution_vit.py and vit_frozen_blocks_baseline.py
do not save model checkpoints — they only emit per-epoch JSON logs. This
makes it impossible to apply the diagnostic protocol to a trained ViT
post-hoc, since the protocol needs an actual model object.
This script trains a 4-block d=128 ViT-Mini with block-level DFA on
CIFAR-10 (same training rule as snapshot_evolution_vit.py) for 60 epochs
and saves:
- the final state_dict
- the random feedback Bs (so the protocol can also verify bug 4 on
this checkpoint)
- test_acc and config
Output: results/vit_dfa_checkpoints/dfa_vit_s{seed}.pt
|
|
All 3 verified on the real DFA s42 checkpoint:
Bug 4: training Bs gives Γ=+0.068, 10 fresh Bs draws give Γ=+0.0043±0.007.
The 'alignment' is the network adapting to specific Bs.
Bug 5: 4 valid aggregation strategies give Γ in [-0.028, +0.074]. The
spread is 0.10 (3.45x ratio) and **the sign flips** between
strategies. Pick the wrong aggregation and DFA is anti-aligned;
pick the right one and DFA looks aligned.
Bug 6: Γ_layer0 = +0.429 dominates the mean +0.068. Hidden layers 1-4 are
all near zero or slightly negative. Mean of hidden layers only is
-0.022 (negative!). The deep blocks the paper claims to be
'training' have Γ ≈ 0 or below.
Bugs 5 and 6 are causally linked: 'median over layers' strategies pick a
negative deep layer; 'mean over layers' is dominated by the positive l0.
The catalog under-reported bug 5 (it said 2.5x, actual is 3.45x with sign
flip).
|
|
Demonstrates the practical use case of the protocol — not as a post-hoc
audit but as an in-training abort condition. Walks through the existing
per-epoch trace and shows when the protocol would have triggered an early
stop on DFA training and what the saved compute would be.
Result: DFA on 4-block d=256 ResMLP fires diagnostic (b) at epoch 4 with
test acc 0.3076. The final acc at epoch 100 is *also* 0.3076 (identical).
Stopping at epoch 4 saves 96% of compute with zero headline acc loss.
|
|
ResMLP (4-block d=256, with out_ln, CIFAR-10):
s42: DFA (a) ep 8, (b) ep 4, acc 0.308
s123: DFA (a) ep 11, (b) ep 4, acc 0.320
s456: DFA (a) ep 8, (b) ep 3, acc 0.300
ViT-Mini (4-block d=128, cls token + terminal LN, CIFAR-10):
s42: DFA (a) ep 1, (b) ep 3, acc 0.256
s123: DFA (a) ep 1, (b) ep 2, acc 0.202
s456: DFA (a) ep 1, (b) ep 3, acc 0.253
StudentNet (4-block d=128, NO terminal LN, synthetic alpha=1.0):
s42: DFA (a) ep 18, (b) NEVER, acc 0.332
s123: DFA (a) ep 14, (b) NEVER, acc 0.314
s456: DFA (a) ep 25, (b) NEVER, acc 0.336
BP: never fires on any seed x any architecture (9/9 sanity passes).
Key cross-architecture finding: diagnostic (b) is specifically the LN-
driven failure mode. Without out_ln, the BP grad never crosses the 1e-7
floor, even though (a) still fires (the residual stream still grows, just
without the LN-cancellation pathology that drives the BP grad to the
floor). This is the causal architectural control: (b) specifically tests
'is terminal-LN gradient cancellation active?' and (a) tests 'is the
residual stream growing without bound?'. They are linked but separable.
This is the §3 cross-architecture validation evidence.
|
|
Old metric: max(||h||) / max(||h_0||, eps). False-positives on ViT-style
architectures because the cls token at layer 0 (right after patch_embed)
has anomalously small magnitude (~0.3-1.5), inflating the ratio even on
healthy BP-trained ViTs.
New metric: max_l(||h_{l+1}|| / ||h_l||) — the largest single-block
residual amplification. Architecture-invariant.
Calibration:
- BP-trained, late training: <5x per block
- BP ViT, early epochs (cls token resolving): 13-25x max
- DFA-trained ResMLP/ViT: 100-4000x per block
Threshold raised from 10 to 50 to sit cleanly between healthy-early-
training (max 25) and failure-regime (min 100).
Re-verifications:
- smoke test (BP/DFA/EP): all 3 verdicts unchanged
- random init (3 seeds): trustworthy on all 3
- 5-method audit table single-seed: identical verdicts
- decision-utility ablation: identical (still 0/5 by S1, 3/5 by S_full)
- temporal evolution 3-seed: (b) now fires first at ep 3-4, (a) at ep
8-11. Both well before training ends. The 'protocol fires ~92 epochs
early' story still holds.
- ViT temporal evolution: BP no longer false-fires; DFA fires (a) ep 1,
(b) ep 3 — protocol works on the second architecture.
|
|
Each bug from the catalog has a synthetic reproducer that runs in <1 sec
without GPU:
Bug 1: x.norm(-1) on a 2x2 tensor returns 1.143 (L_{-1} of whole tensor)
instead of [5, 10] (per-row L_2 along dim=-1).
Bug 2: F.cosine_similarity(a, b) with ||b||=5e-10 returns +0.000905
instead of the true +0.018101. The clamp (eps=1e-8) underestimates
the divisor 20x.
Bug 3: 5e-10 in fp16 -> 0 (underflows smallest subnormal ~6e-8).
Downstream F.cosine_similarity returns NaN. bf16 works because it
shares fp32's exponent range.
Bugs 4-6 (Bs reproducibility, aggregation, layer-0 dominance) require a
trained network and are demonstrated inside audit_table and
ablation_decision_utility.
|
|
s42: (a)+(b) fire at epoch 4, DFA final acc 0.3076
s123: (a)+(b) fire at epoch 4, DFA final acc 0.3203
s456: (a)+(b) fire at epoch 3, DFA final acc 0.2998
BP never fires on any seed (final acc 0.61-0.63).
The 'protocol catches it 96 epochs early' finding is fully reproducible
across seeds.
|
|
Replays per-epoch logged data from results/snapshot_evolution_v2/ through
the protocol thresholds.
Result: diagnostics (a) ||h_l|| explosion AND (b) ||g_L|| at floor BOTH
first fire at epoch 4 of DFA training. At that point, DFA test acc is
0.308 — its final value at epoch 100 is also 0.308. The protocol could
have walked back the headline 96 epochs before training finished.
DFA's gamma hovers at 0.087-0.107 for all 100 epochs. A reviewer looking
at acc+gamma would conclude 'DFA is hovering at 31% acc with ~0.10
alignment, both reasonable'. Wrong on both counts.
BP never fires any diagnostic at any epoch. Stays bounded at ||h_L||~200,
||g_L||~3-5e-5, accuracy climbs to 0.61.
This is the temporal validation of decision utility: the protocol catches
the pathology AS IT HAPPENS, not just retrospectively.
|
|
3-seed random init ResMLP gives chance accuracy (~10%) but the protocol
verdict is 'trustworthy' on all 3 seeds:
- residual norms ~8.7 across all layers (no growth, bounded)
- BP gradient norms ~8e-3 (healthy, well above 1e-7 floor)
- cross-batch stability 0.08-0.18 (in the BP/EP range)
This is the answer to the likely reviewer question: 'is your protocol just
flagging anything that doesn't perform well?' Answer: no. Random init is
at chance and the protocol passes it. The walked-back trained methods are
walked back because of the *measurements*, not because of the accuracy.
Notable: random init g-norms (8e-3) are actually HIGHER than BP-trained
ones (4e-4) — BP training reduces the gradient magnitude as loss decreases.
So the protocol distinguishes 3 distinct regimes: (1) untrained healthy,
(2) trained-and-still-healthy (BP/EP), (3) trained-into-pathology (DFA/SB/CB).
|
|
3 seeds × 5 methods × 4 diagnostics = 60 measurements. Key reproducibility
findings:
- BP: trustworthy on all 3 seeds (acc 0.61-0.62, h_L ~200, g_L ~3-4e-4)
- EP: trustworthy on all 3 seeds (acc 0.29-0.36, h_L 3-8e3, g_L ~1e-4)
- DFA, SB, CB: walked back on all 3 seeds × all 3 of (a)/(b)/(d)
Diagnostic (c) is bimodal across seeds — confirms the prior memory finding:
- DFA s42=0.047 (noise), s123=0.436 (drift), s456=-0.005 (noise)
- SB s42=0.992 (drift), s123=0.561 (drift), s456=0.035 (noise)
- CB s42=0.352 (drift), s123=0.250 (~edge), s456=0.518 (drift)
(c) catches different methods on different seeds. (a)/(b)/(d) catch all 3
failing methods on all 3 seeds — robust binary detection.
|
|
Builds on the 5-method audit JSON. For each method, evaluates 7 reporting
strategies (S0=acc only, S1=+Γ field standard, S2-S5=+single diagnostic,
S_full=full protocol), and emits the verdict each strategy would have
reached.
Result: 3 of 5 methods (DFA/SB/CB) are walked back by S_full but NOT by S1.
Each of (a)scale, (b)floor, (d)frozen is independently sufficient for
binary detection of those 3 failures. Diagnostic (c)stability adds
sub-mode discrimination (drift vs noise) but not new positive detections.
This is the §3 protocol decision-utility evidence.
|
|
Trains both vanilla DFA (lam=0) and penalized DFA (lam=1e-2) from the same
seed, then directly measures the per-layer cosine between DFA's local
credit signal e_T @ B_l^T and the BP gradient at hidden layers. Uses the
training Bs (not fresh ones, per the Bs-specificity finding from earlier).
The penalized run is the key measurement: in that condition the BP grad is
~10^-7 (well above the eps=1e-8 floor), so a near-zero cosine here would
be the direct evidence of the second failure mode (direction-quality
ceiling) that codex round 13 hypothesized.
Pre-registered prediction: penalized cos(DFA, BP) ~ 0.01-0.05 -> direction
quality is the second, separable failure mode. Saves the penalized
checkpoint so the diagnostic protocol can be re-applied to it (where (a)
and (b) should pass, (d) should still fail).
|
|
5-method audit table on 4-block d=256 ResMLP CIFAR-10 seed 42:
- BP: trustworthy (acc 0.615, h_L=2e2, g_L=4e-4, stab 0.099)
- DFA: walked back via (a)+(b)+(d) — h_L=4e8, g_L=4e-9, undercuts frozen
- State Bridge: walked back via all 4 diagnostics — stability 0.992 is the
cleanest possible drift-dominated case
- Credit Bridge: walked back via all 4 — stability 0.352, also drift mode
- EP: trustworthy (acc 0.359, h_L=3e3, g_L=2e-4, stab -0.036) — paper's
internal control case
This is the §2 audit evidence for the main-track paper. Confirms that
standard headline acc + Γ silently fails on 3 of 5 methods on this
architecture, while the 4-diagnostic protocol catches all three.
|
|
Codex round 15 #1 priority for the E&D-track paper:
- protocol/protocol.py: 4 diagnostics (residual norms, BP grad norms,
cross-batch direction stability, and a frozen-baseline comparator)
- protocol/report.py: DiagnosticReport with per-diagnostic verdicts and
pretty-printer
- protocol/smoke_test.py: validates BP/DFA/EP checkpoints produce the
expected verdicts (BP/EP trustworthy; DFA walked back via residual
explosion + BP grad at floor)
- protocol/README.md: usage, audit cases, threshold rationale
- protocol/CHECKLIST.md: 6 evaluation pipeline pitfalls (norm(-1),
cosine_similarity eps clamp, fp16 underflow, Bs reproducibility,
aggregation, layer-0 dominance)
- protocol/REPORTING_TEMPLATE.md: per-method fillable form for FA papers
|
|
methods)
BP/DFA/SB/CB: added seeds 2048,3000,4000,5000,6000 (L=4 only, all 3 alphas)
Total: 1290 rows (was 990)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
CNN CIFAR-10 (5 seeds, fixed Gamma):
BP: acc=86.8%, Gamma=0.970, rho=0.603
DFA: acc=56.7%, Gamma=0.896, rho=0.061
EP: acc=50.6%, Gamma=0.484, rho=0.450
SB: acc=63.3%, Gamma=1.000 (BP self-cos, feedback nets not saved), rho=0.601
CB: acc=31.8%, Gamma=1.000 (BP self-cos), rho=0.226
DFA Gamma=0.896 is notably high — CNN DFA credit aligns well with BP gradients.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
layers
Old code detached hidden states between layers, making layers 0-2 disconnected
from the loss (gradient = None → 0). Fixed by keeping the forward graph connected.
BP CNN Gamma per-layer now: [0.985, 0.990, 0.987, 0.967] (was [0, 0, 0, 0.967])
But gradient norms are ~1e-17 (genuine numerical precision issue with CNN architecture).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
EP synthetic: 30 JSONs + 30 checkpoints (10 seeds × 3α)
EP CIFAR persample: 6 seeds × 4 layers × 256 samples = 6144 rows added
Synth cross-state: 150 EP rows added (990 total)
cifar_persample_all.csv: 30720 rows (was 24576, +6144 EP)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Updated synth_cross_state_distance.csv with 150 EP rows (990 total).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
EP CIFAR d_BP: L0=2.0×, L4=26.7× (much closer to BP than DFA=162×/2.5M×)
EP synthetic: no checkpoints saved (ep_synthetic.py didn't save .pt)
CNN summary: 20 rows confirmed correct
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
EP synthetic: 15 rows (3α × 5 seeds)
Synth cross-state: 840 rows (3α × 2L × 4 methods × 5 seeds × (L+1) layers)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
EP CIFAR d=256: s(1e-6)=100%, mean_norm=1.41e-04
EP produces networks where ALL samples have non-zero BP gradients,
unlike DFA (0.4%), SB (21%), CB (3%). EP is closer to BP (98.7%).
Updated clean_sparsity_summary.csv (980 rows, now includes EP).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Non-BP methods produce radically different representations:
DFA L0: 162×, L4: 2.5M× relative to BP hidden norms
SB L0: 3.2×, L4: 1.1M×
CB L0: 59×, L4: 1.4M×
(BP vs itself = 0)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
EP Synthetic (fixed): Gamma=+0.13~0.20, rho=+0.25
EP CIFAR d=256: Gamma=+0.007, rho=+0.051
EP CIFAR d=512: Gamma=+0.000, rho=-0.002
EP CNN: Gamma=+0.248, rho=+0.492
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
EP d=256 (5 seeds): acc=31.9%, Gamma=+0.007 (was -0.13), rho=+0.051 (was -0.037)
Sign correction: -(h_nudge - h_free)/β aligns EP credit with BP gradient direction.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
EP nudge moves h toward lower loss (opposite to BP grad which points toward loss increase).
Without negation, Gamma is negative and rho is -0.25.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
EP synthetic: acc high (92-96%) but Gamma negative (-0.13 to -0.20), rho=-0.25
EP credit direction may be inverted or diagnostics have issue.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
CNN CIFAR-10 (5 seeds):
BP: 86.8%±0.3%, Gamma=0.238, rho=0.250
DFA: 56.7%±2.0%, Gamma=0.216, rho=0.017
SB: 63.3%±0.5%, Gamma=0.045, rho=0.298
CB: 31.8%±6.2%, Gamma=0.013, rho=0.033
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
State bridge: per-layer StateBridgeNet predicting h3 from flattened h_l
Credit bridge: per-layer ValueNet with terminal + bridge consistency + DFA warmup
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
ReLU MLP (L=4 d=256):
BP: acc=61.1%, Gamma=1.000, rho=0.998
DFA: acc=30.7%, Gamma=0.104, rho=-0.001
SB: acc=15.5%, Gamma=0.300, rho=0.159
CB: acc=28.7%, Gamma=0.298, rho=0.007
Note: SB/CB Gamma uses BP gradient as proxy (feedback nets not checkpointed).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|