summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 18:47:26 -0500
committerYurenHao0426 <Blackhao0426@gmail.com>2026-04-08 18:47:26 -0500
commitffbb53cb59eeea47f7967c4b4654cf2ee73395a9 (patch)
treeb01bfe2b75ab21072cee6eedbf8996b9db93d6c8
parent348e0f4e19be654febc9061ae873493a58080f91 (diff)
paper v2.31.13: §6 ¶3 (c) ranges replaced with audit-data ranges
The (c) calibration ranges "0.05-0.18 healthy, 0.5-0.99 drift-dominated" overstated the separation. Re-aggregated from results/protocol_audit/audit_table_s42_s123_s456.json: Healthy (BP+EP) 6 values: [-0.036, -0.024, 0.087, 0.099, 0.114, 0.120] range = [-0.036, 0.120], median 0.093 (NOT "0.05 to 0.18" — has negative values, max < 0.18) Degen (DFA+SB+CB) 9 values: [-0.005, 0.035, 0.047, 0.250, 0.352, 0.436, 0.518, 0.561, 0.992] range = [-0.005, 0.992], median 0.352 (only 5/9 above 0.30, only 3/9 above 0.50) The (c) discriminator has substantial overlap between healthy and degen distributions on this metric — the paper already calls (c) a "sub-mode discriminator" not a binary detector, so the loose calibration is acknowledged in framing but the numerical ranges should match the data. Updated to: "healthy methods cluster near zero with all six BP/EP values in [-0.04, +0.12], while drift-dominated cases reach high tails up to +0.99, and 5/9 degenerate values exceed the 0.30 default cutoff". This is more honest and points at the audit JSON. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
-rw-r--r--paper/main.pdfbin500971 -> 501154 bytes
-rw-r--r--paper/main.tex2
2 files changed, 1 insertions, 1 deletions
diff --git a/paper/main.pdf b/paper/main.pdf
index 2ff8fd7..90882a6 100644
--- a/paper/main.pdf
+++ b/paper/main.pdf
Binary files differ
diff --git a/paper/main.tex b/paper/main.tex
index b40156e..6ba45ce 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -175,7 +175,7 @@ Diag. & Measurement & Default threshold & Role \\
The point of the protocol is not to add plots; it is to prevent a specific class of false conclusions. For this paper, the minimal protocol is four checks: per-layer activation scale via max-per-block growth, deepest hidden BP gradient floor, meaningful-regime per-layer credit quality, and an architecture-matched frozen-blocks baseline (Table~\ref{tab:protocol_def}). The first two ask whether the reference quantity is still valid; the third asks whether, once validity is restored, the deep blocks receive useful directions; and the fourth asks whether the trained depth is doing better than a model whose residual blocks were never trained at all. Figure~\ref{fig:decision_utility} (Appendix~\ref{app:all_validations}) makes the decision value explicit: accuracy alone walks back $0/5$ audited methods, accuracy plus headline $\Gamma$ still walks back $0/5$, and the full protocol walks back $3/5$ by flagging DFA, State Bridge, and Credit Bridge, with diagnostics (a), (b), and (d) each independently sufficient for binary detection on those failures. On our audit, these checks catch failures that accuracy plus aggregate alignment miss completely.
-The protocol is conservative in a specific sense: it preserves BP and EP as evidence-bearing controls and walks back only claims that fail measurement-validity or depth-utilization checks. Diagnostics (a) and (b) have sharp empirical calibration gaps in the audited regime (Appendix~\ref{app:threshold_sweep}), diagnostic (c) is a sub-mode discriminator computed as the mean pairwise cosine of the per-batch-averaged BP-grad direction at the chosen layer across $K{\geq}8$ disjoint $128$-sample minibatches (high values, $0.5$--$0.99$, indicate drift-dominated reference vectors; healthy per-sample credit gives $0.05$--$0.18$), and diagnostic (d) uses a deliberately weak $2$pp margin as a context check rather than a theorem about useful depth. The Section~\ref{sec:mode2} cross-method cosine-versus-accuracy dissociation reinforces the necessity of keeping all four diagnostics separate: Credit Bridge, State Bridge, and DFA differ by more than $4\times$ in deep-layer alignment under the same penalty rescue without tracking final accuracy in the same direction, so aligning an alternative credit rule with the BP gradient is not a substitute for checking depth utilization against a matched shallow baseline.
+The protocol is conservative in a specific sense: it preserves BP and EP as evidence-bearing controls and walks back only claims that fail measurement-validity or depth-utilization checks. Diagnostics (a) and (b) have sharp empirical calibration gaps in the audited regime (Appendix~\ref{app:threshold_sweep}), diagnostic (c) is a sub-mode discriminator computed as the mean pairwise cosine of the per-batch-averaged BP-grad direction at the chosen layer across $K{\geq}8$ disjoint $128$-sample minibatches (in our 5-method audit, healthy methods cluster near zero with all six BP/EP values in $[-0.04,+0.12]$, while drift-dominated cases reach high tails up to $+0.99$, and $5/9$ degenerate values exceed the $0.30$ default cutoff), and diagnostic (d) uses a deliberately weak $2$pp margin as a context check rather than a theorem about useful depth. The Section~\ref{sec:mode2} cross-method cosine-versus-accuracy dissociation reinforces the necessity of keeping all four diagnostics separate: Credit Bridge, State Bridge, and DFA differ by more than $4\times$ in deep-layer alignment under the same penalty rescue without tracking final accuracy in the same direction, so aligning an alternative credit rule with the BP gradient is not a substitute for checking depth utilization against a matched shallow baseline.
\section{Discussion, Limits, Conclusion}
\label{sec:discussion}