faeval.git/protocol, branch master

Update NOTE.md + EVIDENCE_SUMMARY.md with FA results (2026-04-23)

2026-04-23T16:18:59+00:00

NOTE.md: added comprehensive current-status section at the top with
the full 6-method audit table (BP/FA/EP/DFA/CB/SB), FA vs DFA key
comparison, depth sweep, penalty rescue comparison, cross-method
functional triangulation, and open items. Old Phase 10A content kept
below as historical reference.

EVIDENCE_SUMMARY.md: added "Vanilla FA vs DFA" section with the
paper-changing finding (FA 0.401 ± 0.009 vs DFA 0.306 ± 0.008,
FA has genuine deep cos +0.33, no Mode 1(b) collapse) and the
d=512 depth sweep table.

Co-Authored-By: Claude Opus 4.6 (1M context)

Sync EVIDENCE_SUMMARY.md and PAPER_OUTLINE.md with v2.32 values

2026-04-09T00:25:42+00:00

These two project scratch documents had stale BP=0.609 and DFA=0.308
references from the pre-v2.31 era. Updated to the matched 30-ep 3-seed
values that v2.31-v2.32 corrected:

  BP no-pen 30ep:  0.609 → 0.585 ± 0.001
  BP+pen 30ep:     0.530 → 0.532 ± 0.006
  DFA no-pen 30ep: 0.308 → 0.301 ± 0.005
  DFA+pen 30ep:    0.363 → 0.360 ± 0.001
  Gap math:        +5.5/-8 → +5.9/-5.3 pp; +18.1/+1.4 → +18.3/+1.1 pp
  Deep cos:        +0.155 → +0.151

Now the paper, the protocol library, the README, the helper scripts,
and the project scratch docs all agree on the v2.32 values.

Co-Authored-By: Claude Opus 4.6 (1M context)

Sync experiment+protocol scripts with v2.32 corrected control values

2026-04-09T00:24:06+00:00

The pre-v2.31 unsourced values BP=0.609 and DFA=0.308 (which v2.31 fixed
to 0.585 and 0.301 via matched 30-ep controls) were also hardcoded as
"compare to" comments in 5 helper scripts:

  experiments/bp_with_penalty_control.py
  experiments/dfa_residual_penalty_test.py
  experiments/resmlp_frozen_blocks_baseline.py
  protocol/examples/threshold_d_sensitivity.py
  protocol/examples/plot_penalty_rescue.py

These are non-paper-input scripts (their output goes to stdout, not to
the paper), so the stale values didn't cause numerical errors in the
paper itself. But the original v2.31 BP+pen=0.609 unsourced number bug
came from exactly this kind of hardcoded "for-comparison" comment that
was never measured. Updating them now to remove the same trap from
future runs.

Each script now references the matched 30-ep 3-seed values from
results/bp_no_penalty_30ep, results/dfa_no_penalty_30ep, results/
dfa_pen_short, and results/bp_with_penalty.

protocol/EVIDENCE_SUMMARY.md and PAPER_OUTLINE.md still have stale
numbers — these are project scratch documents and not user-facing.
Deferred to a separate sweep if needed.

Co-Authored-By: Claude Opus 4.6 (1M context)

protocol/README.md: sync (c) range with v2.31.13 paper update

2026-04-09T00:22:38+00:00

Same fix as v2.31.13's paper §6 ¶3 and the protocol.py docstring sync:
the README's "0.05-0.18 / 0.43-0.99" calibration ranges were the
same loose values that v2.31.13 corrected. Updated to match the actual
audit data: BP/EP in [-0.04, +0.12], degenerate up to +0.99 with 5/9
above the 0.30 cutoff.

Now the paper §6 ¶3, protocol.py docstring, and protocol/README.md
all agree on the (c) calibration ranges.

Co-Authored-By: Claude Opus 4.6 (1M context)

protocol/protocol.py: sync (c) range docstring with v2.31.13 paper update

2026-04-09T00:21:56+00:00

The cross_batch_direction_stability docstring claimed healthy values
"~0.05-0.18" and drift-dominated "~0.5-0.99" — these were the same
loose ranges that v2.31.13 corrected in the paper §6 ¶3.

Re-aggregated from results/protocol_audit/audit_table_s42_s123_s456.json
(K=10 batches of 128 samples):
  Healthy 6 BP+EP values: range [-0.036, 0.120], median 0.093
  Degen 9 DFA/SB/CB values: range [-0.005, 0.992], median 0.352
                           5/9 above 0.30 cutoff
                           3/9 above 0.50

Updated docstring to match the actual audit data and point at the
JSON source. Now the paper §6 ¶3 prose and the protocol.py docstring
agree exactly on the (c) calibration ranges.

Co-Authored-By: Claude Opus 4.6 (1M context)

PAPER_OUTLINE: add 6th validation (perturbation correlation triangulation)

2026-04-08T07:12:08+00:00

EVIDENCE_SUMMARY: add 6th validation (perturbation correlation triangulation)

2026-04-08T07:09:15+00:00

EVIDENCE_SUMMARY: §4 fully rewritten under locked two-distinct-modes framing

2026-04-08T07:04:28+00:00

§4 now reflects all 5 independent validations of the converged framing:
  1. Direct deep cos on penalized DFA (3 seeds): +0.155 ± 0.025
  2. Null calibration with fresh Bs: +0.002 ± 0.022 (real signal)
  3. Hypothesis B disambiguation (vanilla early ep): -0.008 ± 0.013
  4. BP+penalty 2×2 control: 17 pp residual = credit quality
  5. Multi-seed lock-in: 24 measurements all near zero

Round 20 language tightening applied:
  - 'lower bound on non-capacity gap' instead of 'clean isolation'
  - Explicit caveats about end-to-end vs local-loss difference
  - Counter to 'different optimization regime' objection

The §4 framing is locked. Five independent validations done. Stop
iterating, start writing.

PAPER_OUTLINE: round 20 language tightening + 5 validation summary

2026-04-08T07:02:37+00:00

§4 updates per round 20:
  - Soften 'confirmed' to 'strongly supports'
  - Add §4.4 BP+penalty capacity-cost control with the round 20 phrasing:
    'lower bound on residual gap under matched architecture/data/optimizer/
    penalty, after accounting for the penalty's direct capacity cost in BP'
  - Add multi-seed lock-in to §4.3 (24 measurements all near zero)
  - List 5 independent validations supporting the converged framing

The §4 narrative is now complete and the framing is locked.

PAPER_OUTLINE: §4 rewrite under 'two distinct failure modes' framing

2026-04-08T06:33:00+00:00

After the round 19 disambiguation experiment confirmed hypothesis B
(penalty CREATES deep alignment, not just reveals it), the paper §4
needs to use the new framing:

  Mode 1: measurement degeneracy via terminal LN gradient cancellation
  Mode 2: low intrinsic credit-direction quality of random feedback

Both modes are direct-measured (mode 1 by diagnostic (b), mode 2 by
per-layer cos in the meaningful regime). The penalty partially
alleviates BOTH modes. Neither is fully fixed.

§4 rewrite includes:
  - The two modes (4.1)
  - Penalty causal validation with 3-seed cos (4.2)
  - Disambiguation: vanilla early-epoch cos table proving hypothesis B (4.3)
  - Why the residual gap is partial alignment (4.4)
  - Why this framing is paper-cleaner than prior ones (4.5)

Walk-back chain extended to 7 entries, with 6 and 7 happening same-day
and converging on the final two-distinct-modes framing.