faeval.git/paper/figures/render_fig5_cross_arch.py, branch master

paper v2.34.3: fix Figure 4 cross-arch verdict matrix (data + layout)

2026-04-09T01:10:14+00:00

User flagged Figure 4 issues. Found three problems:

1. **Row 4 (no-terminal-LN ResMLP) (d) frozen** was encoded as 0 (passes)
   but the actual data is no-outln DFA acc 0.327 ± 0.012 (3-seed) vs
   frozen baseline 0.349 ± 0.002 → margin -2.2 pp, beyond the 2 pp
   threshold → (d) FIRES. Updated to 1 (WB).

2. **Row 5 (CNN BN) cells (c) and (d)** were encoded as 0 (passes) but
   the CNN audit (results/protocol_audit/audit_cnn_3seed.json) only
   measured (a) and (b); there is no CNN frozen baseline and no CNN
   stability run. Showing them as ✓ was misleading. Added a third color
   (gray, "—") for "not measured" and marked CNN (c)+(d) accordingly.

3. **Layout** had massive empty vertical space below the panels with
   the key-finding text floating far below. Compressed:
   - figsize (11, 4.2) → (11, 3.2) [tighter aspect ratio]
   - Key-finding text moved from axes-coordinates y=-1.55 (way below
     plot) to figure-coordinates y=-0.05 (directly under panels)
   - BP panel title clarified: "BP-trained: protocol passes" →
     "BP-trained: protocol passes everywhere"

Also marked ViT-Mini (c) and no-LN ResMLP (c) as "not measured" since
neither has a saved cross_batch_stability value (the audit_cnn,
audit_d512, snapshot_vit_v1, and snapshot_no_outln_v1 files don't
include this diagnostic).

New verdict matrix:
                    (a) (b) (c) (d)
  ResMLP-d256 LN    WB  WB  ✓   WB
  ResMLP-d512 LN    WB  WB  ✓   WB
  ViT-Mini          WB  WB  —   WB
  ResMLP-d256 no-LN WB  ✓   —   WB   ← row 4 (d) was wrong
  CNN BN            WB  ✓   —   —    ← row 5 (c)+(d) were misleading

Key finding "(b) only fires on terminal-LN architectures" is unchanged
and now visually clearer (rows 1-3 have WB in (b), rows 4-5 have ✓).

Page impact: total page count 19 → 18 (the more compact figure
reclaimed an entire page). §1-§7 main content still fits on 9 pages.

Updated docstring with full data sources for each row.

Co-Authored-By: Claude Opus 4.6 (1M context)

paper v2.31.9: relabel "StudentNet" → "no-terminal-LN ResMLP"

2026-04-08T23:32:23+00:00

The §3 ¶3 / §5 ¶3 / Figure 5 / §7 mentions of "StudentNet" as a
cross-architecture validation case were a misleading rebrand of the
no-terminal-LN ResMLP-d256 ablation. Verified by tracing the data:

  results/protocol_audit/temporal_evolution_s{42,123,456}.json
    final_acc 0.332/0.313/0.336 (matches no-outln 3-seed 0.327±0.012)
    first_fire_a {18, 14, 25}
    first_fire_b None / None / None

The actual synth StudentNet (results/snapshot_synth_v1, d=128 alpha=1.0)
has max-per-block growth ~6.88 over 80 epochs and never reaches the
50× threshold, so diagnostic (a) does NOT fire on the real synth
StudentNet at all. Calling the no-outln data "StudentNet" doubled-
counted the same architecture under two names (the same-backbone
causal control AND the cross-arch generalization test).

Relabeled to "no-terminal-LN ResMLP" everywhere it appeared:
- §3 ¶3 paragraph 1 cross-arch list
- §3 ¶3 paragraph 2 (now with explicit per-seed first-fire epochs {18,14,25})
- §5 paragraph (the conclusion)
- §7 conclusion (cross-arch list)
- Figure 5 caption
- Figure 5 row label (with re-rendered PDF)

The remaining cross-arch generalization claim is now: ViT-Mini fires
both diagnostics, ResMLP at d=256/d=512 fires both, no-terminal-LN
ResMLP and BatchNorm CNN fire only (a) — three real architecture
classes, with the no-LN ablation being the same-backbone control rather
than a separate architecture. The cross-arch story is slightly weaker
("3 architecture classes" not "4") but truthful and self-consistent.

Co-Authored-By: Claude Opus 4.6 (1M context)

Figures 3 and 4: fix aspect ratio (fig3 was squeezed strip) and key-finding label overlap (fig4)

2026-04-08T19:35:20+00:00

Per user feedback:
- fig4_penalty_rescue.pdf (Figure 3 in paper): was figsize=(13, 3.5), aspect 3.7:1,
  which rendered as a thin strip with squeezed subplot content. Increased height
  to figsize=(13, 6.0), aspect 2.2:1. Much taller panels that actually show axis
  labels and legends readably.
- fig5_cross_arch_summary.pdf (Figure 4 in paper): the 'Key finding' italic text
  annotation at y=-1.0 in axes transform was overlapping with the multiline
  architecture y-tick labels at the bottom of the second subplot. Moved to
  y=-1.55 and increased figsize height from 3.5 to 4.2 so the lower annotation
  still fits in bbox_inches='tight' crop.
- Also bumped includegraphics width from 0.92\linewidth to \linewidth for both
  figures so they use the full text width.

Main content still exactly 9 pages within E&D budget.

Co-Authored-By: Claude Opus 4.6 (1M context)

Fill in tables 1-3 + generate figures 2/4/5 from existing data

2026-04-08T09:46:59+00:00

Tables filled with real values:
  Table 1: 5-method audit (3-seed mean ± std for acc, headline Γ, verdict)
  Table 2: 4-condition mode 2 validation (cos and ρ values from existing
           checkpoint measurements)
  Table 3: protocol thresholds (50×, 1e-7, 0.30, 2pp)

Figures generated from existing data:
  fig2_decision_utility.pdf: 5×7 verdict heatmap from
    results/protocol_audit/ablation_decision_utility.json
  fig4_penalty_rescue.pdf: 3-panel — trajectory + cos/ρ bars + 2×2 acc
    from snapshot_evolution_v2 + dfa_residual_penalty + bp_with_penalty
  fig5_cross_arch_summary.pdf: 5×4 BP/DFA verdict matrix across
    architectures

Compiles to 8 pages with all tables/figures rendered. §1-§7 main body
still has only paragraph topic sentences (TODO: per-section prose
filling via codex). Figure numbering is wrong (codex put figures in
section order not numerical order — need fixing).