diff options
Diffstat (limited to 'research/flossing/report_bundle_20260603/tables/meeting_figures_v2_report.md')
| -rw-r--r-- | research/flossing/report_bundle_20260603/tables/meeting_figures_v2_report.md | 26 |
1 files changed, 26 insertions, 0 deletions
diff --git a/research/flossing/report_bundle_20260603/tables/meeting_figures_v2_report.md b/research/flossing/report_bundle_20260603/tables/meeting_figures_v2_report.md new file mode 100644 index 0000000..5158c99 --- /dev/null +++ b/research/flossing/report_bundle_20260603/tables/meeting_figures_v2_report.md @@ -0,0 +1,26 @@ +# Meeting Figures v2 + +## Figure Strategy + +0. `fig0_motivation_lambda1_success_failure_hrm_trm.png`: first-exponent success/failure distribution in HRM and TRM. This motivates chaos as a detector before introducing the method. +1. `fig1_hrm_trm_training_curves.png`: performance over training for HRM and TRM. This answers whether the method improves accuracy and where best/final are. +2. `fig2_accuracy_vs_chaotic_volume_phase.png`: phase view, with accuracy versus mean top-8 Lyapunov exponent. This answers whether better checkpoints are dynamically more stable. +3. `fig3_hrm_trm_success_failure_spectra.png`: full success/failure spectrum separation for HRM and TRM best checkpoints. This extends Fig0 beyond λ1. +4. `fig4_ptrm_same_subset_comparison.png`: PTRM same-subset result. This is a secondary inference-time story. +5. `fig5_qhead_vs_lambda1_ptrm.png`: PTRM Q-head halt logit versus finite-difference stability proxy `-lambda_1`. The bottom row isolates mixed problems where trajectory selection actually matters. + +## Key Numbers + +- HRM baseline best: 0.5265 exact. HRM multi4 best: 0.6443 exact. HRM multi4 final: 0.4624 exact. +- TRM baseline best: 0.8686 exact. TRM multi4 best: 0.8965 exact. TRM multi4 final: 0.8351 exact. +- HRM multi4 best dynamics sample: mean top-8 exponent -0.1122; final -0.0407. +- TRM multi4 best dynamics sample: mean top-8 exponent +0.0066; final +0.0185. +- PTRM same subset, K=100: Q-selected 0.984 -> 0.988; mean rollout 0.942 -> 0.954. +- PTRM Q-vs-stability, K=25/N=512: mixed-problem Pearson is 0.786 for baseline and 0.791 for multi4. In both runs, Q-max selection and lambda-min selection reach the same oracle exact accuracy on this subset. + +## Caveats + +- Dynamics spectra use N=512 diagnostic samples, not the full test set. +- PTRM numbers use a fixed N=1000 subset; do not mix its deterministic subset accuracy with full-test W&B exact accuracy. +- Final checkpoints are collapse diagnostics, not the method's reported performance. +- Q-head is not a pure lambda ranker: global Spearman is weak because most problems are all-success/all-failure across K rollouts. The strongest evidence is the mixed-problem class separation and selector equivalence. |
