summaryrefslogtreecommitdiff
path: root/research/flossing/meeting_artifacts_v2
diff options
context:
space:
mode:
Diffstat (limited to 'research/flossing/meeting_artifacts_v2')
-rw-r--r--research/flossing/meeting_artifacts_v2/fig5_qhead_vs_lambda1_ptrm_summary.csv3
-rw-r--r--research/flossing/meeting_artifacts_v2/hrm_trm_redesigned_summary.csv7
-rw-r--r--research/flossing/meeting_artifacts_v2/meeting_figures_v2_report.md26
3 files changed, 36 insertions, 0 deletions
diff --git a/research/flossing/meeting_artifacts_v2/fig5_qhead_vs_lambda1_ptrm_summary.csv b/research/flossing/meeting_artifacts_v2/fig5_qhead_vs_lambda1_ptrm_summary.csv
new file mode 100644
index 0000000..bcbf401
--- /dev/null
+++ b/research/flossing/meeting_artifacts_v2/fig5_qhead_vs_lambda1_ptrm_summary.csv
@@ -0,0 +1,3 @@
+name,path,n_samples,rollouts,mean_rollout_exact,q_max_exact,lambda_min_exact,oracle_pass_exact,q_lambda_same_argmax_frac,global_pearson_q_vs_stability,global_spearman_q_vs_stability,within_problem_pearson_mean,within_problem_spearman_mean,q_success_mean,q_fail_mean,lambda_success_mean,lambda_fail_mean,mixed_problem_count,zero_success_problem_count,full_success_problem_count,mixed_global_pearson_q_vs_stability,mixed_q_max_exact,mixed_lambda_min_exact,mixed_oracle_exact
+TRM baseline + PTRM rollouts,research/flossing/q_lambda_scatter/base58590_k25_d64_sigma03_Lonly_fdlyap_n512_seed20260602.npz,512.0,25.0,0.93890625,0.97265625,0.97265625,0.97265625,0.04296875,0.7524171235289728,-0.025655130770112743,0.12497240516050691,0.09656841990607949,7.796669578964886,-10.659986413043478,5.632336168325687,6.921443493469901,58.0,14.0,440.0,0.7857503437605575,1.0,1.0,1.0
+TRM multi4 + PTRM rollouts,research/flossing/q_lambda_scatter/multi4_35805_k25_d64_sigma03_Lonly_fdlyap_n512_seed20260602.npz,512.0,25.0,0.944609375,0.974609375,0.974609375,0.974609375,0.0859375,0.7289674235313497,-0.13633997124282113,0.18866459504767938,0.11859038341758327,7.7404597169382185,-10.081937588152327,5.7459801223593345,7.271883726456269,43.0,13.0,456.0,0.7911647653323942,1.0,1.0,1.0
diff --git a/research/flossing/meeting_artifacts_v2/hrm_trm_redesigned_summary.csv b/research/flossing/meeting_artifacts_v2/hrm_trm_redesigned_summary.csv
new file mode 100644
index 0000000..c5a4ad0
--- /dev/null
+++ b/research/flossing/meeting_artifacts_v2/hrm_trm_redesigned_summary.csv
@@ -0,0 +1,7 @@
+model,label,step,full_exact,sample_exact,lambda1_all,mean8_all,pos_count_all,lambda1_success,lambda1_fail,mean8_success,mean8_fail,pos_count_success,pos_count_fail
+HRM,baseline best,26040,0.5265287756919861,0.5,-0.0569394779099639,-0.10732079181130239,0.98046875,-0.14642834789538028,0.03254939207545249,-0.18957092847483636,-0.025070655147768406,0.01953125,1.94140625
+HRM,multi4 best,23436,0.6443189,0.654296875,-0.04733450568929953,-0.112159715874796,1.166015625,-0.10103540500250659,0.054302789621007604,-0.16998952501653278,-0.0027078172167066595,0.05970149253731343,3.2598870056497176
+HRM,multi4 final,26040,0.46235448,0.4296875,0.02874533511322852,-0.040662085491063316,1.626953125,0.03573289099003887,0.02348073821974127,-0.053027883530268646,-0.031345388338237384,1.8181818181818181,1.4828767123287672
+TRM,baseline best,58590,0.8686309456825256,0.875,0.02823458132615997,0.013457294571722192,7.841796875,0.01761167685357837,0.10259491263423115,0.008003413600119422,0.05163446137294159,7.819196428571429,8.0
+TRM,multi4 best,35805,0.8964653611183167,0.900390625,0.020381716455975862,0.0065844104191477015,3.841796875,0.01118387160416574,0.10352301992037717,0.0019524994650864183,0.04845325257252518,3.3817787418655096,8.0
+TRM,multi4 final,65100,0.8350536823272705,0.82421875,0.03232463403946895,0.018508151198432188,8.0,0.01912124014472792,0.09423388096814354,0.012883653437293365,0.04488079625621645,8.0,8.0
diff --git a/research/flossing/meeting_artifacts_v2/meeting_figures_v2_report.md b/research/flossing/meeting_artifacts_v2/meeting_figures_v2_report.md
new file mode 100644
index 0000000..5158c99
--- /dev/null
+++ b/research/flossing/meeting_artifacts_v2/meeting_figures_v2_report.md
@@ -0,0 +1,26 @@
+# Meeting Figures v2
+
+## Figure Strategy
+
+0. `fig0_motivation_lambda1_success_failure_hrm_trm.png`: first-exponent success/failure distribution in HRM and TRM. This motivates chaos as a detector before introducing the method.
+1. `fig1_hrm_trm_training_curves.png`: performance over training for HRM and TRM. This answers whether the method improves accuracy and where best/final are.
+2. `fig2_accuracy_vs_chaotic_volume_phase.png`: phase view, with accuracy versus mean top-8 Lyapunov exponent. This answers whether better checkpoints are dynamically more stable.
+3. `fig3_hrm_trm_success_failure_spectra.png`: full success/failure spectrum separation for HRM and TRM best checkpoints. This extends Fig0 beyond λ1.
+4. `fig4_ptrm_same_subset_comparison.png`: PTRM same-subset result. This is a secondary inference-time story.
+5. `fig5_qhead_vs_lambda1_ptrm.png`: PTRM Q-head halt logit versus finite-difference stability proxy `-lambda_1`. The bottom row isolates mixed problems where trajectory selection actually matters.
+
+## Key Numbers
+
+- HRM baseline best: 0.5265 exact. HRM multi4 best: 0.6443 exact. HRM multi4 final: 0.4624 exact.
+- TRM baseline best: 0.8686 exact. TRM multi4 best: 0.8965 exact. TRM multi4 final: 0.8351 exact.
+- HRM multi4 best dynamics sample: mean top-8 exponent -0.1122; final -0.0407.
+- TRM multi4 best dynamics sample: mean top-8 exponent +0.0066; final +0.0185.
+- PTRM same subset, K=100: Q-selected 0.984 -> 0.988; mean rollout 0.942 -> 0.954.
+- PTRM Q-vs-stability, K=25/N=512: mixed-problem Pearson is 0.786 for baseline and 0.791 for multi4. In both runs, Q-max selection and lambda-min selection reach the same oracle exact accuracy on this subset.
+
+## Caveats
+
+- Dynamics spectra use N=512 diagnostic samples, not the full test set.
+- PTRM numbers use a fixed N=1000 subset; do not mix its deterministic subset accuracy with full-test W&B exact accuracy.
+- Final checkpoints are collapse diagnostics, not the method's reported performance.
+- Q-head is not a pure lambda ranker: global Spearman is weak because most problems are all-success/all-failure across K rollouts. The strongest evidence is the mixed-problem class separation and selector equivalence.