summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
13 hoursRaise entropy floor to 0.02, increase eval games to 2000haoyuren
Prevents premature convergence with higher entropy minimum and reduces eval variance with 4x more evaluation games. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
14 hoursChange default eval_every from 10000 to 2500haoyuren
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
14 hoursUse auto-calibrated collect_batch in Colab notebookhaoyuren
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
14 hoursAdd training curve plots to Colab notebookhaoyuren
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
14 hoursAdd entropy annealing to escape greedy local minimum after warmuphaoyuren
After behavioral cloning warmup, policy is very peaked on greedy actions. Start with higher entropy coefficient (default: 5x ent_coef) and linearly decay to target, encouraging exploration of non-greedy strategies early in training. New arg: --ent_start (default: 5x --ent_coef) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
14 hoursAuto-calibrate collect_batch when not specifiedhaoyuren
Benchmarks batch sizes [64,128,256,512] and picks smallest within 10% of peak throughput. Smaller batches = more frequent PPO updates = better training quality at similar speed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
14 hoursFix total_mem → total_memory in Colab GPU checkhaoyuren
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
14 hoursFix invalid notebook cell schema (markdown with execution_count)haoyuren
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
14 hoursBatched game collection for ~7x training speeduphaoyuren
- collect_games_batch(): run N games in parallel with single batched forward pass per step - evaluate_vs_greedy_batch(): batched evaluation replacing sequential eval - Add --collect_batch CLI arg for configurable parallel game count - Use torch.inference_mode() for faster collection - Update Colab notebook: GPU info, --collect_batch, log download cell Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
14 hoursUpdate README and Colab notebook for current rules and featureshaoyuren
- README: document current game rules (SWAP inheritance, free draw, Q removal) - README: add versus.py usage, training features (warmup, CSV log, CPU/GPU) - Colab: update training commands, add log display, fix eval device Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
15 hoursSeparate CPU collect / GPU train, add training CSV loghaoyuren
- Game collection always on CPU, PPO update on GPU (avoids per-step transfer overhead) - Log avg_len, loss, vs_greedy win rate to CSV every 10k episodes - Add --eval_every flag for periodic evaluation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
15 hoursFix SWAP inheritance, stalemate logic, add greedy warmuphaoyuren
- SWAP now inherits previous card's suit/rank for matching - Observation encodes effective top card when SWAP is on top - Fix stalemate: only hard passes (can't draw) count, draw+pass resets - Add behavioral cloning warmup: pre-train on greedy policy before PPO - 2p win rate vs greedy random: 60.5% Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
23 hoursImprove versus UI: suit colors, AI highlighting, draw tellhaoyuren
- Color-code suits: ♠blue ♥magenta ♦yellow ♣cyan - AI actions highlighted in red - Show whether AI has playable cards after drawing (observable tell) - Fix pass prompt: show context-specific reason (无法出牌/不出牌/牌堆已空) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
23 hoursUpdate rules: free draw/pass, remove Q in 2-player gameshaoyuren
- Players can freely choose to draw even with playable cards - After drawing, players may pass instead of playing - Remove Q cards from deck in 2-player games (reverse has no effect) - Use greedy random opponent in evaluation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
24 hoursAdd tqdm progress bar, fix Colab usernamehaoyuren
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
24 hoursAdd Colab GPU training notebookhaoyuren
Clone → train on GPU → download or push model back to GitHub. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
24 hoursInitial commit: Blazing Eights RL agenthaoyuren
- Game environment with draw-then-decide rule (no auto-play on draw) - PPO self-play training script - Interactive human vs AI game (versus.py) - Real-time play assistant (play.py) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>