summaryrefslogtreecommitdiff
path: root/train.py
AgeCommit message (Collapse)Author
11 hoursRaise entropy floor to 0.02, increase eval games to 2000haoyuren
Prevents premature convergence with higher entropy minimum and reduces eval variance with 4x more evaluation games. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
12 hoursChange default eval_every from 10000 to 2500haoyuren
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
12 hoursAdd entropy annealing to escape greedy local minimum after warmuphaoyuren
After behavioral cloning warmup, policy is very peaked on greedy actions. Start with higher entropy coefficient (default: 5x ent_coef) and linearly decay to target, encouraging exploration of non-greedy strategies early in training. New arg: --ent_start (default: 5x --ent_coef) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
12 hoursAuto-calibrate collect_batch when not specifiedhaoyuren
Benchmarks batch sizes [64,128,256,512] and picks smallest within 10% of peak throughput. Smaller batches = more frequent PPO updates = better training quality at similar speed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
12 hoursBatched game collection for ~7x training speeduphaoyuren
- collect_games_batch(): run N games in parallel with single batched forward pass per step - evaluate_vs_greedy_batch(): batched evaluation replacing sequential eval - Add --collect_batch CLI arg for configurable parallel game count - Use torch.inference_mode() for faster collection - Update Colab notebook: GPU info, --collect_batch, log download cell Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
13 hoursSeparate CPU collect / GPU train, add training CSV loghaoyuren
- Game collection always on CPU, PPO update on GPU (avoids per-step transfer overhead) - Log avg_len, loss, vs_greedy win rate to CSV every 10k episodes - Add --eval_every flag for periodic evaluation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
13 hoursFix SWAP inheritance, stalemate logic, add greedy warmuphaoyuren
- SWAP now inherits previous card's suit/rank for matching - Observation encodes effective top card when SWAP is on top - Fix stalemate: only hard passes (can't draw) count, draw+pass resets - Add behavioral cloning warmup: pre-train on greedy policy before PPO - 2p win rate vs greedy random: 60.5% Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
21 hoursUpdate rules: free draw/pass, remove Q in 2-player gameshaoyuren
- Players can freely choose to draw even with playable cards - After drawing, players may pass instead of playing - Remove Q cards from deck in 2-player games (reverse has no effect) - Use greedy random opponent in evaluation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
22 hoursAdd tqdm progress bar, fix Colab usernamehaoyuren
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
23 hoursInitial commit: Blazing Eights RL agenthaoyuren
- Game environment with draw-then-decide rule (no auto-play on draw) - PPO self-play training script - Interactive human vs AI game (versus.py) - Real-time play assistant (play.py) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>