summaryrefslogtreecommitdiff
path: root/train.py
AgeCommit message (Collapse)Author
12 hoursSeparate CPU collect / GPU train, add training CSV loghaoyuren
- Game collection always on CPU, PPO update on GPU (avoids per-step transfer overhead) - Log avg_len, loss, vs_greedy win rate to CSV every 10k episodes - Add --eval_every flag for periodic evaluation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
13 hoursFix SWAP inheritance, stalemate logic, add greedy warmuphaoyuren
- SWAP now inherits previous card's suit/rank for matching - Observation encodes effective top card when SWAP is on top - Fix stalemate: only hard passes (can't draw) count, draw+pass resets - Add behavioral cloning warmup: pre-train on greedy policy before PPO - 2p win rate vs greedy random: 60.5% Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
21 hoursUpdate rules: free draw/pass, remove Q in 2-player gameshaoyuren
- Players can freely choose to draw even with playable cards - After drawing, players may pass instead of playing - Remove Q cards from deck in 2-player games (reverse has no effect) - Use greedy random opponent in evaluation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
22 hoursAdd tqdm progress bar, fix Colab usernamehaoyuren
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
22 hoursInitial commit: Blazing Eights RL agenthaoyuren
- Game environment with draw-then-decide rule (no auto-play on draw) - PPO self-play training script - Interactive human vs AI game (versus.py) - Real-time play assistant (play.py) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>