summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
14 hoursAdd training curve plots to Colab notebookhaoyuren
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
14 hoursAdd entropy annealing to escape greedy local minimum after warmuphaoyuren
After behavioral cloning warmup, policy is very peaked on greedy actions. Start with higher entropy coefficient (default: 5x ent_coef) and linearly decay to target, encouraging exploration of non-greedy strategies early in training. New arg: --ent_start (default: 5x --ent_coef) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
14 hoursAuto-calibrate collect_batch when not specifiedhaoyuren
Benchmarks batch sizes [64,128,256,512] and picks smallest within 10% of peak throughput. Smaller batches = more frequent PPO updates = better training quality at similar speed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
14 hoursFix total_mem → total_memory in Colab GPU checkhaoyuren
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
14 hoursFix invalid notebook cell schema (markdown with execution_count)haoyuren
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
14 hoursBatched game collection for ~7x training speeduphaoyuren
- collect_games_batch(): run N games in parallel with single batched forward pass per step - evaluate_vs_greedy_batch(): batched evaluation replacing sequential eval - Add --collect_batch CLI arg for configurable parallel game count - Use torch.inference_mode() for faster collection - Update Colab notebook: GPU info, --collect_batch, log download cell Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
15 hoursUpdate README and Colab notebook for current rules and featureshaoyuren
- README: document current game rules (SWAP inheritance, free draw, Q removal) - README: add versus.py usage, training features (warmup, CSV log, CPU/GPU) - Colab: update training commands, add log display, fix eval device Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
15 hoursSeparate CPU collect / GPU train, add training CSV loghaoyuren
- Game collection always on CPU, PPO update on GPU (avoids per-step transfer overhead) - Log avg_len, loss, vs_greedy win rate to CSV every 10k episodes - Add --eval_every flag for periodic evaluation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
15 hoursFix SWAP inheritance, stalemate logic, add greedy warmuphaoyuren
- SWAP now inherits previous card's suit/rank for matching - Observation encodes effective top card when SWAP is on top - Fix stalemate: only hard passes (can't draw) count, draw+pass resets - Add behavioral cloning warmup: pre-train on greedy policy before PPO - 2p win rate vs greedy random: 60.5% Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
23 hoursImprove versus UI: suit colors, AI highlighting, draw tellhaoyuren
- Color-code suits: ♠blue ♥magenta ♦yellow ♣cyan - AI actions highlighted in red - Show whether AI has playable cards after drawing (observable tell) - Fix pass prompt: show context-specific reason (无法出牌/不出牌/牌堆已空) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
23 hoursUpdate rules: free draw/pass, remove Q in 2-player gameshaoyuren
- Players can freely choose to draw even with playable cards - After drawing, players may pass instead of playing - Remove Q cards from deck in 2-player games (reverse has no effect) - Use greedy random opponent in evaluation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
24 hoursAdd tqdm progress bar, fix Colab usernamehaoyuren
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
25 hoursAdd Colab GPU training notebookhaoyuren
Clone → train on GPU → download or push model back to GitHub. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
25 hoursInitial commit: Blazing Eights RL agenthaoyuren
- Game environment with draw-then-decide rule (no auto-play on draw) - PPO self-play training script - Interactive human vs AI game (versus.py) - Real-time play assistant (play.py) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>