blazing8.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
11 hours	Raise entropy floor to 0.02, increase eval games to 2000	haoyuren
	Prevents premature convergence with higher entropy minimum and reduces eval variance with 4x more evaluation games. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
12 hours	Change default eval_every from 10000 to 2500	haoyuren
	Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
12 hours	Add entropy annealing to escape greedy local minimum after warmup	haoyuren
	After behavioral cloning warmup, policy is very peaked on greedy actions. Start with higher entropy coefficient (default: 5x ent_coef) and linearly decay to target, encouraging exploration of non-greedy strategies early in training. New arg: --ent_start (default: 5x --ent_coef) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
12 hours	Auto-calibrate collect_batch when not specified	haoyuren
	Benchmarks batch sizes [64,128,256,512] and picks smallest within 10% of peak throughput. Smaller batches = more frequent PPO updates = better training quality at similar speed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
12 hours	Batched game collection for ~7x training speedup	haoyuren
	- collect_games_batch(): run N games in parallel with single batched forward pass per step - evaluate_vs_greedy_batch(): batched evaluation replacing sequential eval - Add --collect_batch CLI arg for configurable parallel game count - Use torch.inference_mode() for faster collection - Update Colab notebook: GPU info, --collect_batch, log download cell Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
13 hours	Separate CPU collect / GPU train, add training CSV log	haoyuren
	- Game collection always on CPU, PPO update on GPU (avoids per-step transfer overhead) - Log avg_len, loss, vs_greedy win rate to CSV every 10k episodes - Add --eval_every flag for periodic evaluation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
13 hours	Fix SWAP inheritance, stalemate logic, add greedy warmup	haoyuren
	- SWAP now inherits previous card's suit/rank for matching - Observation encodes effective top card when SWAP is on top - Fix stalemate: only hard passes (can't draw) count, draw+pass resets - Add behavioral cloning warmup: pre-train on greedy policy before PPO - 2p win rate vs greedy random: 60.5% Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
21 hours	Update rules: free draw/pass, remove Q in 2-player games	haoyuren
	- Players can freely choose to draw even with playable cards - After drawing, players may pass instead of playing - Remove Q cards from deck in 2-player games (reverse has no effect) - Use greedy random opponent in evaluation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
22 hours	Add tqdm progress bar, fix Colab username	haoyuren
	Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
23 hours	Initial commit: Blazing Eights RL agent	haoyuren
	- Game environment with draw-then-decide rule (no auto-play on draw) - PPO self-play training script - Interactive human vs AI game (versus.py) - Real-time play assistant (play.py) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>