From 72cf72d704ca1a3bf4e2a5e04dcbbad99dc0f98e Mon Sep 17 00:00:00 2001
From: haoyuren <13851610112@163.com>
Date: Sun, 22 Feb 2026 01:48:03 -0600
Subject: Initial commit: Blazing Eights RL agent

- Game environment with draw-then-decide rule (no auto-play on draw)
- PPO self-play training script
- Interactive human vs AI game (versus.py)
- Real-time play assistant (play.py)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 README.md | 67 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)
 create mode 100644 README.md

(limited to 'README.md')

diff --git a/README.md b/README.md
new file mode 100644
index 0000000..711f793
--- /dev/null
+++ b/README.md
@@ -0,0 +1,67 @@
+# Blazing Eights — RL Agent
+
+Self-play PPO agent for the Blazing Eights card game.
+
+## Setup
+
+```bash
+pip install torch numpy
+```
+
+## Files
+
+- `blazing_env.py` — Game environment (2-5 players)
+- `train.py` — PPO self-play training
+- `play.py` — Real-time play assistant (input game state, get best move)
+
+## Training
+
+```bash
+# Train a 2-player agent (~10min on CPU for 100k episodes)
+python train.py --num_players 2 --episodes 100000
+
+# Train for 3 players (may need more episodes)
+python train.py --num_players 3 --episodes 200000
+
+# Custom hyperparams
+python train.py --num_players 4 --episodes 300000 --lr 1e-4 --ent_coef 0.02
+```
+
+Training saves checkpoints every 10k episodes and a final model.
+
+## Real-time Play Assistant
+
+After training, use the assistant during a real game:
+
+```bash
+python play.py --model blazing_ppo_final.pt --num_players 3
+```
+
+It will prompt you for:
+1. Your hand (e.g., `8h,Ks,3d,SWAP`)
+2. Top discard card (e.g., `6d`)
+3. Active suit if an 8 was played
+4. Direction (cw/ccw)
+5. Other players' hand sizes
+6. Approximate deck size
+
+Then shows ranked action recommendations with probabilities.
+
+## Game Rules
+
+- **56 cards**: standard 52 + 4 Swap cards
+- **Match**: suit or rank of top card
+- **8**: Wild — choose a suit for next player
+- **K**: All other players draw 1
+- **Q**: Reverse direction (no effect in 2-player)
+- **J**: Skip next player
+- **Swap**: Swap your entire hand with next player (playable anytime, no match needed)
+- **Can't play**: Draw 1, play it if legal
+- **Win**: First to empty hand
+
+## Tips for Better Training
+
+1. **Train per player count** — the optimal policy differs significantly for 2 vs 5 players.
+2. **Increase episodes for more players** — larger games have more variance, need more samples.
+3. **Opponent modeling** — after self-play, you can fine-tune against specific opponent behaviors by replacing some players with heuristic bots that mimic your friends' tendencies.
+4. **Curriculum** — start training with 2 players, then use the trained model to initialize training for 3+ players.
-- 
cgit v1.2.3