Update README and Colab notebook for current rules and features

- README: document current game rules (SWAP inheritance, free draw, Q removal) - README: add versus.py usage, training features (warmup, CSV log, CPU/GPU) - Colab: update training commands, add log display, fix eval device Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
author: haoyuren <13851610112@163.com> 2026-02-22 11:37:49 -0600
committer: haoyuren <13851610112@163.com> 2026-02-22 11:37:49 -0600
commit: 6f7034fabbfbe27197765f335bdcc64ec8c8c85f (patch)
tree: 4b97f81b42ff1f93ef754a17f501eb0fdf6335a2
parent: dc8421e251f059e2136d5535bca2182af67fff75 (diff)
2 files changed, 50 insertions, 84 deletions
diff --git a/README.md b/README.md
index 711f793..f239c39 100644
--- a/README.md
+++ b/README.md
@@ -1,67 +1,77 @@
 # Blazing Eights — RL Agent
 
-Self-play PPO agent for the Blazing Eights card game.
+Self-play PPO agent for the Blazing Eights card game (UNO variant with custom special cards).
 
 ## Setup
 
 ```bash
-pip install torch numpy
+pip install torch numpy tqdm
 ```
 
 ## Files
 
 - `blazing_env.py` — Game environment (2-5 players)
-- `train.py` — PPO self-play training
+- `train.py` — PPO self-play training with greedy warmup
+- `versus.py` — Human vs AI interactive game
 - `play.py` — Real-time play assistant (input game state, get best move)
+- `train_colab.ipynb` — Google Colab GPU training notebook
+
+## Game Rules
+
+**52 cards** (standard deck, Q removed in 2-player) + **4 Swap cards**
+
+| Card | Effect |
+|------|--------|
+| 8 | Wild — choose a suit for next player |
+| K | All other players draw 1 card |
+| Q | Reverse direction (removed in 2-player games) |
+| J | Skip next player |
+| Swap | Swap entire hand with next player (always playable; next card must match the card before the Swap) |
+
+- **Match** top card by suit or rank (8 and Swap are exceptions)
+- **Free draw**: you may draw even if you have playable cards
+- **After drawing**: play any legal card OR pass (one draw per turn max)
+- **Stalemate**: if all players pass without drawing, game ends (fewest cards wins)
+- **Win**: first to empty hand
+- **Initial hand**: 5 cards each
 
 ## Training
 
 ```bash
-# Train a 2-player agent (~10min on CPU for 100k episodes)
+# 2-player (~20min on CPU, 100k episodes)
 python train.py --num_players 2 --episodes 100000
 
-# Train for 3 players (may need more episodes)
-python train.py --num_players 3 --episodes 200000
+# Skip greedy warmup
+python train.py --num_players 2 --episodes 100000 --greedy_warmup 0
 
 # Custom hyperparams
 python train.py --num_players 4 --episodes 300000 --lr 1e-4 --ent_coef 0.02
 ```
 
-Training saves checkpoints every 10k episodes and a final model.
-
-## Real-time Play Assistant
+Training features:
+- **Greedy warmup**: behavioral cloning on greedy play before PPO (default 2000 games)
+- **CPU/GPU split**: game simulation on CPU, PPO updates on GPU (avoids transfer overhead)
+- **CSV log**: `{save_path}_log.csv` with avg_len, loss, vs_greedy win rate every 10k episodes
+- Checkpoints every 10k episodes
 
-After training, use the assistant during a real game:
+## Play vs AI
 
 ```bash
-python play.py --model blazing_ppo_final.pt --num_players 3
+python versus.py --model blazing_ppo_2p_final.pt
+python versus.py --model blazing_ppo_2p_final.pt --num_players 3
+python versus.py --model blazing_ppo_2p_final.pt --show_ai  # show AI hand (debug)
 ```
 
-It will prompt you for:
-1. Your hand (e.g., `8h,Ks,3d,SWAP`)
-2. Top discard card (e.g., `6d`)
-3. Active suit if an 8 was played
-4. Direction (cw/ccw)
-5. Other players' hand sizes
-6. Approximate deck size
+Controls: number to play card, `d` to draw, `p` to pass, `q` to quit.
 
-Then shows ranked action recommendations with probabilities.
+## Play Assistant
 
-## Game Rules
+Input your game state and get ranked action recommendations:
+
+```bash
+python play.py --model blazing_ppo_2p_final.pt --num_players 2
+```
+
+## Colab GPU Training
 
-- **56 cards**: standard 52 + 4 Swap cards
-- **Match**: suit or rank of top card
-- **8**: Wild — choose a suit for next player
-- **K**: All other players draw 1
-- **Q**: Reverse direction (no effect in 2-player)
-- **J**: Skip next player
-- **Swap**: Swap your entire hand with next player (playable anytime, no match needed)
-- **Can't play**: Draw 1, play it if legal
-- **Win**: First to empty hand
-
-## Tips for Better Training
-
-1. **Train per player count** — the optimal policy differs significantly for 2 vs 5 players.
-2. **Increase episodes for more players** — larger games have more variance, need more samples.
-3. **Opponent modeling** — after self-play, you can fine-tune against specific opponent behaviors by replacing some players with heuristic bots that mimic your friends' tendencies.
-4. **Curriculum** — start training with 2 players, then use the trained model to initialize training for 3+ players.
+Open `train_colab.ipynb` in Google Colab for GPU-accelerated training. See notebook for setup instructions.
diff --git a/train_colab.ipynb b/train_colab.ipynb
index 24ae0dc..afae94d 100644
--- a/train_colab.ipynb
+++ b/train_colab.ipynb
@@ -15,11 +15,7 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "# Blazing Eights - Colab GPU Training\n",
-    "\n",
-    "Clone repo → Train PPO agent on GPU → Push trained model back to GitHub"
-   ]
+   "source": "# Blazing Eights - Colab GPU Training\n\nClone repo → Train PPO agent (CPU collect, GPU update) → Push trained model back to GitHub\n\n**Game**: UNO variant with custom special cards (8=Wild, K=All draw, J=Skip, Swap=Swap hands)."
   },
   {
    "cell_type": "markdown",
@@ -58,30 +54,14 @@
   {
    "cell_type": "code",
    "metadata": {},
-   "source": [
-    "# 2-player training: GPU makes the PPO update faster\n",
-    "!python train.py --num_players 2 --episodes 200000 --save_path blazing_ppo_2p"
-   ],
+   "source": "# 2-player training with greedy warmup + CSV logging\n# Game simulation on CPU, PPO updates on GPU automatically\n!python train.py --num_players 2 --episodes 200000 --save_path blazing_ppo_2p\n\n# Show training log\nimport pandas as pd\ndf = pd.read_csv(\"blazing_ppo_2p_log.csv\")\nprint(df.to_string(index=False))",
    "execution_count": null,
    "outputs": []
   },
   {
    "cell_type": "code",
    "metadata": {},
-   "source": [
-    "# (Optional) 3-player training\n",
-    "# !python train.py --num_players 3 --episodes 300000 --save_path blazing_ppo_3p"
-   ],
-   "execution_count": null,
-   "outputs": []
-  },
-  {
-   "cell_type": "code",
-   "metadata": {},
-   "source": [
-    "# (Optional) 4-player training\n",
-    "# !python train.py --num_players 4 --episodes 400000 --lr 1e-4 --ent_coef 0.02 --save_path blazing_ppo_4p"
-   ],
+   "source": "# (Optional) 3-player training\n# !python train.py --num_players 3 --episodes 300000 --save_path blazing_ppo_3p\n\n# (Optional) Skip greedy warmup\n# !python train.py --num_players 2 --episodes 200000 --greedy_warmup 0 --save_path blazing_ppo_2p_no_warmup",
    "execution_count": null,
    "outputs": []
   },
@@ -163,31 +143,7 @@
   {
    "cell_type": "code",
    "metadata": {},
-   "source": [
-    "import sys\n",
-    "sys.path.insert(0, \".\")\n",
-    "from train import PolicyValueNet, evaluate_vs_random\n",
-    "\n",
-    "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
-    "model = PolicyValueNet().to(device)\n",
-    "\n",
-    "# Load the trained model\n",
-    "import glob\n",
-    "final_models = glob.glob(\"*_final.pt\") + glob.glob(\"models/*_final.pt\")\n",
-    "if final_models:\n",
-    "    ckpt = torch.load(final_models[0], map_location=device, weights_only=True)\n",
-    "    model.load_state_dict(ckpt[\"model\"])\n",
-    "    model.eval()\n",
-    "    print(f\"Loaded: {final_models[0]}\")\n",
-    "    print(f\"Trained for {ckpt.get('episode', '?')} episodes\")\n",
-    "    print()\n",
-    "\n",
-    "    for n in [2, 3, 4]:\n",
-    "        wr = evaluate_vs_random(model, num_players=n, num_games=2000, device=device)\n",
-    "        print(f\"  {n} players: win rate = {wr:.1%} (random baseline: {1/n:.1%})\")\n",
-    "else:\n",
-    "    print(\"No model found. Train first!\")"
-   ],
+   "source": "import sys\nsys.path.insert(0, \".\")\nfrom train import PolicyValueNet, evaluate_vs_random\n\ndevice = \"cpu\"  # eval on CPU (single-sample inference)\nmodel = PolicyValueNet().to(device)\n\nimport glob\nfinal_models = glob.glob(\"*_final.pt\") + glob.glob(\"models/*_final.pt\")\nif final_models:\n    ckpt = torch.load(final_models[0], map_location=device, weights_only=True)\n    model.load_state_dict(ckpt[\"model\"])\n    model.eval()\n    print(f\"Loaded: {final_models[0]}\")\n    print(f\"Trained for {ckpt.get('episode', '?')} episodes\")\n    print()\n\n    for n in [2, 3, 4]:\n        wr = evaluate_vs_random(model, num_players=n, num_games=2000, device=device)\n        print(f\"  {n} players: win rate = {wr:.1%} (random baseline: {1/n:.1%})\")\nelse:\n    print(\"No model found. Train first!\")",
    "execution_count": null,
    "outputs": []
   }
author	haoyuren <13851610112@163.com>	2026-02-22 11:37:49 -0600
committer	haoyuren <13851610112@163.com>	2026-02-22 11:37:49 -0600
commit	6f7034fabbfbe27197765f335bdcc64ec8c8c85f (patch)
tree	4b97f81b42ff1f93ef754a17f501eb0fdf6335a2
parent	dc8421e251f059e2136d5535bca2182af67fff75 (diff)