diff options
| -rw-r--r-- | README.md | 82 | ||||
| -rw-r--r-- | train_colab.ipynb | 52 |
2 files changed, 50 insertions, 84 deletions
@@ -1,67 +1,77 @@ # Blazing Eights — RL Agent -Self-play PPO agent for the Blazing Eights card game. +Self-play PPO agent for the Blazing Eights card game (UNO variant with custom special cards). ## Setup ```bash -pip install torch numpy +pip install torch numpy tqdm ``` ## Files - `blazing_env.py` — Game environment (2-5 players) -- `train.py` — PPO self-play training +- `train.py` — PPO self-play training with greedy warmup +- `versus.py` — Human vs AI interactive game - `play.py` — Real-time play assistant (input game state, get best move) +- `train_colab.ipynb` — Google Colab GPU training notebook + +## Game Rules + +**52 cards** (standard deck, Q removed in 2-player) + **4 Swap cards** + +| Card | Effect | +|------|--------| +| 8 | Wild — choose a suit for next player | +| K | All other players draw 1 card | +| Q | Reverse direction (removed in 2-player games) | +| J | Skip next player | +| Swap | Swap entire hand with next player (always playable; next card must match the card before the Swap) | + +- **Match** top card by suit or rank (8 and Swap are exceptions) +- **Free draw**: you may draw even if you have playable cards +- **After drawing**: play any legal card OR pass (one draw per turn max) +- **Stalemate**: if all players pass without drawing, game ends (fewest cards wins) +- **Win**: first to empty hand +- **Initial hand**: 5 cards each ## Training ```bash -# Train a 2-player agent (~10min on CPU for 100k episodes) +# 2-player (~20min on CPU, 100k episodes) python train.py --num_players 2 --episodes 100000 -# Train for 3 players (may need more episodes) -python train.py --num_players 3 --episodes 200000 +# Skip greedy warmup +python train.py --num_players 2 --episodes 100000 --greedy_warmup 0 # Custom hyperparams python train.py --num_players 4 --episodes 300000 --lr 1e-4 --ent_coef 0.02 ``` -Training saves checkpoints every 10k episodes and a final model. - -## Real-time Play Assistant +Training features: +- **Greedy warmup**: behavioral cloning on greedy play before PPO (default 2000 games) +- **CPU/GPU split**: game simulation on CPU, PPO updates on GPU (avoids transfer overhead) +- **CSV log**: `{save_path}_log.csv` with avg_len, loss, vs_greedy win rate every 10k episodes +- Checkpoints every 10k episodes -After training, use the assistant during a real game: +## Play vs AI ```bash -python play.py --model blazing_ppo_final.pt --num_players 3 +python versus.py --model blazing_ppo_2p_final.pt +python versus.py --model blazing_ppo_2p_final.pt --num_players 3 +python versus.py --model blazing_ppo_2p_final.pt --show_ai # show AI hand (debug) ``` -It will prompt you for: -1. Your hand (e.g., `8h,Ks,3d,SWAP`) -2. Top discard card (e.g., `6d`) -3. Active suit if an 8 was played -4. Direction (cw/ccw) -5. Other players' hand sizes -6. Approximate deck size +Controls: number to play card, `d` to draw, `p` to pass, `q` to quit. -Then shows ranked action recommendations with probabilities. +## Play Assistant -## Game Rules +Input your game state and get ranked action recommendations: + +```bash +python play.py --model blazing_ppo_2p_final.pt --num_players 2 +``` + +## Colab GPU Training -- **56 cards**: standard 52 + 4 Swap cards -- **Match**: suit or rank of top card -- **8**: Wild — choose a suit for next player -- **K**: All other players draw 1 -- **Q**: Reverse direction (no effect in 2-player) -- **J**: Skip next player -- **Swap**: Swap your entire hand with next player (playable anytime, no match needed) -- **Can't play**: Draw 1, play it if legal -- **Win**: First to empty hand - -## Tips for Better Training - -1. **Train per player count** — the optimal policy differs significantly for 2 vs 5 players. -2. **Increase episodes for more players** — larger games have more variance, need more samples. -3. **Opponent modeling** — after self-play, you can fine-tune against specific opponent behaviors by replacing some players with heuristic bots that mimic your friends' tendencies. -4. **Curriculum** — start training with 2 players, then use the trained model to initialize training for 3+ players. +Open `train_colab.ipynb` in Google Colab for GPU-accelerated training. See notebook for setup instructions. diff --git a/train_colab.ipynb b/train_colab.ipynb index 24ae0dc..afae94d 100644 --- a/train_colab.ipynb +++ b/train_colab.ipynb @@ -15,11 +15,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "# Blazing Eights - Colab GPU Training\n", - "\n", - "Clone repo → Train PPO agent on GPU → Push trained model back to GitHub" - ] + "source": "# Blazing Eights - Colab GPU Training\n\nClone repo → Train PPO agent (CPU collect, GPU update) → Push trained model back to GitHub\n\n**Game**: UNO variant with custom special cards (8=Wild, K=All draw, J=Skip, Swap=Swap hands)." }, { "cell_type": "markdown", @@ -58,30 +54,14 @@ { "cell_type": "code", "metadata": {}, - "source": [ - "# 2-player training: GPU makes the PPO update faster\n", - "!python train.py --num_players 2 --episodes 200000 --save_path blazing_ppo_2p" - ], + "source": "# 2-player training with greedy warmup + CSV logging\n# Game simulation on CPU, PPO updates on GPU automatically\n!python train.py --num_players 2 --episodes 200000 --save_path blazing_ppo_2p\n\n# Show training log\nimport pandas as pd\ndf = pd.read_csv(\"blazing_ppo_2p_log.csv\")\nprint(df.to_string(index=False))", "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, - "source": [ - "# (Optional) 3-player training\n", - "# !python train.py --num_players 3 --episodes 300000 --save_path blazing_ppo_3p" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "# (Optional) 4-player training\n", - "# !python train.py --num_players 4 --episodes 400000 --lr 1e-4 --ent_coef 0.02 --save_path blazing_ppo_4p" - ], + "source": "# (Optional) 3-player training\n# !python train.py --num_players 3 --episodes 300000 --save_path blazing_ppo_3p\n\n# (Optional) Skip greedy warmup\n# !python train.py --num_players 2 --episodes 200000 --greedy_warmup 0 --save_path blazing_ppo_2p_no_warmup", "execution_count": null, "outputs": [] }, @@ -163,31 +143,7 @@ { "cell_type": "code", "metadata": {}, - "source": [ - "import sys\n", - "sys.path.insert(0, \".\")\n", - "from train import PolicyValueNet, evaluate_vs_random\n", - "\n", - "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", - "model = PolicyValueNet().to(device)\n", - "\n", - "# Load the trained model\n", - "import glob\n", - "final_models = glob.glob(\"*_final.pt\") + glob.glob(\"models/*_final.pt\")\n", - "if final_models:\n", - " ckpt = torch.load(final_models[0], map_location=device, weights_only=True)\n", - " model.load_state_dict(ckpt[\"model\"])\n", - " model.eval()\n", - " print(f\"Loaded: {final_models[0]}\")\n", - " print(f\"Trained for {ckpt.get('episode', '?')} episodes\")\n", - " print()\n", - "\n", - " for n in [2, 3, 4]:\n", - " wr = evaluate_vs_random(model, num_players=n, num_games=2000, device=device)\n", - " print(f\" {n} players: win rate = {wr:.1%} (random baseline: {1/n:.1%})\")\n", - "else:\n", - " print(\"No model found. Train first!\")" - ], + "source": "import sys\nsys.path.insert(0, \".\")\nfrom train import PolicyValueNet, evaluate_vs_random\n\ndevice = \"cpu\" # eval on CPU (single-sample inference)\nmodel = PolicyValueNet().to(device)\n\nimport glob\nfinal_models = glob.glob(\"*_final.pt\") + glob.glob(\"models/*_final.pt\")\nif final_models:\n ckpt = torch.load(final_models[0], map_location=device, weights_only=True)\n model.load_state_dict(ckpt[\"model\"])\n model.eval()\n print(f\"Loaded: {final_models[0]}\")\n print(f\"Trained for {ckpt.get('episode', '?')} episodes\")\n print()\n\n for n in [2, 3, 4]:\n wr = evaluate_vs_random(model, num_players=n, num_games=2000, device=device)\n print(f\" {n} players: win rate = {wr:.1%} (random baseline: {1/n:.1%})\")\nelse:\n print(\"No model found. Train first!\")", "execution_count": null, "outputs": [] } |
