README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67

# Blazing Eights — RL Agent

Self-play PPO agent for the Blazing Eights card game.

## Setup

```bash
pip install torch numpy
```

## Files

- `blazing_env.py` — Game environment (2-5 players)
- `train.py` — PPO self-play training
- `play.py` — Real-time play assistant (input game state, get best move)

## Training

```bash
# Train a 2-player agent (~10min on CPU for 100k episodes)
python train.py --num_players 2 --episodes 100000

# Train for 3 players (may need more episodes)
python train.py --num_players 3 --episodes 200000

# Custom hyperparams
python train.py --num_players 4 --episodes 300000 --lr 1e-4 --ent_coef 0.02
```

Training saves checkpoints every 10k episodes and a final model.

## Real-time Play Assistant

After training, use the assistant during a real game:

```bash
python play.py --model blazing_ppo_final.pt --num_players 3
```

It will prompt you for:
1. Your hand (e.g., `8h,Ks,3d,SWAP`)
2. Top discard card (e.g., `6d`)
3. Active suit if an 8 was played
4. Direction (cw/ccw)
5. Other players' hand sizes
6. Approximate deck size

Then shows ranked action recommendations with probabilities.

## Game Rules

- **56 cards**: standard 52 + 4 Swap cards
- **Match**: suit or rank of top card
- **8**: Wild — choose a suit for next player
- **K**: All other players draw 1
- **Q**: Reverse direction (no effect in 2-player)
- **J**: Skip next player
- **Swap**: Swap your entire hand with next player (playable anytime, no match needed)
- **Can't play**: Draw 1, play it if legal
- **Win**: First to empty hand

## Tips for Better Training

1. **Train per player count** — the optimal policy differs significantly for 2 vs 5 players.
2. **Increase episodes for more players** — larger games have more variance, need more samples.
3. **Opponent modeling** — after self-play, you can fine-tune against specific opponent behaviors by replacing some players with heuristic bots that mimic your friends' tendencies.
4. **Curriculum** — start training with 2 players, then use the trained model to initialize training for 3+ players.