Post

Rolling Dice Game

Rolling Dice Game and Value Functions

Rolling Dice Game

Question:

Here is the question to solve:

Suppose we play a game with a die where we roll and sum our rolls. We can stop any time, and the sum is our score. However, if our sum is ever a multiple of 10, our score is zero, and our game is over. What strategy will yield the greatest expected score? What about the same game played with values other than 10?

To solve this problem, we use Bellman equation to set up and solve this problem.

1. Set up the Bellman equation

Let $N$ be the “bad” multiple (10 in your original game).

State = current sum $S \in {0,1,2,\dots}$

From state $S$, you can:

  • Stop: take payoff $S$.

  • Roll: add a die roll $X \in {1,\dots,6}$,

  • If $S+X$ is a positive multiple of $N$, payoff = 0 and game ends.

  • Otherwise you move to $S’ = S+X$ and decide again.

Define $V(S)$ = optimal expected payoff starting from total $S$.

Then the Bellman optimality equation is:

\[V(S) = \max\Bigl(\text{Stop},\ \text{Roll}\Bigr) = \max\Bigl(S; R(S)\Bigr),\]

where

\[R(S) = \mathbb{E}[\text{payoff if we roll at } S].\]

Explicitly,

\[R(S) = \frac{1}{6}\sum_{x=1}^6 G(S,x),\]

with

\[G(S,x) = \begin{cases} 0 & \text{if } S+x \text{ is a positive multiple of } N, \\ V(S+x) & \text{otherwise.} \end{cases}\]

Also, any positive multiple of $N$ is absorbing with payoff 0:

\[V(kN) = 0,\quad k\ge1.\]

At $S=0$, stopping gives 0, rolling gives something positive, so $V(0) = R(0)$.

We have infinitely many states $S=0,1,2,\dots$, but we can’t solve infinitely many equations. The key observation:

  • When the risk of dying in the next roll is > 0, the continuation value behaves like
\[R(S) \approx p_{\text{survive}} \cdot (\text{something of order } S),\]

which will be strictly smaller than $S$ for large enough $S$. So past some point, you always prefer to stop.

That means there exists some big cutoff $S_{\max}$ such that for all $S\ge S_{\max}$, the optimal action is “stop”, so

\[V(S) = S\quad\text{for }S \ge S_{\max}.\]

So we can:

  • Pick a large $S_{\max}$ (e.g. 150–200 for (N=10)).

  • Solve the Bellman equations only for $S=0,1,\dots,S_{\max}$, using the boundary condition

\[V(S) = S \quad \text{for } S > S_{\max}.\]

This gives a very accurate approximation, and when you increase $S_{\max}$ further, the results stop changing (so you know you’re converged).

Value iteration: how to actually solve it

Here’s the concrete algorithm you can code:

Step 0: Initialize (V)

For $S=0,\dots,S_{\max}$, set an initial guess, e.g.

\[V^{(0)}(S) = S\]

except set $V^{(0)}(kN)=0$ for ($k\ge 1$) (absorbing bad states).

Step 1: Iterative update

For iteration $t=0,1,2,\dots$:

For each state $S=0,\dots,S_{\max}$:

  • If $S>0$ and $S$ is a multiple of (N), keep $V^{(t+1)}(S) = 0$.

  • Otherwise compute:

  • Stop value: $\text{Stop} = S$.

  • Roll value:

\[\text{Roll} = \frac{1}{6}\sum_{x=1}^6 \tilde{V}(S+x)\]

where

\[\tilde{V}(S+x) = \begin{cases} 0 & \text{if } S+x \text{ is a positive multiple of } N,\\ V^{(t)}(S+x) & \text{if } S+x \le S_{\max},\\ S+x & \text{if } S+x > S_{\max}\quad (\text{boundary stop}). \end{cases}\]
  • Then set
\[V^{(t+1)}(S) = \max(\text{Stop},\text{Roll}).\]

Repeat until the updates change very little:

\[\max_{S} \bigl|V^{(t+1)}(S) - V^{(t)}(S)\bigr| < \varepsilon.\]

At that point $V^{(t)}$ is essentially your true (V).

Extract the optimal policy and thresholds

Given the converged $V$, the optimal action at each (S) is:

  • Roll if $R(S) > S$,

  • Stop if $R(S) < S$,

  • (Tie is measure-zero; you can break it arbitrarily.)

You compute $R(S)$ once more using the final $V$ (same formula as above).

Then for each remainder class $r \in {0,1,\dots,N-1}$, you can define the threshold

\[T_r = \min{ S \ge 0 : S \equiv r \pmod{N} \text{ and optimal action is stop} }.\]

If that set is empty, it means you never stop in that remainder class (which does happen).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
N = 10
S_max = 200
tol = 1e-10
max_iter = 10_000

def is_bad(total: int, N: int) -> bool:
    """Bad states are positive multiples of N (game ends with payoff 0)."""
    return total > 0 and total % N == 0

# ----------------------------
# Initialize V
# ----------------------------
V: List[float] = [float(S) for S in range(S_max + 1)]
for S in range(S_max + 1):
    if is_bad(S, N):
        V[S] = 0.0

# ----------------------------
# Value iteration
# ----------------------------
for it in range(max_iter):
    delta = 0.0
    V_new = V.copy()

    for S in range(S_max + 1):
        # Terminal bad states
        if is_bad(S, N):
            V_new[S] = 0.0
            continue

        stop_val = float(S)

        # Expected value if we roll once and then play optimally
        roll_val = 0.0
        for x in range(1, 7):
            Sprime = S + x

            if is_bad(Sprime, N):
                contrib = 0.0
            elif Sprime <= S_max:
                contrib = V[Sprime]
            else:
                # Boundary condition: beyond S_max we assume "stop immediately"
                contrib = float(Sprime)

            roll_val += contrib / 6.0

        V_new[S] = max(stop_val, roll_val)
        delta = max(delta, abs(V_new[S] - V[S]))

    V = V_new

    # Print occasionally (otherwise it's super spammy)
    if it % 50 == 0 or delta < tol:
        print(f"Iter {it:4d} | delta = {delta:.3e}")

    if delta < tol:
        print(f"Converged in {it+1} iterations, delta={delta:.2e}")
        break

# ----------------------------
# Extract policy from converged V
# ----------------------------
policy: List[str] = [""] * (S_max + 1)

for S in range(S_max + 1):
    if is_bad(S, N):
        policy[S] = "terminal"
        continue

    stop_val = float(S)
    roll_val = 0.0

    for x in range(1, 7):
        Sprime = S + x
        if is_bad(Sprime, N):
            contrib = 0.0
        elif Sprime <= S_max:
            contrib = V[Sprime]
        else:
            contrib = float(Sprime)
        roll_val += contrib / 6.0

    policy[S] = "roll" if roll_val > stop_val else "stop"

# Print first 50 policies
for i in range(50):
    print(f"S={i:2d} | V={V[i]:7.3f} | policy={policy[i]}")

# ----------------------------
# Thresholds per remainder class
# ----------------------------
thresholds: Dict[int, Optional[int]] = {}
for r in range(N):
    candidate: Optional[int] = None
    for S in range(S_max + 1):
        if S % N != r:
            continue
        if is_bad(S, N):
            continue
        if policy[S] == "stop":
            candidate = S
            break
    thresholds[r] = candidate

print("\nThresholds (per remainder mod N):")
for r in range(N):
    print(f"r={r}: {thresholds[r]}")

and here is what the output looks like

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Iter    0 | delta = 3.500e+00
Iter   22 | delta = 8.990e-12
Converged in 23 iterations, delta=8.99e-12
S= 0 | V= 13.217 | policy=roll
S= 1 | V= 13.405 | policy=roll
S= 2 | V= 13.608 | policy=roll
S= 3 | V= 13.846 | policy=roll
S= 4 | V= 11.868 | policy=roll
S= 5 | V= 12.847 | policy=roll
S= 6 | V= 13.729 | policy=roll
S= 7 | V= 14.533 | policy=roll
S= 8 | V= 14.827 | policy=roll
S= 9 | V= 15.272 | policy=roll
S=10 | V=  0.000 | policy=terminal
S=11 | V= 18.722 | policy=roll
S=12 | V= 19.018 | policy=roll
S=13 | V= 19.357 | policy=roll
S=14 | V= 16.592 | policy=roll
S=15 | V= 17.945 | policy=roll
S=16 | V= 19.157 | policy=roll
S=17 | V= 20.260 | policy=roll
S=18 | V= 20.795 | policy=roll
S=19 | V= 21.395 | policy=roll
S=20 | V=  0.000 | policy=terminal
S=21 | V= 26.062 | policy=roll
S=22 | V= 26.432 | policy=roll
S=23 | V= 26.878 | policy=roll
S=24 | V= 24.000 | policy=stop
S=25 | V= 25.000 | policy=stop
S=26 | V= 26.303 | policy=roll
S=27 | V= 27.759 | policy=roll
S=28 | V= 28.651 | policy=roll
S=29 | V= 29.558 | policy=roll
S=30 | V=  0.000 | policy=terminal
S=31 | V= 35.764 | policy=roll
S=32 | V= 36.083 | policy=roll
S=33 | V= 36.500 | policy=roll
S=34 | V= 34.000 | policy=stop
S=35 | V= 35.000 | policy=stop
S=36 | V= 36.000 | policy=stop
S=37 | V= 37.000 | policy=stop
S=38 | V= 38.000 | policy=stop
S=39 | V= 39.000 | policy=stop
S=40 | V=  0.000 | policy=terminal
S=41 | V= 45.764 | policy=roll
S=42 | V= 46.083 | policy=roll
S=43 | V= 46.500 | policy=roll
S=44 | V= 44.000 | policy=stop
S=45 | V= 45.000 | policy=stop
S=46 | V= 46.000 | policy=stop
S=47 | V= 47.000 | policy=stop
S=48 | V= 48.000 | policy=stop
S=49 | V= 49.000 | policy=stop

Thresholds (per remainder mod N):
r=0: None
r=1: None
r=2: None
r=3: None
r=4: 24
r=5: 25
r=6: 36
r=7: 37
r=8: 38
r=9: 39
This post is licensed under CC BY 4.0 by the author.

Trending Tags