Rolling Dice Game
Rolling Dice Game and Value Functions
Question:
Here is the question to solve:
Suppose we play a game with a die where we roll and sum our rolls. We can stop any time, and the sum is our score. However, if our sum is ever a multiple of 10, our score is zero, and our game is over. What strategy will yield the greatest expected score? What about the same game played with values other than 10?
To solve this problem, we use Bellman equation to set up and solve this problem.
1. Set up the Bellman equation
Let $N$ be the “bad” multiple (10 in your original game).
State = current sum $S \in {0,1,2,\dots}$
From state $S$, you can:
Stop: take payoff $S$.
Roll: add a die roll $X \in {1,\dots,6}$,
If $S+X$ is a positive multiple of $N$, payoff = 0 and game ends.
Otherwise you move to $S’ = S+X$ and decide again.
Define $V(S)$ = optimal expected payoff starting from total $S$.
Then the Bellman optimality equation is:
\[V(S) = \max\Bigl(\text{Stop},\ \text{Roll}\Bigr) = \max\Bigl(S; R(S)\Bigr),\]where
\[R(S) = \mathbb{E}[\text{payoff if we roll at } S].\]Explicitly,
\[R(S) = \frac{1}{6}\sum_{x=1}^6 G(S,x),\]with
\[G(S,x) = \begin{cases} 0 & \text{if } S+x \text{ is a positive multiple of } N, \\ V(S+x) & \text{otherwise.} \end{cases}\]Also, any positive multiple of $N$ is absorbing with payoff 0:
\[V(kN) = 0,\quad k\ge1.\]At $S=0$, stopping gives 0, rolling gives something positive, so $V(0) = R(0)$.
We have infinitely many states $S=0,1,2,\dots$, but we can’t solve infinitely many equations. The key observation:
- When the risk of dying in the next roll is > 0, the continuation value behaves like
which will be strictly smaller than $S$ for large enough $S$. So past some point, you always prefer to stop.
That means there exists some big cutoff $S_{\max}$ such that for all $S\ge S_{\max}$, the optimal action is “stop”, so
\[V(S) = S\quad\text{for }S \ge S_{\max}.\]So we can:
Pick a large $S_{\max}$ (e.g. 150–200 for (N=10)).
Solve the Bellman equations only for $S=0,1,\dots,S_{\max}$, using the boundary condition
This gives a very accurate approximation, and when you increase $S_{\max}$ further, the results stop changing (so you know you’re converged).
Value iteration: how to actually solve it
Here’s the concrete algorithm you can code:
Step 0: Initialize (V)
For $S=0,\dots,S_{\max}$, set an initial guess, e.g.
\[V^{(0)}(S) = S\]except set $V^{(0)}(kN)=0$ for ($k\ge 1$) (absorbing bad states).
Step 1: Iterative update
For iteration $t=0,1,2,\dots$:
For each state $S=0,\dots,S_{\max}$:
If $S>0$ and $S$ is a multiple of (N), keep $V^{(t+1)}(S) = 0$.
Otherwise compute:
Stop value: $\text{Stop} = S$.
Roll value:
where
\[\tilde{V}(S+x) = \begin{cases} 0 & \text{if } S+x \text{ is a positive multiple of } N,\\ V^{(t)}(S+x) & \text{if } S+x \le S_{\max},\\ S+x & \text{if } S+x > S_{\max}\quad (\text{boundary stop}). \end{cases}\]- Then set
Repeat until the updates change very little:
\[\max_{S} \bigl|V^{(t+1)}(S) - V^{(t)}(S)\bigr| < \varepsilon.\]At that point $V^{(t)}$ is essentially your true (V).
Extract the optimal policy and thresholds
Given the converged $V$, the optimal action at each (S) is:
Roll if $R(S) > S$,
Stop if $R(S) < S$,
(Tie is measure-zero; you can break it arbitrarily.)
You compute $R(S)$ once more using the final $V$ (same formula as above).
Then for each remainder class $r \in {0,1,\dots,N-1}$, you can define the threshold
\[T_r = \min{ S \ge 0 : S \equiv r \pmod{N} \text{ and optimal action is stop} }.\]If that set is empty, it means you never stop in that remainder class (which does happen).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
N = 10
S_max = 200
tol = 1e-10
max_iter = 10_000
def is_bad(total: int, N: int) -> bool:
"""Bad states are positive multiples of N (game ends with payoff 0)."""
return total > 0 and total % N == 0
# ----------------------------
# Initialize V
# ----------------------------
V: List[float] = [float(S) for S in range(S_max + 1)]
for S in range(S_max + 1):
if is_bad(S, N):
V[S] = 0.0
# ----------------------------
# Value iteration
# ----------------------------
for it in range(max_iter):
delta = 0.0
V_new = V.copy()
for S in range(S_max + 1):
# Terminal bad states
if is_bad(S, N):
V_new[S] = 0.0
continue
stop_val = float(S)
# Expected value if we roll once and then play optimally
roll_val = 0.0
for x in range(1, 7):
Sprime = S + x
if is_bad(Sprime, N):
contrib = 0.0
elif Sprime <= S_max:
contrib = V[Sprime]
else:
# Boundary condition: beyond S_max we assume "stop immediately"
contrib = float(Sprime)
roll_val += contrib / 6.0
V_new[S] = max(stop_val, roll_val)
delta = max(delta, abs(V_new[S] - V[S]))
V = V_new
# Print occasionally (otherwise it's super spammy)
if it % 50 == 0 or delta < tol:
print(f"Iter {it:4d} | delta = {delta:.3e}")
if delta < tol:
print(f"Converged in {it+1} iterations, delta={delta:.2e}")
break
# ----------------------------
# Extract policy from converged V
# ----------------------------
policy: List[str] = [""] * (S_max + 1)
for S in range(S_max + 1):
if is_bad(S, N):
policy[S] = "terminal"
continue
stop_val = float(S)
roll_val = 0.0
for x in range(1, 7):
Sprime = S + x
if is_bad(Sprime, N):
contrib = 0.0
elif Sprime <= S_max:
contrib = V[Sprime]
else:
contrib = float(Sprime)
roll_val += contrib / 6.0
policy[S] = "roll" if roll_val > stop_val else "stop"
# Print first 50 policies
for i in range(50):
print(f"S={i:2d} | V={V[i]:7.3f} | policy={policy[i]}")
# ----------------------------
# Thresholds per remainder class
# ----------------------------
thresholds: Dict[int, Optional[int]] = {}
for r in range(N):
candidate: Optional[int] = None
for S in range(S_max + 1):
if S % N != r:
continue
if is_bad(S, N):
continue
if policy[S] == "stop":
candidate = S
break
thresholds[r] = candidate
print("\nThresholds (per remainder mod N):")
for r in range(N):
print(f"r={r}: {thresholds[r]}")
and here is what the output looks like
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Iter 0 | delta = 3.500e+00
Iter 22 | delta = 8.990e-12
Converged in 23 iterations, delta=8.99e-12
S= 0 | V= 13.217 | policy=roll
S= 1 | V= 13.405 | policy=roll
S= 2 | V= 13.608 | policy=roll
S= 3 | V= 13.846 | policy=roll
S= 4 | V= 11.868 | policy=roll
S= 5 | V= 12.847 | policy=roll
S= 6 | V= 13.729 | policy=roll
S= 7 | V= 14.533 | policy=roll
S= 8 | V= 14.827 | policy=roll
S= 9 | V= 15.272 | policy=roll
S=10 | V= 0.000 | policy=terminal
S=11 | V= 18.722 | policy=roll
S=12 | V= 19.018 | policy=roll
S=13 | V= 19.357 | policy=roll
S=14 | V= 16.592 | policy=roll
S=15 | V= 17.945 | policy=roll
S=16 | V= 19.157 | policy=roll
S=17 | V= 20.260 | policy=roll
S=18 | V= 20.795 | policy=roll
S=19 | V= 21.395 | policy=roll
S=20 | V= 0.000 | policy=terminal
S=21 | V= 26.062 | policy=roll
S=22 | V= 26.432 | policy=roll
S=23 | V= 26.878 | policy=roll
S=24 | V= 24.000 | policy=stop
S=25 | V= 25.000 | policy=stop
S=26 | V= 26.303 | policy=roll
S=27 | V= 27.759 | policy=roll
S=28 | V= 28.651 | policy=roll
S=29 | V= 29.558 | policy=roll
S=30 | V= 0.000 | policy=terminal
S=31 | V= 35.764 | policy=roll
S=32 | V= 36.083 | policy=roll
S=33 | V= 36.500 | policy=roll
S=34 | V= 34.000 | policy=stop
S=35 | V= 35.000 | policy=stop
S=36 | V= 36.000 | policy=stop
S=37 | V= 37.000 | policy=stop
S=38 | V= 38.000 | policy=stop
S=39 | V= 39.000 | policy=stop
S=40 | V= 0.000 | policy=terminal
S=41 | V= 45.764 | policy=roll
S=42 | V= 46.083 | policy=roll
S=43 | V= 46.500 | policy=roll
S=44 | V= 44.000 | policy=stop
S=45 | V= 45.000 | policy=stop
S=46 | V= 46.000 | policy=stop
S=47 | V= 47.000 | policy=stop
S=48 | V= 48.000 | policy=stop
S=49 | V= 49.000 | policy=stop
Thresholds (per remainder mod N):
r=0: None
r=1: None
r=2: None
r=3: None
r=4: 24
r=5: 25
r=6: 36
r=7: 37
r=8: 38
r=9: 39