What is the Bellman Equation in Economics? | An Introduction to Bellman Methods, Reinforcement Learning, and Dynamic Programming
In this article, I introduce the core concepts driving the Bellman equation and its use in dynamic programming and reinforcement learning (don’t worry if you’re not familiar with those terms!). Here’s a quick rundown of what you’ll learn:
- Foundational concepts: states, actions, rewards, dynamic programming, and reinforcement learning.
- Bellman Equation struture: immediate rewards + discounted expected future values.
- Uncertainty in Bellman equations: probability distributions and expected values to model stochastic shocks.
- Policies versus actions: the difference between choices (actions) and strategies (decision rules/policies).
- Solution methods: value iteration as a computational approach for using Bellman equations to solve optimization problems.
- Two examples using the Bellman equation: a two-period machine maintenance example solved by hand, and an infinite-horizon consumption-savings model that we use numerical methods to solve.
- Implementation techniques: discretizing continous variables, grid-based optimization, and convergence criteria.
Also: if you want to learn some of the same concepts with a focus on one specifica application of the math, you can check out my article about Monte Carlo simulations here.
That all said, let’s get into it. This is the general form of a Bellman equation (the standard way of expressing it). You may have seen it with different variables, or slight variations of to address a certain problem or method, but regardless, this is the structure it will nearly always maintain:
The Bellman equation provides a structure to solve intertemporal optimization problems. That’s a fancy way of saying it helps us figure out the single best decision or set of decisions to make in a situation where we have decisions to make that could affect the world in different ways, and affect us in different ways.
The initial Bellman principle was developed by Richard Bellman in the 1950s under the banner of dynamic programming, which is an umbrella-term for problem-solving methods that use recursive sub-problems to solve big problems. Recursion maens iterative results, whereas results from one equation are used as inputs to re-run the equation, and so on until a problem is solved.
Modern reinforcement learning methods then adopted and extend those earlier equations. Reinforcement learning is a subset of machine learning, generally involving approaches to getting agents (algorithms) to “discover” optimal decisions within an environment.
Let’s understand some of the basics of reinforcement learning to inform our understanding of the Bellman equation, and then we’ll return to that the general form equation:
Introducing States, Actions, & Rewards
Generally, for any reinforcement learning or dynamic programming problem, we’ll construct a series of rules and equations to produce a “world” or “game” of sorts, and then build an algorithm to play that game (if you’re familiar with game theory, you’ll find a lot of this familiar—think backward induction!).
If we build an algorithm to play the game in the right way and improve over time, it will figure out the optimal way to play the game and answer our question for us.
A cornerstone of this process is what are called states. Intuitively, you can substitute states for states of world. For example, if the weather can either be sunny or raining in the world we’re creating, then the two states or states of word are sunny and raining. States could also be different earnings values reported by a firm, interest rate changes potentially reported by the Federal Bank, or any number of other environmental changes. Transitions between states can occur deterministically (if x event happens, y state is transitioned into) or stochastically (a state which is transitioned into randomly selected from a given distribution).
In turn, actions are things that an algorithm can do given a set of states. If it’s raining, an action may be to get an umbrella, or when it’s sunny, to purchase sunscreen.
Combining these, a state-action pair is an action an agent can take within a state.
Finally, rewards are the cost or benefit garnered from applying actions to states—usually measured in utility for consumers and profit for firms. For example, if an algorithm we create incorrectly purchases suncreen when it’s raining (instead of an umbrella), then we could program getting wet as a negative reward, or something to avoid. This negative reward is a signal to the algorithm, telling it to potentially avoid that action in that state when other, better options are available.
So, reinforcement learning methods establish various states that stochastically occur and evolve over time. Algorithms perform actions in these states and compute a reward given the actions they took and the states in which they occured.
By repeating this process (“learning”) to maximize rewards, algorithms can find optimal approaches to any number of problems: portfolio optimization, risk analysis, economic simulations, or how to not get wet!
This is where Richard Bellman comes in. Bellman methods are the backbone of many reinforcement learning methods, as reinforcement learning algorithms are ways to solve or approximate the Bellman equations. Like I said at the start of the article, the Bellman equation provides a structure used to solve complex problems.
Specifically, the equation says that the value of a state or state–action pair equals the immediate reward plus the discounted value of what comes next. In an economics context, you’ve likely heard of discounted profit streams, whereas you want to maximize the value of not just profits currently, but the value of all profits over time given some discount factor (“I value money 5% more today versus tomorrow). As mentioned, it’s a recursive equation, meaning it breaks a large problem down into smaller sub-problems, and solves them incrementally using data from each solved sub-problem.
Breaking Down the Bellman Equation
Let’s return to the Bellman general form:
V(s)
is a general form notation—like f(x)—describing the right-hand side of the equation is a value function. Conceptually, maximizing the value function represents the maximum expected lifetime payoff (utility or profit, depending on the context) that a decision maker can achieve given rules and constraints (states, actions, rewards, etc.).
Now, I just said that the Bellman equation combines immediate reward with the discounted value of what comes next. Let’s first understand the immediate reward portion of the equation:
s
is the current value of the state we’re considering in our “world.” a
is the decision variable, the unknown value that we’re adjusting to achieve the optimal outcome. For example, our current state could be wealth and our decision variable investment, and in that case we’re trying to figure out how much to optimally invest given our wealth. Like so, u(s,a)
is the utility we get from choosing action a
in state s
. Because s refers to the current state and a the current input, this u(s,a)
calculates the immediate reward.
This visualization represents many possible actions for a state s
(some may produce the same state, others different states). Each state corresponds with a value once the immedate reward is added. In this case (note the bottom-right), the state corresponding to action 5 produces the highest value.
But, here’s the catch: actions do not lead to a predefined next state with a probability of 1. Instead, An action a
taken in state s
leads to a state s'
according to a state transition probability distribution, P(s'|s,a)
(meaning s'
is conditional upon the state-action pair s,a
. The action is an input to the function that defines the distribution of next states.
We’ll get to that in a second—let’s first touch on the discount factor.
β∈(0,1) is the discount factor (the “time preference factor”). It determines the rate at which value today is compared to value tomorrow—a person with a year to live versus one with fifty years to live will likely value money very different, for example, and that spread is a difference in time preference. This time preference factor must be between 0 and 1, with 1 meaning you care about all future value equally to today, and 0 meaning you only care about current value.
s'
is the state of the next period, which is conditional (note the notation “|”) upon the current state-action pair s,a
. For example, if we make a certain smart investment a today given a wealth s of $100,000, we may have $200,000 in the next state s'
(which could represent, say, the following year). V(s′) is the value of that next-period state.
Finally, the expectation E[⋅] integrates over uncertainty. Let’s go a bit deeper there. Here’s what it’s equivalent to:
With P(s'|s,a)
being the probability of landing in state s′ after choosing action a
in state s
. This reflects the fundamental notion that the world is uncertain. For example (back to the weather!), there’s always some chance that our umbrella breaks. Let’s say it happens 10% of the time. This means that even when we choose the action of bringing an umbrella (action) on our walk when there are dark clouds (state) which produce rain 50% of the time, we’re not guaranteed to stay dry (our reward). Instead, bringing the umbrella means we stay dry 95% of the time, but get soaked the other 5%. If we value being dry versus wet at (100, 0), then our expected value is 95. Formally:
P(dry) = P(no rain) + P(rain)*P(umbrella works) =0.5+0.5×0.9=0.95
So, to reflect uncertainty, E sums the probability of landing in each state multipled by the value derived from the state. By integrating, we find the average or expected value from choosing that state-action pair s,a
. We call that the continuation value.
Here’s a visualization, representing that when we choose an action a
in state s
, we are probably going to end up in a certain state s'
, but due to uncertainty, we could also end up in states s''
through s''''
(the green bars representing the probability, or liklihood of occuring, with probabilities necessarily summing to 1). Each state brings about a different value:
When we combine the two parts of the equation we’ve covered thus far, we get the current reward and the average value (reward) we’ll get in the next period given the actions we’re taking in this period.
The last component of the equation we haven’t touched on is the max notation. This means that given our current state s
, we’re considering all feasible actions a
in the set A(s)
through the two components we just described. Like so, each action will produce a different immediate payoff + discounted continuation value. Since we’re trying to maximize value (“max”), we select the highest calculated value. Conceptually, the maximum utility or profit is the best action we can take given our current state and the randomness of the world.
Practically, and in many economic applications of the Bellman equation, we’ll need to consider many periods, or even infinite. We don’t have to change the equation to do this — instead, we just iterate through the one-period equation as many times as we need to. Here’s the equation for a multi-period Bellman setup:
Whereas, the value of the current period Vt is calculated given a maximization of the immediate reward from the state-action pair in this period and the continuation value involving Vt+1. This repeats T-1 number of times (T being the end state) to solve the multi-period (intertemporal) optimization problem.
Introducing Policies
Now, I mentioned dynamic programming. More specifically, in economics, dynamic programming involves coded scripts (usually) employing Bellman equations to recursively iterate through various actions, states, and values.
A core concept in dynamic programming, and a practical component attached to Bellman equations that we haven’t touched on yet, are policies. Policies are the strategy component of optimization—functions that map states onto actions—and inform a program’s decision on which action to take given a certain state. Value functions let policies be improved based on the value they bring about.
If you’ve played the board game Clue, each guess you make is likely a value function output. You may not have guessed the right answer, but your overarching “policy” or strategy moving forward is better because of your guess (your value function output). With enough guesses, you will settle on the correct answer.
Polices differ from actions, such that if policies are the strategy to the game, then actions are the rules. Say we want an agent to move through a maze: possible actions may be to move left, right, up, or down, while the policy for that state may be to move left. A policy can also apply to all states, like we’ve discussed previously (say, always bring an umbrella if there are rainclouds).
Let’s map what we’ve learned about Bellman equations onto two examples of increasing complexity:
A One-Machine, Two-Period Maintenance Model
Say we want to optimally decide whether to let a machine in our factory produce goods or to fix it. There are two states, one in which the machine is in a good condition G, and the other for a bad condition, B, which signals that the machine needs fixing. There are two possible actions—keep (to run the machine this period) and fix (pay to fix it in this period). For our instant reward function u(s,a)
, we apply these rules for every combination of states and actions:
- s = G, a = keep → profit = 10
- s = B, a = keep → profit = 2
- s = G, a = fix → profit = -8
- s = B, a = fix → profit = -8
If we keep in the current period given a state of Good, there is a 60% probability the machine says in the Good state, and a 40% probability it transitions to the bad state (these are called “transition paths”). If it’s already in a bad state, it stays bad with a probability of 1, and if we fix instead of keep, the machine moves to G with a probability of 1.
Our discount factor (beta) is 0.95, meaning we value profit in the next period 95% much as we do profit today.
So, we have two states, two actions, and the optimal policy (strategy) question of what actions our agent should play in each state. Let’s calculate the Bellman equations for the states of good and bad:
For state G, we can either “play” the action keep or the action fix. Let’s calculate outcomes for both:
- keep: 10 + 0.95[0.6V(G)+0.4V(B)]
- fix: -8 + 0.95V(G)
Here’s the same for state B:
- keep: 2 + 0.95V(B)
- fix: −8 + 0.95V(G)
Now, to solve, we’ll use guess-and-check (something the computer will do when creating scripts to solve problems with Bellman equations) where we guess a policy, solve, and compare. Let’s start by guessing a very reasonable policy (ruleset) of keep in state G and fix in state B (logically, fix a machine when it’s broken, and let it run when it’s not):
- V(G) = 10 + 0.95[0.6V(G) + 0.4V(B)]
- V(B) = −8 + 0.95V(G).
- This produces: 0.43V(G) − 0.38V(B) = 10.
- Substituting in V(B) = −8 + 0.95V(G): 0.43V(G) + 3.04−0.361V(G) = 10⟹ 0.069V(G) = 6.96 ⟹ V(G) ≈100.9.
- Plugging V(G) into V(B): -8 + 0.95(100.9) = 87.8
To check whether this is an optimal policy, we want to make sure that the chosen actions yield the larger payoff in each state. We just calculated keep in state G and fix in state B, so we’ll compare that to fix in stage G and keep in state B.
- keep(G) = 10 + 0.95[0.6(100.9) + 0.4(87.8)] ≈ 100.9
- fix(G) = −8 + 0.95V(G) = -8 + 0.95(100.9) = 87.9
So, playing keep in state G does produce a higher payoff. We’ll do the same for state B:
- keep(B) = 2 + 0.95V(B) = 2 + 0.95V(87.8) = 85.4
- fix(B) = −8 + 0.95V(G) = -8 + 0.95(100.9) = 87.9
Like so, we’ve confirmed that our previous guess is the optimal policy, or the optimal “strategy” for this game.
This makes intuitive sense—as stated, we’re saying that when a machine is working (state G), we should run it (keep), and when it’s broken (B), we should fix it (fix). Paying -8 now is better than making only +2 in profit.
For our second example, we’ll work through something a little more relevant to an economics-related course or job. We’ll use a Python script to calculate the optimal policy with the Bellman equation instead of calculating by hand.
A quick note: depending on where you are in your studies of economics and/or finance and computer science, you may or may not be comfortable with the notation and concepts involved in optimization problems of this sort. If not, I think it’s best to keep in mind that the notation can make problems feel significnatly more mathematically difficult than they are. Problems will (usually) be much easier if you take time to understand the question and are generally comfortable with the notation.
Infinite Horizon Consumption–Saving Problem
For this problem, we’ll set up an agent deciding whether to consume or save. The agent’s current wealth is some positive number Wt ≥ 0. This (wealth) is the state.
For the action, the agent chooses how much to consume c in the current period t. The choice must exist on an interval between 0 and the wealth, which can be thought of as the budget constraint: ct ∈ [0,Wt]. Wealth not consumed is saved (savings in current period = wealth in current period - consumption in curent period):
- st = Wt− ct
The immediate reward function u(s,a)
takes the general form u(Wt, ct) since wealth is the state and consumption is the action. The specification of this function is:
With gamma (y) being a parameter subject to gamma > 0 that is the notational norm in Constant Relative Risk Aversion (CRRA) utility models and represents relative risk aversion. The higher the value of gamma, the more risk-averse the agent is (in turn affecting consumption versus savings decisions).
The state transition path—how wealth “updates” in the next period—is the wealth saved in this period plus a stochastic (meaning random) shock. Intuitively, the amount of money you have tomorrow is modeled pretty well as much you save today plus some randomness, whereas you aren’t fully sure of your income tomorrow. Note that wealth saved in this period appreciates by a risk-free interest rate r:
The exogenous shock Yt+1 is drawn (randomly selected) from a known distribution Fy. Fy (F with a subscript y) is a variable that can denote any distribution we choose (usually, you want to empirically match a distribution to an input). We could say, for example, that Fy represents a normal Gaussian distribution.
We can say that P(Wt+1 | Wt,ct) is induced by Y ∼ Fy, or that the wealth in the next period is conditional upon wealth and consumption in this period, and is also dependent upon Fy because Y is determined randomly by Fy and this randomness carries over into the Wt+1 calculation.
The discount factor is 0.952 given β ∈ (0,1). β, more specifically, equals 1/(1+rate), so an annual discount rate of 5% produces a beta value of 0.952, and so on.
That all said, this problem differs from the previous example, which had only two periods. Instead, here, we’re dealing with an infinite horizon optimization. With such a problem, we’ll converge to an answer over time through the recursive iteration method described previously.
Convergence is the process of guessing a value, plugging it back into the equation to find a higher value, and continuing - as we (well, our script) keeps plugging in previous values, the change in the V(x) output will get smaller and smaller, until it eventually “converges” to a fixed point (again, I like the Clue board game analogy for this). Once we have that point, we can derive the action for each state which attains that maximum value. This resulting set of state-action pairs is the solution to the problem.
Ok, here’s the Bellman equation - in this case, value as a function of wealth (the state):
The equation is just like the previous ones we’ve looked at - the immediate reward function (the fraction to the left of the “+”) is equal to u(c), the state and action (W,c) is used in the continuination value, and we’re maximizing subject to consumption c being a positive fraction of wealth W.
We know that wealth in the next period is found through a random income shock in the next period added to appreciated saved wealth in the current period:
Solving this fixed-point equation (meaning that one fixed set of values is the solution) gives the lifetime value V(⋅), and the consumption that attains this max in each wealth state is the optimal consumption policy c(W).
A quick note: we’re using value iteration here, since we’re guessing-and-checking a certain value and updating our Bellman equation with it to converge on a lifetime value. You will also hear about policy iteration, which is another dynamic programming approach in which a policy is calculated and recursively improved in each iteration, instead of values. While policy iteration tends to converge faster, the best algorithm to use depends upon the problem at hand.
Now, an analytical solution is impossible because income is random. This is why we have to use numerical methods, which are methods where we approximate an answer (convergence being one way of doing so).
With all this setup in mind, here are the steps we want our Python script to succcessfully execute value iteration with the Bellman equation:
- Choose an inital guess V0(W) = 0
- Create a discrete “grid” for wealth.
- For each point on the wealth grid, compute the Bellman equation to update the value.
- Return the value as the new guess, and repeat.
- Stop this process once the changes in output values get extremely small.
- Return the input value that maximizes the function given the converged value.
A quick note on step two: wealth is fundamentally continous and can take on an infinite number of values. To get by this problem, we “discretize” the state space by defining a maximum, minumum, and iterval for the computer will consider. For example, we could set the minimum and maximum to 1 and 100, and the iterval to 1. Like so, the computer will consider wealth states of 1, 2, 3 … 100. We’ll create more than one grid (one for consumption as well), and so we create nested grids of sorts whereas each discrete wealth grid point contains a consumption grid within it, each representing a different equation to calculate and “run up the chain.”
Here’s the script, which we’ll break down (you can run it yourself in terminal to follow along!):
"""
Infinite-Horizon Consumption–Saving Problem using Value Iteration
"""
import numpy as np
import matplotlib.pyplot as plt
# 1. PROBLEM PARAMETERS
# Economic parameters
beta = 0.95238 # Discount factor (patience): how much future utility matters
gamma = 1.5 # Risk aversion coefficient: higher = more risk averse
r = 0.03 # Risk-free interest rate on savings
# Income process
Y_vals = np.linspace(0, 2, 5) # Possible income values
Y_prob = np.full_like(Y_vals, 1/len(Y_vals)) # Equal probabilities
# 2. NUMERICAL SETUP
# Wealth grid (state space)
W_min, W_max = 0.0, 20.0 # Wealth bounds
N_W = 300 # Number of wealth grid points
W_grid = np.linspace(W_min, W_max, N_W)
# Consumption grid (for optimization at each state)
N_C = 40 # Number of consumption choices to evaluate per wealth level
# Convergence parameters
max_iter = 400 # Maximum iterations for value iteration
tol = 1e-6 # Convergence tolerance
print(f"\nNumerical Setup:")
print(f" Wealth grid: [{W_min}, {W_max}] with {N_W} points")
print(f" Consumption choices per state: {N_C}")
print(f" Max iterations: {max_iter}, tolerance: {tol}")
# 3. UTILITY FUNCTION
def utility(c):
if gamma == 1.0:
return np.log(c)
else:
return (c ** (1 - gamma)) / (1 - gamma)
# 4. VALUE ITERATION ALGORITHM
print(f"\nStarting Value Iteration...")
# Initialize arrays
V_old = np.zeros(N_W) # Current value function
V_new = np.zeros(N_W) # Updated value function
policy_c = np.zeros(N_W) # Optimal consumption policy
# Track convergence
error_history = []
# Main value iteration loop
for iteration in range(max_iter):
# For each wealth level, find optimal consumption
for i, W in enumerate(W_grid):
# Create consumption grid: can't consume more than current wealth
C_grid = np.linspace(1e-8, W, N_C) # Small lower bound avoids log(0)
# Evaluate value for each consumption choice
values = np.full(N_C, -np.inf) # Initialize with very low values
for j, c in enumerate(C_grid):
# Current period utility
current_utility = utility(c)
# Next period wealth for each possible income
W_next = (W - c) * (1 + r) + Y_vals
# Expected continuation value (interpolate since W_next may not be on grid)
expected_continuation = np.dot(Y_prob, np.interp(W_next, W_grid, V_old))
# Total value = current utility + discounted expected future value
values[j] = current_utility + beta * expected_continuation
# Choose consumption that maximizes value
best_choice = np.argmax(values)
V_new[i] = values[best_choice]
policy_c[i] = C_grid[best_choice]
# Check convergence
max_error = np.max(np.abs(V_new - V_old))
error_history.append(max_error)
if max_error < tol:
print(f"✓ Converged after {iteration + 1} iterations")
print(f" Final max error: {max_error:.2e}")
break
# Update for next iteration
V_old = V_new.copy()
# Progress indicator
if (iteration + 1) % 50 == 0:
print(f" Iteration {iteration + 1}: max error = {max_error:.2e}")
else:
print(f"⚠ Warning: Reached maximum iterations ({max_iter}) without full convergence")
print(f" Final max error: {max_error:.2e}")
# 5. RESULTS
print(f"\nResults Summary:")
print(f" Converged: {'Yes' if max_error < tol else 'No'}")
print(f" Final iterations: {len(error_history)}")
print(f" Wealth range: [{W_grid[0]:.1f}, {W_grid[-1]:.1f}]")
print(f" Consumption range: [{policy_c.min():.3f}, {policy_c.max():.3f}]")
Let’s break down that script by parts and then we’ll look at visualizations and convergence.
First (note the image below), we set the parameters for the model. These are exogenous (determined by us) and usually calculated empirically when attempting to simulate a world-world scenario. The Y-vals line sets up an array of 5 equally spaced values between 0 and 2, and Y_prob assigns equal probabilities of occuring to those values. Cumultively, they form the stochastic income shock and the corresponding “random” distribution (it is not that the distribution is random, but rather the choice is random according to the distribution):
# 1. PROBLEM PARAMETERS
# Economic parameters
beta = 0.96 # Discount factor; how much future utility matters
gamma = 1.5 # Risk aversion coefficient; higher = more risk averse
r = 0.03 # Risk-free interest rate on savings
# Income process
Y_vals = np.linspace(0, 2, 5) # Possible income values
Y_prob = np.full_like(Y_vals, 1/len(Y_vals)) # Equal probabilities
Next, we set up the mentioned wealth grid. We establish the minimum, maximum, and number of points, and use np.linspace to create an array, evenly spacing those 300 grid points throughout the defined min/max interval. We then apply convergence bounds by setting the max number of possible iterations to 400 and the convergence threshold, or the maximum spread allowed prior to convergence, to be a very small number:
# 2. NUMERICAL SETUP
# Wealth grid (state space)
W_min, W_max = 0.0, 20.0 # Wealth bounds
N_W = 300 # Number of wealth grid points
W_grid = np.linspace(W_min, W_max, N_W)
# Consumption grid (for optimization at each state)
N_C = 40 # Number of consumption choices to evaluate per wealth level
# Convergence parameters
max_iter = 400 # Maximum iterations for value iteration
tol = 1e-6 # Convergence tolerance
print(f"\nNumerical Setup:")
print(f" Wealth grid: [{W_min}, {W_max}] with {N_W} points")
print(f" Consumption choices per state: {N_C}")
print(f" Max iterations: {max_iter}, tolerance: {tol}")
Third, we set up the utility function u(c) of the standard CRRA form. There’s a special case when gamma equals 1; if this happens, we have to log the variable (this is standard to the CRRA setup and reflects marginal diminshing returns). If not, we adhere to the utility function described: (c^1-gamma)/1 — gamma).
# 3. UTILITY FUNCTION
def utility(c):
if gamma == 1.0:
return np.log(c)
else:
return (c ** (1 - gamma)) / (1 - gamma)
Fourth, we run through the core value iteration. We initialize V_old as the value function from the previous iteration, V_new the updated value function for the current iteration (computed based on V_old), and policy_c the optimal consumption policy that we’re trying to find.
Now’s the meat of the script: where we actually recursively solve the problem. Here’s the high-level setup:
Translating this process into code, we use a nested for loop, create another array (grid) for consumption, and calculate the utility for each consumption point given the wealth point in another nested for loop. Note that argmax, short for “argument of the maximum”, returns the input value that maximizes a given function, in this case the best consumption choice. That best_choice value then updates the value function.
# 4. VALUE ITERATION ALGORITHM
print(f"\nStarting Value Iteration...")
# Initialize arrays
V_old = np.zeros(N_W) # Current value function
V_new = np.zeros(N_W) # Updated value function
policy_c = np.zeros(N_W) # Optimal consumption policy
# Main value iteration loop
for iteration in range(max_iter):
# For each wealth level, find optimal consumption
for i, W in enumerate(W_grid):
# Create consumption grid: can't consume more than current wealth
C_grid = np.linspace(1e-8, W, N_C) # Small lower bound avoids log(0)
# Evaluate value for each consumption choice
values = np.full(N_C, -np.inf) # Initialize with very low values
for j, c in enumerate(C_grid):
# Current period utility
current_utility = utility(c)
# Next period wealth for each possible income
W_next = (W - c) * (1 + r) + Y_vals
# Expected continuation value (interpolate since W_next may not be on grid)
expected_continuation = np.dot(Y_prob, np.interp(W_next, W_grid, V_old))
# Total value = current utility + discounted expected future value
values[j] = current_utility + beta * expected_continuation
# Choose consumption that maximizes value
best_choice = np.argmax(values)
V_new[i] = values[best_choice]
policy_c[i] = C_grid[best_choice]
We finally check for convergence by checking the spread between the new value and the old value and comparing it to the tol value. If it’s smaller than the tol threshold, then the value is considered converged and the script exits the loop.
# Check convergence
max_error = np.max(np.abs(V_new - V_old))
if max_error < tol:
print(f"✓ Converged after {iteration + 1} iterations")
print(f" Final max error: {max_error:.2e}")
break
# Update for next iteration
V_old = V_new.copy()
else:
print(f"⚠ Warning: Reached maximum iterations ({max_iter}) without full convergence")
print(f" Final max error: {max_error:.2e}")
I won’t expand on section #5 — that’s just printing results.
So, when running the script, here’s the terminal output:
You can visually how the algorithm “improves” its guesses over time through the Bellman equation, with a max error (spread) of 0.0274 at iteration 50, then 0.00355 at iteration 100, and so on before converging with a spread of 0.0000000985 after 357 iterations.
Here some plots of the script over time and the insights it delivers (created using the matplotlib Python library):
That concludes our two examples!
Summary
Here’s a quick rundown of everything we’ve covered to this point:
- Foundational concepts: states, actions, rewards, dynamic programming, and reinforcement learning.
- Bellman Equation struture: immedaite rewards + discounted expected future values.
- Uncertainty in Bellman equations: probability distributions and expected values to model stochastic shocks.
- Policies versus actions: the difference between choices (actions) and strategies (decision rules/policies).
- Solution methods: value iteration as a computational approach using Bellman equations to solve optimization problems.
- Two examples using the Bellman equation: a 2-period machine maintenance example solved by hand, and an infinite-horizon consumption-savings model requiring numerical methods to solve.
- Implementation techniques: discretizing continious variables, grid-based optimization, and convergence criteria.
There’s a lot more to the world of optimization methods that Bellman methods open up, but we’ve covered the essentials. If I missed anything, or you’re curious about a related subject, make sure to leave it in the comments. Thanks for reading!
If you liked this article, you’ll probably like this article I just wrote about Monte Carlo simulations, which rely on many similar concepts we’ve employed when exploring the Bellman equations and are a very useful applied means of simulating the real world in economics, finance, and beyond.
Or, if you’re into (or studying!) economics, you’re in the right place, and make sure to check out my other articles here.