Reinforcement Learning Fundamentals | AI Course

What is Reinforcement Learning?

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment. Unlike supervised learning where we have labeled data, RL agents learn from trial and error, receiving rewards or penalties for their actions. Think of it like training a pet: good behavior gets treats, bad behavior doesn't.

Key Concept

Reinforcement Learning Defined

Reinforcement Learning is a computational approach to learning from interaction. An agent takes actions in an environment to maximize cumulative reward over time. The agent isn't told which actions to take, but must discover which actions yield the most reward through experience.

Real-World Analogy: Imagine learning to ride a bike. Nobody gives you step-by-step instructions for every muscle movement. Instead, you try things, fall down (negative reward), make adjustments, and eventually learn to balance (positive reward). That's reinforcement learning!

The Three Paradigms of Machine Learning

To understand where RL fits, let's compare it with the other main types of machine learning:

Supervised Learning

Learn from: Labeled examples

Goal: Predict labels for new data

Example: Classifying emails as spam/not spam using labeled training data

Unsupervised Learning

Learn from: Unlabeled data

Goal: Find patterns and structure

Example: Grouping customers by purchasing behavior without predefined categories

Reinforcement Learning

Learn from: Rewards and penalties

Goal: Maximize cumulative reward

Example: Training an AI to play chess by rewarding wins and penalizing losses

Famous RL Success Stories

Reinforcement learning has achieved remarkable milestones in artificial intelligence:

AlphaGo (2016): DeepMind's AI defeated world champion Lee Sedol in Go, a game with more possible positions than atoms in the universe. AlphaGo combined deep neural networks with Monte Carlo tree search, learning initially from human games and then improving through self-play. This achievement was considered a decade ahead of predictions.
Atari Games (2013): DeepMind's Deep Q-Network (DQN) learned to play 49 different Atari games at superhuman level using only raw pixel inputs and game scores. The same algorithm mastered games as diverse as Breakout, Space Invaders, and Pong without any game-specific modifications.
OpenAI Five (2019): This team of five neural networks defeated the world champions in Dota 2, a complex multiplayer strategy game with thousands of possible actions per timestep. The agents trained for the equivalent of 45,000 years of gameplay through self-play.
Self-Driving Cars: Companies like Waymo and Tesla use RL to help autonomous vehicles make real-time driving decisions. RL handles complex scenarios like merging into traffic, navigating intersections, and responding to unpredictable pedestrian behavior.
Data Center Cooling (Google): Google applied RL to optimize cooling systems in their data centers, reducing energy consumption by 40%. The agent learned to adjust hundreds of variables like fans, cooling systems, and windows to minimize power usage while maintaining safe temperatures.

Beginner Tip: The key difference between RL and other ML types is sequential decision making. In RL, today's action affects tomorrow's situation. It's not just about making one good prediction—it's about making a series of good decisions over time.

RL Components & Terminology

Before diving into algorithms, let's build a solid foundation by understanding the key components that power every reinforcement learning system. Think of these as the vocabulary you need to "speak" RL fluently—master these terms, and everything else will click into place.

Consider a simple analogy: teaching a dog to fetch. The dog (agent) observes where the ball landed (state), decides to run toward it (action), and receives a treat when it brings the ball back (reward). Through repetition, the dog learns the optimal strategy (policy) to maximize treats. Reinforcement learning operates on precisely the same principle—but with mathematical rigor. Whether it's mastering video games, orchestrating robotic movements, or navigating financial markets, every RL system relies on these same fundamental building blocks. Once you internalize them, you'll be equipped to decode research papers, implement cutting-edge algorithms, and architect your own intelligent solutions.

The Agent-Environment Interaction

Every RL system revolves around two key players: the agent (the learner and decision-maker) and the environment (the world it interacts with). They engage in a continuous dialogue through a simple yet powerful loop:

The agent is the AI system we're training—it could be a neural network controlling a game character, a algorithm managing a robot arm, or a model making trading decisions. The agent's job is to learn from experience and improve its behavior over time. It doesn't know the rules of the environment in advance; it must discover them through trial and error.

The environment encompasses everything outside the agent's direct control. In a video game, the environment includes the game physics, enemy behavior, and level layout. For a self-driving car, it's the roads, traffic, pedestrians, and weather conditions. The environment responds to the agent's actions by transitioning to new states and providing rewards—but it never explains why things happened or how to do better. The agent must figure that out on its own.

This separation is crucial: the agent can only influence the environment through its actions, and the environment only communicates back through states and rewards. There's no teacher providing instructions, no labeled dataset showing correct answers—just this ongoing interaction where the agent learns by doing.

The RL Loop

Agent-Environment Interaction Cycle

Observe: The agent perceives the current state of the environment through sensors or data inputs. This could be raw pixels from a game screen, sensor readings from a robot, or numerical features describing a situation. The quality and completeness of this observation directly impacts learning.
Act: Based on its current policy (strategy), the agent selects and executes an action from the available options. Early in training, actions may be random (exploration); later, they become more deliberate as the agent learns which actions lead to better outcomes.
Reward: The environment provides immediate feedback as a numerical reward signal. Positive rewards reinforce good behavior, negative rewards discourage bad choices, and the magnitude indicates importance. This is the only teaching signal the agent receives.
Transition: The environment evolves to a new state based on the action taken. This transition may be deterministic (same action always leads to same result) or stochastic (some randomness involved). The agent has no control over how the environment transitions.
Repeat: This cycle continues at each timestep until reaching a terminal state (goal achieved, game over, or maximum steps reached). One complete sequence from start to terminal state is called an episode.

Core RL Terminology

Now that you understand the interaction loop, let's formalize the vocabulary. These four concepts—State, Action, Reward, and Policy—form the foundation of every RL algorithm. Understanding them deeply is essential because they appear in every paper, tutorial, and implementation you'll encounter.

Think of it this way: the State tells the agent "where am I?", the Action answers "what can I do?", the Reward reveals "how did I do?", and the Policy encodes "what should I do next?". Together, these four elements completely describe any RL problem. Whether you're training a robot to walk, an AI to play chess, or an algorithm to optimize energy usage—you're always working with states, actions, rewards, and policies.

We'll use a simple maze navigation scenario throughout these examples. Imagine an AI agent placed in a grid-based maze who must find the exit while avoiding traps. This intuitive setup makes it easy to visualize each concept before we move to more complex applications.

State (S)

A complete description of the current situation that captures everything the agent needs to make an optimal decision. States can be discrete (like board positions in chess) or continuous (like the exact position and velocity of a robot arm). The collection of all possible states is called the state space.

Maze Example: The agent's current position (row 2, column 3), what walls surround it, and possibly the distance to the goal. In more complex environments, the state might include the agent's velocity, energy level, or memory of past observations.

Action (A)

A choice the agent can make that affects the environment. Actions can be discrete (choose from a fixed set like up/down/left/right) or continuous (like setting a steering angle between -45° and +45°). The set of all possible actions is called the action space, and it may vary depending on the current state.

Maze Example: Move up, down, left, or right (4 discrete actions). In a self-driving car, actions might include continuous values for steering, acceleration, and braking simultaneously.

Reward (R)

A scalar feedback signal indicating how good or bad an action was in a given state. Rewards can be positive (encouragement), negative (punishment), or zero. The agent's ultimate objective is to maximize the cumulative reward over time, not just immediate reward. Designing good reward functions is both an art and a science—poorly designed rewards can lead to unexpected agent behavior.

Maze Example: +100 for reaching the goal, -1 for each step taken (encourages finding the shortest path), -50 for hitting a wall. The small negative step reward prevents the agent from wandering aimlessly.

Policy (π)

The agent's strategy or behavior—a mapping from states to actions (or probability distributions over actions). A deterministic policy always chooses the same action in a given state, while a stochastic policy may choose different actions with certain probabilities. The goal of RL is to find the optimal policy that maximizes expected cumulative reward.

Maze Example: "If there's a wall to the right, go up. If the goal is visible, move toward it. If at a dead end, backtrack." A trained policy essentially encodes the solution to the maze.

Value Functions: Measuring Future Success

Value functions are one of the most important concepts in RL. Instead of just asking "What reward do I get now?", they ask "What's my expected total reward from here if I follow my policy?" This forward-looking perspective enables the agent to make decisions that sacrifice short-term gains for long-term success—like taking a longer path to avoid a trap, or accepting a small penalty now to reach a bigger reward later.

State-Value Function V(s): How good is it to be in state s? This measures the expected cumulative reward starting from state s and following the current policy thereafter. States closer to high rewards have higher values. The value function essentially creates a "landscape" where the agent can follow the gradient toward better outcomes.

# Conceptual example: V(s) = expected total reward from state s
# In a maze, V(s) would be higher for states closer to the goal

V(goal) = 100         # At the goal - maximum value!
V(one_step_away) = 99 # Very close to goal
V(start) = 50         # Far from goal, but reachable

Action-Value Function Q(s, a): How good is it to take action a in state s? This is called the Q-value (where Q stands for "quality") and is central to algorithms like Q-learning and DQN. Unlike V(s) which only tells you how good a state is, Q(s, a) tells you how good each specific action is—making it directly useful for decision-making. The optimal Q-function satisfies the property that always choosing the action with the highest Q-value yields the best possible behavior.

# Q(s, a) = expected total reward after taking action a in state s
# This tells us which action is best in each state

# At a T-junction in the maze:
Q(junction, go_left) = 30   # Left path is longer
Q(junction, go_right) = 80  # Right path leads to goal faster
# The agent should go right!

Exploration vs Exploitation

One of the fundamental challenges in RL is balancing exploration (trying new things to discover better strategies) with exploitation (using what you already know works). This dilemma appears everywhere in life: Should you try that new restaurant, or go to your proven favorite? Should you apply for a risky dream job, or stay in your comfortable current position? In RL, getting this balance wrong can be catastrophic—too little exploration means the agent might never discover the best strategy, while too much exploration wastes time on random, unproductive actions.

Exploration

Trying new, untested actions to gather information about the environment. Exploration is essential early in learning when the agent knows little about which actions are good. Common strategies include random action selection, adding noise to actions, or seeking out unfamiliar states.

Risk: Might waste time on clearly bad actions or dangerous situations
Benefit: Might discover shortcuts, better strategies, or avoid getting stuck in local optima

Exploitation

Choosing the best action according to current knowledge to maximize immediate expected reward. Exploitation is crucial once the agent has learned enough to make informed decisions. Pure exploitation uses the policy greedily without any randomness.

Risk: Might miss better options that were never tried, or converge to suboptimal behavior
Benefit: Reliably good results based on proven experience, efficient use of learned knowledge

The most common solution is the epsilon-greedy strategy: with probability ε, explore randomly; otherwise, exploit the best known action.

import numpy as np

def epsilon_greedy_action(Q, state, epsilon=0.1):
    """
    Choose an action using epsilon-greedy strategy.
    
    Args:
        Q: Q-table mapping (state, action) to value
        state: Current state
        epsilon: Probability of exploring (default 10%)
    
    Returns:
        Selected action
    """
    if np.random.random() < epsilon:
        # Explore: choose a random action
        return np.random.choice(list(Q[state].keys()))
    else:
        # Exploit: choose the best known action
        return max(Q[state], key=Q[state].get)

The Exploration-Exploitation Dilemma: Too much exploration wastes time on random actions. Too much exploitation may cause the agent to get stuck in suboptimal behavior. Finding the right balance is key to successful RL!

Markov Decision Processes

Markov Decision Processes (MDPs) provide the mathematical foundation for reinforcement learning. An MDP formally describes the environment and allows us to reason about optimal behavior. Don't worry—we'll break it down into simple pieces!

If you've ever wondered how RL algorithms "know" what they're optimizing, the answer is MDPs. An MDP is essentially a mathematical description of the decision-making problem the agent faces. It specifies the rules of the game: what states exist, what actions are available, how the environment responds to actions, and what rewards are given. Once we frame a problem as an MDP, we have access to a powerful toolkit of algorithms that can find optimal solutions.

The beauty of MDPs is their generality. Almost any sequential decision-making problem can be formulated as an MDP: playing games, controlling robots, managing inventory, routing network traffic, personalizing recommendations—the list is endless. By understanding MDPs, you gain a universal framework for tackling a vast range of real-world challenges.

Mathematical Framework

What is an MDP?

A Markov Decision Process is defined by a tuple (S, A, P, R, γ) where:

S (State Space): The complete set of all possible states the agent might encounter. Each state should capture all relevant information about the current situation. For a chess game, a state is the full board configuration. For a robot, it might include position, velocity, and sensor readings. The state space can be finite (like a grid world with 100 cells) or infinite (like continuous physical positions).
A (Action Space): The set of all possible actions available to the agent. Actions are how the agent influences the environment. In some MDPs, the same actions are available in every state; in others, available actions depend on the current state (you can't move a chess piece that's captured). Actions can be discrete (turn left, turn right, go straight) or continuous (apply 73.5% throttle).
P(s'|s, a) (Transition Function): The probability of transitioning to state s' when taking action a in state s. This defines the "physics" of the environment—how it responds to actions. In deterministic environments, P equals 1 for one specific next state and 0 for all others. In stochastic environments, the same action might lead to different outcomes with different probabilities (like rolling dice or uncertain robot movements).
R(s, a, s') (Reward Function): The immediate reward received after transitioning from state s to state s' via action a. This is how we communicate our goals to the agent. Positive rewards encourage behaviors (reaching a goal: +100), negative rewards discourage them (hitting an obstacle: -50), and small penalties for time steps encourage efficiency (-1 per step). Reward design is crucial—poorly designed rewards lead to unexpected or undesirable behavior.
γ (Discount Factor): A value between 0 and 1 that determines the present value of future rewards. When γ=0, the agent is completely myopic and only cares about immediate rewards. When γ=1, future rewards are valued equally to immediate ones (only valid for finite-horizon tasks). Typical values like 0.95 or 0.99 make the agent forward-looking while still preferring sooner rewards over later ones. Mathematically, a reward k steps in the future is multiplied by γᵏ.

Together, these five components completely specify the decision-making problem. Given an MDP, the goal is to find an optimal policy π* that maximizes expected cumulative discounted reward from any starting state.

The Markov Property: "The Future Depends Only on the Present"

The key assumption in MDPs is the Markov property: the future state depends only on the current state and action, not on the history of how we got here. This is sometimes called the "memoryless" property, but don't let that name mislead you—we can always encode relevant history into the state itself if needed.

Mathematically, this is expressed as:

The Markov Property

In plain English: knowing where you are right now tells you everything about where you might go next. Knowing how you got here adds no additional predictive power.

Why is this property so important? It dramatically simplifies the problem. Without the Markov property, the agent would need to remember and reason about every past state and action—an exponentially growing burden. Imagine trying to make decisions based on millions of past observations! With the Markov property, the agent only needs to consider the current situation to make an optimal decision. This makes RL computationally tractable and allows us to use elegant mathematical tools like dynamic programming.

In practice, many real-world problems aren't perfectly Markov, but we can often make them approximately Markov by designing richer state representations. For example, if velocity matters for predicting the next position, we include velocity in the state. If the last 4 video game frames are needed to understand motion, we stack them together as a single state. The art of RL problem design often involves finding the right state representation that captures enough information to satisfy (or approximately satisfy) the Markov property.

Markov (Memoryless)

Chess: The current board position tells you everything you need to decide your next move. It doesn't matter if you reached this position through a brilliant sacrifice or a series of blunders—the optimal next move is the same. Two players arriving at identical board positions through completely different games face exactly the same strategic situation.

Grid World Navigation: Knowing the agent is at cell (3, 4) is sufficient to determine all possible next states and their probabilities. Whether the agent wandered randomly for 100 steps or took a direct path to reach (3, 4), the available actions and their outcomes are identical.

Tetris: The current arrangement of blocks on the board plus the current falling piece completely determines the game state. How you stacked previous pieces doesn't affect what happens when you place the current one.

Key Insight: In Markov environments, history is irrelevant for prediction. The current state acts as a "sufficient statistic" that summarizes all past information needed for future decisions.

Non-Markov (Has Memory)

Stock Market: Past price trends and momentum significantly affect future prices. A stock at $50 that was at $40 yesterday behaves differently than one at $50 that was at $60 yesterday—the first is on an uptrend, the second on a downtrend. Technical analysis relies entirely on historical patterns.

Poker & Hidden Information: In poker, the current visible cards don't tell the whole story. Past betting patterns reveal crucial information about opponents' hidden cards—a player who raised aggressively likely has strong cards. Without this history, you'd play suboptimally.

Dialogue Systems: A chatbot's response depends on the entire conversation history, not just the last message. "Yes" means different things depending on what question was asked earlier. Context from many turns back can be essential.

Solution: Transform the problem into a Markov one by expanding the state definition. Include last N observations, use recurrent neural networks (LSTMs/GRUs) to encode history, or maintain belief states that summarize relevant past information.

State Design Tip: If your agent's performance plateaus or behaves inconsistently, the state might not be Markov enough. Try adding more information to the state: recent history, velocities, or derived features. Remember—the state should contain everything the agent needs to make optimal decisions.

The Discount Factor (γ): Valuing Future Rewards

The discount factor γ (gamma) determines how much the agent values future rewards compared to immediate ones. It's always between 0 and 1, and it profoundly shapes the agent's behavior.

Think of it like this: would you rather have $100 today or $100 a year from now? Most people prefer money now—there's uncertainty in the future, and immediate rewards can be reinvested. The discount factor captures this preference mathematically. A reward received k steps in the future is multiplied by γᵏ, making it worth less than the same reward received immediately.

The choice of γ has significant practical implications. A low γ (like 0.5) creates a "greedy" agent that prioritizes quick wins, potentially missing larger delayed rewards. A high γ (like 0.99) creates a "patient" agent willing to sacrifice now for bigger payoffs later. The right value depends on your problem: for games with clear endings, use γ close to 1; for continuing tasks or when you need faster learning, use smaller values.

Discounted Return from time t

# Example: How discount factor affects value of future rewards
# Imagine you'll receive $10 every day forever

def calculate_total_value(daily_reward, gamma, days=100):
    """Calculate total discounted value of rewards."""
    total = 0
    for day in range(days):
        total += (gamma ** day) * daily_reward
    return total

# Different discount factors:
print(f"γ=0.0 (only today matters): ${calculate_total_value(10, 0.0):.2f}")
print(f"γ=0.5 (halves each day):    ${calculate_total_value(10, 0.5):.2f}")
print(f"γ=0.9 (patient agent):       ${calculate_total_value(10, 0.9):.2f}")
print(f"γ=0.99 (very patient):      ${calculate_total_value(10, 0.99):.2f}")

# Output:
# γ=0.0 (only today matters): $10.00
# γ=0.5 (halves each day):    $20.00
# γ=0.9 (patient agent):       $100.00
# γ=0.99 (very patient):      $1000.00

Choosing γ: Use γ close to 1 (like 0.99) when future rewards are important and the task has a clear endpoint. Use smaller γ (like 0.9) for tasks where you want the agent to prefer quicker rewards.

The Bellman Equation: The Heart of RL

The Bellman equation is the foundation of most RL algorithms. It expresses a beautiful recursive relationship: the value of a state equals the immediate reward plus the discounted value of the next state. This simple insight is remarkably powerful—it lets us break down the complex problem of evaluating long-term outcomes into simpler one-step calculations.

Named after mathematician Richard Bellman, this equation is the cornerstone of dynamic programming and reinforcement learning. The key insight is that we don't need to trace out all possible futures to evaluate a state—we just need to look one step ahead and rely on our estimates of future state values. This recursive structure enables efficient algorithms that iteratively improve value estimates until they converge to the true values.

There are two versions of the Bellman equation corresponding to our two value functions. The state-value version tells us how good it is to be in a state, while the action-value version tells us how good it is to take a specific action in a state. Q-learning, which we'll explore next, works directly with the action-value form.

State-Value Function under policy π

Action-Value Function (Q-function)

# Bellman equation in Python (simplified)
# V(s) = immediate_reward + gamma * V(next_state)

def bellman_update(V, state, next_state, reward, gamma=0.99):
    """
    One step of Bellman update for value function.
    
    The value of current state = reward + discounted future value
    """
    V[state] = reward + gamma * V[next_state]
    return V

# Example: Simple 3-state chain
# Start -> Middle -> Goal (+10 reward)
V = {'start': 0, 'middle': 0, 'goal': 10}
gamma = 0.9

# Working backwards from goal:
V = bellman_update(V, 'middle', 'goal', 0, gamma)
print(f"V(middle) = 0 + 0.9 * 10 = {V['middle']}")  # 9.0

V = bellman_update(V, 'start', 'middle', 0, gamma)
print(f"V(start) = 0 + 0.9 * 9 = {V['start']}")    # 8.1

Practice: MDPs

Task: An agent receives rewards [1, 2, 3, 10] over 4 timesteps. Calculate the total discounted return from the start using γ=0.9.

Show Solution

def discounted_return(rewards, gamma):
    """Calculate total discounted return."""
    total = 0
    for t, reward in enumerate(rewards):
        total += (gamma ** t) * reward
    return total

rewards = [1, 2, 3, 10]
gamma = 0.9

G = discounted_return(rewards, gamma)
print(f"G = 1 + 0.9*2 + 0.81*3 + 0.729*10")
print(f"G = 1 + 1.8 + 2.43 + 7.29 = {G:.2f}")
# G = 12.52

Task: Create a 3x3 gridworld where the bottom-right cell is the goal (+1 reward). Use Bellman updates to calculate the value of each cell with γ=0.9.

Show Solution

import numpy as np

# 3x3 grid, goal at (2,2)
grid_size = 3
gamma = 0.9
goal = (2, 2)

# Initialize values
V = np.zeros((grid_size, grid_size))
V[goal] = 1.0  # Goal has value 1

# Actions: up, down, left, right
actions = [(-1, 0), (1, 0), (0, -1), (0, 1)]

def get_next_state(state, action):
    next_state = (state[0] + action[0], state[1] + action[1])
    # Stay in bounds
    if 0 <= next_state[0] < grid_size and 0 <= next_state[1] < grid_size:
        return next_state
    return state

# Value iteration
for iteration in range(50):
    V_new = V.copy()
    for i in range(grid_size):
        for j in range(grid_size):
            if (i, j) == goal:
                continue
            # Max over all actions (greedy)
            values = []
            for action in actions:
                next_state = get_next_state((i, j), action)
                values.append(gamma * V[next_state])
            V_new[i, j] = max(values)
    V = V_new

print("Optimal Values:")
print(np.round(V, 2))

Task: For each scenario, determine if it's Markov or Non-Markov. If Non-Markov, suggest how to make it Markov.

A robot's position (x, y) on a flat surface
Predicting tomorrow's weather using only today's temperature
Playing Tic-Tac-Toe given the current board state
A bouncing ball's position without velocity information

Show Solution

# Analysis of each scenario:

# 1. Robot position (x, y) - NON-MARKOV
# Why: Position alone doesn't tell us velocity/direction
# Solution: Include velocity in state: (x, y, vx, vy)

# 2. Weather from today's temperature - NON-MARKOV  
# Why: Weather depends on pressure, humidity, season, trends
# Solution: Include multiple days of data, pressure, humidity

# 3. Tic-Tac-Toe board state - MARKOV
# Why: Current board completely determines legal moves
# and game outcome possibilities

# 4. Bouncing ball position - NON-MARKOV
# Why: Can't predict next position without velocity
# Solution: State = (x, y, vx, vy) or stack last 2 positions

print("Markov: Tic-Tac-Toe")
print("Non-Markov: Robot position, Weather, Bouncing ball")
print("\nKey insight: If you need history to predict the future,")
print("either add that history to the state or use RNNs!")

Task: An agent can either: (A) Get +1 reward immediately, or (B) Wait 5 steps then get +10 reward. For which values of γ should it choose option A vs B?

Show Solution

import numpy as np

def compare_options(gamma):
    """Compare immediate vs delayed reward."""
    option_a = 1.0  # Immediate +1
    option_b = (gamma ** 5) * 10  # +10 after 5 steps
    return option_a, option_b

print("γ\t\tOption A\tOption B\tBest Choice")
print("-" * 55)

for gamma in [0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99]:
    a, b = compare_options(gamma)
    best = "A (immediate)" if a > b else "B (wait)"
    print(f"{gamma}\t\t{a:.2f}\t\t{b:.2f}\t\t{best}")

# Find the crossover point
# We need: 1 = γ^5 * 10
# γ^5 = 0.1
# γ = 0.1^(1/5) ≈ 0.631

crossover = 0.1 ** (1/5)
print(f"\nCrossover γ = {crossover:.3f}")
print("For γ < 0.631: Choose immediate reward")
print("For γ > 0.631: Choose delayed reward")

Task: Design a reward function for a maze agent that: (1) reaches the goal quickly, (2) avoids walls, and (3) prefers shorter paths. Consider edge cases and potential reward hacking.

Show Solution

def maze_reward(state, action, next_state, hit_wall, reached_goal, step_count):
    """
    Carefully designed reward function for maze navigation.
    
    Design considerations:
    - Goal reward >> step penalty to ensure reaching goal is priority
    - Wall penalty should discourage but not cause extreme avoidance
    - Small step penalty encourages efficiency without rushing
    """
    
    reward = 0
    
    # 1. Goal reward (large positive)
    if reached_goal:
        reward += 100.0
        # Bonus for reaching quickly (optional)
        # reward += max(0, 50 - step_count)  # Bonus decreases with steps
    
    # 2. Wall collision penalty (moderate negative)
    if hit_wall:
        reward -= 5.0
        # Note: Too high penalty might make agent afraid to explore
    
    # 3. Step penalty (small negative to encourage efficiency)
    reward -= 0.1
    # This encourages shorter paths but doesn't dominate
    
    # 4. Optional: Distance-based shaping (helps learning but can cause issues)
    # distance_to_goal = manhattan_distance(next_state, goal)
    # reward += 0.01 * (old_distance - distance_to_goal)  # Reward getting closer
    
    return reward

# Potential issues to consider:
print("Reward Hacking Scenarios:")
print("1. If step penalty too high → Agent might prefer hitting wall to end episode")
print("2. If wall penalty too high → Agent might stay still to avoid any risk")
print("3. If goal reward too low → Agent might find 'loops' that accumulate reward")
print("\nGood Practice: Goal reward should be ~100-1000x larger than step penalty")

Q-Learning Algorithm

Q-Learning is one of the most important RL algorithms. It's a model-free method that learns the optimal action-value function Q*(s,a) directly from experience, without needing to know the environment's dynamics. Let's build it step by step!

Invented by Christopher Watkins in 1989, Q-Learning revolutionized reinforcement learning by showing that an agent could learn optimal behavior without a model of the environment. Before Q-Learning, most RL methods required knowledge of transition probabilities—which states lead to which other states. Q-Learning threw out that requirement entirely: the agent simply tries actions, observes outcomes, and updates its estimates. This made RL practical for real-world problems where the environment dynamics are unknown or too complex to model.

The "Q" in Q-Learning stands for "quality"—Q(s, a) represents the quality of taking action a in state s. More precisely, it estimates the expected cumulative discounted reward if you take action a in state s and then follow the optimal policy thereafter. Once we know the Q-values for all state-action pairs, the optimal policy is trivial: just pick the action with the highest Q-value in each state!

What makes Q-Learning special is that it's off-policy: the agent can learn about the optimal policy while following a completely different (exploratory) policy. This separation allows the agent to explore freely—trying random actions, making mistakes, taking detours—while still converging to optimal behavior. In contrast, on-policy methods must balance learning with their current behavior, which can be more restrictive.

Algorithm

Q-Learning in a Nutshell

Q-Learning maintains a table Q(s, a) storing the expected return for each state-action pair. After each action, it updates this table using the rule:

Q-Learning Update Rule

Key Insight: Q-Learning is "off-policy"—it learns the optimal policy while following an exploratory policy. This means the agent can explore freely while still learning the best possible behavior!

Understanding the Q-Learning Update

Let's break down each part of the Q-Learning update equation:

# Q-Learning Update Breakdown
# Q(s,a) = Q(s,a) + alpha * (target - current)

# Where:
# target = r + gamma * max(Q(s', all_actions))
# current = Q(s, a)

def q_learning_update(Q, state, action, reward, next_state, alpha=0.1, gamma=0.99):
    """
    Perform one Q-learning update.
    
    Args:
        Q: Q-table (dictionary of state -> action -> value)
        state: Current state
        action: Action taken
        reward: Reward received
        next_state: Resulting state
        alpha: Learning rate (how fast to learn)
        gamma: Discount factor (how much to value future)
    """
    # Current estimate
    current_q = Q[state][action]
    
    # Best possible future value (greedy lookahead)
    max_future_q = max(Q[next_state].values()) if Q[next_state] else 0
    
    # Target: what we should move toward
    target = reward + gamma * max_future_q
    
    # TD Error: how wrong our current estimate is
    td_error = target - current_q
    
    # Update: move slightly toward the target
    Q[state][action] = current_q + alpha * td_error
    
    return Q

This function implements the core Q-Learning algorithm. The current_q retrieves our current estimate of the value of taking this action in this state. The max_future_q looks ahead to the next state and finds the best possible action we could take from there—this is the "greedy lookahead" that makes Q-Learning work. The target combines the immediate reward with the discounted future value, representing what our Q-value "should" be based on this experience. The td_error (Temporal Difference error) measures how wrong our current estimate was—positive means we underestimated, negative means we overestimated. Finally, we nudge our Q-value toward the target by a small amount controlled by alpha, gradually improving our estimates over many updates.

Q-Learning Parameters

Learning Rate (α)

Controls how much new information overrides old estimates. A learning rate of 1.0 would completely replace the old Q-value with the new target; a rate of 0.0 would never update. Too high causes oscillations and instability as Q-values swing wildly. Too low means painfully slow learning—the agent needs many more experiences to converge.

Typical value: 0.1 to 0.3

Tip: Start with 0.1 and adjust based on convergence speed and stability.

Discount Factor (γ)

Determines how much the agent cares about future rewards versus immediate ones. With γ=0, the agent is completely short-sighted. With γ=0.99, a reward 100 steps away still has 37% of its original value. Higher γ values encourage strategic, long-term planning but require more training to propagate values through the state space.

Typical value: 0.9 to 0.99

Tip: Use higher γ for episodic tasks with clear goals, lower for continuing tasks.

Exploration Rate (ε)

The probability of choosing a random action instead of the best-known action. Essential for discovering better strategies—without exploration, the agent might never find the optimal path. Typically starts at 1.0 (100% random) and decays toward a small minimum (like 0.01) as the agent becomes more confident in its learned values.

Typical decay: 0.995 per episode

Tip: Decay slowly enough to ensure thorough exploration before committing to exploitation.

Complete Q-Learning Implementation

Let's implement Q-Learning for a simple environment. We'll use OpenAI Gym's FrozenLake environment—a grid where an agent must navigate from a starting position to a goal while avoiding holes in the ice. This classic benchmark perfectly demonstrates how Q-Learning discovers optimal paths through trial and error.

First, we initialize the Q-table and set up our hyperparameters:

import numpy as np
import gymnasium as gym

# Create the environment
env = gym.make('FrozenLake-v1', is_slippery=False)

# Initialize Q-table with zeros
# Shape: (num_states, num_actions)
n_states = env.observation_space.n   # 16 states (4x4 grid)
n_actions = env.action_space.n       # 4 actions (left, down, right, up)
Q = np.zeros((n_states, n_actions))

# Hyperparameters
alpha = 0.1      # Learning rate
gamma = 0.99     # Discount factor
epsilon = 1.0    # Initial exploration rate
epsilon_decay = 0.995
epsilon_min = 0.01
num_episodes = 2000

Next, we define our action selection strategy using epsilon-greedy:

def choose_action(state, Q, epsilon):
    """
    Epsilon-greedy action selection.
    
    With probability epsilon: explore (random action)
    Otherwise: exploit (best known action)
    """
    if np.random.random() < epsilon:
        return env.action_space.sample()  # Random action
    else:
        return np.argmax(Q[state])         # Best action

Now the main training loop where the agent learns from experience:

# Training loop
rewards_per_episode = []

for episode in range(num_episodes):
    state, _ = env.reset()
    total_reward = 0
    done = False
    
    while not done:
        # Choose action using epsilon-greedy
        action = choose_action(state, Q, epsilon)
        
        # Take action, observe result
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        
        # Q-Learning update
        best_next_q = np.max(Q[next_state])
        td_target = reward + gamma * best_next_q * (not done)
        td_error = td_target - Q[state, action]
        Q[state, action] += alpha * td_error
        
        state = next_state
        total_reward += reward
    
    # Decay exploration rate
    epsilon = max(epsilon_min, epsilon * epsilon_decay)
    rewards_per_episode.append(total_reward)
    
    # Print progress every 500 episodes
    if (episode + 1) % 500 == 0:
        avg_reward = np.mean(rewards_per_episode[-100:])
        print(f"Episode {episode+1}, Avg Reward: {avg_reward:.2f}, Epsilon: {epsilon:.3f}")

Finally, we can visualize the learned Q-values and test our trained agent:

# Visualize learned policy
def print_policy(Q):
    """Print the learned policy as a grid."""
    actions = ['←', '↓', '→', '↑']
    for row in range(4):
        row_str = ""
        for col in range(4):
            state = row * 4 + col
            best_action = np.argmax(Q[state])
            row_str += f" {actions[best_action]} "
        print(row_str)

print("\nLearned Policy:")
print_policy(Q)

# Test the trained agent
def test_agent(Q, num_tests=100):
    """Test the learned policy."""
    successes = 0
    for _ in range(num_tests):
        state, _ = env.reset()
        done = False
        while not done:
            action = np.argmax(Q[state])  # Always exploit
            state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            if reward == 1:
                successes += 1
    return successes / num_tests

success_rate = test_agent(Q)
print(f"\nTest Success Rate: {success_rate*100:.1f}%")

Convergence Guarantee: Q-Learning is proven to converge to the optimal Q-values if: (1) every state-action pair is visited infinitely often, and (2) the learning rate decays appropriately. In practice, with enough training and exploration, it finds very good policies!

Deep Q-Networks (DQN)

Q-Learning works great for small environments, but what about games with millions of possible states (like Atari)? That's where Deep Q-Networks come in! DQN uses neural networks to approximate Q-values, enabling RL to scale to complex, high-dimensional problems.

Deep RL Breakthrough

What is DQN?

Deep Q-Network (DQN) replaces the Q-table with a neural network Q(s, a; θ) that takes a state as input and outputs Q-values for all actions. Instead of storing millions of Q-values in a table, the network learns to generalize across similar states.

2015 Breakthrough: DeepMind's DQN was the first algorithm to achieve human-level performance across many Atari games, learning directly from raw pixels!

Why Tables Don't Scale

Consider these state space sizes:

FrozenLake (4×4): 16 states → Q-table works perfectly
Atari Games: 210×160 pixels × 256 colors = ~10^30000 states → Table impossible!
Self-Driving Car: Continuous sensor readings → Infinite states

Neural networks solve this by learning a function that generalizes: similar states should have similar Q-values.

DQN Key Innovations

Simply replacing the Q-table with a neural network doesn't work well—in fact, early attempts at combining neural networks with Q-Learning often failed spectacularly, with learning diverging or becoming unstable. The problem is that neural networks expect training data to be independent and identically distributed (i.i.d.), but RL experiences are highly correlated: consecutive frames in a game look almost identical, and the agent's actions directly influence what states it sees next.

DeepMind's breakthrough in 2013-2015 wasn't just using neural networks—it was introducing two crucial techniques that made deep RL actually work. These innovations transformed reinforcement learning from a technique that worked only on small problems to one that could master complex visual domains.

Experience Replay

Instead of learning from experiences as they happen (which creates correlated, non-i.i.d. data), we store experiences in a large replay buffer—typically holding hundreds of thousands of transitions. During training, we randomly sample batches from this buffer, breaking the temporal correlation and creating a more stable, diverse training set.

Why it helps:

Breaks correlation: Random sampling ensures consecutive experiences don't dominate training
Data efficiency: Each experience can be used for learning many times, not just once
Smoother gradients: Diverse batches prevent the network from overfitting to recent experiences
Retains rare events: Important but infrequent experiences (like reaching a goal) stay in the buffer

Target Network

In Q-Learning, we update Q(s,a) toward a target that depends on Q(s',a'). But if we're using the same network for both, every update changes the target we're trying to reach—it's like chasing a moving goalpost. The solution is to maintain two networks: a "policy network" that we update every step, and a "target network" that we freeze and only update periodically (e.g., every 1000 steps).

Why it helps:

Stable targets: The target values don't change with every gradient step
Prevents feedback loops: Updates to Q(s,a) don't immediately affect the target
Reduces oscillations: Learning converges more reliably instead of bouncing around
Avoids divergence: Without this, training often spirals out of control

DQN Architecture

Let's build a DQN step by step using PyTorch. First, we define the neural network that approximates Q-values:

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque
import random

class DQN(nn.Module):
    """
    Deep Q-Network architecture.
    
    Takes a state as input, outputs Q-values for all actions.
    Uses fully-connected layers for simplicity.
    """
    def __init__(self, state_size, action_size, hidden_size=64):
        super(DQN, self).__init__()
        
        self.network = nn.Sequential(
            nn.Linear(state_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, action_size)
        )
    
    def forward(self, state):
        """Forward pass: state -> Q-values for all actions."""
        return self.network(state)

The network takes the current state (e.g., 4 values for CartPole: cart position, cart velocity, pole angle, pole angular velocity) and outputs Q-values for each possible action. The nn.Sequential creates a feedforward network with two hidden layers of 64 neurons each, using ReLU activation for non-linearity. The output layer has no activation—it directly outputs Q-value estimates for each action.

Next, we implement the experience replay buffer to store and sample past experiences:

class ReplayBuffer:
    """
    Experience Replay Buffer.
    
    Stores transitions and provides random batches for training.
    """
    def __init__(self, capacity=10000):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        """Store a transition."""
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        """Sample a random batch of transitions."""
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        
        return (
            torch.FloatTensor(states),
            torch.LongTensor(actions),
            torch.FloatTensor(rewards),
            torch.FloatTensor(next_states),
            torch.FloatTensor(dones)
        )
    
    def __len__(self):
        return len(self.buffer)

The buffer stores tuples of (state, action, reward, next_state, done) up to a maximum capacity. When full, old experiences are automatically removed (FIFO). The sample() method randomly selects a batch—this randomization is crucial because it breaks the correlation between consecutive experiences, which would otherwise cause the neural network to overfit to recent patterns.

Now the main DQN Agent class that handles action selection, learning, and target network updates:

class DQNAgent:
    """
    DQN Agent with experience replay and target network.
    """
    def __init__(self, state_size, action_size, 
                 lr=1e-3, gamma=0.99, epsilon=1.0,
                 epsilon_decay=0.995, epsilon_min=0.01):
        
        self.state_size = state_size
        self.action_size = action_size
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        
        # Q-Network (the one we train)
        self.q_network = DQN(state_size, action_size)
        
        # Target Network (frozen copy, updated periodically)
        self.target_network = DQN(state_size, action_size)
        self.target_network.load_state_dict(self.q_network.state_dict())
        
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=lr)
        self.memory = ReplayBuffer()
    
    def choose_action(self, state):
        """Epsilon-greedy action selection."""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.action_size)
        
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            q_values = self.q_network(state_tensor)
            return q_values.argmax().item()

Notice we create two identical networks: q_network (the one we actively train) and target_network (a frozen copy used to compute stable targets). The target network's weights are initially copied from the Q-network using load_state_dict(). The choose_action() method implements epsilon-greedy: with probability ε it explores randomly, otherwise it exploits by choosing the action with the highest Q-value. The torch.no_grad() context prevents unnecessary gradient computation during inference.

The learning method computes TD targets using the target network and updates the Q-network:

    def learn(self, batch_size=64):
        """Train on a batch of experiences."""
        if len(self.memory) < batch_size:
            return
        
        # Sample from replay buffer
        states, actions, rewards, next_states, dones = self.memory.sample(batch_size)
        
        # Current Q-values from main network
        current_q = self.q_network(states).gather(1, actions.unsqueeze(1))
        
        # Target Q-values from target network (no gradient!)
        with torch.no_grad():
            max_next_q = self.target_network(next_states).max(1)[0]
            target_q = rewards + self.gamma * max_next_q * (1 - dones)
        
        # Compute loss and update
        loss = nn.MSELoss()(current_q.squeeze(), target_q)
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # Decay exploration rate
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
    
    def update_target_network(self):
        """Copy weights from Q-network to target network."""
        self.target_network.load_state_dict(self.q_network.state_dict())

The learn() method is where the magic happens. First, it samples a random batch from replay memory. Then it computes current_q—the Q-values our network currently predicts for the actions that were actually taken (using gather() to select specific action indices). The target_q is computed using the Bellman equation: reward + γ × max(Q(next_state)) from the frozen target network. The (1 - dones) term ensures we don't add future value for terminal states. Finally, we minimize the MSE loss between predicted and target Q-values using backpropagation. The update_target_network() method periodically syncs the target network with the trained Q-network to slowly update our targets.

Training Loop

Here's how to train a DQN agent on CartPole, a classic RL benchmark where you balance a pole on a moving cart:

import gymnasium as gym

# Create environment
env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]  # 4
action_size = env.action_space.n              # 2

# Create agent
agent = DQNAgent(state_size, action_size)

# Training parameters
num_episodes = 500
target_update_freq = 10

# Training loop
scores = []
for episode in range(num_episodes):
    state, _ = env.reset()
    total_reward = 0
    done = False
    
    while not done:
        # Choose and execute action
        action = agent.choose_action(state)
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        
        # Store experience and learn
        agent.memory.push(state, action, reward, next_state, done)
        agent.learn()
        
        state = next_state
        total_reward += reward
    
    scores.append(total_reward)
    
    # Update target network periodically
    if episode % target_update_freq == 0:
        agent.update_target_network()
    
    # Print progress
    if (episode + 1) % 50 == 0:
        avg_score = np.mean(scores[-50:])
        print(f"Episode {episode+1}, Avg Score: {avg_score:.1f}, Epsilon: {agent.epsilon:.3f}")

print(f"\nFinal Average Score (last 100): {np.mean(scores[-100:]):.1f}")

Each episode represents one complete game from start to finish. The loop follows the standard RL cycle: observe state, choose action (with exploration via epsilon), execute action, receive reward and next state, store experience, learn from batch, and repeat. The target network is updated every target_update_freq episodes (not every step) to provide stable learning targets. We track scores to monitor progress—when the rolling average approaches the maximum possible score (500 for CartPole), our agent has mastered the task. The epsilon decay ensures we explore early but exploit our learned knowledge later.

CartPole Goal: Keep the pole balanced for 500 steps. A score of 500 means perfect performance! DQN typically achieves this within a few hundred episodes of training.

Policy Gradients

While Q-Learning and DQN learn a value function and derive a policy from it, Policy Gradient methods take a fundamentally different approach: they directly learn the policy itself. Instead of asking "how valuable is this state-action pair?", policy gradients ask "how should I adjust my policy to get more reward?"

Policy gradient methods represent a major branch of reinforcement learning that has powered some of the most impressive recent achievements, from ChatGPT's training (via RLHF) to robotic manipulation and complex game-playing. Understanding both value-based (Q-Learning, DQN) and policy-based (Policy Gradients) methods gives you the complete picture of modern RL.

Value-Based vs Policy-Based Methods

In value-based methods (like Q-Learning and DQN), we learn Q(s, a)—the expected return for each state-action pair—and then derive our policy by always picking the action with the highest Q-value. The policy is implicit: it's just "argmax over Q-values."

In policy-based methods, we directly parameterize the policy π_θ(a|s) and adjust the parameters θ to maximize expected reward. The policy outputs action probabilities directly—no Q-values needed!

Value-Based (Q-Learning, DQN)

Learn Q(s, a) → Derive policy from Q-values

Works best: Discrete action spaces (left, right, up, down)
Policy: Deterministic (always pick max Q)
Exploration: ε-greedy or similar
Examples: Atari games, board games, navigation

Policy-Based (Policy Gradients)

Learn π_θ(a|s) directly → Output action probabilities

Works best: Continuous action spaces (motor control, angles)
Policy: Stochastic (sample from probability distribution)
Exploration: Built into the stochastic policy
Examples: Robotics, continuous control, fine-tuning LLMs

The Policy Gradient Intuition

The core idea of policy gradients is beautifully simple: increase the probability of actions that led to high rewards, and decrease the probability of actions that led to low rewards.

Imagine you're learning to throw darts. You throw 10 darts, some hit near the bullseye (high reward), others miss badly (low reward). Policy gradients say: "remember what you did for the good throws, and do more of that. Remember what you did for the bad throws, and do less of that." Over time, you'll naturally favor the throwing motions that work best.

Mathematically, we use the REINFORCE algorithm (also called vanilla policy gradient):

Policy Gradient Theorem

∇_θ J(θ) = E_π[ ∇_θ log π_θ(a|s) · G_t ]

Where:

J(θ) = Expected total reward under policy π_θ
∇_θ = Gradient with respect to policy parameters θ
log π_θ(a|s) = Log probability of taking action a in state s
G_t = Return (cumulative discounted reward) from time t

The intuition: if G_t is high (good outcome), we increase log π_θ(a|s), making that action more likely. If G_t is low (bad outcome), we decrease it. The gradient naturally points toward better policies!

Simple REINFORCE Implementation

import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical

class PolicyNetwork(nn.Module):
    """Simple policy network that outputs action probabilities."""
    def __init__(self, state_size, action_size, hidden_size=64):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, action_size),
            nn.Softmax(dim=-1)  # Output probabilities
        )
    
    def forward(self, state):
        return self.network(state)
    
    def select_action(self, state):
        """Sample an action from the policy distribution."""
        probs = self.forward(state)
        dist = Categorical(probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return action.item(), log_prob

def reinforce_update(policy, optimizer, rewards, log_probs, gamma=0.99):
    """
    Update policy using REINFORCE algorithm.
    
    rewards: list of rewards from one episode
    log_probs: list of log probabilities for each action taken
    """
    # Calculate discounted returns
    returns = []
    G = 0
    for r in reversed(rewards):
        G = r + gamma * G
        returns.insert(0, G)
    
    # Normalize returns for stability
    returns = torch.tensor(returns)
    returns = (returns - returns.mean()) / (returns.std() + 1e-8)
    
    # Calculate policy loss
    policy_loss = []
    for log_prob, G in zip(log_probs, returns):
        policy_loss.append(-log_prob * G)  # Negative because we want gradient ascent
    
    # Update policy
    optimizer.zero_grad()
    loss = torch.stack(policy_loss).sum()
    loss.backward()
    optimizer.step()
    
    return loss.item()

This implementation shows the core REINFORCE algorithm. The policy network outputs action probabilities via softmax, then we sample actions from this distribution (enabling stochastic exploration). After an episode, we compute discounted returns for each timestep, normalize them for training stability, and perform a gradient update that increases the probability of actions with high returns. The negative sign before log_prob * G converts our gradient descent optimizer into gradient ascent on expected reward.

Actor-Critic: Best of Both Worlds

Vanilla REINFORCE has high variance—returns can fluctuate wildly between episodes. Actor-Critic methods combine policy gradients with value functions to reduce this variance:

Actor

The policy network π_θ(a|s) that decides what actions to take. This is the "actor" because it performs actions in the environment. It's updated using policy gradients to increase probability of good actions.

Critic

A value function V(s) that estimates how good each state is. This is the "critic" because it evaluates how well the actor is doing. It's trained using TD learning (like DQN) to accurately predict returns.

Instead of using raw returns G_t, actor-critic uses the advantage: A(s, a) = Q(s, a) - V(s). This tells us how much better (or worse) the chosen action was compared to the average action in that state. Using advantages dramatically reduces variance while keeping the updates unbiased.

Popular Policy Gradient Algorithms

PPO (Proximal Policy Optimization)

The most popular RL algorithm today. Uses a "clipped" objective to prevent too-large policy updates. Simple to implement, reliable, and works across many domains. Used by OpenAI for ChatGPT training (RLHF).

TRPO (Trust Region Policy Optimization)

Constrains policy updates using KL divergence to stay within a "trust region." More theoretically principled than PPO but harder to implement. Guarantees monotonic improvement under certain conditions.

A3C/A2C (Advantage Actor-Critic)

Runs multiple agents in parallel, each collecting experience independently. Asynchronous (A3C) or synchronous (A2C) updates. Great for utilizing multiple CPU cores efficiently.

SAC (Soft Actor-Critic)

Adds an entropy bonus to encourage exploration. Maximizes both reward and policy entropy. State-of-the-art for continuous control tasks like robotic manipulation.

When to use Policy Gradients: Continuous action spaces (robotics, motor control), when you need stochastic policies, when value-based methods struggle, or when fine-tuning language models (RLHF).

Real-World RL Applications

Reinforcement learning has moved far beyond academic research into real-world applications that impact millions of lives daily. From mastering complex games to controlling robots and optimizing industrial processes, RL is transforming industries. Let's explore the most exciting applications across game playing, robotics, and beyond.

Game Playing & Strategy

Games have been the proving ground for RL algorithms, offering complex challenges with clear reward signals. Each breakthrough in game AI has pushed the boundaries of what's possible:

Board Games

AlphaGo (2016): Defeated world champion Lee Sedol in Go, a game with 10¹⁷⁰ possible positions. Combined deep neural networks with Monte Carlo tree search.

AlphaZero (2017): Mastered Chess, Go, and Shogi from scratch through pure self-play—no human knowledge needed. Achieved superhuman performance in hours.

MuZero (2019): Learns the rules of games while learning to play them, without even knowing the game dynamics in advance.

Video Games

Atari DQN (2013): DeepMind's breakthrough—a single algorithm learned to play 49 different Atari games from raw pixels at superhuman level.

OpenAI Five (2019): Defeated world champions in Dota 2, a game with complex teamwork, long-term planning, and imperfect information.

AlphaStar (2019): Achieved Grandmaster level in StarCraft II, mastering real-time strategy with thousands of possible actions per second.

Poker & Imperfect Information

Libratus (2017): Beat top poker professionals in heads-up no-limit Texas Hold'em, mastering bluffing and opponent modeling.

Pluribus (2019): Extended to 6-player poker, defeating elite players in the most popular poker format.

These systems excel at deception, probabilistic reasoning, and adapting to opponent strategies—skills crucial for negotiation and strategic planning.

Robotics & Physical Control

Robotics is where RL truly shines—learning complex motor skills that are nearly impossible to program by hand. The physical world presents unique challenges: continuous action spaces, real-time constraints, and costly failures.

Robotic Manipulation

Dexterous Hands: OpenAI trained a robotic hand to solve a Rubik's cube using RL, learning human-like finger movements through simulation and transferring to the real world.

Grasping: Google's robotic arms learned to pick up diverse objects from bins, generalizing to items they'd never seen before.

Assembly: RL enables robots to insert pegs, assemble furniture, and perform precise manufacturing tasks.

Locomotion

Bipedal Walking: Boston Dynamics uses RL for agile robot movement. Agility Robotics' Cassie learned to walk, run, and recover from pushes through RL.

Quadrupeds: ANYmal and Spot learn to traverse difficult terrain, climb stairs, and maintain balance in challenging conditions.

Sim-to-Real: Robots train in simulation with randomized conditions, then transfer learned skills to physical hardware.

Autonomous Vehicles

Self-Driving Cars: Waymo and Tesla use RL for decision-making: when to change lanes, how to navigate intersections, handling edge cases.

Drones: Autonomous drone racing, warehouse navigation, and delivery systems use RL for agile flight control.

Racing: Gran Turismo Sophy achieved superhuman performance in realistic racing simulation, learning racing lines and opponent strategies.

Industrial & Commercial Applications

Data Center Cooling

DeepMind reduced Google's data center cooling costs by 40% using RL. The system learns to optimize thousands of cooling parameters in real-time, saving millions of dollars and reducing environmental impact.

Language Model Fine-Tuning (RLHF)

ChatGPT and other modern LLMs use Reinforcement Learning from Human Feedback. Humans rate model outputs, training a reward model, which then guides policy gradient updates to make responses more helpful and safe.

Recommendation & Advertising

Netflix, YouTube, and TikTok use RL-based recommender systems. The algorithm learns to maximize long-term engagement, not just immediate clicks, leading to better user satisfaction.

Scientific Discovery

RL helps design new molecules and materials, optimize chemical reactions, and control plasma in nuclear fusion reactors. AlphaFold 2's protein structure predictions used RL-like training techniques.

Finance & Trading

Algorithmic trading systems use RL for portfolio optimization, market making, and execution strategies. The ability to learn from market dynamics without explicit rules is invaluable.

Healthcare

RL optimizes treatment plans for chronic diseases, radiation therapy dosing, and drug dosing in intensive care. It learns personalized policies that adapt to individual patient responses.

The Common Thread: RL excels wherever we have sequential decisions, delayed rewards, and environments too complex for handcrafted rules. If you can simulate it or learn from interaction, RL can optimize it.

Building an RL Agent

Let's put everything together and build a complete RL agent from scratch! We'll create an agent that learns to navigate a custom gridworld environment, demonstrating all the concepts we've learned.

Custom Gridworld Environment

First, let's create a simple but interesting environment where an agent must reach a goal while avoiding obstacles:

Step 1: Set Up the Environment Class

import numpy as np

class GridWorld:
    """
    A simple gridworld environment.
    
    The agent starts at top-left and must reach bottom-right (goal).
    Some cells have obstacles (negative reward) or bonuses.
    """
    def __init__(self, size=5):
        self.size = size
        self.state = None
        self.goal = (size - 1, size - 1)
        
        # Define rewards: default -0.1 (encourages efficiency)
        self.rewards = np.full((size, size), -0.1)
        self.rewards[self.goal] = 10.0    # Goal: big reward
        
        # Add some obstacles
        self.rewards[1, 1] = -5.0  # Obstacle
        self.rewards[2, 3] = -5.0  # Obstacle
        self.rewards[3, 1] = -5.0  # Obstacle
        
        # Actions: 0=up, 1=down, 2=left, 3=right
        self.actions = [(-1, 0), (1, 0), (0, -1), (0, 1)]
        self.action_names = ['↑', '↓', '←', '→']

We define a 5x5 grid where the agent starts at the top-left corner and must navigate to the bottom-right goal. Each cell has a small negative reward (-0.1) to encourage the agent to find the shortest path. The goal gives a large positive reward (+10), while obstacles have severe penalties (-5). The four possible actions move the agent in cardinal directions.

Step 2: Implement Reset and State Conversion

    def reset(self):
        """Reset agent to start position."""
        self.state = (0, 0)
        return self._get_state_index()
    
    def _get_state_index(self):
        """Convert 2D position to 1D index."""
        return self.state[0] * self.size + self.state[1]

The reset() method places the agent back at the starting position (0, 0) at the beginning of each episode. The _get_state_index() helper converts 2D grid coordinates into a 1D index for our Q-table. For a 5x5 grid, position (2, 3) becomes index 2×5+3=13.

Step 3: Implement the Step Function

    def step(self, action):
        """Execute action, return (next_state, reward, done)."""
        # Calculate next position
        row = self.state[0] + self.actions[action][0]
        col = self.state[1] + self.actions[action][1]
        
        # Stay in bounds
        row = max(0, min(self.size - 1, row))
        col = max(0, min(self.size - 1, col))
        
        self.state = (row, col)
        reward = self.rewards[row, col]
        done = (self.state == self.goal)
        
        return self._get_state_index(), reward, done

The step() function is the core of the environment. It takes an action, calculates the new position, and clamps it to stay within grid boundaries (if the agent tries to walk off the edge, it stays in place). It returns the new state index, the reward for that cell, and a boolean indicating whether the agent reached the goal.

Step 4: Add Visualization

    def render(self, Q=None):
        """Visualize the grid and optionally the learned policy."""
        print("\nGrid World:")
        for i in range(self.size):
            row_str = ""
            for j in range(self.size):
                if (i, j) == self.state:
                    row_str += " A "  # Agent
                elif (i, j) == self.goal:
                    row_str += " G "  # Goal
                elif self.rewards[i, j] < -1:
                    row_str += " X "  # Obstacle
                else:
                    row_str += " . "
            print(row_str)

The render() method provides a simple text visualization of the grid. It shows the agent's current position (A), the goal (G), obstacles (X), and empty cells (.). This helps us understand the environment layout and debug the agent's behavior during training.

Training and Evaluating the Agent

Now let's train a Q-learning agent on our custom environment:

Step 1: Initialize Environment and Q-Table

# Initialize
env = GridWorld(size=5)
n_states = env.size * env.size   # 25 states
n_actions = 4                     # 4 actions

# Q-table
Q = np.zeros((n_states, n_actions))

# Hyperparameters
alpha = 0.1
gamma = 0.95
epsilon = 1.0
epsilon_decay = 0.99
epsilon_min = 0.01
num_episodes = 1000

We create the environment and initialize a Q-table with zeros. The table has 25 rows (one for each state) and 4 columns (one for each action). The hyperparameters control learning: alpha is the learning rate, gamma is the discount factor for future rewards, and epsilon controls exploration with exponential decay from 1.0 to 0.01.

Step 2: Training Loop with Q-Learning

# Training
episode_rewards = []

for episode in range(num_episodes):
    state = env.reset()
    total_reward = 0
    done = False
    steps = 0
    max_steps = 100  # Prevent infinite loops
    
    while not done and steps < max_steps:
        # Epsilon-greedy action selection
        if np.random.random() < epsilon:
            action = np.random.randint(n_actions)
        else:
            action = np.argmax(Q[state])
        
        # Take action
        next_state, reward, done = env.step(action)
        
        # Q-Learning update
        best_next = np.max(Q[next_state])
        Q[state, action] += alpha * (reward + gamma * best_next - Q[state, action])
        
        state = next_state
        total_reward += reward
        steps += 1
    
    episode_rewards.append(total_reward)
    epsilon = max(epsilon_min, epsilon * epsilon_decay)
    
    if (episode + 1) % 200 == 0:
        avg = np.mean(episode_rewards[-100:])
        print(f"Episode {episode+1}: Avg Reward = {avg:.2f}, ε = {epsilon:.3f}")

Each episode starts by resetting the environment. The agent uses epsilon-greedy selection: random actions with probability ε (exploration), otherwise the best known action (exploitation). After each step, we apply the Q-learning update rule to improve our estimates. The max_steps limit prevents the agent from getting stuck in loops during early training when it hasn't learned anything yet.

Visualizing the Learned Policy

After training, let's see what strategy the agent learned:

Step 1: Display Policy as Arrows

def visualize_policy(Q, env):
    """Display the learned policy as arrows."""
    print("\nLearned Policy:")
    actions = ['↑', '↓', '←', '→']
    
    for i in range(env.size):
        row_str = ""
        for j in range(env.size):
            state_idx = i * env.size + j
            if (i, j) == env.goal:
                row_str += " G "
            elif env.rewards[i, j] < -1:
                row_str += " X "
            else:
                best_action = np.argmax(Q[state_idx])
                row_str += f" {actions[best_action]} "
        print(row_str)

This function displays the learned policy as a grid of arrows. For each cell, it looks up the Q-values and shows an arrow pointing in the direction of the best action. This gives us an intuitive view of how the agent has learned to navigate—you should see arrows pointing toward the goal while avoiding obstacles.

Step 2: Display State Values

def visualize_values(Q, env):
    """Display the max Q-value for each state."""
    print("\nState Values (max Q):")
    for i in range(env.size):
        row_str = ""
        for j in range(env.size):
            state_idx = i * env.size + j
            value = np.max(Q[state_idx])
            row_str += f"{value:6.2f}"
        print(row_str)

# Visualize results
visualize_policy(Q, env)
visualize_values(Q, env)

The value visualization shows the maximum Q-value for each state. Higher values indicate states closer to the goal (more expected future reward). You should see values increasing as they approach the goal and lower values near obstacles. This value gradient is what guides the agent's decisions.

Step 3: Test the Learned Policy

# Test the learned policy
def test_policy(Q, env, num_tests=10):
    """Test the learned policy."""
    successes = 0
    total_rewards = []
    
    for _ in range(num_tests):
        state = env.reset()
        total_reward = 0
        done = False
        steps = 0
        
        while not done and steps < 50:
            action = np.argmax(Q[state])
            state, reward, done = env.step(action)
            total_reward += reward
            steps += 1
        
        total_rewards.append(total_reward)
        if done:
            successes += 1
    
    print(f"\nTest Results ({num_tests} episodes):")
    print(f"Success Rate: {successes/num_tests*100:.1f}%")
    print(f"Average Reward: {np.mean(total_rewards):.2f}")

test_policy(Q, env)

Finally, we test the learned policy by running it without exploration (always choosing the best action). A well-trained agent should achieve near 100% success rate and consistently high rewards. If the success rate is low, try increasing the number of training episodes or adjusting hyperparameters.

Practice: Build Your Own Agent

Task: Modify the training code to use a linear epsilon decay (from 1.0 to 0.01 over the first 80% of episodes, then stay at 0.01). Compare learning curves with exponential decay.

Show Solution

def linear_epsilon(episode, total_episodes, 
                    start=1.0, end=0.01, decay_fraction=0.8):
    """Linear epsilon decay schedule."""
    decay_episodes = int(total_episodes * decay_fraction)
    
    if episode < decay_episodes:
        # Linear decay
        return start - (start - end) * (episode / decay_episodes)
    else:
        # Stay at minimum
        return end

# Usage in training loop:
for episode in range(num_episodes):
    epsilon = linear_epsilon(episode, num_episodes)
    # ... rest of training code

Task: SARSA is an "on-policy" variant of Q-Learning. Instead of using max(Q[next_state]), it uses Q[next_state, next_action] where next_action is chosen by the current policy. Implement SARSA and compare its learning with Q-Learning.

Show Solution

def train_sarsa(env, num_episodes=1000, alpha=0.1, gamma=0.95):
    """Train using SARSA (on-policy TD learning)."""
    Q = np.zeros((n_states, n_actions))
    epsilon = 1.0
    
    for episode in range(num_episodes):
        state = env.reset()
        # Choose initial action
        action = epsilon_greedy(Q, state, epsilon)
        done = False
        
        while not done:
            next_state, reward, done = env.step(action)
            
            # Choose next action (this is the key difference!)
            next_action = epsilon_greedy(Q, next_state, epsilon)
            
            # SARSA update: use actual next action, not max
            Q[state, action] += alpha * (
                reward + gamma * Q[next_state, next_action] - Q[state, action]
            )
            
            state = next_state
            action = next_action
        
        epsilon = max(0.01, epsilon * 0.99)
    
    return Q

# SARSA is more conservative - it learns the value of the policy
# it's actually following, including exploration mistakes.

Key Takeaways

Learning from Interaction

RL agents learn through trial and error by interacting with an environment, receiving rewards for good actions and penalties for bad ones. Unlike supervised learning, no labeled dataset is needed—just define a reward signal that captures your goal

The Agent-Environment Loop

At each timestep, an agent observes the current state, selects an action based on its policy, receives a reward, and transitions to a new state. This observe-act-learn cycle repeats continuously, allowing the agent to improve its behavior over time

Exploration vs Exploitation

The fundamental RL dilemma: should the agent try new actions to discover better strategies (explore) or use actions it already knows work well (exploit)? Epsilon-greedy is a simple solution that explores randomly with probability ε and exploits otherwise

Q-Learning Algorithm

Q-Learning learns the value of each state-action pair Q(s,a) through temporal difference updates. It's model-free (no environment dynamics needed) and off-policy (learns optimal behavior while exploring), making it practical for real-world problems

Deep Q-Networks (DQN)

DQN replaces Q-tables with neural networks, enabling RL to scale to complex, high-dimensional problems like video games. Experience replay and target networks are crucial innovations that stabilize training and enable superhuman game performance

Policy Gradients

While Q-Learning learns value functions, policy gradient methods directly learn the policy itself. They excel at continuous action spaces (robotics) and are the foundation of modern algorithms like PPO, which powers ChatGPT's RLHF training

Markov Decision Processes

MDPs provide the mathematical framework for RL: states, actions, rewards, transitions, and discount factors. The Markov property—future depends only on the current state—enables tractable solutions through dynamic programming and value iteration

Real-World Applications

RL powers game-playing AI (AlphaGo, Atari), robotic manipulation (dexterous hands, locomotion), autonomous vehicles, data center optimization, and language model fine-tuning (RLHF). Anywhere sequential decisions matter, RL can help

Reinforcement Learning

What You'll Learn

Contents

What is Reinforcement Learning?

Reinforcement Learning Defined

The Three Paradigms of Machine Learning

Supervised Learning

Unsupervised Learning

Reinforcement Learning

Famous RL Success Stories

RL Components & Terminology

The Agent-Environment Interaction

Agent-Environment Interaction Cycle

Core RL Terminology

State (S)

Action (A)

Reward (R)

Policy (π)

Value Functions: Measuring Future Success

Exploration vs Exploitation

Exploration

Exploitation

Markov Decision Processes

What is an MDP?

The Markov Property: "The Future Depends Only on the Present"

Markov (Memoryless)

Non-Markov (Has Memory)

The Discount Factor (γ): Valuing Future Rewards

The Bellman Equation: The Heart of RL

Practice: MDPs

Easy Calculate discounted returns for a simple episode

Medium Implement value iteration for a simple gridworld

Easy Identify Markov vs Non-Markov environments

Medium Compare different discount factors

Hard Design a reward function for a maze-solving agent

Q-Learning Algorithm

Q-Learning in a Nutshell

Understanding the Q-Learning Update

Q-Learning Parameters

Learning Rate (α)

Discount Factor (γ)

Exploration Rate (ε)

Complete Q-Learning Implementation

Deep Q-Networks (DQN)

What is DQN?

Why Tables Don't Scale

DQN Key Innovations

Experience Replay

Target Network

DQN Architecture

Training Loop

Policy Gradients

Value-Based vs Policy-Based Methods

Value-Based (Q-Learning, DQN)

Policy-Based (Policy Gradients)

The Policy Gradient Intuition

Simple REINFORCE Implementation

Actor-Critic: Best of Both Worlds

Actor

Critic

Popular Policy Gradient Algorithms

PPO (Proximal Policy Optimization)

TRPO (Trust Region Policy Optimization)

A3C/A2C (Advantage Actor-Critic)

SAC (Soft Actor-Critic)

Real-World RL Applications

Game Playing & Strategy

Board Games

Video Games

Poker & Imperfect Information

Robotics & Physical Control

Robotic Manipulation

Locomotion

Autonomous Vehicles

Industrial & Commercial Applications

Data Center Cooling

Language Model Fine-Tuning (RLHF)

Recommendation & Advertising

Scientific Discovery

Finance & Trading