Project 5: RL Game Agent | AI Course

Project Overview

Reinforcement Learning (RL) enables agents to learn optimal behaviors through trial and error. In this project, you will build agents that learn to play classic control games from the OpenAI Gymnasium library. You will implement both tabular Q-learning for simple environments and Deep Q-Networks (DQN) for more complex games. Target: Achieve average reward over 195 on CartPole and solve FrozenLake with over 70% success rate.

Skills Applied: This project tests your understanding of RL fundamentals (MDPs, Bellman equations, exploration vs exploitation), neural network implementation, and training optimization techniques.

Q-Learning

Tabular RL with Q-table updates and epsilon-greedy policy

Deep Q-Network

Neural network function approximation for large state spaces

Experience Replay

Memory buffer for stable training with mini-batch sampling

Training Analysis

Learning curves, reward plots, and agent visualization

Learning Objectives

Technical Skills

Implement Q-learning algorithm from scratch
Build DQN with PyTorch or TensorFlow
Create experience replay buffer
Implement target network for stable training
Record and visualize agent gameplay

RL Concepts

Understand Markov Decision Processes (MDPs)
Master the Bellman equation and temporal difference learning
Balance exploration vs exploitation with epsilon-greedy
Tune hyperparameters (learning rate, discount factor, epsilon decay)
Evaluate agent performance and convergence

Ready to submit? Already completed the project? Submit your work now!

Submit Now

Problem Scenario

GameMind AI

You have been hired as an AI Research Engineer at GameMind AI, a startup developing intelligent agents for game testing and autonomous systems. The company needs a proof-of-concept showing that RL agents can learn to solve control tasks. Your task is to build agents that can master classic control environments and demonstrate learning progress.

"We need agents that can learn to balance poles, navigate frozen lakes, and control pendulums. Start with tabular methods for simple environments, then scale up to deep learning for complex ones. Document the learning process with visualizations. Can you build this?"

Dr. Sarah Kim, Head of AI Research, GameMind AI

Technical Challenges to Solve

Algorithm Selection

When to use tabular Q-learning vs DQN?
How to handle continuous state spaces?
Trade-offs between sample efficiency and stability
Choosing appropriate neural network architecture

Training Stability

Why does vanilla DQN training diverge?
How does experience replay help?
Role of target network in stabilization
Epsilon decay scheduling strategies

Hyperparameter Tuning

Learning rate (alpha) selection
Discount factor (gamma) for long-term rewards
Epsilon schedule for exploration
Replay buffer size and batch size

Evaluation

Measuring learning progress
Defining "solved" for each environment
Averaging over multiple evaluation episodes
Recording and visualizing agent behavior

Pro Tip: Start with FrozenLake-v1 (discrete states) to validate your Q-learning implementation, then move to CartPole-v1 (continuous states) for DQN. Always plot learning curves to debug training issues.

Gymnasium Environments

You will work with OpenAI Gymnasium (formerly Gym) environments. These provide standardized interfaces for training and evaluating RL agents.

Required Environments

Install Gymnasium and work with these classic control environments:

Gymnasium Documentation Classic Control Envs

FrozenLake-v1 (Q-Learning)

Navigate a frozen lake without falling into holes. Discrete 4x4 grid with slippery ice.

State Space: 16 discrete states (grid positions)
Action Space: 4 discrete actions (up, down, left, right)
Reward: +1 for reaching goal, 0 otherwise
Solved: Average reward over 0.7 over 100 episodes
Algorithm: Tabular Q-learning

CartPole-v1 (DQN)

Balance a pole on a cart by moving left or right. Classic control benchmark.

State Space: 4 continuous values (position, velocity, angle, angular velocity)
Action Space: 2 discrete actions (push left, push right)
Reward: +1 for each timestep pole is balanced
Solved: Average reward over 195 over 100 episodes
Algorithm: Deep Q-Network (DQN)

Bonus: Additional Environments

For extra credit, implement agents for these environments:

MountainCar-v0

Drive up a steep hill with momentum. Sparse reward challenge.

LunarLander-v2

Land a spacecraft on the moon. Continuous state, discrete action.

Acrobot-v1

Swing a 2-link robot above a line. Challenging control problem.

Note: Install with pip install gymnasium[classic-control]. For video recording, also install pip install gymnasium[other] for MoviePy support.

Project Requirements

Your project must include all of the following components. This is a comprehensive reinforcement learning project covering both tabular and deep RL methods.

Environment Setup

Set up Gymnasium environments:

Install Gymnasium and dependencies
Create environment wrappers
Understand state and action spaces
Test random agent baseline
Set up video recording for evaluation

Deliverable: Environment exploration notebook showing state/action spaces and random baseline performance.

Q-Learning Implementation

Build tabular Q-learning agent:

Initialize Q-table with proper dimensions
Implement epsilon-greedy action selection
Apply Bellman equation for Q-value updates
Implement epsilon decay schedule
Train on FrozenLake-v1 environment

Target: Achieve over 70% success rate on FrozenLake

Deliverable: Q-learning notebook with trained agent achieving target performance.

Deep Q-Network (DQN)

Build DQN with neural network:

Design Q-network architecture (MLP)
Implement experience replay buffer
Create target network for stability
Implement training loop with mini-batch sampling
Add soft or hard target network updates

Target: Solve CartPole-v1 (average reward over 195)

Deliverable: DQN notebook with trained agent solving CartPole.

Hyperparameter Tuning

Experiment with hyperparameters:

Test different learning rates (alpha)
Vary discount factor (gamma)
Compare epsilon decay schedules
Tune replay buffer and batch sizes
Document impact on learning curves

Deliverable: Hyperparameter analysis with comparison plots.

Training Visualization

Create comprehensive visualizations:

Plot episode rewards over training
Show moving average reward curves
Visualize epsilon decay
Plot loss curves for DQN
Compare Q-learning vs DQN performance

Deliverable: Training visualization dashboard with all plots.

Agent Visualization

Record and analyze agent gameplay:

Record videos of trained agents playing
Create before/after training comparisons
Visualize Q-values or policy for discrete envs
Document agent behavior and strategies

Deliverable: Video recordings and Q-value/policy visualizations.

Q-Learning Algorithm

Q-learning is a model-free, off-policy algorithm that learns the value of state-action pairs. It uses the Bellman equation to iteratively update Q-values toward optimal values.

Key Equations

Concept	Equation	Description
Q-Value Update	`Q(s,a) ← Q(s,a) + α[r + γ·max Q(s',a') - Q(s,a)]`	Update Q-value using temporal difference
Epsilon-Greedy	`a = argmax Q(s,a) with prob (1-ε), random with prob ε`	Balance exploration and exploitation
Epsilon Decay	`ε = max(ε_min, ε × decay_rate)`	Reduce exploration over time

Recommended Hyperparameters

Parameter	Symbol	FrozenLake	Description
Learning Rate	α	0.1 - 0.8	How much to update Q-values
Discount Factor	γ	0.95 - 0.99	Importance of future rewards
Initial Epsilon	ε₀	1.0	Start with full exploration
Min Epsilon	ε_min	0.01 - 0.1	Minimum exploration rate
Epsilon Decay	decay	0.995 - 0.999	Per-episode decay rate
Episodes	N	10,000 - 50,000	Training episodes

Deep Q-Network (DQN)

DQN extends Q-learning to continuous state spaces using neural networks as function approximators. Key innovations include experience replay and target networks for stable training.

DQN Architecture

Layer	Type	Size	Activation
Input	State	4 (CartPole)	-
Hidden 1	Dense	64 - 128	ReLU
Hidden 2	Dense	64 - 128	ReLU
Output	Dense	2 (actions)	Linear

Key DQN Components

Experience Replay

Store transitions (s, a, r, s', done) in buffer
Sample random mini-batches for training
Breaks correlation between consecutive samples
Typical buffer size: 10,000 - 100,000
Batch size: 32 - 128

Target Network

Separate network for computing targets
Updated less frequently than online network
Prevents moving target problem
Hard update every N steps or soft update (τ)
τ = 0.001 - 0.01 for soft updates

DQN Hyperparameters

Parameter	CartPole Value	Description
Learning Rate	0.001	Adam optimizer learning rate
Discount Factor (γ)	0.99	Future reward discount
Replay Buffer Size	10,000	Maximum stored transitions
Batch Size	64	Mini-batch size for training
Target Update Freq	100 steps	Steps between target network updates
Epsilon Start	1.0	Initial exploration rate
Epsilon End	0.01	Final exploration rate
Epsilon Decay	0.995	Per-episode decay

Evaluation and Visualization

Proper evaluation and visualization are essential for understanding agent learning and debugging training issues.

Required Visualizations

Learning Curves

Episode rewards over training with moving average

Loss Curves

DQN training loss over time

Epsilon Decay

Exploration rate over episodes

Q-Table Heatmap

Visualize learned Q-values for FrozenLake

Agent Gameplay

Recorded videos of trained agents

Before/After

Compare random vs trained agent

Visualization Tips:

Use rolling averages (window=100) to smooth noisy reward curves
Record videos using Gymnasium's RecordVideo wrapper
For FrozenLake, create a grid showing optimal action per state
Include training time and hardware specs in documentation

Submission Requirements

Create a public GitHub repository with the exact name shown below:

Required Repository Name

rl-game-agent

github.com/<your-username>/rl-game-agent

Required Project Structure

rl-game-agent/
├── notebooks/
│   ├── 01_environment_exploration.ipynb   # Env setup and baseline
│   ├── 02_q_learning.ipynb                # Q-learning implementation
│   ├── 03_dqn.ipynb                       # DQN implementation
│   └── 04_visualization.ipynb             # Training analysis
├── src/
│   ├── q_learning.py                      # Q-learning agent class
│   ├── dqn.py                             # DQN agent class
│   ├── replay_buffer.py                   # Experience replay
│   └── utils.py                           # Helper functions
├── models/
│   ├── q_table_frozenlake.npy             # Trained Q-table
│   └── dqn_cartpole.pt                    # Trained DQN weights
├── reports/
│   ├── learning_curves.png                # Reward plots
│   ├── q_table_heatmap.png                # Q-value visualization
│   └── hyperparameter_comparison.png      # Tuning results
├── videos/
│   ├── frozenlake_trained.mp4             # FrozenLake gameplay
│   └── cartpole_trained.mp4               # CartPole gameplay
├── requirements.txt                       # Python dependencies
└── README.md                              # Project documentation

README.md Required Sections

1. Project Header

Project title and description
Your full name and submission date
Final performance metrics

2. Environments

Environments used
State and action space descriptions
Solved criteria for each

3. Q-Learning Results

FrozenLake success rate
Hyperparameters used
Q-table visualization

4. DQN Results

CartPole average reward
Network architecture
Training configuration

5. Visualizations

Learning curves
GIF or video demos
Hyperparameter analysis

6. How to Run

Installation instructions
Training commands
Evaluation commands

Submit Your Project

Enter your GitHub username - we will verify your repository automatically

Grading Rubric

Your project will be graded on the following criteria. Total: 750 points.

Criteria	Points	Description
Environment Setup	50	Proper setup, baseline evaluation
Q-Learning Implementation	150	Correct algorithm, over 70% success on FrozenLake
DQN Implementation	200	Replay buffer, target network, solves CartPole
Hyperparameter Analysis	100	Systematic tuning with documented results
Visualizations	125	Learning curves, Q-table, agent videos
Documentation	100	README quality, code comments, reproducibility
Bonus: Extra Environment	25	Solve LunarLander or MountainCar
Total	750

Grading Levels

Excellent

675-750

Solves all envs, excellent visualizations

Good

563-674

Meets all requirements, good docs

Satisfactory

450-562

Meets minimum requirements

Needs Work

< 450

Missing components or poor performance

Ready to Submit?

Make sure your agents are trained and videos are recorded.

Submit Your Project

RL Game Agent

What You Will Build

Contents

Project Overview

Q-Learning

Deep Q-Network

Experience Replay

Training Analysis

Learning Objectives

Technical Skills

RL Concepts

Problem Scenario

GameMind AI

Technical Challenges to Solve

Gymnasium Environments

Required Environments

Bonus: Additional Environments

MountainCar-v0

LunarLander-v2

Acrobot-v1

Project Requirements

Environment Setup

Q-Learning Implementation

Deep Q-Network (DQN)

Hyperparameter Tuning

Training Visualization

Agent Visualization

Q-Learning Algorithm

Key Equations

Recommended Hyperparameters

Deep Q-Network (DQN)

DQN Architecture

Key DQN Components

DQN Hyperparameters

Evaluation and Visualization

Required Visualizations

Learning Curves

Loss Curves

Epsilon Decay

Q-Table Heatmap

Agent Gameplay

Before/After

Submission Requirements

Required Repository Name

Required Project Structure

README.md Required Sections

1. Project Header

2. Environments

3. Q-Learning Results

4. DQN Results

5. Visualizations

6. How to Run

Grading Rubric

Grading Levels

Excellent

Good

Satisfactory

Needs Work

Ready to Submit?