Project Overview
Reinforcement Learning (RL) enables agents to learn optimal behaviors through trial and error. In this project, you will build agents that learn to play classic control games from the OpenAI Gymnasium library. You will implement both tabular Q-learning for simple environments and Deep Q-Networks (DQN) for more complex games. Target: Achieve average reward over 195 on CartPole and solve FrozenLake with over 70% success rate.
Q-Learning
Tabular RL with Q-table updates and epsilon-greedy policy
Deep Q-Network
Neural network function approximation for large state spaces
Experience Replay
Memory buffer for stable training with mini-batch sampling
Training Analysis
Learning curves, reward plots, and agent visualization
Learning Objectives
Technical Skills
- Implement Q-learning algorithm from scratch
- Build DQN with PyTorch or TensorFlow
- Create experience replay buffer
- Implement target network for stable training
- Record and visualize agent gameplay
RL Concepts
- Understand Markov Decision Processes (MDPs)
- Master the Bellman equation and temporal difference learning
- Balance exploration vs exploitation with epsilon-greedy
- Tune hyperparameters (learning rate, discount factor, epsilon decay)
- Evaluate agent performance and convergence
Problem Scenario
GameMind AI
You have been hired as an AI Research Engineer at GameMind AI, a startup developing intelligent agents for game testing and autonomous systems. The company needs a proof-of-concept showing that RL agents can learn to solve control tasks. Your task is to build agents that can master classic control environments and demonstrate learning progress.
"We need agents that can learn to balance poles, navigate frozen lakes, and control pendulums. Start with tabular methods for simple environments, then scale up to deep learning for complex ones. Document the learning process with visualizations. Can you build this?"
Technical Challenges to Solve
- When to use tabular Q-learning vs DQN?
- How to handle continuous state spaces?
- Trade-offs between sample efficiency and stability
- Choosing appropriate neural network architecture
- Why does vanilla DQN training diverge?
- How does experience replay help?
- Role of target network in stabilization
- Epsilon decay scheduling strategies
- Learning rate (alpha) selection
- Discount factor (gamma) for long-term rewards
- Epsilon schedule for exploration
- Replay buffer size and batch size
- Measuring learning progress
- Defining "solved" for each environment
- Averaging over multiple evaluation episodes
- Recording and visualizing agent behavior
Gymnasium Environments
You will work with OpenAI Gymnasium (formerly Gym) environments. These provide standardized interfaces for training and evaluating RL agents.
Required Environments
Install Gymnasium and work with these classic control environments:
Navigate a frozen lake without falling into holes. Discrete 4x4 grid with slippery ice.
- State Space: 16 discrete states (grid positions)
- Action Space: 4 discrete actions (up, down, left, right)
- Reward: +1 for reaching goal, 0 otherwise
- Solved: Average reward over 0.7 over 100 episodes
- Algorithm: Tabular Q-learning
Balance a pole on a cart by moving left or right. Classic control benchmark.
- State Space: 4 continuous values (position, velocity, angle, angular velocity)
- Action Space: 2 discrete actions (push left, push right)
- Reward: +1 for each timestep pole is balanced
- Solved: Average reward over 195 over 100 episodes
- Algorithm: Deep Q-Network (DQN)
Bonus: Additional Environments
For extra credit, implement agents for these environments:
MountainCar-v0
Drive up a steep hill with momentum. Sparse reward challenge.
LunarLander-v2
Land a spacecraft on the moon. Continuous state, discrete action.
Acrobot-v1
Swing a 2-link robot above a line. Challenging control problem.
pip install gymnasium[classic-control].
For video recording, also install pip install gymnasium[other] for MoviePy support.
Project Requirements
Your project must include all of the following components. This is a comprehensive reinforcement learning project covering both tabular and deep RL methods.
Environment Setup
Set up Gymnasium environments:
- Install Gymnasium and dependencies
- Create environment wrappers
- Understand state and action spaces
- Test random agent baseline
- Set up video recording for evaluation
Q-Learning Implementation
Build tabular Q-learning agent:
- Initialize Q-table with proper dimensions
- Implement epsilon-greedy action selection
- Apply Bellman equation for Q-value updates
- Implement epsilon decay schedule
- Train on FrozenLake-v1 environment
Target: Achieve over 70% success rate on FrozenLake
Deep Q-Network (DQN)
Build DQN with neural network:
- Design Q-network architecture (MLP)
- Implement experience replay buffer
- Create target network for stability
- Implement training loop with mini-batch sampling
- Add soft or hard target network updates
Target: Solve CartPole-v1 (average reward over 195)
Hyperparameter Tuning
Experiment with hyperparameters:
- Test different learning rates (alpha)
- Vary discount factor (gamma)
- Compare epsilon decay schedules
- Tune replay buffer and batch sizes
- Document impact on learning curves
Training Visualization
Create comprehensive visualizations:
- Plot episode rewards over training
- Show moving average reward curves
- Visualize epsilon decay
- Plot loss curves for DQN
- Compare Q-learning vs DQN performance
Agent Visualization
Record and analyze agent gameplay:
- Record videos of trained agents playing
- Create before/after training comparisons
- Visualize Q-values or policy for discrete envs
- Document agent behavior and strategies
Q-Learning Algorithm
Q-learning is a model-free, off-policy algorithm that learns the value of state-action pairs. It uses the Bellman equation to iteratively update Q-values toward optimal values.
Key Equations
| Concept | Equation | Description |
|---|---|---|
| Q-Value Update | Q(s,a) ← Q(s,a) + α[r + γ·max Q(s',a') - Q(s,a)] |
Update Q-value using temporal difference |
| Epsilon-Greedy | a = argmax Q(s,a) with prob (1-ε), random with prob ε |
Balance exploration and exploitation |
| Epsilon Decay | ε = max(ε_min, ε × decay_rate) |
Reduce exploration over time |
Recommended Hyperparameters
| Parameter | Symbol | FrozenLake | Description |
|---|---|---|---|
| Learning Rate | α | 0.1 - 0.8 | How much to update Q-values |
| Discount Factor | γ | 0.95 - 0.99 | Importance of future rewards |
| Initial Epsilon | ε₀ | 1.0 | Start with full exploration |
| Min Epsilon | ε_min | 0.01 - 0.1 | Minimum exploration rate |
| Epsilon Decay | decay | 0.995 - 0.999 | Per-episode decay rate |
| Episodes | N | 10,000 - 50,000 | Training episodes |
Deep Q-Network (DQN)
DQN extends Q-learning to continuous state spaces using neural networks as function approximators. Key innovations include experience replay and target networks for stable training.
DQN Architecture
| Layer | Type | Size | Activation |
|---|---|---|---|
| Input | State | 4 (CartPole) | - |
| Hidden 1 | Dense | 64 - 128 | ReLU |
| Hidden 2 | Dense | 64 - 128 | ReLU |
| Output | Dense | 2 (actions) | Linear |
Key DQN Components
- Store transitions (s, a, r, s', done) in buffer
- Sample random mini-batches for training
- Breaks correlation between consecutive samples
- Typical buffer size: 10,000 - 100,000
- Batch size: 32 - 128
- Separate network for computing targets
- Updated less frequently than online network
- Prevents moving target problem
- Hard update every N steps or soft update (τ)
- τ = 0.001 - 0.01 for soft updates
DQN Hyperparameters
| Parameter | CartPole Value | Description |
|---|---|---|
| Learning Rate | 0.001 | Adam optimizer learning rate |
| Discount Factor (γ) | 0.99 | Future reward discount |
| Replay Buffer Size | 10,000 | Maximum stored transitions |
| Batch Size | 64 | Mini-batch size for training |
| Target Update Freq | 100 steps | Steps between target network updates |
| Epsilon Start | 1.0 | Initial exploration rate |
| Epsilon End | 0.01 | Final exploration rate |
| Epsilon Decay | 0.995 | Per-episode decay |
Evaluation and Visualization
Proper evaluation and visualization are essential for understanding agent learning and debugging training issues.
Required Visualizations
Learning Curves
Episode rewards over training with moving average
Loss Curves
DQN training loss over time
Epsilon Decay
Exploration rate over episodes
Q-Table Heatmap
Visualize learned Q-values for FrozenLake
Agent Gameplay
Recorded videos of trained agents
Before/After
Compare random vs trained agent
- Use rolling averages (window=100) to smooth noisy reward curves
- Record videos using Gymnasium's RecordVideo wrapper
- For FrozenLake, create a grid showing optimal action per state
- Include training time and hardware specs in documentation
Submission Requirements
Create a public GitHub repository with the exact name shown below:
Required Repository Name
rl-game-agent
Required Project Structure
rl-game-agent/
├── notebooks/
│ ├── 01_environment_exploration.ipynb # Env setup and baseline
│ ├── 02_q_learning.ipynb # Q-learning implementation
│ ├── 03_dqn.ipynb # DQN implementation
│ └── 04_visualization.ipynb # Training analysis
├── src/
│ ├── q_learning.py # Q-learning agent class
│ ├── dqn.py # DQN agent class
│ ├── replay_buffer.py # Experience replay
│ └── utils.py # Helper functions
├── models/
│ ├── q_table_frozenlake.npy # Trained Q-table
│ └── dqn_cartpole.pt # Trained DQN weights
├── reports/
│ ├── learning_curves.png # Reward plots
│ ├── q_table_heatmap.png # Q-value visualization
│ └── hyperparameter_comparison.png # Tuning results
├── videos/
│ ├── frozenlake_trained.mp4 # FrozenLake gameplay
│ └── cartpole_trained.mp4 # CartPole gameplay
├── requirements.txt # Python dependencies
└── README.md # Project documentation
README.md Required Sections
1. Project Header
- Project title and description
- Your full name and submission date
- Final performance metrics
2. Environments
- Environments used
- State and action space descriptions
- Solved criteria for each
3. Q-Learning Results
- FrozenLake success rate
- Hyperparameters used
- Q-table visualization
4. DQN Results
- CartPole average reward
- Network architecture
- Training configuration
5. Visualizations
- Learning curves
- GIF or video demos
- Hyperparameter analysis
6. How to Run
- Installation instructions
- Training commands
- Evaluation commands
Enter your GitHub username - we will verify your repository automatically
Grading Rubric
Your project will be graded on the following criteria. Total: 750 points.
| Criteria | Points | Description |
|---|---|---|
| Environment Setup | 50 | Proper setup, baseline evaluation |
| Q-Learning Implementation | 150 | Correct algorithm, over 70% success on FrozenLake |
| DQN Implementation | 200 | Replay buffer, target network, solves CartPole |
| Hyperparameter Analysis | 100 | Systematic tuning with documented results |
| Visualizations | 125 | Learning curves, Q-table, agent videos |
| Documentation | 100 | README quality, code comments, reproducibility |
| Bonus: Extra Environment | 25 | Solve LunarLander or MountainCar |
| Total | 750 |
Grading Levels
Excellent
Solves all envs, excellent visualizations
Good
Meets all requirements, good docs
Satisfactory
Meets minimum requirements
Needs Work
Missing components or poor performance