Assignment 2-A

ML Basics: Build Your First Pipeline

Build a complete machine learning pipeline that combines all Module 2 concepts: data preparation, train/test splitting, classification with KNN and Decision Trees, regression with Linear Regression, and model evaluation with industry-standard metrics.

8-10 hours
Intermediate
250 Points
Submit Assignment
What You'll Practice
  • Load and preprocess datasets
  • Implement train/test splitting
  • Train KNN & Decision Tree classifiers
  • Build Linear Regression models
  • Evaluate with accuracy, MSE, R²
Contents
01

Assignment Overview

In this assignment, you will build a complete Machine Learning Pipeline from scratch. This comprehensive project requires you to apply ALL concepts from Module 2: data loading, preprocessing, train/test splitting, classification models (KNN, Decision Trees), regression models (Linear Regression), and model evaluation with industry-standard metrics.

Libraries Allowed: You may use numpy, pandas, scikit-learn, and matplotlib. These are the standard tools used in the industry for machine learning tasks.
Skills Applied: This assignment tests your understanding of ML Foundations (Topic 2.1), Classification (Topic 2.2), and Regression (Topic 2.3) from Module 2.
Data Preparation

Loading, cleaning, splitting, and scaling datasets

Classification

KNN, Decision Trees, accuracy, precision, recall, F1

Regression

Linear Regression, MSE, RMSE, MAE, R² score

Ready to submit? Already completed the assignment? Submit your work now!
Submit Now
02

The Scenario

TechPredict Analytics

You've just joined TechPredict Analytics, a consulting firm that helps businesses make data-driven decisions. Your manager has assigned you two key projects:

"We have two clients who need predictive models. First, a telecom company wants to identify customers likely to cancel their subscription. Second, a real estate agency needs an automated system to estimate house prices. Can you build ML models for both problems?"

Project 1: Classification
Customer Churn Prediction

Build a classifier to predict which customers are likely to leave the telecom company, so retention offers can be made proactively.

Project 2: Regression
House Price Estimation

Build a regression model to estimate house sale prices based on property features like square footage, bedrooms, and age.

Your Task

Create a Python file called ml_pipeline.py that implements a complete machine learning pipeline. Your code must load datasets, train multiple models, evaluate their performance, and generate visualizations comparing results.

03

The Datasets

You will work with TWO datasets. Create these CSV files or generate synthetic data:

customer_churn.csv

Telecom customer data with churn labels

ColumnDescription
tenureMonths as customer
monthly_chargesMonthly bill amount
total_chargesTotal amount paid
contract_type0=Monthly, 1=Yearly
churn0=Stayed, 1=Left (target)

house_prices.csv

House features and sale prices

ColumnDescription
sqftSquare footage
bedroomsNumber of bedrooms
bathroomsNumber of bathrooms
ageHouse age in years
priceSale price $ (target)
Download: Get sample datasets from the course data folder or generate synthetic data using scikit-learn's make_classification and make_regression.
04

Requirements

Implement the following 13 functions in your ml_pipeline.py file. Each function should follow the exact signature provided:

1
load_dataset

Load a CSV file into a pandas DataFrame. Handle missing values by dropping rows with any null values.

def load_dataset(filepath: str) -> pd.DataFrame:
    """
    Load and clean a CSV dataset.
    
    Args:
        filepath: Path to the CSV file
        
    Returns:
        Cleaned pandas DataFrame with no missing values
    """
    pass
2
split_features_target

Separate the dataset into features (X) and target variable (y).

def split_features_target(df: pd.DataFrame, target_col: str) -> tuple:
    """
    Split DataFrame into features and target.
    
    Args:
        df: Input DataFrame
        target_col: Name of the target column
        
    Returns:
        Tuple of (X, y) where X is features DataFrame and y is target Series
    """
    pass
3
create_train_test_split

Split data into training and testing sets. Use an 80/20 split ratio and set a random state for reproducibility.

def create_train_test_split(X, y, test_size=0.2, random_state=42) -> tuple:
    """
    Create train/test split of the data.
    
    Args:
        X: Features
        y: Target
        test_size: Proportion of data for testing (default 0.2)
        random_state: Random seed for reproducibility
        
    Returns:
        Tuple of (X_train, X_test, y_train, y_test)
    """
    pass
4
scale_features

Standardize features using StandardScaler. Fit on training data only, then transform both train and test.

def scale_features(X_train, X_test) -> tuple:
    """
    Standardize features using StandardScaler.
    
    Args:
        X_train: Training features
        X_test: Test features
        
    Returns:
        Tuple of (X_train_scaled, X_test_scaled, scaler)
    """
    pass
5
train_knn_classifier

Train a K-Nearest Neighbors classifier with configurable number of neighbors.

def train_knn_classifier(X_train, y_train, n_neighbors=5):
    """
    Train a KNN classifier.
    
    Args:
        X_train: Training features
        y_train: Training labels
        n_neighbors: Number of neighbors (default 5)
        
    Returns:
        Trained KNeighborsClassifier model
    """
    pass
6
train_decision_tree_classifier

Train a Decision Tree classifier with maximum depth to prevent overfitting.

def train_decision_tree_classifier(X_train, y_train, max_depth=5, random_state=42):
    """
    Train a Decision Tree classifier.
    
    Args:
        X_train: Training features
        y_train: Training labels
        max_depth: Maximum tree depth (default 5)
        random_state: Random seed for reproducibility
        
    Returns:
        Trained DecisionTreeClassifier model
    """
    pass
7
train_linear_regression

Train a Linear Regression model for the house price prediction task.

def train_linear_regression(X_train, y_train):
    """
    Train a Linear Regression model.
    
    Args:
        X_train: Training features
        y_train: Training target values
        
    Returns:
        Trained LinearRegression model
    """
    pass
8
evaluate_classifier

Evaluate a classification model. Calculate accuracy, precision, recall, and F1-score.

def evaluate_classifier(model, X_test, y_test) -> dict:
    """
    Evaluate a classification model.
    
    Args:
        model: Trained classifier
        X_test: Test features
        y_test: True labels
        
    Returns:
        Dictionary with 'accuracy', 'precision', 'recall', 'f1' scores
    """
    pass
9
evaluate_regressor

Evaluate a regression model. Calculate MSE, RMSE, MAE, and R² score.

def evaluate_regressor(model, X_test, y_test) -> dict:
    """
    Evaluate a regression model.
    
    Args:
        model: Trained regressor
        X_test: Test features
        y_test: True target values
        
    Returns:
        Dictionary with 'mse', 'rmse', 'mae', 'r2' scores
    """
    pass
10
compare_classifiers

Compare KNN and Decision Tree classifiers by training and evaluating both.

def compare_classifiers(X_train, X_test, y_train, y_test) -> dict:
    """
    Train and compare KNN and Decision Tree classifiers.
    
    Args:
        X_train, X_test: Train and test features
        y_train, y_test: Train and test labels
        
    Returns:
        Dictionary with model names as keys and evaluation dicts as values
        Example: {'KNN': {'accuracy': 0.85, ...}, 'DecisionTree': {...}}
    """
    pass
11
plot_confusion_matrix

Create and save a confusion matrix visualization for a classifier.

def plot_confusion_matrix(model, X_test, y_test, save_path: str) -> None:
    """
    Plot and save confusion matrix.
    
    Args:
        model: Trained classifier
        X_test: Test features
        y_test: True labels
        save_path: File path to save the plot
    """
    pass
12
plot_regression_results

Create a scatter plot comparing actual vs predicted values for regression.

def plot_regression_results(model, X_test, y_test, save_path: str) -> None:
    """
    Plot actual vs predicted values for regression.
    
    Args:
        model: Trained regressor
        X_test: Test features
        y_test: True target values
        save_path: File path to save the plot
    """
    pass
13
main (Pipeline Execution)

Create a main function that orchestrates the entire ML pipeline.

def main():
    """
    Execute the complete ML pipeline:
    1. Load and prepare both datasets
    2. Train classification models (KNN, Decision Tree)
    3. Train regression model (Linear Regression)
    4. Evaluate all models
    5. Generate visualizations
    6. Print summary report
    """
    pass

if __name__ == "__main__":
    main()
05

Submission

Create a public GitHub repository with the exact name shown below:

Required Repository Name
ai-ml-basics-assignment
github.com/<your-username>/ai-ml-basics-assignment
Required Files
ai-ml-basics-assignment/
├── ml_pipeline.py          # Main implementation file with ALL 13 functions
├── data/
│   ├── customer_churn.csv  # Classification dataset
│   └── house_prices.csv    # Regression dataset
├── outputs/
│   ├── confusion_matrix.png
│   └── regression_plot.png
├── requirements.txt        # Dependencies
└── README.md               # REQUIRED - see contents below
README.md Must Include:
  • Your full name and submission date
  • Results Summary: A table showing model performance metrics
  • Key Findings: Which model performed better and why
  • Challenges: Any difficulties you encountered and how you solved them
  • Instructions to run your code
Do Include
  • All 13 functions implemented and working
  • Docstrings for every function
  • Type hints in function signatures
  • Both output visualizations from running your code
  • PEP 8 compliant code style
  • README.md with all required sections
Do Not Include
  • Hardcoded file paths
  • Any .pyc or __pycache__ files (use .gitignore)
  • Virtual environment folders
  • Code that doesn't run without errors
  • Deprecated sklearn functions
  • Code copied without understanding
Important: Before submitting, run your script to make sure it executes without errors and generates both visualization files correctly!
Submit Your Assignment

Enter your GitHub username - we'll verify your repository automatically

06

Grading Rubric

Criteria Points Description
Data Loading & Prep 30 Correctly loads CSVs, handles missing values, splits features/target
Train/Test Split 20 Proper 80/20 split with random state, feature scaling implemented
Classification Models 50 KNN and Decision Tree correctly implemented and trained
Regression Model 30 Linear Regression correctly implemented and trained
Model Evaluation 40 All metrics calculated correctly (accuracy, precision, recall, F1, MSE, R²)
Visualizations 30 Confusion matrix and regression plot generated and saved
Code Quality 30 PEP 8, docstrings, type hints, clean structure, error handling
Documentation 20 Complete README with setup, results, and analysis
Total 250
Bonus Points (+25): Implement an additional model (e.g., Random Forest, SVM) and include it in your comparison analysis.

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment
07

Pro Tips

Start Simple, Then Iterate
  • Start with loading data and printing its shape
  • Add train/test split, then add one model
  • Test each step before moving to the next
  • Build complexity gradually, not all at once
Feature Scaling
  • Always scale before KNN (distance-based)
  • Fit scaler on train data only, transform both
  • Decision Trees don't require scaling
  • Never fit on test data—that's data leakage!
Evaluation Tips
  • Use classification_report() for all metrics
  • Check for class imbalance with y.value_counts()
  • Accuracy alone can be misleading
  • Also look at precision, recall, and F1
Common Mistakes
  • Using deprecated sklearn functions
  • Hardcoding file paths instead of parameters
  • Fitting scaler on entire dataset (data leakage)
  • Not setting random_state for reproducibility
08

Pre-Submission Checklist

Code Requirements
Repository Requirements