Assignment 2: ML Basics | AI Course

Assignment Overview

In this assignment, you will build a complete Machine Learning Pipeline from scratch. This comprehensive project requires you to apply ALL concepts from Module 2: data loading, preprocessing, train/test splitting, classification models (KNN, Decision Trees), regression models (Linear Regression), and model evaluation with industry-standard metrics.

Libraries Allowed: You may use numpy, pandas, scikit-learn, and matplotlib. These are the standard tools used in the industry for machine learning tasks.

Skills Applied: This assignment tests your understanding of ML Foundations (Topic 2.1), Classification (Topic 2.2), and Regression (Topic 2.3) from Module 2.

Data Preparation

Loading, cleaning, splitting, and scaling datasets

Classification

KNN, Decision Trees, accuracy, precision, recall, F1

Regression

Linear Regression, MSE, RMSE, MAE, R² score

Ready to submit? Already completed the assignment? Submit your work now!

Submit Now

The Scenario

TechPredict Analytics

You've just joined TechPredict Analytics, a consulting firm that helps businesses make data-driven decisions. Your manager has assigned you two key projects:

"We have two clients who need predictive models. First, a telecom company wants to identify customers likely to cancel their subscription. Second, a real estate agency needs an automated system to estimate house prices. Can you build ML models for both problems?"

Project 1: Classification

Customer Churn Prediction

Build a classifier to predict which customers are likely to leave the telecom company, so retention offers can be made proactively.

Project 2: Regression

House Price Estimation

Build a regression model to estimate house sale prices based on property features like square footage, bedrooms, and age.

Your Task

Create a Python file called ml_pipeline.py that implements a complete machine learning pipeline. Your code must load datasets, train multiple models, evaluate their performance, and generate visualizations comparing results.

The Datasets

You will work with TWO datasets. Create these CSV files or generate synthetic data:

`customer_churn.csv`

Telecom customer data with churn labels

Column	Description
`tenure`	Months as customer
`monthly_charges`	Monthly bill amount
`total_charges`	Total amount paid
`contract_type`	0=Monthly, 1=Yearly
`churn`	0=Stayed, 1=Left (target)

`house_prices.csv`

House features and sale prices

Column	Description
`sqft`	Square footage
`bedrooms`	Number of bedrooms
`bathrooms`	Number of bathrooms
`age`	House age in years
`price`	Sale price $ (target)

Download: Get sample datasets from the course data folder or generate synthetic data using scikit-learn's make_classification and make_regression.

Requirements

Implement the following 13 functions in your ml_pipeline.py file. Each function should follow the exact signature provided:

load_dataset

Load a CSV file into a pandas DataFrame. Handle missing values by dropping rows with any null values.

def load_dataset(filepath: str) -> pd.DataFrame:
    """
    Load and clean a CSV dataset.
    
    Args:
        filepath: Path to the CSV file
        
    Returns:
        Cleaned pandas DataFrame with no missing values
    """
    pass

split_features_target

Separate the dataset into features (X) and target variable (y).

def split_features_target(df: pd.DataFrame, target_col: str) -> tuple:
    """
    Split DataFrame into features and target.
    
    Args:
        df: Input DataFrame
        target_col: Name of the target column
        
    Returns:
        Tuple of (X, y) where X is features DataFrame and y is target Series
    """
    pass

create_train_test_split

Split data into training and testing sets. Use an 80/20 split ratio and set a random state for reproducibility.

def create_train_test_split(X, y, test_size=0.2, random_state=42) -> tuple:
    """
    Create train/test split of the data.
    
    Args:
        X: Features
        y: Target
        test_size: Proportion of data for testing (default 0.2)
        random_state: Random seed for reproducibility
        
    Returns:
        Tuple of (X_train, X_test, y_train, y_test)
    """
    pass

scale_features

Standardize features using StandardScaler. Fit on training data only, then transform both train and test.

def scale_features(X_train, X_test) -> tuple:
    """
    Standardize features using StandardScaler.
    
    Args:
        X_train: Training features
        X_test: Test features
        
    Returns:
        Tuple of (X_train_scaled, X_test_scaled, scaler)
    """
    pass

train_knn_classifier

Train a K-Nearest Neighbors classifier with configurable number of neighbors.

def train_knn_classifier(X_train, y_train, n_neighbors=5):
    """
    Train a KNN classifier.
    
    Args:
        X_train: Training features
        y_train: Training labels
        n_neighbors: Number of neighbors (default 5)
        
    Returns:
        Trained KNeighborsClassifier model
    """
    pass

train_decision_tree_classifier

Train a Decision Tree classifier with maximum depth to prevent overfitting.

def train_decision_tree_classifier(X_train, y_train, max_depth=5, random_state=42):
    """
    Train a Decision Tree classifier.
    
    Args:
        X_train: Training features
        y_train: Training labels
        max_depth: Maximum tree depth (default 5)
        random_state: Random seed for reproducibility
        
    Returns:
        Trained DecisionTreeClassifier model
    """
    pass

train_linear_regression

Train a Linear Regression model for the house price prediction task.

def train_linear_regression(X_train, y_train):
    """
    Train a Linear Regression model.
    
    Args:
        X_train: Training features
        y_train: Training target values
        
    Returns:
        Trained LinearRegression model
    """
    pass

evaluate_classifier

Evaluate a classification model. Calculate accuracy, precision, recall, and F1-score.

def evaluate_classifier(model, X_test, y_test) -> dict:
    """
    Evaluate a classification model.
    
    Args:
        model: Trained classifier
        X_test: Test features
        y_test: True labels
        
    Returns:
        Dictionary with 'accuracy', 'precision', 'recall', 'f1' scores
    """
    pass

evaluate_regressor

Evaluate a regression model. Calculate MSE, RMSE, MAE, and R² score.

def evaluate_regressor(model, X_test, y_test) -> dict:
    """
    Evaluate a regression model.
    
    Args:
        model: Trained regressor
        X_test: Test features
        y_test: True target values
        
    Returns:
        Dictionary with 'mse', 'rmse', 'mae', 'r2' scores
    """
    pass

compare_classifiers

Compare KNN and Decision Tree classifiers by training and evaluating both.

def compare_classifiers(X_train, X_test, y_train, y_test) -> dict:
    """
    Train and compare KNN and Decision Tree classifiers.
    
    Args:
        X_train, X_test: Train and test features
        y_train, y_test: Train and test labels
        
    Returns:
        Dictionary with model names as keys and evaluation dicts as values
        Example: {'KNN': {'accuracy': 0.85, ...}, 'DecisionTree': {...}}
    """
    pass

plot_confusion_matrix

Create and save a confusion matrix visualization for a classifier.

def plot_confusion_matrix(model, X_test, y_test, save_path: str) -> None:
    """
    Plot and save confusion matrix.
    
    Args:
        model: Trained classifier
        X_test: Test features
        y_test: True labels
        save_path: File path to save the plot
    """
    pass

plot_regression_results

Create a scatter plot comparing actual vs predicted values for regression.

def plot_regression_results(model, X_test, y_test, save_path: str) -> None:
    """
    Plot actual vs predicted values for regression.
    
    Args:
        model: Trained regressor
        X_test: Test features
        y_test: True target values
        save_path: File path to save the plot
    """
    pass

main (Pipeline Execution)

Create a main function that orchestrates the entire ML pipeline.

def main():
    """
    Execute the complete ML pipeline:
    1. Load and prepare both datasets
    2. Train classification models (KNN, Decision Tree)
    3. Train regression model (Linear Regression)
    4. Evaluate all models
    5. Generate visualizations
    6. Print summary report
    """
    pass

if __name__ == "__main__":
    main()

Submission

Create a public GitHub repository with the exact name shown below:

Required Repository Name

ai-ml-basics-assignment

github.com/<your-username>/ai-ml-basics-assignment

Required Files

ai-ml-basics-assignment/
├── ml_pipeline.py          # Main implementation file with ALL 13 functions
├── data/
│   ├── customer_churn.csv  # Classification dataset
│   └── house_prices.csv    # Regression dataset
├── outputs/
│   ├── confusion_matrix.png
│   └── regression_plot.png
├── requirements.txt        # Dependencies
└── README.md               # REQUIRED - see contents below

README.md Must Include:

Your full name and submission date
Results Summary: A table showing model performance metrics
Key Findings: Which model performed better and why
Challenges: Any difficulties you encountered and how you solved them
Instructions to run your code

Do Include

All 13 functions implemented and working
Docstrings for every function
Type hints in function signatures
Both output visualizations from running your code
PEP 8 compliant code style
README.md with all required sections

Do Not Include

Hardcoded file paths
Any .pyc or __pycache__ files (use .gitignore)
Virtual environment folders
Code that doesn't run without errors
Deprecated sklearn functions
Code copied without understanding

Important: Before submitting, run your script to make sure it executes without errors and generates both visualization files correctly!

Submit Your Assignment

Enter your GitHub username - we'll verify your repository automatically

Grading Rubric

Criteria	Points	Description
Data Loading & Prep	30	Correctly loads CSVs, handles missing values, splits features/target
Train/Test Split	20	Proper 80/20 split with random state, feature scaling implemented
Classification Models	50	KNN and Decision Tree correctly implemented and trained
Regression Model	30	Linear Regression correctly implemented and trained
Model Evaluation	40	All metrics calculated correctly (accuracy, precision, recall, F1, MSE, R²)
Visualizations	30	Confusion matrix and regression plot generated and saved
Code Quality	30	PEP 8, docstrings, type hints, clean structure, error handling
Documentation	20	Complete README with setup, results, and analysis
Total	250

Bonus Points (+25): Implement an additional model (e.g., Random Forest, SVM) and include it in your comparison analysis.

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.