Beginner Project 2

Iris Flower Classification

Build your first complete classification pipeline using the famous Iris dataset. You will explore the data, train multiple classifiers, evaluate model performance with metrics like accuracy, precision, recall, and visualize decision boundaries.

4-6 hours
Beginner
200 Points
What You Will Build
  • EDA with visualizations
  • Multiple classification models
  • Model evaluation & comparison
  • Decision boundary plots
  • Confusion matrix analysis
Contents
01

Project Overview

The Iris dataset is the "Hello World" of machine learning classification. This project will help you master the fundamentals of classification by working with 150 samples of iris flowers across 3 species (Setosa, Versicolor, Virginica) using 4 features (sepal length, sepal width, petal length, petal width). Though simple, this project teaches core concepts that scale to any classification problem.

Skills Applied: This project tests your proficiency in Python (pandas, numpy, matplotlib, seaborn), scikit-learn (preprocessing, classifiers, metrics), and data visualization for classification problems.
Explore

Visualize features, distributions, and class separability

Preprocess

Scale features and split data properly

Train

Build and compare multiple classifiers

Evaluate

Analyze with metrics and confusion matrices

Learning Objectives

Classification Fundamentals
  • Understand multi-class classification problems
  • Learn train/test splitting strategies
  • Apply feature scaling for different algorithms
  • Interpret classification metrics correctly
  • Visualize decision boundaries
Practical Skills
  • Create pair plots and correlation heatmaps
  • Implement Logistic Regression, KNN, SVM, Decision Trees
  • Use cross-validation for robust evaluation
  • Generate and interpret confusion matrices
  • Write clean, documented Jupyter notebooks
Ready to submit? Already completed the project? Submit your work now!
Submit Now
02

Business Scenario

FloraID Botanics

You have been hired as a Junior Data Scientist at FloraID Botanics, a botanical research company that develops automated plant identification systems. Your manager has assigned you to work on the iris flower classification module:

"We're building an app that helps amateur botanists identify flowers from measurements. Start with the classic iris dataset - it's small but perfect for learning. Build me a classifier that can accurately identify the three iris species from petal and sepal measurements. Show me which algorithm works best and why."

Dr. Maya Patel, Lead Data Scientist, FloraID Botanics

Questions to Answer

Classification
  • Which species does a flower belong to given its measurements?
  • How accurate is the classification model?
  • Which algorithm performs best on this dataset?
  • Are all species equally easy to classify?
Feature Analysis
  • Which features are most important for classification?
  • Can we separate species using just 2 features?
  • Are there overlapping regions between species?
  • How do the feature distributions differ by species?
Pro Tip: While this is a beginner project, treat it professionally! Write clean code, document your findings, and explain your reasoning - these habits will serve you well on larger projects.
03

The Dataset

The Iris dataset is one of the most famous datasets in machine learning, introduced by statistician Ronald Fisher in 1936. Download the CSV file or load it directly from scikit-learn:

Dataset Download

Download the Iris dataset from Kaggle (based on UCI ML Repository) or use sklearn's built-in loader.

Download from Kaggle
Original Data Source

This project uses the classic Iris Dataset from Kaggle/UCI - one of the most famous datasets in machine learning history, introduced by R.A. Fisher in 1936. It contains measurements from 150 iris flowers across 3 species (Setosa, Versicolor, Virginica) with 50 samples per class.

Dataset Info: 150 samples × 5 columns | 4 features (sepal/petal length & width) | 3 classes | Balanced: 50 samples per class | No missing values | Perfect for classification beginners
Features Overview

FeatureTypeUnitRangeDescription
sepal_lengthfloatcm4.3 - 7.9Length of the sepal
sepal_widthfloatcm2.0 - 4.4Width of the sepal
petal_lengthfloatcm1.0 - 6.9Length of the petal
petal_widthfloatcm0.1 - 2.5Width of the petal

ClassLabelCountCharacteristics
0Iris-setosa50Small petals, easily separable from others
1Iris-versicolor50Medium-sized, some overlap with virginica
2Iris-virginica50Large petals, some overlap with versicolor
Dataset Stats: 150 samples, 4 features, 3 balanced classes, no missing values
Key Insight: Setosa is linearly separable; Versicolor & Virginica have some overlap
Sample Data Preview
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
5.13.51.40.2setosa
7.03.24.71.4versicolor
6.33.36.02.5virginica
04

Project Requirements

Create a single well-organized Jupyter notebook that covers all the following components with clear documentation and visualizations.

1
Data Loading & EDA
  • Load the iris dataset (CSV or sklearn)
  • Display dataset shape, dtypes, and basic statistics
  • Check for missing values (should be none)
  • Visualize class distribution with bar chart
  • Create pair plot colored by species
  • Generate correlation heatmap
  • Box plots of features by species
2
Data Preprocessing
  • Separate features (X) and target (y)
  • Split data into train/test sets (80/20 or 70/30)
  • Apply StandardScaler to features
  • Document your preprocessing choices
3
Model Training

Train at least 4 different classifiers:

  • Logistic Regression
  • K-Nearest Neighbors (KNN)
  • Support Vector Machine (SVM)
  • Decision Tree Classifier
  • (Optional: Random Forest, Naive Bayes)
4
Model Evaluation
  • Calculate accuracy, precision, recall, F1-score for each model
  • Generate confusion matrix for each model
  • Perform 5-fold cross-validation
  • Create comparison table of all models
  • Visualize decision boundaries (using 2 best features)
5
Conclusion
  • Summarize which model performed best and why
  • Discuss which features are most important
  • Note any challenges or interesting findings
  • Suggest potential improvements
05

Model Specifications

Train and compare the following classifiers. Use default parameters first, then optionally tune hyperparameters for your best performer.

Linear Models
  • Logistic Regression: Multi-class with 'ovr' or 'multinomial'
  • SVM (Linear): kernel='linear', C=1.0
Non-Linear Models
  • KNN: Try k=3, 5, 7
  • Decision Tree: max_depth=3 to 5
  • SVM (RBF): kernel='rbf', gamma='scale'
Evaluation Metrics
Accuracy

Overall correct predictions

Precision

Correct positive predictions

Recall

Actual positives found

F1-Score

Harmonic mean of P & R

Sample Code
# Load and prepare data
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

models = {
    'Logistic Regression': LogisticRegression(max_iter=200),
    'KNN (k=5)': KNeighborsClassifier(n_neighbors=5),
    'SVM (RBF)': SVC(kernel='rbf', gamma='scale'),
    'Decision Tree': DecisionTreeClassifier(max_depth=4, random_state=42)
}

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    accuracy = model.score(X_test_scaled, y_test)
    print(f'{name}: {accuracy:.4f}')
Target Performance: Aim for accuracy > 95% on test set. With proper preprocessing, most algorithms can achieve 96-100% on this dataset.
06

Required Visualizations

Create at least 10 visualizations in your notebook. Each should have proper titles, labels, and brief interpretive commentary.

EDA
Exploratory Visualizations
  • Class distribution bar chart
  • Pair plot (seaborn pairplot)
  • Correlation heatmap
  • Box plots by species
  • Violin plots or histograms
Model
Model Evaluation Plots
  • Confusion matrices (heatmaps)
  • Model accuracy comparison bar chart
  • Cross-validation score boxplots
  • Decision boundary plot (2D)
  • Classification report visualization
07

Submission Requirements

Create a public GitHub repository with the exact name shown below:

Required Repository Name
iris-classification-ml
github.com/<your-username>/iris-classification-ml
Required Project Structure
iris-classification-ml/
├── data/
│   └── iris.csv                  # Dataset (or note if using sklearn)
├── notebooks/
│   └── iris_classification.ipynb # Main analysis notebook
├── visualizations/
│   ├── pair_plot.png
│   ├── confusion_matrices.png
│   ├── model_comparison.png
│   └── decision_boundaries.png
├── requirements.txt              # Python dependencies
└── README.md                     # Project documentation
README.md Required Sections
  • Project title, your name, date
  • Project overview and objectives
  • Dataset description
  • Technologies used (Python, sklearn, etc.)
  • Key findings from EDA
  • Model comparison results
  • Best model and accuracy
  • How to run the notebook
08

Grading Rubric

Your project will be graded on the following criteria. Total: 200 points.

Criteria Points Description
Exploratory Data Analysis 40 Comprehensive EDA with 5+ visualizations and insights
Data Preprocessing 20 Proper train/test split and feature scaling
Model Training 40 At least 4 different classifiers trained correctly
Model Evaluation 40 Metrics, confusion matrices, cross-validation
Visualizations 30 10+ clear, labeled visualizations
Documentation 30 README, code comments, conclusions
Total 200

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Project