Capstone Project 2

Web Scraper

Build a powerful web scraper using Python, BeautifulSoup, and Requests. You will learn to extract data from websites, handle pagination, implement error handling, and export structured data to CSV files for analysis.

8-10 hours
Intermediate
450 Points
What You Will Build
  • HTTP request handling
  • HTML parsing with BeautifulSoup
  • Data extraction pipeline
  • Pagination handling
  • CSV export functionality
Contents
01

Project Overview

Build a professional web scraping application that demonstrates your Python skills with HTTP requests, HTML parsing, data extraction, and file handling. Your scraper will collect book data from a practice website, handle multiple pages, implement robust error handling, and export clean CSV files.

Skills Applied: This project tests your proficiency in Python libraries (requests, BeautifulSoup, csv), HTML/CSS selectors, error handling (try/except), data cleaning, and working with external APIs and websites.

What You Will Build

A fully functional web scraper that extracts book data:

$ python scraper.py

==============================================
     BOOKSCRAPER - Web Scraping Tool
==============================================

Target URL: https://books.toscrape.com
Categories: All Books

[INFO] Starting scraper...
[INFO] Fetching page 1...
[INFO] Found 20 books on page 1
[INFO] Fetching page 2...
[INFO] Found 20 books on page 2
...
[INFO] Fetching page 50...
[INFO] Found 20 books on page 50

==============================================
           SCRAPING COMPLETE!
==============================================
Total Books Scraped: 1000
Categories Found: 50
Export File: data/books_2025-01-15.csv
Time Elapsed: 45.2 seconds

[SUCCESS] Data exported to CSV!

Skills You Will Apply

HTTP Requests

GET requests, headers, sessions, and response handling

HTML Parsing

BeautifulSoup, CSS selectors, DOM navigation

Data Extraction

Pattern matching, data cleaning, validation

CSV Export

File writing, structured data, encoding

Learning Objectives

Technical Skills
  • Make HTTP requests using the requests library
  • Parse HTML content with BeautifulSoup
  • Navigate DOM trees and extract specific elements
  • Handle pagination and multiple pages
  • Export data to CSV with proper formatting
Professional Skills
  • Implement robust error handling
  • Add rate limiting and polite scraping
  • Write modular, reusable code
  • Document scraping workflow
  • Understand web scraping ethics and legality
Ready to submit? Already completed the project? Submit your work now!
Submit Now
02

Project Scenario

DataHarvest Analytics

You have been hired as a Python Developer at DataHarvest Analytics, a data collection startup. The company needs to build a web scraping tool that can extract book information from online bookstores for market research and price comparison analysis. The tool should be reliable, respectful of website resources, and produce clean, analysis-ready data.

"We need a scraper that can collect book data - titles, prices, ratings, and availability. It should handle multiple pages, deal with errors gracefully, and output everything to CSV so our analysts can work with the data. Make sure it's polite to the servers - we don't want to get blocked!"

Alex Rivera, Data Engineering Lead

Core Features Required

Web Requests
  • Send HTTP GET requests to target URLs
  • Handle response status codes
  • Set proper User-Agent headers
  • Implement request timeouts
HTML Parsing
  • Parse HTML with BeautifulSoup
  • Use CSS selectors to find elements
  • Extract text, attributes, and links
  • Handle missing or malformed data
Data Extraction
  • Extract book titles, prices, ratings
  • Get availability status and categories
  • Follow links to detail pages
  • Clean and normalize extracted data
Export & Storage
  • Export data to CSV format
  • Handle Unicode characters properly
  • Create timestamped output files
  • Include summary statistics
Practice Website: Use books.toscrape.com - a website specifically designed for practicing web scraping. It is free to scrape and won't block your requests!
03

The Dataset

Your scraper will extract data similar to real-world book datasets. We provide sample data to help you understand the expected output format and validate your scraping results.

Dataset Download

Download the sample dataset from Kaggle (Amazon Top 50 Bestselling Books 2009-2019) to understand the expected output format and compare your scraped results.

Original Data Source

This project is inspired by the Amazon Top 50 Bestselling Books 2009-2019 dataset from Kaggle - a popular dataset containing 550 books scraped from Amazon. The dataset demonstrates real-world book data with titles, authors, ratings, reviews, prices, and genres that you will learn to extract through web scraping.

Dataset Info: 550 books | Years: 2009-2019 | Fields: Name, Author, User Rating, Reviews, Price, Year, Genre (Fiction/Non-Fiction) | License: CC0 Public Domain | Usability: 10.0
Dataset Schema (Kaggle Format)

ColumnTypeDescription
NameStringBook title
AuthorStringAuthor name
User RatingDecimalAverage user rating (0.0-5.0)
ReviewsIntegerNumber of user reviews
PriceIntegerPrice in USD
YearIntegerYear of publication (2009-2019)
GenreStringFiction or Non Fiction
Sample Data Preview

Here is sample data from the Kaggle dataset (Amazon bestsellers):

NameAuthorUser RatingReviewsPriceYearGenre
BecomingMichelle Obama4.861133$112019Non Fiction
Where the Crawdads SingDelia Owens4.887841$152019Fiction
Educated: A MemoirTara Westover4.742865$142018Non Fiction
Practice Target: books.toscrape.com is recommended for practicing your scraping skills. Compare your scraped results with the Kaggle dataset format!
04

Project Requirements

Your project must include all of the following components. Structure your code with clear organization, proper documentation, and follow Python best practices.

1
Project Structure

Organize your code into the following structure:

web-scraper/
├── scraper.py           # Main entry point
├── fetcher.py           # HTTP request handling
├── parser.py            # BeautifulSoup parsing logic
├── exporter.py          # CSV export functionality
├── config.py            # Configuration settings
├── utils.py             # Helper functions
├── data/
│   ├── books_YYYY-MM-DD.csv    # Scraped output (auto-generated)
│   └── categories.csv          # Category summary
├── tests/
│   ├── test_fetcher.py
│   ├── test_parser.py
│   └── test_exporter.py
├── requirements.txt     # Dependencies
└── README.md            # Project documentation
Requirement: Each module should have a clear, single responsibility. The scraper should work as a standalone CLI tool.
2
Fetcher Module (HTTP Requests)

Handle all HTTP operations:

  • get_page(url): Fetch a single URL and return response
  • get_pages(urls): Fetch multiple URLs with rate limiting
  • Headers: Set User-Agent and Accept headers
  • Timeouts: Implement request timeouts (10 seconds)
  • Retries: Retry failed requests up to 3 times
  • Rate limiting: Wait 1 second between requests
import requests
import time
from typing import Optional

class Fetcher:
    BASE_URL = "https://books.toscrape.com"
    
    def __init__(self, delay: float = 1.0):
        self.delay = delay
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'BookScraper/1.0 (Student Project)',
            'Accept': 'text/html,application/xhtml+xml'
        })
    
    def get_page(self, url: str) -> Optional[str]:
        """Fetch a single page and return HTML content."""
        try:
            time.sleep(self.delay)  # Rate limiting
            response = self.session.get(url, timeout=10)
            response.raise_for_status()
            return response.text
        except requests.RequestException as e:
            print(f"Error fetching {url}: {e}")
            return None
3
Parser Module (BeautifulSoup)

Parse HTML and extract data:

  • parse_book_list(html): Extract all books from a listing page
  • parse_book_detail(html): Extract full details from a book page
  • parse_categories(html): Extract all category links
  • parse_pagination(html): Find next page link if exists
  • clean_price(text): Convert "£51.77" to float 51.77
  • clean_rating(class_name): Convert "star-rating Three" to 3
from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class Book:
    title: str
    price: float
    rating: int
    availability: str
    category: str = ""
    upc: str = ""
    description: str = ""
    url: str = ""

class Parser:
    RATING_MAP = {
        'One': 1, 'Two': 2, 'Three': 3, 
        'Four': 4, 'Five': 5
    }
    
    def parse_book_list(self, html: str) -> List[Book]:
        """Extract all books from a listing page."""
        soup = BeautifulSoup(html, 'html.parser')
        books = []
        for article in soup.select('article.product_pod'):
            title = article.h3.a['title']
            price = self._clean_price(article.select_one('.price_color').text)
            rating = self._get_rating(article.select_one('.star-rating'))
            availability = article.select_one('.availability').text.strip()
            url = article.h3.a['href']
            books.append(Book(title, price, rating, availability, url=url))
        return books
4
Exporter Module (CSV)

Export data to CSV files:

  • export_books(books, filepath): Export book list to CSV
  • export_categories(categories, filepath): Export category summary
  • Encoding: Use UTF-8 encoding for Unicode support
  • Timestamps: Include date in filename
  • Headers: Include column headers in first row
import csv
from datetime import datetime
from typing import List
from parser import Book

class Exporter:
    def export_books(self, books: List[Book], filepath: str = None) -> str:
        """Export books to CSV file."""
        if filepath is None:
            date_str = datetime.now().strftime('%Y-%m-%d')
            filepath = f"data/books_{date_str}.csv"
        
        with open(filepath, 'w', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow(['title', 'price', 'rating', 'availability', 
                           'category', 'upc', 'description', 'url'])
            for book in books:
                writer.writerow([
                    book.title, book.price, book.rating, book.availability,
                    book.category, book.upc, book.description, book.url
                ])
        return filepath
5
Error Handling

Implement robust error handling:

  • Network errors: Handle connection timeouts and failures
  • HTTP errors: Handle 404, 500, and other status codes
  • Parsing errors: Handle missing elements gracefully
  • Logging: Log all errors with timestamps
  • Recovery: Continue scraping even if some pages fail
Requirement: Your scraper should never crash due to missing HTML elements or network issues. Use try/except blocks and provide fallback values.
6
CLI Interface

Command-line arguments:

  • python scraper.py - Scrape all books
  • python scraper.py --category Travel - Scrape specific category
  • python scraper.py --pages 5 - Limit to first N pages
  • python scraper.py --output data/my_books.csv - Custom output file
  • python scraper.py --verbose - Show detailed progress
05

Feature Specifications

Implement the following features with proper error handling. Each feature should be testable independently.

Page Fetching
  • Fetch pages with proper headers
  • Implement 1-second delay between requests
  • Retry failed requests (max 3 times)
  • Handle timeout errors (10s limit)
  • Log response status codes
  • Support session persistence
HTML Parsing
  • Parse book listings with BeautifulSoup
  • Extract title, price, rating, availability
  • Navigate to detail pages for more info
  • Extract UPC, description, category
  • Handle missing elements gracefully
  • Clean and normalize extracted text
Pagination
  • Detect if next page exists
  • Build correct next page URL
  • Loop through all available pages
  • Track total pages processed
  • Option to limit number of pages
  • Handle last page gracefully
CSV Export
  • Export to CSV with UTF-8 encoding
  • Include header row
  • Handle special characters in text
  • Create timestamped filenames
  • Create data directory if missing
  • Return filepath after export
Error Handling
  • Catch network connection errors
  • Handle HTTP 4xx/5xx responses
  • Deal with missing HTML elements
  • Log errors with context
  • Continue on individual failures
  • Report summary of errors at end
Statistics
  • Total books scraped count
  • Categories found count
  • Pages processed count
  • Errors encountered count
  • Time elapsed
  • Average price and rating
Sample Output: Scraping Complete
$ python scraper.py --verbose

==============================================
     BOOKSCRAPER - Web Scraping Tool
==============================================

[INFO] Initializing scraper...
[INFO] Target: https://books.toscrape.com
[INFO] Rate limit: 1.0 seconds between requests

[PAGE 1/50] Fetching catalogue/page-1.html
  ✓ Found 20 books
  → A Light in the Attic (£51.77, 3 stars)
  → Tipping the Velvet (£53.74, 1 star)
  → Soumission (£50.10, 1 star)
  ...

[PAGE 2/50] Fetching catalogue/page-2.html
  ✓ Found 20 books
  ...

==============================================
           SCRAPING COMPLETE!
==============================================

Summary:
  Total Books: 1000
  Categories: 50
  Pages Scraped: 50
  Errors: 0
  Time: 52.3 seconds

Output Files:
  → data/books_2025-01-15.csv (1000 rows)
  → data/categories.csv (50 rows)

[SUCCESS] All data exported!
06

Web Scraping Ethics

Web scraping comes with ethical and legal responsibilities. Always follow these guidelines when building scrapers.

Do's - Best Practices
  • Check robots.txt - Respect website's crawling rules
  • Rate limit requests - Add delays between requests (1+ seconds)
  • Identify yourself - Set a descriptive User-Agent header
  • Cache responses - Don't re-scrape unchanged pages
  • Handle errors gracefully - Don't hammer failing servers
  • Use practice sites - Like books.toscrape.com for learning
Don'ts - Avoid These
  • Don't ignore robots.txt - It's a guideline to follow
  • Don't scrape too fast - Can overload servers, get blocked
  • Don't scrape personal data - Respect privacy laws (GDPR)
  • Don't bypass authentication - Only scrape public data
  • Don't violate ToS - Read website terms of service
  • Don't redistribute data - Check copyright restrictions
Legal Note: Web scraping laws vary by country. For this project, we use books.toscrape.com - a website specifically designed for scraping practice. Always check a website's Terms of Service and robots.txt before scraping.
Checking robots.txt
# Always check robots.txt before scraping
import urllib.robotparser

def can_scrape(url: str, user_agent: str = '*') -> bool:
    """Check if scraping is allowed by robots.txt."""
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(url.rstrip('/') + '/robots.txt')
    rp.read()
    return rp.can_fetch(user_agent, url)

# Example usage
if can_scrape('https://books.toscrape.com/catalogue/'):
    print("Scraping allowed!")
else:
    print("Scraping not allowed by robots.txt")
07

Submission Requirements

Create a public GitHub repository with the exact name shown below:

Required Repository Name
python-web-scraper
github.com/<your-username>/python-web-scraper
Required Project Structure
python-web-scraper/
├── scraper.py           # Main entry point (run this)
├── fetcher.py           # HTTP request handling
├── parser.py            # BeautifulSoup parsing
├── exporter.py          # CSV export functionality
├── config.py            # Configuration settings
├── utils.py             # Helper functions
├── data/
│   └── bestsellers with categories.csv # Kaggle dataset
├── tests/
│   ├── test_fetcher.py  # Unit tests for Fetcher
│   ├── test_parser.py   # Unit tests for Parser
│   └── test_exporter.py # Unit tests for Exporter
├── screenshots/
│   ├── scraping.png     # Screenshot of scraper running
│   ├── output.png       # Screenshot of CSV output
│   └── stats.png        # Screenshot of statistics
├── requirements.txt     # Dependencies
└── README.md            # Project documentation
README.md Required Sections
1. Project Header
  • Project title and badges
  • Brief description
  • Your name and submission date
2. Features
  • List all implemented features
  • Highlight bonus features
  • Libraries used
3. Installation
  • Clone command
  • Python version (3.8+)
  • pip install requirements
4. Usage
  • How to run the scraper
  • CLI arguments explained
  • Example commands
5. Output Format
  • CSV column descriptions
  • Sample output rows
  • Output file locations
6. Project Structure
  • Explain each module
  • Class diagrams (optional)
7. Testing
  • How to run tests
  • Test coverage info
8. Ethical Considerations
  • robots.txt compliance
  • Rate limiting explanation
Do Include
  • All Python modules with docstrings
  • Sample scraped data CSV files
  • Unit tests for core modules
  • Screenshots of scraper output
  • requirements.txt with dependencies
  • Clear README with examples
Do Not Include
  • __pycache__ folders
  • .pyc compiled files
  • Virtual environment folder
  • Large data files (>10MB)
  • Cached HTML pages
  • API keys or credentials
Important: Your requirements.txt must include: requests, beautifulsoup4, and lxml (optional parser).
Submit Your Project

Enter your GitHub username - we will verify your repository automatically

08

Grading Rubric

Your project will be graded on the following criteria. Total: 450 points.

Criteria Points Description
HTTP Requests 60 Proper request handling, headers, timeouts, retries
HTML Parsing 80 BeautifulSoup usage, CSS selectors, data extraction
Pagination 50 Handle multiple pages, next page detection
CSV Export 60 Proper CSV formatting, UTF-8 encoding, headers
Error Handling 70 Graceful failures, logging, recovery
Code Quality 50 Modular design, docstrings, type hints
Testing 40 Unit tests for core modules
Documentation 40 README, comments, usage examples
Total 450
Grading Levels
Excellent

405-450

90%+
Good

360-404

80-89%
Satisfactory

315-359

70-79%
Needs Work

<315

<70%
Bonus Points (up to 50 extra)
+15 Points

Add SQLite database storage option in addition to CSV

+20 Points

Implement async scraping with aiohttp for faster performance

+15 Points

Add data visualization of scraped results with matplotlib

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Project