Web Archiving API — Screenshot-Based Page Preservation 2026

Build a web archiving system that preserves timestamped visual snapshots of web pages for compliance, evidence, and change detection. REST API, no browser required.

Get Free API Key

What is Screenshot-Based Web Archiving?

Web archiving is the practice of capturing and preserving the state of web pages at a specific point in time. Traditional web archiving tools like Wget and HTTrack capture the raw HTML, CSS, and asset files — but this approach fails for modern JavaScript-rendered sites where the page content is assembled in the browser, not in the HTML file. A social media post, a dynamic product listing, or an interactive dashboard cannot be accurately preserved as raw HTML because the content only exists in its complete form after JavaScript executes. Screenshot-based web archiving captures the visual state of the page as a pixel-accurate PNG or JPEG image — preserving exactly what the page looked like at the moment of capture, regardless of how the content was rendered. For compliance, legal evidence, and visual change detection, the screenshot is the only reliable preservation format for modern web pages.

Building a Web Archiving System with SnapAPI

import os, requests, time, hashlib
from datetime import datetime
import boto3

class WebArchiver:
    def __init__(self, api_key: str, s3_bucket: str):
        self.api_key = api_key
        self.s3 = boto3.client('s3')
        self.bucket = s3_bucket

    def capture(self, url: str, metadata: dict = None) -> dict:
        timestamp = datetime.utcnow().isoformat()
        url_hash = hashlib.md5(url.encode()).hexdigest()[:8]
        key = f"archives/{url_hash}/{timestamp}.png"

        resp = requests.get(
            'https://snapapi.pics/screenshot',
            params={
                'access_key': self.api_key,
                'url': url,
                'full_page': '1',
                'format': 'png',
                'viewport_width': '1280',
                'viewport_height': '800',
            },
            timeout=30
        )
        resp.raise_for_status()

        self.s3.put_object(
            Bucket=self.bucket,
            Key=key,
            Body=resp.content,
            ContentType='image/png',
            Metadata={
                'url': url,
                'archived_at': timestamp,
                **(metadata or {})
            }
        )
        return {
            'url': url,
            'archived_at': timestamp,
            's3_key': key,
            'size_bytes': len(resp.content)
        }

archiver = WebArchiver(os.environ['SNAPAPI_KEY'], 'my-archive-bucket')
record = archiver.capture('https://example.com/product/123')
print(f"Archived: {record['s3_key']} ({record['size_bytes']} bytes)")

Scheduled Archiving with APScheduler

from apscheduler.schedulers.blocking import BlockingScheduler
from apscheduler.triggers.interval import IntervalTrigger
import logging

logging.basicConfig(level=logging.INFO)
scheduler = BlockingScheduler()
archiver = WebArchiver(os.environ['SNAPAPI_KEY'], 'archive-bucket')

MONITORED_URLS = [
    'https://competitor.com/pricing',
    'https://example.com/dashboard',
    'https://news-site.com/breaking',
]

@scheduler.scheduled_job(IntervalTrigger(hours=6))
def archive_all():
    for url in MONITORED_URLS:
        try:
            record = archiver.capture(url)
            logging.info(f"Archived {url} -> {record['s3_key']}")
        except Exception as e:
            logging.error(f"Failed to archive {url}: {e}")

scheduler.start()

Visual Change Detection Between Archives

Storing multiple timestamped screenshots of the same URL enables visual change detection: compare the latest screenshot against the previous one to detect visual regressions, competitor price changes, or content modifications. Use the Pillow library for pixel-based comparison and generate a diff image that highlights changed regions for human review. Set a percentage threshold — typically 2 to 5 percent of changed pixels — to distinguish meaningful changes from normal anti-aliasing variation and dynamic content like timestamps. Send an alert notification when the change threshold is exceeded, including both the before and after screenshots and the diff image as email attachments or Slack message uploads.

Compliance and Legal Evidence Archiving

Web archive screenshots serve as admissible evidence in legal proceedings for trademark infringement, false advertising claims, contractual disputes, and regulatory compliance documentation. For legal admissibility, the archive must include a reliable timestamp, be stored in an immutable manner (S3 with object lock or a write-once storage system), and have a verifiable chain of custody showing that the screenshot was not modified after capture. Store the capture timestamp in the image EXIF metadata and in a separate integrity log that records a cryptographic hash of the screenshot bytes alongside the URL, timestamp, and capture parameters. S3 Object Lock in compliance mode prevents deletion or modification for the specified retention period, satisfying legal hold requirements. For high-stakes archiving, use a third-party notary service that timestamps the capture and provides a notarized certificate alongside the screenshot file.

Common Web Archiving Use Cases

Competitive intelligence teams archive competitor pricing pages, feature announcements, and product pages on a daily or weekly schedule to track changes over time and build a timeline of competitor strategy evolution. Legal and compliance teams archive web evidence for ongoing disputes — archiving at the time of the alleged infringement with a verifiable timestamp. Marketing teams archive their own campaign landing pages and email click-through destinations to document the appearance at campaign launch for post-campaign analysis. Journalists archive web sources cited in published articles as permanent records of the page state at the time of citation. GDPR and data retention compliance teams archive consent notices, privacy policies, and terms of service pages before and after each update to maintain a complete version history. All of these use cases require screenshot-based archiving rather than HTML-based archiving because the visual state of the page — not the raw HTML — is what matters for documentation and evidence purposes.

Web Archive Storage and Retrieval Architecture

A production web archiving system needs a storage architecture that supports efficient retrieval by URL, timestamp, and date range. S3 is the natural backend for screenshot archives at any scale — it provides durable object storage with no capacity limits, per-object metadata for URL and timestamp tagging, and lifecycle policies for automatic deletion of archives older than the retention period. Organize archive keys by URL hash and timestamp: archives/{url_hash}/{year}/{month}/{day}/{timestamp}.png. This structure enables listing all archives for a URL with an S3 ListObjectsV2 prefix query, retrieving archives within a date range by filtering on the timestamp portion of the key, and deleting all archives for a URL by deleting the prefix. Store a parallel metadata record in a relational database (PostgreSQL) or DynamoDB with columns for url, url_hash, archived_at, s3_key, size_bytes, and status — this enables fast queries by URL, date range, and status without listing S3 objects on every read request. The database records are the index; S3 objects are the actual archive files.

Change Detection Pipeline

A change detection pipeline compares each new archive screenshot against the previous one for the same URL and stores the diff result. The pipeline runs after every capture: load the new and previous screenshot, compute a pixel difference using Pillow or the PIL library's ImageChops.difference method, calculate the percentage of changed pixels, compare against the configured threshold, and if the threshold is exceeded, store a diff record in the database and trigger notifications. The threshold should account for expected variation: a news homepage will have frequent small changes from banner ad rotation and article additions, requiring a 5 to 10 percent threshold to avoid alert fatigue, while a legal terms-of-service page should alert on any change above 0.1 percent. Store the diff image alongside the archive screenshot so reviewers can see exactly what changed — color the changed pixels red against a faded gray background for easy visual identification of the modification region.

Regulatory and Industry Use Cases

Web archiving has specific regulatory requirements in several industries. Financial services firms archiving investment-related web content for SEC Rule 17a-4 compliance need immutable storage (S3 Object Lock) with a minimum three-year retention period and a verifiable audit trail showing that the archive was not modified after capture. Healthcare organizations archiving HIPAA-covered web content for patient communication compliance need encryption at rest and in transit with key management that satisfies HIPAA requirements. Legal firms archiving web evidence for litigation need a notarized timestamp or trusted third-party certificate alongside each screenshot. E-commerce companies archiving competitor pricing pages for competitive intelligence do not have regulatory requirements but benefit from a consistent archiving frequency and a queryable history for trend analysis. SnapAPI provides the screenshot capture layer; your storage and compliance infrastructure (S3 Object Lock, KMS encryption, audit logging) provides the regulatory compliance guarantees.

Archiving at Scale: Volume and Cost Considerations

Web archiving at scale requires thoughtful volume management to keep API costs predictable. Calculate your monthly request volume before choosing a plan: a monitoring system checking 100 URLs every 6 hours generates 100 x 4 x 30 = 12,000 screenshots per month, which fits within the $19/month Starter tier at 5,000 requests only if you reduce frequency or implement smart change-detection deduplication. For large-scale archiving — thousands of URLs at high frequency — implement a tiered capture strategy: capture critical pages hourly, important pages daily, and archive pages weekly. Store the last capture timestamp and change rate for each URL, and adjust capture frequency dynamically based on observed change rate. URLs that rarely change (legal pages, about pages, static documentation) can be captured monthly without losing meaningful archiving value, freeing quota for high-change pages that need frequent capture. This adaptive frequency approach can reduce API usage by 60 to 80 percent compared to uniform capture intervals, allowing larger URL portfolios within a fixed monthly budget.