What is Screenshot-Based Web Archiving?
Web archiving is the practice of capturing and preserving the state of web pages at a specific point in time. Traditional web archiving tools like Wget and HTTrack capture the raw HTML, CSS, and asset files — but this approach fails for modern JavaScript-rendered sites where the page content is assembled in the browser, not in the HTML file. A social media post, a dynamic product listing, or an interactive dashboard cannot be accurately preserved as raw HTML because the content only exists in its complete form after JavaScript executes. Screenshot-based web archiving captures the visual state of the page as a pixel-accurate PNG or JPEG image — preserving exactly what the page looked like at the moment of capture, regardless of how the content was rendered. For compliance, legal evidence, and visual change detection, the screenshot is the only reliable preservation format for modern web pages.
Building a Web Archiving System with SnapAPI
import os, requests, time, hashlib
from datetime import datetime
import boto3
class WebArchiver:
def __init__(self, api_key: str, s3_bucket: str):
self.api_key = api_key
self.s3 = boto3.client('s3')
self.bucket = s3_bucket
def capture(self, url: str, metadata: dict = None) -> dict:
timestamp = datetime.utcnow().isoformat()
url_hash = hashlib.md5(url.encode()).hexdigest()[:8]
key = f"archives/{url_hash}/{timestamp}.png"
resp = requests.get(
'https://snapapi.pics/screenshot',
params={
'access_key': self.api_key,
'url': url,
'full_page': '1',
'format': 'png',
'viewport_width': '1280',
'viewport_height': '800',
},
timeout=30
)
resp.raise_for_status()
self.s3.put_object(
Bucket=self.bucket,
Key=key,
Body=resp.content,
ContentType='image/png',
Metadata={
'url': url,
'archived_at': timestamp,
**(metadata or {})
}
)
return {
'url': url,
'archived_at': timestamp,
's3_key': key,
'size_bytes': len(resp.content)
}
archiver = WebArchiver(os.environ['SNAPAPI_KEY'], 'my-archive-bucket')
record = archiver.capture('https://example.com/product/123')
print(f"Archived: {record['s3_key']} ({record['size_bytes']} bytes)")
Scheduled Archiving with APScheduler
from apscheduler.schedulers.blocking import BlockingScheduler
from apscheduler.triggers.interval import IntervalTrigger
import logging
logging.basicConfig(level=logging.INFO)
scheduler = BlockingScheduler()
archiver = WebArchiver(os.environ['SNAPAPI_KEY'], 'archive-bucket')
MONITORED_URLS = [
'https://competitor.com/pricing',
'https://example.com/dashboard',
'https://news-site.com/breaking',
]
@scheduler.scheduled_job(IntervalTrigger(hours=6))
def archive_all():
for url in MONITORED_URLS:
try:
record = archiver.capture(url)
logging.info(f"Archived {url} -> {record['s3_key']}")
except Exception as e:
logging.error(f"Failed to archive {url}: {e}")
scheduler.start()
Visual Change Detection Between Archives
Storing multiple timestamped screenshots of the same URL enables visual change detection: compare the latest screenshot against the previous one to detect visual regressions, competitor price changes, or content modifications. Use the Pillow library for pixel-based comparison and generate a diff image that highlights changed regions for human review. Set a percentage threshold — typically 2 to 5 percent of changed pixels — to distinguish meaningful changes from normal anti-aliasing variation and dynamic content like timestamps. Send an alert notification when the change threshold is exceeded, including both the before and after screenshots and the diff image as email attachments or Slack message uploads.
Compliance and Legal Evidence Archiving
Web archive screenshots serve as admissible evidence in legal proceedings for trademark infringement, false advertising claims, contractual disputes, and regulatory compliance documentation. For legal admissibility, the archive must include a reliable timestamp, be stored in an immutable manner (S3 with object lock or a write-once storage system), and have a verifiable chain of custody showing that the screenshot was not modified after capture. Store the capture timestamp in the image EXIF metadata and in a separate integrity log that records a cryptographic hash of the screenshot bytes alongside the URL, timestamp, and capture parameters. S3 Object Lock in compliance mode prevents deletion or modification for the specified retention period, satisfying legal hold requirements. For high-stakes archiving, use a third-party notary service that timestamps the capture and provides a notarized certificate alongside the screenshot file.
Common Web Archiving Use Cases
Competitive intelligence teams archive competitor pricing pages, feature announcements, and product pages on a daily or weekly schedule to track changes over time and build a timeline of competitor strategy evolution. Legal and compliance teams archive web evidence for ongoing disputes — archiving at the time of the alleged infringement with a verifiable timestamp. Marketing teams archive their own campaign landing pages and email click-through destinations to document the appearance at campaign launch for post-campaign analysis. Journalists archive web sources cited in published articles as permanent records of the page state at the time of citation. GDPR and data retention compliance teams archive consent notices, privacy policies, and terms of service pages before and after each update to maintain a complete version history. All of these use cases require screenshot-based archiving rather than HTML-based archiving because the visual state of the page — not the raw HTML — is what matters for documentation and evidence purposes.