Building a Web Archiving System with Screenshot API
February 7, 2026 ยท 5 min read
Photo via Unsplash
Web pages change constantly. Prices update, content gets removed, terms of service are revised, and pages go offline entirely. For many organizations, capturing and preserving web content at specific points in time is not just useful -- it is a legal or regulatory requirement. In this guide, we will build a complete web archiving system using SnapAPI.
Why Archive Web Pages?
Legal and Compliance
- Evidence preservation: Capture competitor pricing claims, advertising content, or defamatory material before it is taken down
- Regulatory compliance: Financial services, healthcare, and government organizations often need to archive public-facing content
- Contract documentation: Screenshot terms of service and pricing pages at the time of agreement
Business Intelligence
- Competitor monitoring: Track how competitors update their pricing, features, and messaging over time
- Content change detection: Get alerted when monitored pages change significantly
- Brand monitoring: Archive pages that mention your brand for review
Research and Documentation
- Academic research: Preserve web sources cited in papers
- Journalism: Archive articles and social media posts that might be deleted
- Historical records: Document the evolution of websites over time
Architecture Overview
Our archiving system has three components:
- Scheduler: Triggers captures at configured intervals (cron or GitHub Actions)
- Capture service: Calls SnapAPI to take screenshots and store them
- Archive viewer: A simple web UI to browse captured snapshots
Step 1: Capture and Store to S3
SnapAPI has a built-in storage feature that sends screenshots directly to your S3 bucket. This means the image data never passes through your server -- it goes straight from SnapAPI to S3.
Setup required: Before using S3 storage, configure your S3 credentials in the SnapAPI dashboard under Settings > Storage. You will need your AWS access key, secret key, bucket name, and region.
// archive.js
const fs = require("fs");
const SNAPAPI_KEY = process.env.SNAPAPI_KEY;
const API_URL = "https://api.snapapi.pics/v1/screenshot";
async function archivePage(url, metadata = {}) {
const timestamp = new Date().toISOString().replace(/[:.]/g, "-");
const slug = url.replace(/https?:\/\//, "").replace(/[^a-zA-Z0-9]/g, "_");
const filename = `archives/${slug}/${timestamp}.png`;
const response = await fetch(API_URL, {
method: "POST",
headers: {
"X-Api-Key": SNAPAPI_KEY,
"Content-Type": "application/json",
},
body: JSON.stringify({
url,
format: "png",
width: 1440,
height: 900,
fullPage: true,
blockCookieBanners: true,
responseType: "json",
storage: {
enabled: true,
destination: "user_s3",
path: filename,
},
}),
});
if (!response.ok) {
const error = await response.text();
throw new Error(`Archive failed for ${url}: ${error}`);
}
const result = await response.json();
// Log the archive entry
const entry = {
url,
timestamp: new Date().toISOString(),
storagePath: filename,
storageUrl: result.storageUrl,
...metadata,
};
console.log(`Archived: ${url} -> ${filename}`);
return entry;
}
module.exports = { archivePage };
Step 2: Batch Archiving with Metadata
Define a list of URLs to archive and run them all with full metadata tracking:
// run-archive.js
const fs = require("fs");
const { archivePage } = require("./archive");
const URLS_TO_ARCHIVE = [
{ url: "https://competitor.com/pricing", category: "competitor", tags: ["pricing"] },
{ url: "https://competitor.com/features", category: "competitor", tags: ["features"] },
{ url: "https://yoursite.com/terms", category: "legal", tags: ["tos"] },
{ url: "https://yoursite.com/privacy", category: "legal", tags: ["privacy"] },
{ url: "https://news.ycombinator.com", category: "monitoring", tags: ["tech-news"] },
];
const LOG_FILE = "./archive-log.json";
async function runArchive() {
// Load existing log
let log = [];
if (fs.existsSync(LOG_FILE)) {
log = JSON.parse(fs.readFileSync(LOG_FILE, "utf-8"));
}
console.log(`Archiving ${URLS_TO_ARCHIVE.length} pages...\n`);
for (const item of URLS_TO_ARCHIVE) {
try {
const entry = await archivePage(item.url, {
category: item.category,
tags: item.tags,
});
log.push(entry);
} catch (err) {
console.error(` Failed: ${item.url} - ${err.message}`);
log.push({
url: item.url,
timestamp: new Date().toISOString(),
status: "error",
error: err.message,
});
}
}
// Save updated log
fs.writeFileSync(LOG_FILE, JSON.stringify(log, null, 2));
console.log(`\nArchive log saved to ${LOG_FILE}`);
console.log(`Total entries: ${log.length}`);
}
runArchive();
Step 3: Schedule with Cron
Run the archiver on a schedule using cron. Add this to your crontab (crontab -e):
# Archive every day at 6 AM UTC
0 6 * * * cd /home/user/web-archiver && SNAPAPI_KEY=your_key node run-archive.js >> /var/log/archiver.log 2>&1
# Archive competitor pricing pages every 6 hours
0 */6 * * * cd /home/user/web-archiver && SNAPAPI_KEY=your_key node run-archive.js --category competitor >> /var/log/archiver.log 2>&1
Step 3 (Alternative): Schedule with GitHub Actions
If you prefer a serverless approach, use GitHub Actions with a cron schedule:
# .github/workflows/archive.yml
name: Web Archive
on:
schedule:
# Run daily at 6 AM UTC
- cron: "0 6 * * *"
workflow_dispatch: # Allow manual triggers
jobs:
archive:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
- name: Install dependencies
run: npm ci
- name: Run archive
run: node run-archive.js
env:
SNAPAPI_KEY: ${{ secrets.SNAPAPI_KEY }}
- name: Commit archive log
run: |
git config user.name "Archive Bot"
git config user.email "bot@yoursite.com"
git add archive-log.json
git diff --staged --quiet || git commit -m "Update archive log $(date -u +%Y-%m-%d)"
git push
Step 4: Build an Archive Viewer
A simple Express.js application to browse your archived snapshots:
// viewer.js
const express = require("express");
const fs = require("fs");
const app = express();
const LOG_FILE = "./archive-log.json";
app.get("/", (req, res) => {
const log = JSON.parse(fs.readFileSync(LOG_FILE, "utf-8"));
const { category, url, date } = req.query;
let entries = log.filter((e) => e.status !== "error");
// Apply filters
if (category) entries = entries.filter((e) => e.category === category);
if (url) entries = entries.filter((e) => e.url.includes(url));
if (date) entries = entries.filter((e) => e.timestamp.startsWith(date));
// Group by URL
const grouped = {};
for (const entry of entries) {
if (!grouped[entry.url]) grouped[entry.url] = [];
grouped[entry.url].push(entry);
}
const html = `
<!DOCTYPE html>
<html>
<head>
<title>Web Archive Viewer</title>
<style>
body { font-family: system-ui; max-width: 1200px; margin: 0 auto; padding: 20px; background: #0a0a0f; color: #e2e8f0; }
h1 { margin-bottom: 20px; }
.filters { margin-bottom: 20px; display: flex; gap: 10px; }
.filters select, .filters input { padding: 8px; background: #1a1a2e; color: #e2e8f0; border: 1px solid #2a2a3e; border-radius: 4px; }
.url-group { margin-bottom: 30px; border: 1px solid #2a2a3e; border-radius: 8px; padding: 16px; }
.url-group h3 { color: #00d4ff; margin-bottom: 12px; word-break: break-all; }
.snapshots { display: flex; gap: 12px; overflow-x: auto; padding: 8px 0; }
.snapshot { min-width: 200px; text-align: center; }
.snapshot img { width: 200px; border-radius: 4px; border: 1px solid #2a2a3e; }
.snapshot p { font-size: 0.8rem; color: #64748b; margin-top: 4px; }
</style>
</head>
<body>
<h1>Web Archive Viewer</h1>
<p style="color: #94a3b8;">${entries.length} snapshots across ${Object.keys(grouped).length} URLs</p>
${Object.entries(grouped)
.map(
([url, snapshots]) => `
<div class="url-group">
<h3>${url}</h3>
<div class="snapshots">
${snapshots
.slice(-10)
.map(
(s) => `
<div class="snapshot">
<a href="${s.storageUrl}" target="_blank">
<img src="${s.storageUrl}" loading="lazy" alt="Archive snapshot">
</a>
<p>${new Date(s.timestamp).toLocaleString()}</p>
</div>
`
)
.join("")}
</div>
</div>
`
)
.join("")}
</body>
</html>
`;
res.send(html);
});
app.listen(3001, () => console.log("Archive viewer: http://localhost:3001"));
Archiving Without S3 (Local Storage)
If you do not need S3 integration, you can store screenshots locally or on any file system:
const fs = require("fs");
const path = require("path");
async function archiveLocal(url, outputDir = "./archives") {
const timestamp = new Date().toISOString().replace(/[:.]/g, "-");
const slug = url.replace(/https?:\/\//, "").replace(/[^a-zA-Z0-9]/g, "_");
const dir = path.join(outputDir, slug);
fs.mkdirSync(dir, { recursive: true });
const response = await fetch("https://api.snapapi.pics/v1/screenshot", {
method: "POST",
headers: {
"X-Api-Key": process.env.SNAPAPI_KEY,
"Content-Type": "application/json",
},
body: JSON.stringify({
url,
format: "png",
fullPage: true,
blockCookieBanners: true,
responseType: "image",
}),
});
if (!response.ok) throw new Error(`Failed: ${response.status}`);
const filepath = path.join(dir, `${timestamp}.png`);
const buffer = Buffer.from(await response.arrayBuffer());
fs.writeFileSync(filepath, buffer);
return { url, timestamp, filepath, size: buffer.length };
}
Python Archiver
Here is the same archiving logic in Python, for teams that prefer it:
import requests
import os
import json
from datetime import datetime
SNAPAPI_KEY = os.environ["SNAPAPI_KEY"]
def archive_page(url: str, output_dir: str = "./archives") -> dict:
timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H-%M-%S")
slug = url.replace("https://", "").replace("http://", "")
slug = "".join(c if c.isalnum() else "_" for c in slug)
page_dir = os.path.join(output_dir, slug)
os.makedirs(page_dir, exist_ok=True)
response = requests.post(
"https://api.snapapi.pics/v1/screenshot",
headers={
"X-Api-Key": SNAPAPI_KEY,
"Content-Type": "application/json",
},
json={
"url": url,
"format": "png",
"fullPage": True,
"blockCookieBanners": True,
"responseType": "image",
},
timeout=60,
)
response.raise_for_status()
filepath = os.path.join(page_dir, f"{timestamp}.png")
with open(filepath, "wb") as f:
f.write(response.content)
entry = {
"url": url,
"timestamp": datetime.utcnow().isoformat(),
"filepath": filepath,
"size_bytes": len(response.content),
}
print(f" Archived: {url} ({len(response.content)} bytes)")
return entry
# Archive multiple URLs
urls = [
"https://competitor.com/pricing",
"https://yoursite.com/terms",
"https://news.ycombinator.com",
]
results = []
for url in urls:
try:
entry = archive_page(url)
results.append(entry)
except Exception as e:
print(f" Failed: {url} - {e}")
# Save log
with open("archive-log.json", "w") as f:
json.dump(results, f, indent=2)
print(f"\nArchived {len(results)}/{len(urls)} pages")
Best Practices
- Use full-page capture. For archival purposes, capture the entire page with
fullPage: trueso nothing is missed. - Block cookie banners. Consent popups obscure content. Always use
blockCookieBanners: true. - Store metadata alongside screenshots. Record the URL, timestamp, HTTP status, and any relevant context in a structured log file.
- Use timestamps in filenames. ISO 8601 format sorts correctly and is unambiguous across timezones.
- Set up retention policies. Configure S3 lifecycle rules to move old archives to Glacier or delete them after a retention period.
- Monitor for failures. Set up alerting when captures fail repeatedly. A URL that returns errors might mean the page was removed.
Next Steps
- API Documentation -- full reference for storage parameters
- Visual Regression Testing -- use archiving techniques for testing
- PDF Generation -- archive pages as PDFs instead of images
Try SnapAPI Free
Get 200 free screenshots per month. Start archiving web pages today.
Get Started Free →Related Reading
- Web Archiving Use Case โ compliance and historical archiving at scale
- PDF Generation API โ archive pages as PDF documents
- Free Screenshot API โ start archiving with 200 free captures per month