Use Case Archiving

Building a Web Archiving System with Screenshot API

February 7, 2026 · 5 min read

Photo by Yang Yinfan on Unsplash

Web pages change constantly. Prices update, content gets removed, terms of service are revised, and pages go offline entirely. For many organizations, capturing and preserving web content at specific points in time is not just useful -- it is a legal or regulatory requirement. In this guide, we will build a complete web archiving system using SnapAPI.

Why Archive Web Pages?

Legal and Compliance

Evidence preservation: Capture competitor pricing claims, advertising content, or defamatory material before it is taken down
Regulatory compliance: Financial services, healthcare, and government organizations often need to archive public-facing content
Contract documentation: Screenshot terms of service and pricing pages at the time of agreement

Business Intelligence

Competitor monitoring: Track how competitors update their pricing, features, and messaging over time
Content change detection: Get alerted when monitored pages change significantly
Brand monitoring: Archive pages that mention your brand for review

Research and Documentation

Academic research: Preserve web sources cited in papers
Journalism: Archive articles and social media posts that might be deleted
Historical records: Document the evolution of websites over time

Architecture Overview

Our archiving system has three components:

Scheduler: Triggers captures at configured intervals (cron or GitHub Actions)
Capture service: Calls SnapAPI to take screenshots and store them
Archive viewer: A simple web UI to browse captured snapshots

Step 1: Capture and Store to S3

SnapAPI has a built-in storage feature that sends screenshots directly to your S3 bucket. This means the image data never passes through your server -- it goes straight from SnapAPI to S3.

Setup required: Before using S3 storage, configure your S3 credentials in the SnapAPI dashboard under Settings > Storage. You will need your AWS access key, secret key, bucket name, and region.

// archive.js
const fs = require("fs");

const SNAPAPI_KEY = process.env.SNAPAPI_KEY;
const API_URL = "https://api.snapapi.pics/v1/screenshot";

async function archivePage(url, metadata = {}) {
  const timestamp = new Date().toISOString().replace(/[:.]/g, "-");
  const slug = url.replace(/https?:\/\//, "").replace(/[^a-zA-Z0-9]/g, "_");
  const filename = `archives/${slug}/${timestamp}.png`;

  const response = await fetch(API_URL, {
    method: "POST",
    headers: {
      "X-Api-Key": SNAPAPI_KEY,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      url,
      format: "png",
      width: 1440,
      height: 900,
      fullPage: true,
      blockCookieBanners: true,
      responseType: "json",
      storage: {
        enabled: true,
        destination: "user_s3",
        path: filename,
      },
    }),
  });

  if (!response.ok) {
    const error = await response.text();
    throw new Error(`Archive failed for ${url}: ${error}`);
  }

  const result = await response.json();

  // Log the archive entry
  const entry = {
    url,
    timestamp: new Date().toISOString(),
    storagePath: filename,
    storageUrl: result.storageUrl,
    ...metadata,
  };

  console.log(`Archived: ${url} -> ${filename}`);
  return entry;
}

module.exports = { archivePage };

Step 2: Batch Archiving with Metadata

Define a list of URLs to archive and run them all with full metadata tracking:

// run-archive.js
const fs = require("fs");
const { archivePage } = require("./archive");

const URLS_TO_ARCHIVE = [
  { url: "https://competitor.com/pricing", category: "competitor", tags: ["pricing"] },
  { url: "https://competitor.com/features", category: "competitor", tags: ["features"] },
  { url: "https://yoursite.com/terms", category: "legal", tags: ["tos"] },
  { url: "https://yoursite.com/privacy", category: "legal", tags: ["privacy"] },
  { url: "https://news.ycombinator.com", category: "monitoring", tags: ["tech-news"] },
];

const LOG_FILE = "./archive-log.json";

async function runArchive() {
  // Load existing log
  let log = [];
  if (fs.existsSync(LOG_FILE)) {
    log = JSON.parse(fs.readFileSync(LOG_FILE, "utf-8"));
  }

  console.log(`Archiving ${URLS_TO_ARCHIVE.length} pages...\n`);

  for (const item of URLS_TO_ARCHIVE) {
    try {
      const entry = await archivePage(item.url, {
        category: item.category,
        tags: item.tags,
      });
      log.push(entry);
    } catch (err) {
      console.error(`  Failed: ${item.url} - ${err.message}`);
      log.push({
        url: item.url,
        timestamp: new Date().toISOString(),
        status: "error",
        error: err.message,
      });
    }
  }

  // Save updated log
  fs.writeFileSync(LOG_FILE, JSON.stringify(log, null, 2));
  console.log(`\nArchive log saved to ${LOG_FILE}`);
  console.log(`Total entries: ${log.length}`);
}

runArchive();

Step 3: Schedule with Cron

Run the archiver on a schedule using cron. Add this to your crontab (crontab -e):

# Archive every day at 6 AM UTC
0 6 * * * cd /home/user/web-archiver && SNAPAPI_KEY=your_key node run-archive.js >> /var/log/archiver.log 2>&1

# Archive competitor pricing pages every 6 hours
0 */6 * * * cd /home/user/web-archiver && SNAPAPI_KEY=your_key node run-archive.js --category competitor >> /var/log/archiver.log 2>&1

Step 3 (Alternative): Schedule with GitHub Actions

If you prefer a serverless approach, use GitHub Actions with a cron schedule:

# .github/workflows/archive.yml
name: Web Archive

on:
  schedule:
    # Run daily at 6 AM UTC
    - cron: "0 6 * * *"
  workflow_dispatch: # Allow manual triggers

jobs:
  archive:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: "20"

      - name: Install dependencies
        run: npm ci

      - name: Run archive
        run: node run-archive.js
        env:
          SNAPAPI_KEY: ${{ secrets.SNAPAPI_KEY }}

      - name: Commit archive log
        run: |
          git config user.name "Archive Bot"
          git config user.email "bot@yoursite.com"
          git add archive-log.json
          git diff --staged --quiet || git commit -m "Update archive log $(date -u +%Y-%m-%d)"
          git push

Step 4: Build an Archive Viewer

A simple Express.js application to browse your archived snapshots:

// viewer.js
const express = require("express");
const fs = require("fs");

const app = express();
const LOG_FILE = "./archive-log.json";

app.get("/", (req, res) => {
  const log = JSON.parse(fs.readFileSync(LOG_FILE, "utf-8"));
  const { category, url, date } = req.query;

  let entries = log.filter((e) => e.status !== "error");

  // Apply filters
  if (category) entries = entries.filter((e) => e.category === category);
  if (url) entries = entries.filter((e) => e.url.includes(url));
  if (date) entries = entries.filter((e) => e.timestamp.startsWith(date));

  // Group by URL
  const grouped = {};
  for (const entry of entries) {
    if (!grouped[entry.url]) grouped[entry.url] = [];
    grouped[entry.url].push(entry);
  }

  const html = `
    <!DOCTYPE html>
    <html>
    <head>
      <title>Web Archive Viewer</title>
      <style>
        body { font-family: system-ui; max-width: 1200px; margin: 0 auto; padding: 20px; background: #0a0a0f; color: #334155; }
        h1 { margin-bottom: 20px; }
        .filters { margin-bottom: 20px; display: flex; gap: 10px; }
        .filters select, .filters input { padding: 8px; background: #f1f5f9; color: #334155; border: 1px solid #2a2a3e; border-radius: 4px; }
        .url-group { margin-bottom: 30px; border: 1px solid #2a2a3e; border-radius: 8px; padding: 16px; }
        .url-group h3 { color: #00d4ff; margin-bottom: 12px; word-break: break-all; }
        .snapshots { display: flex; gap: 12px; overflow-x: auto; padding: 8px 0; }
        .snapshot { min-width: 200px; text-align: center; }
        .snapshot img { width: 200px; border-radius: 4px; border: 1px solid #2a2a3e; }
        .snapshot p { font-size: 0.8rem; color: #64748b; margin-top: 4px; }
      </style>
    </head>
    <body>
      <h1>Web Archive Viewer</h1>
      <p style="color: #94a3b8;">${entries.length} snapshots across ${Object.keys(grouped).length} URLs</p>
      ${Object.entries(grouped)
        .map(
          ([url, snapshots]) => `
        <div class="url-group">
          <h3>${url}</h3>
          <div class="snapshots">
            ${snapshots
              .slice(-10)
              .map(
                (s) => `
              <div class="snapshot">
                <a href="${s.storageUrl}" target="_blank">
                  <img src="${s.storageUrl}" loading="lazy" alt="Archive snapshot">
                </a>
                <p>${new Date(s.timestamp).toLocaleString()}</p>
              </div>
            `
              )
              .join("")}
          </div>
        </div>
      `
        )
        .join("")}
    </body>
    </html>
  `;

  res.send(html);
});

app.listen(3001, () => console.log("Archive viewer: http://localhost:3001"));

Archiving Without S3 (Local Storage)

If you do not need S3 integration, you can store screenshots locally or on any file system:

const fs = require("fs");
const path = require("path");

async function archiveLocal(url, outputDir = "./archives") {
  const timestamp = new Date().toISOString().replace(/[:.]/g, "-");
  const slug = url.replace(/https?:\/\//, "").replace(/[^a-zA-Z0-9]/g, "_");
  const dir = path.join(outputDir, slug);
  fs.mkdirSync(dir, { recursive: true });

  const response = await fetch("https://api.snapapi.pics/v1/screenshot", {
    method: "POST",
    headers: {
      "X-Api-Key": process.env.SNAPAPI_KEY,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      url,
      format: "png",
      fullPage: true,
      blockCookieBanners: true,
      responseType: "image",
    }),
  });

  if (!response.ok) throw new Error(`Failed: ${response.status}`);

  const filepath = path.join(dir, `${timestamp}.png`);
  const buffer = Buffer.from(await response.arrayBuffer());
  fs.writeFileSync(filepath, buffer);

  return { url, timestamp, filepath, size: buffer.length };
}

Python Archiver

Here is the same archiving logic in Python, for teams that prefer it:

import requests
import os
import json
from datetime import datetime

SNAPAPI_KEY = os.environ["SNAPAPI_KEY"]

def archive_page(url: str, output_dir: str = "./archives") -> dict:
    timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H-%M-%S")
    slug = url.replace("https://", "").replace("http://", "")
    slug = "".join(c if c.isalnum() else "_" for c in slug)

    page_dir = os.path.join(output_dir, slug)
    os.makedirs(page_dir, exist_ok=True)

    response = requests.post(
        "https://api.snapapi.pics/v1/screenshot",
        headers={
            "X-Api-Key": SNAPAPI_KEY,
            "Content-Type": "application/json",
        },
        json={
            "url": url,
            "format": "png",
            "fullPage": True,
            "blockCookieBanners": True,
            "responseType": "image",
        },
        timeout=60,
    )
    response.raise_for_status()

    filepath = os.path.join(page_dir, f"{timestamp}.png")
    with open(filepath, "wb") as f:
        f.write(response.content)

    entry = {
        "url": url,
        "timestamp": datetime.utcnow().isoformat(),
        "filepath": filepath,
        "size_bytes": len(response.content),
    }

    print(f"  Archived: {url} ({len(response.content)} bytes)")
    return entry


# Archive multiple URLs
urls = [
    "https://competitor.com/pricing",
    "https://yoursite.com/terms",
    "https://news.ycombinator.com",
]

results = []
for url in urls:
    try:
        entry = archive_page(url)
        results.append(entry)
    except Exception as e:
        print(f"  Failed: {url} - {e}")

# Save log
with open("archive-log.json", "w") as f:
    json.dump(results, f, indent=2)

print(f"\nArchived {len(results)}/{len(urls)} pages")

Best Practices

Use full-page capture. For archival purposes, capture the entire page with fullPage: true so nothing is missed.
Block cookie banners. Consent popups obscure content. Always use blockCookieBanners: true.
Store metadata alongside screenshots. Record the URL, timestamp, HTTP status, and any relevant context in a structured log file.
Use timestamps in filenames. ISO 8601 format sorts correctly and is unambiguous across timezones.
Set up retention policies. Configure S3 lifecycle rules to move old archives to Glacier or delete them after a retention period.
Monitor for failures. Set up alerting when captures fail repeatedly. A URL that returns errors might mean the page was removed.

Next Steps

API Documentation -- full reference for storage parameters
Visual Regression Testing -- use archiving techniques for testing
PDF Generation -- archive pages as PDFs instead of images

Try SnapAPI Free

Get 200 free screenshots per month. Start archiving web pages today.

Get Started Free →

Web Archiving API: Archive Any Webpage Programmatically

Building a Web Archiving System with Screenshot API

Why Archive Web Pages?

Legal and Compliance

Business Intelligence

Research and Documentation

Architecture Overview

Step 1: Capture and Store to S3

Step 2: Batch Archiving with Metadata

Step 3: Schedule with Cron

Step 3 (Alternative): Schedule with GitHub Actions

Step 4: Build an Archive Viewer

Archiving Without S3 (Local Storage)

Python Archiver

Best Practices

Next Steps

Try SnapAPI Free

Start Capturing for Free

Building a Web Archiving System with Screenshot API

Why Archive Web Pages?

Legal and Compliance

Business Intelligence

Research and Documentation

Architecture Overview

Step 1: Capture and Store to S3

Step 2: Batch Archiving with Metadata

Step 3: Schedule with Cron

Step 3 (Alternative): Schedule with GitHub Actions

Step 4: Build an Archive Viewer

Archiving Without S3 (Local Storage)

Python Archiver

Best Practices

Next Steps

Try SnapAPI Free

Start Capturing for Free

Related Reading