Use Case Archiving

Building a Web Archiving System with Screenshot API

February 7, 2026 ยท 5 min read

Datacenter server rack with colorful network cables

Photo via Unsplash

Web pages change constantly. Prices update, content gets removed, terms of service are revised, and pages go offline entirely. For many organizations, capturing and preserving web content at specific points in time is not just useful -- it is a legal or regulatory requirement. In this guide, we will build a complete web archiving system using SnapAPI.

Why Archive Web Pages?

Legal and Compliance

Business Intelligence

Research and Documentation

Architecture Overview

Our archiving system has three components:

  1. Scheduler: Triggers captures at configured intervals (cron or GitHub Actions)
  2. Capture service: Calls SnapAPI to take screenshots and store them
  3. Archive viewer: A simple web UI to browse captured snapshots

Step 1: Capture and Store to S3

SnapAPI has a built-in storage feature that sends screenshots directly to your S3 bucket. This means the image data never passes through your server -- it goes straight from SnapAPI to S3.

Setup required: Before using S3 storage, configure your S3 credentials in the SnapAPI dashboard under Settings > Storage. You will need your AWS access key, secret key, bucket name, and region.

// archive.js
const fs = require("fs");

const SNAPAPI_KEY = process.env.SNAPAPI_KEY;
const API_URL = "https://api.snapapi.pics/v1/screenshot";

async function archivePage(url, metadata = {}) {
  const timestamp = new Date().toISOString().replace(/[:.]/g, "-");
  const slug = url.replace(/https?:\/\//, "").replace(/[^a-zA-Z0-9]/g, "_");
  const filename = `archives/${slug}/${timestamp}.png`;

  const response = await fetch(API_URL, {
    method: "POST",
    headers: {
      "X-Api-Key": SNAPAPI_KEY,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      url,
      format: "png",
      width: 1440,
      height: 900,
      fullPage: true,
      blockCookieBanners: true,
      responseType: "json",
      storage: {
        enabled: true,
        destination: "user_s3",
        path: filename,
      },
    }),
  });

  if (!response.ok) {
    const error = await response.text();
    throw new Error(`Archive failed for ${url}: ${error}`);
  }

  const result = await response.json();

  // Log the archive entry
  const entry = {
    url,
    timestamp: new Date().toISOString(),
    storagePath: filename,
    storageUrl: result.storageUrl,
    ...metadata,
  };

  console.log(`Archived: ${url} -> ${filename}`);
  return entry;
}

module.exports = { archivePage };

Step 2: Batch Archiving with Metadata

Define a list of URLs to archive and run them all with full metadata tracking:

// run-archive.js
const fs = require("fs");
const { archivePage } = require("./archive");

const URLS_TO_ARCHIVE = [
  { url: "https://competitor.com/pricing", category: "competitor", tags: ["pricing"] },
  { url: "https://competitor.com/features", category: "competitor", tags: ["features"] },
  { url: "https://yoursite.com/terms", category: "legal", tags: ["tos"] },
  { url: "https://yoursite.com/privacy", category: "legal", tags: ["privacy"] },
  { url: "https://news.ycombinator.com", category: "monitoring", tags: ["tech-news"] },
];

const LOG_FILE = "./archive-log.json";

async function runArchive() {
  // Load existing log
  let log = [];
  if (fs.existsSync(LOG_FILE)) {
    log = JSON.parse(fs.readFileSync(LOG_FILE, "utf-8"));
  }

  console.log(`Archiving ${URLS_TO_ARCHIVE.length} pages...\n`);

  for (const item of URLS_TO_ARCHIVE) {
    try {
      const entry = await archivePage(item.url, {
        category: item.category,
        tags: item.tags,
      });
      log.push(entry);
    } catch (err) {
      console.error(`  Failed: ${item.url} - ${err.message}`);
      log.push({
        url: item.url,
        timestamp: new Date().toISOString(),
        status: "error",
        error: err.message,
      });
    }
  }

  // Save updated log
  fs.writeFileSync(LOG_FILE, JSON.stringify(log, null, 2));
  console.log(`\nArchive log saved to ${LOG_FILE}`);
  console.log(`Total entries: ${log.length}`);
}

runArchive();

Step 3: Schedule with Cron

Run the archiver on a schedule using cron. Add this to your crontab (crontab -e):

# Archive every day at 6 AM UTC
0 6 * * * cd /home/user/web-archiver && SNAPAPI_KEY=your_key node run-archive.js >> /var/log/archiver.log 2>&1

# Archive competitor pricing pages every 6 hours
0 */6 * * * cd /home/user/web-archiver && SNAPAPI_KEY=your_key node run-archive.js --category competitor >> /var/log/archiver.log 2>&1

Step 3 (Alternative): Schedule with GitHub Actions

If you prefer a serverless approach, use GitHub Actions with a cron schedule:

# .github/workflows/archive.yml
name: Web Archive

on:
  schedule:
    # Run daily at 6 AM UTC
    - cron: "0 6 * * *"
  workflow_dispatch: # Allow manual triggers

jobs:
  archive:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: "20"

      - name: Install dependencies
        run: npm ci

      - name: Run archive
        run: node run-archive.js
        env:
          SNAPAPI_KEY: ${{ secrets.SNAPAPI_KEY }}

      - name: Commit archive log
        run: |
          git config user.name "Archive Bot"
          git config user.email "bot@yoursite.com"
          git add archive-log.json
          git diff --staged --quiet || git commit -m "Update archive log $(date -u +%Y-%m-%d)"
          git push

Step 4: Build an Archive Viewer

A simple Express.js application to browse your archived snapshots:

// viewer.js
const express = require("express");
const fs = require("fs");

const app = express();
const LOG_FILE = "./archive-log.json";

app.get("/", (req, res) => {
  const log = JSON.parse(fs.readFileSync(LOG_FILE, "utf-8"));
  const { category, url, date } = req.query;

  let entries = log.filter((e) => e.status !== "error");

  // Apply filters
  if (category) entries = entries.filter((e) => e.category === category);
  if (url) entries = entries.filter((e) => e.url.includes(url));
  if (date) entries = entries.filter((e) => e.timestamp.startsWith(date));

  // Group by URL
  const grouped = {};
  for (const entry of entries) {
    if (!grouped[entry.url]) grouped[entry.url] = [];
    grouped[entry.url].push(entry);
  }

  const html = `
    <!DOCTYPE html>
    <html>
    <head>
      <title>Web Archive Viewer</title>
      <style>
        body { font-family: system-ui; max-width: 1200px; margin: 0 auto; padding: 20px; background: #0a0a0f; color: #e2e8f0; }
        h1 { margin-bottom: 20px; }
        .filters { margin-bottom: 20px; display: flex; gap: 10px; }
        .filters select, .filters input { padding: 8px; background: #1a1a2e; color: #e2e8f0; border: 1px solid #2a2a3e; border-radius: 4px; }
        .url-group { margin-bottom: 30px; border: 1px solid #2a2a3e; border-radius: 8px; padding: 16px; }
        .url-group h3 { color: #00d4ff; margin-bottom: 12px; word-break: break-all; }
        .snapshots { display: flex; gap: 12px; overflow-x: auto; padding: 8px 0; }
        .snapshot { min-width: 200px; text-align: center; }
        .snapshot img { width: 200px; border-radius: 4px; border: 1px solid #2a2a3e; }
        .snapshot p { font-size: 0.8rem; color: #64748b; margin-top: 4px; }
      </style>
    </head>
    <body>
      <h1>Web Archive Viewer</h1>
      <p style="color: #94a3b8;">${entries.length} snapshots across ${Object.keys(grouped).length} URLs</p>
      ${Object.entries(grouped)
        .map(
          ([url, snapshots]) => `
        <div class="url-group">
          <h3>${url}</h3>
          <div class="snapshots">
            ${snapshots
              .slice(-10)
              .map(
                (s) => `
              <div class="snapshot">
                <a href="${s.storageUrl}" target="_blank">
                  <img src="${s.storageUrl}" loading="lazy" alt="Archive snapshot">
                </a>
                <p>${new Date(s.timestamp).toLocaleString()}</p>
              </div>
            `
              )
              .join("")}
          </div>
        </div>
      `
        )
        .join("")}
    </body>
    </html>
  `;

  res.send(html);
});

app.listen(3001, () => console.log("Archive viewer: http://localhost:3001"));

Archiving Without S3 (Local Storage)

If you do not need S3 integration, you can store screenshots locally or on any file system:

const fs = require("fs");
const path = require("path");

async function archiveLocal(url, outputDir = "./archives") {
  const timestamp = new Date().toISOString().replace(/[:.]/g, "-");
  const slug = url.replace(/https?:\/\//, "").replace(/[^a-zA-Z0-9]/g, "_");
  const dir = path.join(outputDir, slug);
  fs.mkdirSync(dir, { recursive: true });

  const response = await fetch("https://api.snapapi.pics/v1/screenshot", {
    method: "POST",
    headers: {
      "X-Api-Key": process.env.SNAPAPI_KEY,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      url,
      format: "png",
      fullPage: true,
      blockCookieBanners: true,
      responseType: "image",
    }),
  });

  if (!response.ok) throw new Error(`Failed: ${response.status}`);

  const filepath = path.join(dir, `${timestamp}.png`);
  const buffer = Buffer.from(await response.arrayBuffer());
  fs.writeFileSync(filepath, buffer);

  return { url, timestamp, filepath, size: buffer.length };
}

Python Archiver

Here is the same archiving logic in Python, for teams that prefer it:

import requests
import os
import json
from datetime import datetime

SNAPAPI_KEY = os.environ["SNAPAPI_KEY"]

def archive_page(url: str, output_dir: str = "./archives") -> dict:
    timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H-%M-%S")
    slug = url.replace("https://", "").replace("http://", "")
    slug = "".join(c if c.isalnum() else "_" for c in slug)

    page_dir = os.path.join(output_dir, slug)
    os.makedirs(page_dir, exist_ok=True)

    response = requests.post(
        "https://api.snapapi.pics/v1/screenshot",
        headers={
            "X-Api-Key": SNAPAPI_KEY,
            "Content-Type": "application/json",
        },
        json={
            "url": url,
            "format": "png",
            "fullPage": True,
            "blockCookieBanners": True,
            "responseType": "image",
        },
        timeout=60,
    )
    response.raise_for_status()

    filepath = os.path.join(page_dir, f"{timestamp}.png")
    with open(filepath, "wb") as f:
        f.write(response.content)

    entry = {
        "url": url,
        "timestamp": datetime.utcnow().isoformat(),
        "filepath": filepath,
        "size_bytes": len(response.content),
    }

    print(f"  Archived: {url} ({len(response.content)} bytes)")
    return entry


# Archive multiple URLs
urls = [
    "https://competitor.com/pricing",
    "https://yoursite.com/terms",
    "https://news.ycombinator.com",
]

results = []
for url in urls:
    try:
        entry = archive_page(url)
        results.append(entry)
    except Exception as e:
        print(f"  Failed: {url} - {e}")

# Save log
with open("archive-log.json", "w") as f:
    json.dump(results, f, indent=2)

print(f"\nArchived {len(results)}/{len(urls)} pages")

Best Practices

  1. Use full-page capture. For archival purposes, capture the entire page with fullPage: true so nothing is missed.
  2. Block cookie banners. Consent popups obscure content. Always use blockCookieBanners: true.
  3. Store metadata alongside screenshots. Record the URL, timestamp, HTTP status, and any relevant context in a structured log file.
  4. Use timestamps in filenames. ISO 8601 format sorts correctly and is unambiguous across timezones.
  5. Set up retention policies. Configure S3 lifecycle rules to move old archives to Glacier or delete them after a retention period.
  6. Monitor for failures. Set up alerting when captures fail repeatedly. A URL that returns errors might mean the page was removed.

Next Steps

Try SnapAPI Free

Get 200 free screenshots per month. Start archiving web pages today.

Get Started Free →

Last updated: February 19, 2026