Guide

API for Web Archiving: Tools, Patterns & Best Practices

Tutorial Archiving

Build a Web Archive with Screenshot + Extract APIs

Published February 20, 2026 · 9 min read

Library archive shelves

Photo by Janko Ferlič on Unsplash

The internet is ephemeral. Pages disappear, content changes, domains expire. If you need to preserve web content — for legal compliance, research, competitive intelligence, or historical records — you need a web archiving API that captures both the visual appearance and the underlying content.

Combining SnapAPI's screenshot and extract endpoints gives you a complete archiving solution: visual snapshots show exactly how a page looked, while extracted content preserves the searchable text.

🚀 TL;DR: Use SnapAPI to capture screenshots (visual proof), extract content (searchable text), and generate PDFs (printable records) from any URL. Store them together for a complete web archive.

Why Build a Web Archive?

  • Legal & compliance: Preserve evidence of published content, terms of service, regulatory disclosures
  • Competitive intelligence: Track how competitor pages evolve over time
  • Research: Academic and journalistic preservation of sources
  • Content backup: Archive your own published content in case of data loss
  • Link rot prevention: Keep copies of pages you reference or link to
  • Due diligence: Record vendor/partner claims and commitments

Architecture Overview

A complete web archive captures three things per URL:

  • Screenshot (PNG/JPEG): Visual record of how the page appeared
  • Extracted content (Markdown/JSON): Searchable text, metadata, and structured data
  • PDF: Printable, full-page document for legal archives

Full Implementation

cURL — Archive a Single Page

# 1. Capture screenshot
curl -X POST https://api.snapapi.pics/v1/screenshot \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/page",
    "fullPage": true,
    "format": "png",
    "blockCookieBanners": true
  }' --output archive/screenshot.png

# 2. Extract content
curl -X POST https://api.snapapi.pics/v1/extract \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/page",
    "type": "structured"
  }' --output archive/content.json

# 3. Generate PDF
curl -X POST https://api.snapapi.pics/v1/screenshot \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/page",
    "format": "pdf",
    "pdf": { "format": "A4", "printBackground": true }
  }' --output archive/document.pdf

Node.js — Complete Archiving System

const fetch = require('node-fetch');
const fs = require('fs');
const path = require('path');

const SNAPAPI_KEY = process.env.SNAPAPI_KEY;
const ARCHIVE_DIR = './archive';

async function archivePage(url) {
  const timestamp = new Date().toISOString().replace(/[:.]/g, '-');
  const slug = new URL(url).hostname + '_' + timestamp;
  const dir = path.join(ARCHIVE_DIR, slug);
  fs.mkdirSync(dir, { recursive: true });

  // Capture all three formats in parallel
  const [screenshot, content, pdf] = await Promise.all([
    // 1. Full-page screenshot
    fetch('https://api.snapapi.pics/v1/screenshot', {
      method: 'POST',
      headers: {
        'X-API-Key': SNAPAPI_KEY,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        url,
        fullPage: true,
        format: 'png',
        blockCookieBanners: true
      })
    }).then(r => r.buffer()),

    // 2. Structured content extraction
    fetch('https://api.snapapi.pics/v1/extract', {
      method: 'POST',
      headers: {
        'X-API-Key': SNAPAPI_KEY,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({ url, type: 'structured' })
    }).then(r => r.json()),

    // 3. PDF document
    fetch('https://api.snapapi.pics/v1/screenshot', {
      method: 'POST',
      headers: {
        'X-API-Key': SNAPAPI_KEY,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        url,
        format: 'pdf',
        pdf: { format: 'A4', printBackground: true }
      })
    }).then(r => r.buffer())
  ]);

  // Save all artifacts
  fs.writeFileSync(path.join(dir, 'screenshot.png'), screenshot);
  fs.writeFileSync(path.join(dir, 'content.json'), JSON.stringify(content, null, 2));
  fs.writeFileSync(path.join(dir, 'document.pdf'), pdf);

  // Save metadata
  const metadata = {
    url,
    archivedAt: new Date().toISOString(),
    title: content.data?.title || 'Unknown',
    author: content.data?.author || 'Unknown',
    wordCount: content.data?.wordCount || 0,
    files: ['screenshot.png', 'content.json', 'document.pdf']
  };
  fs.writeFileSync(path.join(dir, 'metadata.json'), JSON.stringify(metadata, null, 2));

  console.log(`Archived: ${url} → ${dir}`);
  return { dir, metadata };
}

// Archive multiple URLs
const urls = [
  'https://competitor.com/pricing',
  'https://partner.com/terms',
  'https://news-site.com/article/important-story'
];

(async () => {
  for (const url of urls) {
    try {
      await archivePage(url);
    } catch (err) {
      console.error(`Failed to archive ${url}:`, err.message);
    }
  }
})();

Python — Web Archiving Pipeline

import requests
import os
import json
from datetime import datetime
from urllib.parse import urlparse
from concurrent.futures import ThreadPoolExecutor

SNAPAPI_KEY = os.environ['SNAPAPI_KEY']
ARCHIVE_DIR = './archive'

def capture_screenshot(url: str) -> bytes:
    return requests.post(
        'https://api.snapapi.pics/v1/screenshot',
        headers={'X-API-Key': SNAPAPI_KEY, 'Content-Type': 'application/json'},
        json={'url': url, 'fullPage': True, 'format': 'png', 'blockCookieBanners': True}
    ).content

def extract_content(url: str) -> dict:
    return requests.post(
        'https://api.snapapi.pics/v1/extract',
        headers={'X-API-Key': SNAPAPI_KEY, 'Content-Type': 'application/json'},
        json={'url': url, 'type': 'structured'}
    ).json()

def capture_pdf(url: str) -> bytes:
    return requests.post(
        'https://api.snapapi.pics/v1/screenshot',
        headers={'X-API-Key': SNAPAPI_KEY, 'Content-Type': 'application/json'},
        json={'url': url, 'format': 'pdf', 'pdf': {'format': 'A4', 'printBackground': True}}
    ).content

def archive_page(url: str) -> dict:
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    hostname = urlparse(url).hostname.replace('.', '_')
    dir_path = os.path.join(ARCHIVE_DIR, f'{hostname}_{timestamp}')
    os.makedirs(dir_path, exist_ok=True)
    
    # Capture all formats in parallel
    with ThreadPoolExecutor(max_workers=3) as executor:
        screenshot_future = executor.submit(capture_screenshot, url)
        content_future = executor.submit(extract_content, url)
        pdf_future = executor.submit(capture_pdf, url)
        
        screenshot = screenshot_future.result()
        content = content_future.result()
        pdf = pdf_future.result()
    
    # Save artifacts
    with open(os.path.join(dir_path, 'screenshot.png'), 'wb') as f:
        f.write(screenshot)
    with open(os.path.join(dir_path, 'content.json'), 'w') as f:
        json.dump(content, f, indent=2)
    with open(os.path.join(dir_path, 'document.pdf'), 'wb') as f:
        f.write(pdf)
    
    # Save metadata
    metadata = {
        'url': url,
        'archived_at': datetime.now().isoformat(),
        'title': content.get('data', {}).get('title', 'Unknown'),
        'word_count': content.get('data', {}).get('wordCount', 0),
    }
    with open(os.path.join(dir_path, 'metadata.json'), 'w') as f:
        json.dump(metadata, f, indent=2)
    
    print(f'Archived: {url} → {dir_path}')
    return metadata

# Archive a list of URLs
urls = [
    'https://competitor.com/pricing',
    'https://partner.com/terms-of-service',
    'https://news-site.com/article/breaking-news',
]

for url in urls:
    try:
        archive_page(url)
    except Exception as e:
        print(f'Failed: {url} — {e}')

Storage Strategies

Where you store your archives matters:

  • Local filesystem: Simple for small archives. Organize by date and domain.
  • S3/R2/GCS: Scalable cloud storage. Use lifecycle policies for automatic tiering.
  • Database + blob storage: Store metadata in PostgreSQL, files in S3. Enables search and querying.
// Upload to S3 after capture
const { S3Client, PutObjectCommand } = require('@aws-sdk/client-s3');
const s3 = new S3Client({ region: 'us-east-1' });

async function uploadToS3(key, body, contentType) {
  await s3.send(new PutObjectCommand({
    Bucket: 'my-web-archive',
    Key: key,
    Body: body,
    ContentType: contentType
  }));
}

Scheduling Archives

For ongoing monitoring, schedule periodic captures:

  • Daily: Important pages (pricing, terms, competitor landing pages)
  • Weekly: Blog posts, documentation, feature pages
  • On-demand: Triggered by events (new link discovered, content alert)

Search and Retrieval

An archive is only useful if you can find things in it. Since you're extracting content as structured text, you can build full-text search:

// Simple search across archived content
const archives = fs.readdirSync(ARCHIVE_DIR);
const results = [];

for (const dir of archives) {
  const contentPath = path.join(ARCHIVE_DIR, dir, 'content.json');
  if (!fs.existsSync(contentPath)) continue;
  
  const content = JSON.parse(fs.readFileSync(contentPath));
  const text = content.data?.content || '';
  
  if (text.toLowerCase().includes(searchQuery.toLowerCase())) {
    results.push({
      dir,
      title: content.data?.title,
      url: content.data?.url,
      snippet: text.substring(
        Math.max(0, text.toLowerCase().indexOf(searchQuery.toLowerCase()) - 100),
        text.toLowerCase().indexOf(searchQuery.toLowerCase()) + 200
      )
    });
  }
}

Pricing for Archiving Workloads

Each page archive uses ~1.5 API credits (1 screenshot + 0.5 extraction). PDF adds another 1 credit.

  • Free: 200 credits/month — archive ~80 pages with full captures
  • Pro ($19/mo): 25,000 credits — archive ~10,000 pages/month
  • Business ($79/mo): 100,000 credits — enterprise-scale archiving

Get Started

Build your web archive today. Sign up free and start preserving the web pages that matter to your business.

💡 Pro tip: Combine archiving with SnapAPI's AI Analysis API to automatically summarize and tag archived content, making your archive searchable by topic and sentiment.

Start Capturing for Free

200 screenshots/month. Screenshots, PDF, scraping, and video recording. No credit card required.

Get Free API Key →