Build a Web Archive with Screenshot + Extract APIs
Published February 20, 2026 · 9 min read
Photo by Janko Ferlič on Unsplash
The internet is ephemeral. Pages disappear, content changes, domains expire. If you need to preserve web content — for legal compliance, research, competitive intelligence, or historical records — you need a web archiving API that captures both the visual appearance and the underlying content.
Combining SnapAPI's screenshot and extract endpoints gives you a complete archiving solution: visual snapshots show exactly how a page looked, while extracted content preserves the searchable text.
🚀 TL;DR: Use SnapAPI to capture screenshots (visual proof), extract content (searchable text), and generate PDFs (printable records) from any URL. Store them together for a complete web archive.
Why Build a Web Archive?
- Legal & compliance: Preserve evidence of published content, terms of service, regulatory disclosures
- Competitive intelligence: Track how competitor pages evolve over time
- Research: Academic and journalistic preservation of sources
- Content backup: Archive your own published content in case of data loss
- Link rot prevention: Keep copies of pages you reference or link to
- Due diligence: Record vendor/partner claims and commitments
Architecture Overview
A complete web archive captures three things per URL:
- Screenshot (PNG/JPEG): Visual record of how the page appeared
- Extracted content (Markdown/JSON): Searchable text, metadata, and structured data
- PDF: Printable, full-page document for legal archives
Full Implementation
cURL — Archive a Single Page
# 1. Capture screenshot
curl -X POST https://api.snapapi.pics/v1/screenshot \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/page",
"fullPage": true,
"format": "png",
"blockCookieBanners": true
}' --output archive/screenshot.png
# 2. Extract content
curl -X POST https://api.snapapi.pics/v1/extract \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/page",
"type": "structured"
}' --output archive/content.json
# 3. Generate PDF
curl -X POST https://api.snapapi.pics/v1/screenshot \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/page",
"format": "pdf",
"pdf": { "format": "A4", "printBackground": true }
}' --output archive/document.pdfNode.js — Complete Archiving System
const fetch = require('node-fetch');
const fs = require('fs');
const path = require('path');
const SNAPAPI_KEY = process.env.SNAPAPI_KEY;
const ARCHIVE_DIR = './archive';
async function archivePage(url) {
const timestamp = new Date().toISOString().replace(/[:.]/g, '-');
const slug = new URL(url).hostname + '_' + timestamp;
const dir = path.join(ARCHIVE_DIR, slug);
fs.mkdirSync(dir, { recursive: true });
// Capture all three formats in parallel
const [screenshot, content, pdf] = await Promise.all([
// 1. Full-page screenshot
fetch('https://api.snapapi.pics/v1/screenshot', {
method: 'POST',
headers: {
'X-API-Key': SNAPAPI_KEY,
'Content-Type': 'application/json'
},
body: JSON.stringify({
url,
fullPage: true,
format: 'png',
blockCookieBanners: true
})
}).then(r => r.buffer()),
// 2. Structured content extraction
fetch('https://api.snapapi.pics/v1/extract', {
method: 'POST',
headers: {
'X-API-Key': SNAPAPI_KEY,
'Content-Type': 'application/json'
},
body: JSON.stringify({ url, type: 'structured' })
}).then(r => r.json()),
// 3. PDF document
fetch('https://api.snapapi.pics/v1/screenshot', {
method: 'POST',
headers: {
'X-API-Key': SNAPAPI_KEY,
'Content-Type': 'application/json'
},
body: JSON.stringify({
url,
format: 'pdf',
pdf: { format: 'A4', printBackground: true }
})
}).then(r => r.buffer())
]);
// Save all artifacts
fs.writeFileSync(path.join(dir, 'screenshot.png'), screenshot);
fs.writeFileSync(path.join(dir, 'content.json'), JSON.stringify(content, null, 2));
fs.writeFileSync(path.join(dir, 'document.pdf'), pdf);
// Save metadata
const metadata = {
url,
archivedAt: new Date().toISOString(),
title: content.data?.title || 'Unknown',
author: content.data?.author || 'Unknown',
wordCount: content.data?.wordCount || 0,
files: ['screenshot.png', 'content.json', 'document.pdf']
};
fs.writeFileSync(path.join(dir, 'metadata.json'), JSON.stringify(metadata, null, 2));
console.log(`Archived: ${url} → ${dir}`);
return { dir, metadata };
}
// Archive multiple URLs
const urls = [
'https://competitor.com/pricing',
'https://partner.com/terms',
'https://news-site.com/article/important-story'
];
(async () => {
for (const url of urls) {
try {
await archivePage(url);
} catch (err) {
console.error(`Failed to archive ${url}:`, err.message);
}
}
})();Python — Web Archiving Pipeline
import requests
import os
import json
from datetime import datetime
from urllib.parse import urlparse
from concurrent.futures import ThreadPoolExecutor
SNAPAPI_KEY = os.environ['SNAPAPI_KEY']
ARCHIVE_DIR = './archive'
def capture_screenshot(url: str) -> bytes:
return requests.post(
'https://api.snapapi.pics/v1/screenshot',
headers={'X-API-Key': SNAPAPI_KEY, 'Content-Type': 'application/json'},
json={'url': url, 'fullPage': True, 'format': 'png', 'blockCookieBanners': True}
).content
def extract_content(url: str) -> dict:
return requests.post(
'https://api.snapapi.pics/v1/extract',
headers={'X-API-Key': SNAPAPI_KEY, 'Content-Type': 'application/json'},
json={'url': url, 'type': 'structured'}
).json()
def capture_pdf(url: str) -> bytes:
return requests.post(
'https://api.snapapi.pics/v1/screenshot',
headers={'X-API-Key': SNAPAPI_KEY, 'Content-Type': 'application/json'},
json={'url': url, 'format': 'pdf', 'pdf': {'format': 'A4', 'printBackground': True}}
).content
def archive_page(url: str) -> dict:
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
hostname = urlparse(url).hostname.replace('.', '_')
dir_path = os.path.join(ARCHIVE_DIR, f'{hostname}_{timestamp}')
os.makedirs(dir_path, exist_ok=True)
# Capture all formats in parallel
with ThreadPoolExecutor(max_workers=3) as executor:
screenshot_future = executor.submit(capture_screenshot, url)
content_future = executor.submit(extract_content, url)
pdf_future = executor.submit(capture_pdf, url)
screenshot = screenshot_future.result()
content = content_future.result()
pdf = pdf_future.result()
# Save artifacts
with open(os.path.join(dir_path, 'screenshot.png'), 'wb') as f:
f.write(screenshot)
with open(os.path.join(dir_path, 'content.json'), 'w') as f:
json.dump(content, f, indent=2)
with open(os.path.join(dir_path, 'document.pdf'), 'wb') as f:
f.write(pdf)
# Save metadata
metadata = {
'url': url,
'archived_at': datetime.now().isoformat(),
'title': content.get('data', {}).get('title', 'Unknown'),
'word_count': content.get('data', {}).get('wordCount', 0),
}
with open(os.path.join(dir_path, 'metadata.json'), 'w') as f:
json.dump(metadata, f, indent=2)
print(f'Archived: {url} → {dir_path}')
return metadata
# Archive a list of URLs
urls = [
'https://competitor.com/pricing',
'https://partner.com/terms-of-service',
'https://news-site.com/article/breaking-news',
]
for url in urls:
try:
archive_page(url)
except Exception as e:
print(f'Failed: {url} — {e}')Storage Strategies
Where you store your archives matters:
- Local filesystem: Simple for small archives. Organize by date and domain.
- S3/R2/GCS: Scalable cloud storage. Use lifecycle policies for automatic tiering.
- Database + blob storage: Store metadata in PostgreSQL, files in S3. Enables search and querying.
// Upload to S3 after capture
const { S3Client, PutObjectCommand } = require('@aws-sdk/client-s3');
const s3 = new S3Client({ region: 'us-east-1' });
async function uploadToS3(key, body, contentType) {
await s3.send(new PutObjectCommand({
Bucket: 'my-web-archive',
Key: key,
Body: body,
ContentType: contentType
}));
}Scheduling Archives
For ongoing monitoring, schedule periodic captures:
- Daily: Important pages (pricing, terms, competitor landing pages)
- Weekly: Blog posts, documentation, feature pages
- On-demand: Triggered by events (new link discovered, content alert)
Search and Retrieval
An archive is only useful if you can find things in it. Since you're extracting content as structured text, you can build full-text search:
// Simple search across archived content
const archives = fs.readdirSync(ARCHIVE_DIR);
const results = [];
for (const dir of archives) {
const contentPath = path.join(ARCHIVE_DIR, dir, 'content.json');
if (!fs.existsSync(contentPath)) continue;
const content = JSON.parse(fs.readFileSync(contentPath));
const text = content.data?.content || '';
if (text.toLowerCase().includes(searchQuery.toLowerCase())) {
results.push({
dir,
title: content.data?.title,
url: content.data?.url,
snippet: text.substring(
Math.max(0, text.toLowerCase().indexOf(searchQuery.toLowerCase()) - 100),
text.toLowerCase().indexOf(searchQuery.toLowerCase()) + 200
)
});
}
}Pricing for Archiving Workloads
Each page archive uses ~1.5 API credits (1 screenshot + 0.5 extraction). PDF adds another 1 credit.
- Free: 200 credits/month — archive ~80 pages with full captures
- Pro ($19/mo): 25,000 credits — archive ~10,000 pages/month
- Business ($79/mo): 100,000 credits — enterprise-scale archiving
Get Started
Build your web archive today. Sign up free and start preserving the web pages that matter to your business.
💡 Pro tip: Combine archiving with SnapAPI's AI Analysis API to automatically summarize and tag archived content, making your archive searchable by topic and sentiment.