Extracting structured data from websites is one of the most common automation tasks. Whether you need product pricing, job listings, contact information, or news articles — the challenge is always the same: HTML is messy, sites change without warning, and JavaScript rendering complicates everything. This guide covers the traditional approach with selectors, modern AI-powered extraction, and how SnapAPI's extract endpoint gives you structured JSON from any page with a single API call.
Traditional Extraction with CSS Selectors
The classic approach: fetch HTML, parse it, and extract data using CSS selectors. This works well for consistent page structures but breaks whenever the site updates its HTML:
import * as cheerio from 'cheerio';
import axios from 'axios';
async function extractProducts(url) {
const { data: html } = await axios.get(url);
const $ = cheerio.load(html);
return $('.product-card').map((i, el) => ({
name: $(el).find('.product-name').text().trim(),
price: parseFloat($(el).find('.price').text().replace(/[^0-9.]/g, '')),
image: $(el).find('img').attr('src'),
rating: parseFloat($(el).find('[data-rating]').attr('data-rating')),
url: $(el).find('a').attr('href'),
})).get();
}
// Problem: selectors break when the site changes
// '.product-card' might become '.item-card' overnight
// '.price' might move to a different element
// You won't know until your pipeline silently returns empty data
The fragility problem is real — you need to monitor extraction results, handle missing fields, and rewrite selectors whenever the target site updates its markup. For a single site, this is manageable. For dozens, it becomes a maintenance nightmare.
Schema-Based Extraction with SnapAPI
SnapAPI's /v1/extract endpoint takes a different approach. You define a schema — the shape of data you want — and SnapAPI figures out how to extract it from the page. No CSS selectors needed, and it works across different site structures:
import SnapAPI from 'snapapi-js';
const snap = new SnapAPI('sk_live_your_key');
// Extract product data — works on any e-commerce site
const result = await snap.extract({
url: 'https://shop.example.com/category/electronics',
schema: {
products: [{
name: 'string',
price: 'number',
currency: 'string',
rating: 'number',
review_count: 'number',
in_stock: 'boolean',
image_url: 'string',
}],
pagination: {
current_page: 'number',
total_pages: 'number',
next_url: 'string',
},
},
});
console.log(result.data);
// {
// products: [
// { name: "MacBook Pro M4", price: 1999, currency: "USD", rating: 4.8, ... },
// { name: "Dell XPS 15", price: 1299, currency: "USD", rating: 4.5, ... },
// ],
// pagination: { current_page: 1, total_pages: 12, next_url: "/category/electronics?page=2" }
// }
Common Extraction Use Cases
Job Listings
const jobs = await snap.extract({
url: 'https://careers.example.com/engineering',
schema: {
listings: [{
title: 'string',
company: 'string',
location: 'string',
salary_range: 'string',
remote: 'boolean',
posted_date: 'string',
apply_url: 'string',
}],
},
});
News Articles
const news = await snap.extract({
url: 'https://news.example.com/technology',
schema: {
articles: [{
headline: 'string',
author: 'string',
published_date: 'string',
summary: 'string',
category: 'string',
read_time_minutes: 'number',
}],
},
});
Pricing Pages
const pricing = await snap.extract({
url: 'https://saas.example.com/pricing',
schema: {
plans: [{
name: 'string',
price_monthly: 'number',
price_annual: 'number',
currency: 'string',
features: ['string'],
highlighted: 'boolean',
cta_text: 'string',
}],
},
});
Contact Information
const contact = await snap.extract({
url: 'https://company.example.com/about',
schema: {
company_name: 'string',
description: 'string',
email: 'string',
phone: 'string',
address: 'string',
social_links: [{
platform: 'string',
url: 'string',
}],
},
});
AI-Powered Page Analysis
Beyond structured extraction, SnapAPI's /v1/analyze endpoint uses AI to answer questions about any page — useful when you don't know the exact structure of the data you need:
// Ask questions about any page
const analysis = await snap.analyze({
url: 'https://competitor.example.com/pricing',
prompt: 'What are the pricing tiers, and which plan offers the best value for a startup with 10 team members?',
});
console.log(analysis.result);
// "The pricing has three tiers: Starter ($29/mo, 5 users),
// Growth ($79/mo, 25 users), Enterprise (custom). For a
// startup with 10 team members, Growth is the best value..."
// Competitive analysis
const competitive = await snap.analyze({
url: 'https://competitor.example.com',
prompt: 'Summarize this company\'s main product, target audience, key features, and any weaknesses or gaps in their offering.',
});
Building an Extraction Pipeline
Combine extraction with scheduling for automated data pipelines:
import SnapAPI from 'snapapi-js';
import { CronJob } from 'cron';
import fs from 'fs/promises';
const snap = new SnapAPI('sk_live_your_key');
class ExtractionPipeline {
constructor(config) {
this.config = config;
this.results = [];
}
async extract(url, schema) {
try {
const result = await snap.extract({ url, schema });
return { url, data: result.data, timestamp: new Date().toISOString(), error: null };
} catch (error) {
return { url, data: null, timestamp: new Date().toISOString(), error: error.message };
}
}
async run() {
console.log(`Running pipeline: ${this.config.name}`);
const results = [];
for (const source of this.config.sources) {
const result = await this.extract(source.url, source.schema);
results.push(result);
// Rate limit between requests
await new Promise(r => setTimeout(r, 500));
}
// Save results
const filename = `data/${this.config.name}-${Date.now()}.json`;
await fs.writeFile(filename, JSON.stringify(results, null, 2));
console.log(`Saved ${results.length} results to ${filename}`);
return results;
}
}
// Usage: monitor competitor pricing daily
const pricingPipeline = new ExtractionPipeline({
name: 'competitor-pricing',
sources: [
{
url: 'https://competitor-a.com/pricing',
schema: { plans: [{ name: 'string', price: 'number', features: ['string'] }] },
},
{
url: 'https://competitor-b.com/pricing',
schema: { plans: [{ name: 'string', price: 'number', features: ['string'] }] },
},
],
});
// Run daily at 9 AM
new CronJob('0 9 * * *', () => pricingPipeline.run(), null, true);
Extraction Approach Comparison
| Approach | JS Rendering | Maintenance | Accuracy | Speed |
|---|---|---|---|---|
| CSS Selectors (Cheerio) | No | High — breaks on HTML changes | Exact (when working) | Very fast |
| XPath | No | High — same fragility | Exact (when working) | Very fast |
| Regex | No | Very high — brittle | Low | Fast |
| Playwright + selectors | Yes | High — plus browser infra | Exact (when working) | Slow (2-5s) |
| SnapAPI /extract | Yes | Zero — schema-based | High | Fast (1-3s) |
| SnapAPI /analyze | Yes | Zero — prompt-based | High (AI-powered) | Moderate (2-5s) |
Extract Data from Any Website — No Selectors Needed
Define a schema, get structured JSON. SnapAPI handles rendering, anti-bot detection, and parsing. Free tier includes 200 extractions/month.
Start Free — No Credit Card Required