How to Extract Structured Data from Any Website with an API (2026)

Extracting structured data from websites is one of the most common automation tasks. Whether you need product pricing, job listings, contact information, or news articles — the challenge is always the same: HTML is messy, sites change without warning, and JavaScript rendering complicates everything. This guide covers the traditional approach with selectors, modern AI-powered extraction, and how SnapAPI's extract endpoint gives you structured JSON from any page with a single API call.

Traditional Extraction with CSS Selectors

The classic approach: fetch HTML, parse it, and extract data using CSS selectors. This works well for consistent page structures but breaks whenever the site updates its HTML:

import * as cheerio from 'cheerio';
import axios from 'axios';

async function extractProducts(url) {
  const { data: html } = await axios.get(url);
  const $ = cheerio.load(html);

  return $('.product-card').map((i, el) => ({
    name: $(el).find('.product-name').text().trim(),
    price: parseFloat($(el).find('.price').text().replace(/[^0-9.]/g, '')),
    image: $(el).find('img').attr('src'),
    rating: parseFloat($(el).find('[data-rating]').attr('data-rating')),
    url: $(el).find('a').attr('href'),
  })).get();
}

// Problem: selectors break when the site changes
// '.product-card' might become '.item-card' overnight
// '.price' might move to a different element
// You won't know until your pipeline silently returns empty data

The fragility problem is real — you need to monitor extraction results, handle missing fields, and rewrite selectors whenever the target site updates its markup. For a single site, this is manageable. For dozens, it becomes a maintenance nightmare.

Schema-Based Extraction with SnapAPI

SnapAPI's /v1/extract endpoint takes a different approach. You define a schema — the shape of data you want — and SnapAPI figures out how to extract it from the page. No CSS selectors needed, and it works across different site structures:

import SnapAPI from 'snapapi-js';

const snap = new SnapAPI('sk_live_your_key');

// Extract product data — works on any e-commerce site
const result = await snap.extract({
  url: 'https://shop.example.com/category/electronics',
  schema: {
    products: [{
      name: 'string',
      price: 'number',
      currency: 'string',
      rating: 'number',
      review_count: 'number',
      in_stock: 'boolean',
      image_url: 'string',
    }],
    pagination: {
      current_page: 'number',
      total_pages: 'number',
      next_url: 'string',
    },
  },
});

console.log(result.data);
// {
//   products: [
//     { name: "MacBook Pro M4", price: 1999, currency: "USD", rating: 4.8, ... },
//     { name: "Dell XPS 15", price: 1299, currency: "USD", rating: 4.5, ... },
//   ],
//   pagination: { current_page: 1, total_pages: 12, next_url: "/category/electronics?page=2" }
// }

Common Extraction Use Cases

Job Listings

const jobs = await snap.extract({
  url: 'https://careers.example.com/engineering',
  schema: {
    listings: [{
      title: 'string',
      company: 'string',
      location: 'string',
      salary_range: 'string',
      remote: 'boolean',
      posted_date: 'string',
      apply_url: 'string',
    }],
  },
});

News Articles

const news = await snap.extract({
  url: 'https://news.example.com/technology',
  schema: {
    articles: [{
      headline: 'string',
      author: 'string',
      published_date: 'string',
      summary: 'string',
      category: 'string',
      read_time_minutes: 'number',
    }],
  },
});

Pricing Pages

const pricing = await snap.extract({
  url: 'https://saas.example.com/pricing',
  schema: {
    plans: [{
      name: 'string',
      price_monthly: 'number',
      price_annual: 'number',
      currency: 'string',
      features: ['string'],
      highlighted: 'boolean',
      cta_text: 'string',
    }],
  },
});

Contact Information

const contact = await snap.extract({
  url: 'https://company.example.com/about',
  schema: {
    company_name: 'string',
    description: 'string',
    email: 'string',
    phone: 'string',
    address: 'string',
    social_links: [{
      platform: 'string',
      url: 'string',
    }],
  },
});

AI-Powered Page Analysis

Beyond structured extraction, SnapAPI's /v1/analyze endpoint uses AI to answer questions about any page — useful when you don't know the exact structure of the data you need:

// Ask questions about any page
const analysis = await snap.analyze({
  url: 'https://competitor.example.com/pricing',
  prompt: 'What are the pricing tiers, and which plan offers the best value for a startup with 10 team members?',
});

console.log(analysis.result);
// "The pricing has three tiers: Starter ($29/mo, 5 users),
//  Growth ($79/mo, 25 users), Enterprise (custom). For a
//  startup with 10 team members, Growth is the best value..."

// Competitive analysis
const competitive = await snap.analyze({
  url: 'https://competitor.example.com',
  prompt: 'Summarize this company\'s main product, target audience, key features, and any weaknesses or gaps in their offering.',
});

Building an Extraction Pipeline

Combine extraction with scheduling for automated data pipelines:

import SnapAPI from 'snapapi-js';
import { CronJob } from 'cron';
import fs from 'fs/promises';

const snap = new SnapAPI('sk_live_your_key');

class ExtractionPipeline {
  constructor(config) {
    this.config = config;
    this.results = [];
  }

  async extract(url, schema) {
    try {
      const result = await snap.extract({ url, schema });
      return { url, data: result.data, timestamp: new Date().toISOString(), error: null };
    } catch (error) {
      return { url, data: null, timestamp: new Date().toISOString(), error: error.message };
    }
  }

  async run() {
    console.log(`Running pipeline: ${this.config.name}`);
    const results = [];

    for (const source of this.config.sources) {
      const result = await this.extract(source.url, source.schema);
      results.push(result);

      // Rate limit between requests
      await new Promise(r => setTimeout(r, 500));
    }

    // Save results
    const filename = `data/${this.config.name}-${Date.now()}.json`;
    await fs.writeFile(filename, JSON.stringify(results, null, 2));
    console.log(`Saved ${results.length} results to ${filename}`);

    return results;
  }
}

// Usage: monitor competitor pricing daily
const pricingPipeline = new ExtractionPipeline({
  name: 'competitor-pricing',
  sources: [
    {
      url: 'https://competitor-a.com/pricing',
      schema: { plans: [{ name: 'string', price: 'number', features: ['string'] }] },
    },
    {
      url: 'https://competitor-b.com/pricing',
      schema: { plans: [{ name: 'string', price: 'number', features: ['string'] }] },
    },
  ],
});

// Run daily at 9 AM
new CronJob('0 9 * * *', () => pricingPipeline.run(), null, true);

Extraction Approach Comparison

Approach	JS Rendering	Maintenance	Accuracy	Speed
CSS Selectors (Cheerio)	No	High — breaks on HTML changes	Exact (when working)	Very fast
XPath	No	High — same fragility	Exact (when working)	Very fast
Regex	No	Very high — brittle	Low	Fast
Playwright + selectors	Yes	High — plus browser infra	Exact (when working)	Slow (2-5s)
SnapAPI /extract	Yes	Zero — schema-based	High	Fast (1-3s)
SnapAPI /analyze	Yes	Zero — prompt-based	High (AI-powered)	Moderate (2-5s)

Extract Data from Any Website — No Selectors Needed

Define a schema, get structured JSON. SnapAPI handles rendering, anti-bot detection, and parsing. Free tier includes 200 extractions/month.

Start Free — No Credit Card Required