Data Extraction

How to Extract Structured Data from Any Website with an API (2026)

Published April 5, 2026 · 14 min read

Extracting structured data from websites is one of the most common automation tasks. Whether you need product pricing, job listings, contact information, or news articles — the challenge is always the same: HTML is messy, sites change without warning, and JavaScript rendering complicates everything. This guide covers the traditional approach with selectors, modern AI-powered extraction, and how SnapAPI's extract endpoint gives you structured JSON from any page with a single API call.

Traditional Extraction with CSS Selectors

The classic approach: fetch HTML, parse it, and extract data using CSS selectors. This works well for consistent page structures but breaks whenever the site updates its HTML:

import * as cheerio from 'cheerio';
import axios from 'axios';

async function extractProducts(url) {
  const { data: html } = await axios.get(url);
  const $ = cheerio.load(html);

  return $('.product-card').map((i, el) => ({
    name: $(el).find('.product-name').text().trim(),
    price: parseFloat($(el).find('.price').text().replace(/[^0-9.]/g, '')),
    image: $(el).find('img').attr('src'),
    rating: parseFloat($(el).find('[data-rating]').attr('data-rating')),
    url: $(el).find('a').attr('href'),
  })).get();
}

// Problem: selectors break when the site changes
// '.product-card' might become '.item-card' overnight
// '.price' might move to a different element
// You won't know until your pipeline silently returns empty data

The fragility problem is real — you need to monitor extraction results, handle missing fields, and rewrite selectors whenever the target site updates its markup. For a single site, this is manageable. For dozens, it becomes a maintenance nightmare.

Schema-Based Extraction with SnapAPI

SnapAPI's /v1/extract endpoint takes a different approach. You define a schema — the shape of data you want — and SnapAPI figures out how to extract it from the page. No CSS selectors needed, and it works across different site structures:

import SnapAPI from 'snapapi-js';

const snap = new SnapAPI('sk_live_your_key');

// Extract product data — works on any e-commerce site
const result = await snap.extract({
  url: 'https://shop.example.com/category/electronics',
  schema: {
    products: [{
      name: 'string',
      price: 'number',
      currency: 'string',
      rating: 'number',
      review_count: 'number',
      in_stock: 'boolean',
      image_url: 'string',
    }],
    pagination: {
      current_page: 'number',
      total_pages: 'number',
      next_url: 'string',
    },
  },
});

console.log(result.data);
// {
//   products: [
//     { name: "MacBook Pro M4", price: 1999, currency: "USD", rating: 4.8, ... },
//     { name: "Dell XPS 15", price: 1299, currency: "USD", rating: 4.5, ... },
//   ],
//   pagination: { current_page: 1, total_pages: 12, next_url: "/category/electronics?page=2" }
// }

Common Extraction Use Cases

Job Listings

const jobs = await snap.extract({
  url: 'https://careers.example.com/engineering',
  schema: {
    listings: [{
      title: 'string',
      company: 'string',
      location: 'string',
      salary_range: 'string',
      remote: 'boolean',
      posted_date: 'string',
      apply_url: 'string',
    }],
  },
});

News Articles

const news = await snap.extract({
  url: 'https://news.example.com/technology',
  schema: {
    articles: [{
      headline: 'string',
      author: 'string',
      published_date: 'string',
      summary: 'string',
      category: 'string',
      read_time_minutes: 'number',
    }],
  },
});

Pricing Pages

const pricing = await snap.extract({
  url: 'https://saas.example.com/pricing',
  schema: {
    plans: [{
      name: 'string',
      price_monthly: 'number',
      price_annual: 'number',
      currency: 'string',
      features: ['string'],
      highlighted: 'boolean',
      cta_text: 'string',
    }],
  },
});

Contact Information

const contact = await snap.extract({
  url: 'https://company.example.com/about',
  schema: {
    company_name: 'string',
    description: 'string',
    email: 'string',
    phone: 'string',
    address: 'string',
    social_links: [{
      platform: 'string',
      url: 'string',
    }],
  },
});

AI-Powered Page Analysis

Beyond structured extraction, SnapAPI's /v1/analyze endpoint uses AI to answer questions about any page — useful when you don't know the exact structure of the data you need:

// Ask questions about any page
const analysis = await snap.analyze({
  url: 'https://competitor.example.com/pricing',
  prompt: 'What are the pricing tiers, and which plan offers the best value for a startup with 10 team members?',
});

console.log(analysis.result);
// "The pricing has three tiers: Starter ($29/mo, 5 users),
//  Growth ($79/mo, 25 users), Enterprise (custom). For a
//  startup with 10 team members, Growth is the best value..."

// Competitive analysis
const competitive = await snap.analyze({
  url: 'https://competitor.example.com',
  prompt: 'Summarize this company\'s main product, target audience, key features, and any weaknesses or gaps in their offering.',
});

Building an Extraction Pipeline

Combine extraction with scheduling for automated data pipelines:

import SnapAPI from 'snapapi-js';
import { CronJob } from 'cron';
import fs from 'fs/promises';

const snap = new SnapAPI('sk_live_your_key');

class ExtractionPipeline {
  constructor(config) {
    this.config = config;
    this.results = [];
  }

  async extract(url, schema) {
    try {
      const result = await snap.extract({ url, schema });
      return { url, data: result.data, timestamp: new Date().toISOString(), error: null };
    } catch (error) {
      return { url, data: null, timestamp: new Date().toISOString(), error: error.message };
    }
  }

  async run() {
    console.log(`Running pipeline: ${this.config.name}`);
    const results = [];

    for (const source of this.config.sources) {
      const result = await this.extract(source.url, source.schema);
      results.push(result);

      // Rate limit between requests
      await new Promise(r => setTimeout(r, 500));
    }

    // Save results
    const filename = `data/${this.config.name}-${Date.now()}.json`;
    await fs.writeFile(filename, JSON.stringify(results, null, 2));
    console.log(`Saved ${results.length} results to ${filename}`);

    return results;
  }
}

// Usage: monitor competitor pricing daily
const pricingPipeline = new ExtractionPipeline({
  name: 'competitor-pricing',
  sources: [
    {
      url: 'https://competitor-a.com/pricing',
      schema: { plans: [{ name: 'string', price: 'number', features: ['string'] }] },
    },
    {
      url: 'https://competitor-b.com/pricing',
      schema: { plans: [{ name: 'string', price: 'number', features: ['string'] }] },
    },
  ],
});

// Run daily at 9 AM
new CronJob('0 9 * * *', () => pricingPipeline.run(), null, true);

Extraction Approach Comparison

ApproachJS RenderingMaintenanceAccuracySpeed
CSS Selectors (Cheerio)NoHigh — breaks on HTML changesExact (when working)Very fast
XPathNoHigh — same fragilityExact (when working)Very fast
RegexNoVery high — brittleLowFast
Playwright + selectorsYesHigh — plus browser infraExact (when working)Slow (2-5s)
SnapAPI /extractYesZero — schema-basedHighFast (1-3s)
SnapAPI /analyzeYesZero — prompt-basedHigh (AI-powered)Moderate (2-5s)

Extract Data from Any Website — No Selectors Needed

Define a schema, get structured JSON. SnapAPI handles rendering, anti-bot detection, and parsing. Free tier includes 200 extractions/month.

Start Free — No Credit Card Required