How to Extract Data from a Website (2026): CSS Selectors, AI Extraction & APIs

Extraction Methods Compared

Method	Best for	Maintenance	Handles JS?
CSS selectors (Cheerio)	Static HTML, predictable structure	High (breaks on redesign)	No
XPath	Complex DOM traversal	High	No
Playwright evaluate()	SPAs, dynamic content	Medium	Yes
JSON-LD / microdata	Schema.org-tagged pages	Low	No
AI schema extraction	Any page, no selectors needed	Very low	Yes

CSS Selectors with Cheerio

import * as cheerio from 'cheerio';
import fetch from 'node-fetch';

const res = await fetch('https://example.com/product');
const $ = cheerio.load(await res.text());

const product = {
  title:       $('h1.product-title').text().trim(),
  price:       $('.price-current').text().trim(),
  rating:      parseFloat($('[itemprop="ratingValue"]').attr('content') || '0'),
  description: $('.product-description').text().trim(),
  images:      $('img.product-image').map((_, el) => $(el).attr('src')).get(),
  variants:    $('.variant-option').map((_, el) => ({
    name:  $(el).find('.variant-name').text().trim(),
    price: $(el).find('.variant-price').text().trim(),
    sku:   $(el).data('sku'),
  })).get(),
};

Playwright evaluate() for SPAs

import { chromium } from 'playwright';

const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com/product', { waitUntil: 'networkidle' });

// Extract structured data inside the browser context
const product = await page.evaluate(() => {
  const getText = (sel) => document.querySelector(sel)?.textContent?.trim() ?? null;
  const getAttr = (sel, attr) => document.querySelector(sel)?.getAttribute(attr) ?? null;

  return {
    title:       getText('h1'),
    price:       getText('[data-price]') || getAttr('[data-price]', 'data-price'),
    inStock:     !document.querySelector('.out-of-stock'),
    images:      [...document.querySelectorAll('img.gallery-img')].map(el => el.src),
    breadcrumbs: [...document.querySelectorAll('nav.breadcrumb a')].map(el => el.textContent.trim()),
  };
});

await browser.close();

JSON-LD: Zero Maintenance Extraction

Many ecommerce and news sites embed application/ld+json structured data. It's reliable and doesn't break on layout changes:

import * as cheerio from 'cheerio';

async function extractJsonLd(url) {
  const res = await fetch(url);
  const $ = cheerio.load(await res.text());

  const schemas = [];
  $('script[type="application/ld+json"]').each((_, el) => {
    try {
      schemas.push(JSON.parse($(el).html()));
    } catch {}
  });

  // Find Product schema
  const product = schemas.find(s => s['@type'] === 'Product');
  if (!product) return null;

  return {
    name:        product.name,
    description: product.description,
    price:       product.offers?.price,
    currency:    product.offers?.priceCurrency,
    availability: product.offers?.availability,
    rating:      product.aggregateRating?.ratingValue,
    reviewCount: product.aggregateRating?.reviewCount,
  };
}

Schema-Based Extraction with SnapAPI

SnapAPI's /v1/extract endpoint accepts a URL and a JSON schema — it returns structured, typed data without writing any browser automation code. Ideal for production pipelines where reliability and simplicity matter.

// Node.js — schema-based extraction const axios = require('axios'); async function extractProduct(url) { const res = await axios.post('https://api.snapapi.pics/v1/extract', { url, schema: { title: { type: 'string', description: 'Product title' }, price: { type: 'number', description: 'Current price in USD' }, originalPrice: { type: 'number', description: 'Original list price if discounted' }, inStock: { type: 'boolean', description: 'Whether the item is available' }, rating: { type: 'number', description: 'Average star rating 0–5' }, reviewCount: { type: 'number', description: 'Total number of reviews' }, images: { type: 'array', items: { type: 'string' }, description: 'Product image URLs' }, variants: { type: 'array', items: { type: 'object', properties: { name: { type: 'string' }, value: { type: 'string' }, available: { type: 'boolean' } } } } }, stealth: true // bypass bot protection on retailer pages }, { headers: { 'X-Api-Key': process.env.SNAPAPI_KEY } }); return res.data.data; // fully typed JSON matching your schema } // Batch extract — run multiple URLs concurrently async function batchExtract(urls, schema) { const results = await Promise.allSettled( urls.map(url => extractProduct(url)) ); return results.map((r, i) => ({ url: urls[i], data: r.status === 'fulfilled' ? r.value : null, error: r.status === 'rejected' ? r.reason.message : null })); }

The response res.data.data is guaranteed to match the schema types — numbers are numbers, arrays are arrays. No more parseFloat(price.replace('$','')).

Python example

import httpx, asyncio, os SNAPAPI_KEY = os.environ['SNAPAPI_KEY'] HEADERS = {'X-Api-Key': SNAPAPI_KEY} PRODUCT_SCHEMA = { 'title': {'type': 'string'}, 'price': {'type': 'number'}, 'in_stock': {'type': 'boolean'}, 'rating': {'type': 'number'}, 'review_count':{'type': 'number'}, 'images': {'type': 'array', 'items': {'type': 'string'}}, } async def extract(session: httpx.AsyncClient, url: str) -> dict: r = await session.post('https://api.snapapi.pics/v1/extract', json={'url': url, 'schema': PRODUCT_SCHEMA, 'stealth': True}, headers=HEADERS, timeout=30) r.raise_for_status() return r.json()['data'] async def batch_extract(urls: list[str]) -> list[dict]: async with httpx.AsyncClient() as session: tasks = [extract(session, u) for u in urls] return await asyncio.gather(*tasks, return_exceptions=True) if __name__ == '__main__': urls = [ 'https://example-shop.com/products/widget-a', 'https://example-shop.com/products/widget-b', ] results = asyncio.run(batch_extract(urls)) for url, data in zip(urls, results): print(url, data)

Extracting Tables from Web Pages

HTML tables are common in finance sites, sports stats, and government data. Here's a robust utility using Cheerio that converts any <table> to an array of objects:

Choosing the Right Approach

No single extraction method works for every site. Use this decision guide:

Static HTML + Cheerio — News sites, blogs, product pages without login. Fast, cheap, scalable.

Playwright evaluate() — SPAs (React/Vue/Angular), infinite scroll, pages requiring interaction before content loads.

JSON-LD / Microdata — E-commerce and recipes. When structured data is present it's the most reliable source — it doesn't break when layout changes.

SnapAPI /extract schema — When you want typed, structured output without writing browser code. Best for pipelines across many domains.

Pro tip: Layer your approach — check for JSON-LD first, fall back to CSS selectors, and use SnapAPI /extract as the last resort for protected or JS-heavy pages. This maximises speed while keeping costs low.

How to Extract Data from a Website (2026)