How to Extract Data from a Website (2026)

From CSS selectors with Cheerio to Playwright evaluate, JSON-LD parsing, and AI-powered schema extraction — pick the right method for your use case.

Data ExtractionCheerioPlaywright AI ExtractionSnapAPIApril 2026

Extraction Methods Compared

MethodBest forMaintenanceHandles JS?
CSS selectors (Cheerio)Static HTML, predictable structureHigh (breaks on redesign)No
XPathComplex DOM traversalHighNo
Playwright evaluate()SPAs, dynamic contentMediumYes
JSON-LD / microdataSchema.org-tagged pagesLowNo
AI schema extractionAny page, no selectors neededVery lowYes

CSS Selectors with Cheerio

import * as cheerio from 'cheerio';
import fetch from 'node-fetch';

const res = await fetch('https://example.com/product');
const $ = cheerio.load(await res.text());

const product = {
  title:       $('h1.product-title').text().trim(),
  price:       $('.price-current').text().trim(),
  rating:      parseFloat($('[itemprop="ratingValue"]').attr('content') || '0'),
  description: $('.product-description').text().trim(),
  images:      $('img.product-image').map((_, el) => $(el).attr('src')).get(),
  variants:    $('.variant-option').map((_, el) => ({
    name:  $(el).find('.variant-name').text().trim(),
    price: $(el).find('.variant-price').text().trim(),
    sku:   $(el).data('sku'),
  })).get(),
};

Playwright evaluate() for SPAs

import { chromium } from 'playwright';

const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com/product', { waitUntil: 'networkidle' });

// Extract structured data inside the browser context
const product = await page.evaluate(() => {
  const getText = (sel) => document.querySelector(sel)?.textContent?.trim() ?? null;
  const getAttr = (sel, attr) => document.querySelector(sel)?.getAttribute(attr) ?? null;

  return {
    title:       getText('h1'),
    price:       getText('[data-price]') || getAttr('[data-price]', 'data-price'),
    inStock:     !document.querySelector('.out-of-stock'),
    images:      [...document.querySelectorAll('img.gallery-img')].map(el => el.src),
    breadcrumbs: [...document.querySelectorAll('nav.breadcrumb a')].map(el => el.textContent.trim()),
  };
});

await browser.close();

JSON-LD: Zero Maintenance Extraction

Many ecommerce and news sites embed application/ld+json structured data. It's reliable and doesn't break on layout changes:

import * as cheerio from 'cheerio';

async function extractJsonLd(url) {
  const res = await fetch(url);
  const $ = cheerio.load(await res.text());

  const schemas = [];
  $('script[type="application/ld+json"]').each((_, el) => {
    try {
      schemas.push(JSON.parse($(el).html()));
    } catch {}
  });

  // Find Product schema
  const product = schemas.find(s => s['@type'] === 'Product');
  if (!product) return null;

  return {
    name:        product.name,
    description: product.description,
    price:       product.offers?.price,
    currency:    product.offers?.priceCurrency,
    availability: product.offers?.availability,
    rating:      product.aggregateRating?.ratingValue,
    reviewCount: product.aggregateRating?.reviewCount,
  };
}

Schema-Based Extraction with SnapAPI

SnapAPI's /v1/extract endpoint accepts a URL and a JSON schema — it returns structured, typed data without writing any browser automation code. Ideal for production pipelines where reliability and simplicity matter.

// Node.js — schema-based extraction
const axios = require('axios');

async function extractProduct(url) {
  const res = await axios.post('https://api.snapapi.pics/v1/extract', {
    url,
    schema: {
      title:         { type: 'string',  description: 'Product title' },
      price:         { type: 'number',  description: 'Current price in USD' },
      originalPrice: { type: 'number',  description: 'Original list price if discounted' },
      inStock:       { type: 'boolean', description: 'Whether the item is available' },
      rating:        { type: 'number',  description: 'Average star rating 0–5' },
      reviewCount:   { type: 'number',  description: 'Total number of reviews' },
      images:        { type: 'array', items: { type: 'string' }, description: 'Product image URLs' },
      variants: {
        type: 'array',
        items: {
          type: 'object',
          properties: {
            name:      { type: 'string' },
            value:     { type: 'string' },
            available: { type: 'boolean' }
          }
        }
      }
    },
    stealth: true   // bypass bot protection on retailer pages
  }, { headers: { 'X-Api-Key': process.env.SNAPAPI_KEY } });

  return res.data.data; // fully typed JSON matching your schema
}

// Batch extract — run multiple URLs concurrently
async function batchExtract(urls, schema) {
  const results = await Promise.allSettled(
    urls.map(url => extractProduct(url))
  );
  return results.map((r, i) => ({
    url: urls[i],
    data: r.status === 'fulfilled' ? r.value : null,
    error: r.status === 'rejected' ? r.reason.message : null
  }));
}

The response res.data.data is guaranteed to match the schema types — numbers are numbers, arrays are arrays. No more parseFloat(price.replace('$','')).

Python example

import httpx, asyncio, os

SNAPAPI_KEY = os.environ['SNAPAPI_KEY']
HEADERS = {'X-Api-Key': SNAPAPI_KEY}

PRODUCT_SCHEMA = {
    'title':       {'type': 'string'},
    'price':       {'type': 'number'},
    'in_stock':    {'type': 'boolean'},
    'rating':      {'type': 'number'},
    'review_count':{'type': 'number'},
    'images':      {'type': 'array', 'items': {'type': 'string'}},
}

async def extract(session: httpx.AsyncClient, url: str) -> dict:
    r = await session.post('https://api.snapapi.pics/v1/extract',
        json={'url': url, 'schema': PRODUCT_SCHEMA, 'stealth': True},
        headers=HEADERS, timeout=30)
    r.raise_for_status()
    return r.json()['data']

async def batch_extract(urls: list[str]) -> list[dict]:
    async with httpx.AsyncClient() as session:
        tasks = [extract(session, u) for u in urls]
        return await asyncio.gather(*tasks, return_exceptions=True)

if __name__ == '__main__':
    urls = [
        'https://example-shop.com/products/widget-a',
        'https://example-shop.com/products/widget-b',
    ]
    results = asyncio.run(batch_extract(urls))
    for url, data in zip(urls, results):
        print(url, data)

Extracting Tables from Web Pages

HTML tables are common in finance sites, sports stats, and government data. Here's a robust utility using Cheerio that converts any <table> to an array of objects:

const cheerio = require('cheerio');

/**
 * Parse all tables on a page into arrays of objects.
 * @param {string} html - Raw HTML string
 * @returns {Array>} - One array of row-objects per table
 */
function extractTables(html) {
  const $ = cheerio.load(html);
  const tables = [];

  $('table').each((_, table) => {
    const headers = [];
    const rows = [];

    // Collect headers from  or first 
    $(table).find('thead tr th, thead tr td').each((_, th) => {
      headers.push($(th).text().trim().toLowerCase().replace(/\s+/g, '_'));
    });
    if (!headers.length) {
      $(table).find('tr').first().find('th, td').each((_, th) => {
        headers.push($(th).text().trim().toLowerCase().replace(/\s+/g, '_'));
      });
    }

    // Parse body rows
    $(table).find('tbody tr, tr').each((i, tr) => {
      const cells = $(tr).find('td');
      if (!cells.length) return; // skip header rows
      const row = {};
      cells.each((j, td) => {
        const key = headers[j] || `col_${j}`;
        row[key] = $(td).text().trim();
      });
      if (Object.keys(row).length) rows.push(row);
    });

    if (rows.length) tables.push(rows);
  });

  return tables;
}

// Usage with SnapAPI scrape
const axios = require('axios');
async function scrapeTable(url) {
  const { data } = await axios.post('https://api.snapapi.pics/v1/scrape',
    { url, stealth: true },
    { headers: { 'X-Api-Key': process.env.SNAPAPI_KEY } }
  );
  const tables = extractTables(data.html);
  console.log(`Found ${tables.length} table(s)`);
  console.log(tables[0]?.slice(0, 3)); // preview first 3 rows
  return tables;
}

Choosing the Right Approach

No single extraction method works for every site. Use this decision guide:

Pro tip: Layer your approach — check for JSON-LD first, fall back to CSS selectors, and use SnapAPI /extract as the last resort for protected or JS-heavy pages. This maximises speed while keeping costs low.

Extract structured data in one API call

No browser automation code. Paste a URL and a schema — get back typed JSON. 200 free requests/month.

Try SnapAPI Free →