Extraction Methods Compared
| Method | Best for | Maintenance | Handles JS? |
|---|---|---|---|
| CSS selectors (Cheerio) | Static HTML, predictable structure | High (breaks on redesign) | No |
| XPath | Complex DOM traversal | High | No |
| Playwright evaluate() | SPAs, dynamic content | Medium | Yes |
| JSON-LD / microdata | Schema.org-tagged pages | Low | No |
| AI schema extraction | Any page, no selectors needed | Very low | Yes |
CSS Selectors with Cheerio
import * as cheerio from 'cheerio';
import fetch from 'node-fetch';
const res = await fetch('https://example.com/product');
const $ = cheerio.load(await res.text());
const product = {
title: $('h1.product-title').text().trim(),
price: $('.price-current').text().trim(),
rating: parseFloat($('[itemprop="ratingValue"]').attr('content') || '0'),
description: $('.product-description').text().trim(),
images: $('img.product-image').map((_, el) => $(el).attr('src')).get(),
variants: $('.variant-option').map((_, el) => ({
name: $(el).find('.variant-name').text().trim(),
price: $(el).find('.variant-price').text().trim(),
sku: $(el).data('sku'),
})).get(),
};
Playwright evaluate() for SPAs
import { chromium } from 'playwright';
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com/product', { waitUntil: 'networkidle' });
// Extract structured data inside the browser context
const product = await page.evaluate(() => {
const getText = (sel) => document.querySelector(sel)?.textContent?.trim() ?? null;
const getAttr = (sel, attr) => document.querySelector(sel)?.getAttribute(attr) ?? null;
return {
title: getText('h1'),
price: getText('[data-price]') || getAttr('[data-price]', 'data-price'),
inStock: !document.querySelector('.out-of-stock'),
images: [...document.querySelectorAll('img.gallery-img')].map(el => el.src),
breadcrumbs: [...document.querySelectorAll('nav.breadcrumb a')].map(el => el.textContent.trim()),
};
});
await browser.close();
JSON-LD: Zero Maintenance Extraction
Many ecommerce and news sites embed application/ld+json structured data. It's reliable and doesn't break on layout changes:
import * as cheerio from 'cheerio';
async function extractJsonLd(url) {
const res = await fetch(url);
const $ = cheerio.load(await res.text());
const schemas = [];
$('script[type="application/ld+json"]').each((_, el) => {
try {
schemas.push(JSON.parse($(el).html()));
} catch {}
});
// Find Product schema
const product = schemas.find(s => s['@type'] === 'Product');
if (!product) return null;
return {
name: product.name,
description: product.description,
price: product.offers?.price,
currency: product.offers?.priceCurrency,
availability: product.offers?.availability,
rating: product.aggregateRating?.ratingValue,
reviewCount: product.aggregateRating?.reviewCount,
};
}