From simple HTML parsing to JavaScript-rendered SPAs. Learn three approaches to Node.js web scraping and when each one applies.
Try SnapAPI Free — 200 calls/moNode.js has a rich ecosystem for web scraping. The right approach depends on the target site: static HTML, server-rendered pages, or fully JavaScript-rendered SPAs. Here is a decision tree and practical code for each approach.
Quick decision guide:
Cheerio loads HTML as a jQuery-like API. It is the fastest option for static pages but fails
completely on JavaScript-rendered content. Install with npm install cheerio axios:
const axios = require('axios');
const cheerio = require('cheerio');
async function scrapeStatic(url) {
const { data } = await axios.get(url, {
headers: { 'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0)' }
});
const $ = cheerio.load(data);
// Extract all article titles from a blog
const titles = [];
$('h2.post-title, article h2, .entry-title').each((i, el) => {
titles.push($(el).text().trim());
});
return {
pageTitle: $('title').text(),
metaDescription: $('meta[name=description]').attr('content'),
titles
};
}
scrapeStatic('https://news.ycombinator.com').then(console.log);
⚠ Cheerio does not execute JavaScript. If the page renders content via React, Vue, or Angular, the scraped HTML will be empty shells.
Playwright launches a real browser and waits for JavaScript to complete before extracting content.
It handles SPAs, lazy-loaded content, and dynamic pagination. Install with
npm install playwright && npx playwright install chromium:
const { chromium } = require('playwright');
async function scrapeWithPlaywright(url) {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
try {
await page.goto(url, { waitUntil: 'networkidle', timeout: 30000 });
// Wait for a specific element that indicates content is loaded
await page.waitForSelector('.product-grid', { timeout: 10000 });
const products = await page.evaluate(() => {
return Array.from(document.querySelectorAll('.product-card')).map(card => ({
name: card.querySelector('.product-name')?.textContent?.trim(),
price: card.querySelector('.product-price')?.textContent?.trim(),
available: !card.classList.contains('out-of-stock')
}));
});
return products;
} finally {
await browser.close();
}
}
⚠ The hidden cost of self-hosted Playwright: A single Chromium instance uses 300-500MB RAM. Serving 10 concurrent scrape requests requires 3-5GB RAM, careful browser pool management, crash recovery, and Docker images over 1GB.
SnapAPI runs Chromium on managed infrastructure and returns clean Markdown. No browser to install, no memory to manage, no crash recovery to write. The same API key handles screenshots, scraping, extraction, and PDF generation:
const axios = require('axios');
async function scrapeWithSnapAPI(url) {
const { data } = await axios.get('https://api.snapapi.pics/v1/scrape', {
headers: { 'X-API-Key': process.env.SNAP_API_KEY },
params: { url, format: 'markdown' }
});
// data.content is clean Markdown — no ads, nav, or boilerplate
return data.content;
}
// Works on SPAs, React apps, login-gated pages with cookie headers
scrapeWithSnapAPI('https://news.ycombinator.com').then(md => {
console.log(md.slice(0, 500)); // First 500 chars of clean markdown
});
When scraping multiple pages, use Promise.allSettled for concurrent requests with error isolation:
const axios = require('axios');
const SNAP_KEY = process.env.SNAP_API_KEY;
const CONCURRENCY = 5; // adjust based on your plan
async function batchScrape(urls) {
const chunks = [];
for (let i = 0; i < urls.length; i += CONCURRENCY) {
chunks.push(urls.slice(i, i + CONCURRENCY));
}
const allResults = [];
for (const chunk of chunks) {
const results = await Promise.allSettled(
chunk.map(url =>
axios.get('https://api.snapapi.pics/v1/scrape', {
headers: { 'X-API-Key': SNAP_KEY },
params: { url, format: 'markdown' }
}).then(r => ({ url, content: r.data.content, ok: true }))
.catch(e => ({ url, error: e.message, ok: false }))
)
);
allResults.push(...results.map(r => r.value));
}
return allResults;
}
const urls = [
'https://example.com/product/1',
'https://example.com/product/2',
// ... up to hundreds of URLs
];
batchScrape(urls).then(results => {
const ok = results.filter(r => r.ok);
const failed = results.filter(r => !r.ok);
console.log(`Scraped ${ok.length}/${urls.length} successfully`);
});
When you need typed, structured data instead of Markdown, use SnapAPI’s extract endpoint. Define a JSON schema and get back validated objects:
const axios = require('axios');
async function extractProductData(url) {
const schema = {
type: 'object',
properties: {
name: { type: 'string' },
price: { type: 'number' },
currency: { type: 'string' },
in_stock: { type: 'boolean' },
rating: { type: 'number' },
review_count: { type: 'integer' },
description: { type: 'string' }
}
};
const { data } = await axios.post(
'https://api.snapapi.pics/v1/extract',
{ url, schema },
{ headers: { 'X-API-Key': process.env.SNAP_API_KEY } }
);
return data; // { name: 'Product X', price: 49.99, in_stock: true, ... }
}
Anti-bot systems have become significantly more sophisticated. Cloudflare Bot Management, Akamai Bot Manager, and similar systems detect browser automation through dozens of signals: WebGL fingerprint, canvas fingerprint, timing patterns, missing browser APIs, and TLS fingerprint. A vanilla Playwright instance is detectable by all major bot management systems without additional stealth plugins. SnapAPI runs Chromium with stealth configuration that passes most bot detection systems, making it suitable for extracting publicly accessible data from sites with basic anti-bot protection. For sites with aggressive bot management, always check the site’s terms of service and robots.txt before scraping.
Scraping publicly accessible data is generally legal in most jurisdictions, but you must respect robots.txt, rate limits, and the site’s terms of service. Never scrape private user data, login-required content without permission, or data explicitly prohibited in ToS. When in doubt, consult a lawyer.
Use Cheerio for static HTML pages where speed matters most. Use Playwright when you need fine-grained browser control for testing or automation workflows. Use SnapAPI for production data pipelines where you want no infrastructure overhead and consistent results across JavaScript-heavy sites.
Generate the paginated URLs (e.g., ?page=1, ?page=2) and pass each to batchScrape. For infinite scroll pages, use Playwright to scroll and capture the full DOM, or use SnapAPI with a long delay to allow the page to load all content.
Use SnapAPI’s managed scraping API. 200 free calls, no browser to install.
Get Free API KeyNode.js has several HTTP clients suited for API integrations. The native fetch API (available since Node 18 without a flag) works for simple use cases. Axios is the most popular choice because it handles JSON automatically, supports request/response interceptors for retry logic, and works identically in Node and browser environments. Got is a lighter alternative with a clean promise API. For production scraping pipelines that make hundreds of requests, Axios with a custom retry adapter using axios-retry is the standard pattern.
When working with SnapAPI, binary responses (screenshots, PDFs) require responseType: "arraybuffer" in Axios or a buffer-based approach with fetch. Text responses (scraping, extraction) return JSON by default. Here is the canonical pattern for handling both response types in a single client:
Production scraping pipelines need robust error handling. The most common failure modes with any screenshot or scraping API are: network timeouts on slow target sites, 4xx errors when the target site returns an error page, and 5xx errors from the API itself. A well-designed retry strategy handles each case differently. Network timeouts and 5xx errors should be retried with exponential backoff. 4xx errors from the target site (like a 404) should be recorded as failures without retry. API-level 4xx errors (like 401 Unauthorized or 429 Rate Limited) should stop immediately and alert.
The npm package axios-retry handles the backoff logic cleanly. Configure it with retries: 3, retryDelay: axiosRetry.exponentialDelay, and a retryCondition that returns true only for network errors and 5xx responses.
For small-scale pipelines, writing results to a JSON file or SQLite database is sufficient. For production systems at scale, the standard stack is PostgreSQL for structured data (with Prisma or Knex as the ORM layer), S3-compatible object storage for binary files like screenshots and PDFs, and Redis for tracking job state and implementing distributed locking to prevent duplicate scrapes. If you are using a job queue like BullMQ, you can use its built-in retry and concurrency controls instead of implementing them manually at the HTTP client level.
For AI-powered applications that process scraped content, the typical flow is: scrape with SnapAPI, chunk the Markdown output into segments of 512-2000 tokens, embed each chunk with an embedding model, and store in a vector database like Pinecone, Weaviate, or pgvector. This pattern powers RAG (retrieval-augmented generation) applications that answer questions over scraped web content.