Web Scraping

Scraping JavaScript SPAs: React, Vue, and Angular

April 2025 · 9 min read

The majority of modern web applications are JavaScript-rendered single-page applications. React, Vue, Angular, and Svelte apps deliver an empty HTML shell on initial load and populate it with content after JavaScript executes. Traditional scrapers that fetch HTML with curl or requests see an empty page — or at best, a loading spinner.

This guide covers why static scrapers fail on SPAs, how headless browsers solve it, and how to use a managed API to extract structured data from any JavaScript-rendered page without managing browser infrastructure.

Why Static Scrapers Fail on SPAs

When you fetch a React application with a standard HTTP client, the server returns the initial HTML — typically a bare <div id="root"></div> and a bundle of JavaScript files. The content you want is not in this HTML. It is rendered by React after the JavaScript bundle executes and any data fetching (REST or GraphQL API calls) completes.

Axios, requests, Cheerio, and BeautifulSoup all operate on the HTML returned by the HTTP GET. They have no JavaScript engine — they never execute the bundle, never trigger data fetches, and never see the rendered DOM. For SPAs, they consistently return empty or incomplete data.

The Headless Browser Solution

A headless browser executes the full page lifecycle: loads the HTML, parses and executes the JavaScript, handles network requests, and renders the final DOM. When the page is ready, you can query the DOM for the data you need — seeing exactly what a real user would see in their browser.

Playwright and Puppeteer are the two main headless browser tools for Node.js. Both launch a Chromium instance, navigate to the URL, wait for rendering to complete, and provide an API to interact with the page and extract data.

// Puppeteer — requires a local Chromium installation
const puppeteer = require('puppeteer');
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://spa-example.com/products', { waitUntil: 'networkidle2' });
const prices = await page.$$eval('.product-price', els => els.map(el => el.textContent));
await browser.close();
console.log(prices);

The waitUntil: 'networkidle2' option waits until there are no more than 2 active network connections for 500ms — a heuristic that works for most SPAs. For apps with streaming data or WebSocket connections, you may need to use page.waitForSelector() to wait for a specific element instead.

The Managed API Approach

A screenshot and extraction API provides the same rendering capability as a self-hosted headless browser, without the DevOps overhead. SnapAPI runs a managed Chromium fleet, handles browser crashes and restarts, manages anti-bot bypass, and exposes the results via a simple HTTPS API. Your scraper makes one POST request and receives structured data — no browser process to manage.

import requests

# Extract product data from a React SPA
resp = requests.post(
    'https://api.snapapi.pics/v1/extract',
    headers={'X-Api-Key': 'sk_live_YOUR_KEY'},
    json={
        'url': 'https://react-spa.example.com/products',
        'schema': {
            'products': {
                'selector': '.product-card',
                'multiple': True,
                'fields': {
                    'name': { 'selector': '.product-name' },
                    'price': { 'selector': '.price', 'transform': 'number' }
                }
            }
        }
    }
)
products = resp.json()['data']['products']
print(f'Found {len(products)} products')

SnapAPI renders the page in Chromium, executes the React JavaScript, waits for network requests to complete, and then evaluates your CSS selectors against the fully rendered DOM. The result is the same data you would get from manually inspecting the page in DevTools — but returned as structured JSON with zero browser setup.

Anti-Bot Bypass for SPAs

Many JavaScript SPAs are protected by Cloudflare, DataDome, or similar bot detection services. These services fingerprint the browser environment, check for headless indicators, and serve CAPTCHAs or empty responses to detected bots. SnapAPI's stealth mode uses a combination of browser fingerprint spoofing, natural request timing, and residential proxy routing to bypass standard bot detection — without any configuration required from you.

Waiting for Dynamic Content

SPAs that load data asynchronously present a timing challenge. The DOM may be ready but the API call that populates the product list hasn't completed yet. Use the wait_for parameter to specify a CSS selector that appears only after the data has loaded. SnapAPI waits up to 30 seconds for the element to appear before capturing — giving the SPA time to fetch and render its data.

resp = requests.post(
    'https://api.snapapi.pics/v1/extract',
    headers={'X-Api-Key': 'sk_live_YOUR_KEY'},
    json={
        'url': 'https://spa.example.com/products',
        'wait_for': '.product-card',  # Wait for first product to appear
        'schema': {
            'products': { 'selector': '.product-card', 'multiple': True }
        }
    }
)

Authenticated SPAs

Many SPAs require authentication to access the content you need. Use the cookies parameter to pass session cookies to SnapAPI. Log in to the site normally in your browser, copy the session cookie from DevTools, and pass it in the cookies array. SnapAPI will send these cookies with every request, giving the browser an authenticated session to load the page.

For automated authentication, use the js_code parameter to inject JavaScript that sets localStorage values or dispatches login events before the page renders your target content. Some SPAs store auth tokens in localStorage — inject the token before page load and the app will behave as if you are logged in.

Get Started

Sign up at snapapi.pics for 200 free extractions per month, no credit card required. The extract endpoint supports CSS selector schemas, AI mode for unstructured extraction, wait conditions, cookies, JS injection, and anti-bot bypass — everything needed to scrape any JavaScript SPA without managing browser infrastructure.

Vue, Angular, and Svelte SPAs

The same techniques apply to Vue, Angular, and Svelte applications. All three frameworks render content client-side using JavaScript. The key difference between frameworks is how they structure their CSS class names. Vue components often use scoped styles with generated class suffixes (like data-v-abc123). Angular uses ViewEncapsulation which adds attribute selectors to CSS. Use attribute selectors, data attributes, or semantic HTML element selectors rather than framework-generated class names for more resilient schemas.

GraphQL-Powered SPAs

Many modern SPAs fetch data via GraphQL. The rendered DOM contains the same data regardless of the underlying data fetching mechanism — SnapAPI sees the final rendered output, not the network requests. If the GraphQL data takes time to load, use wait_for to wait for an element that indicates the data has rendered before your schema selectors run.

Infinite Scroll and Virtual Lists

React applications with infinite scroll or virtual list implementations only render the visible items in the DOM. Scrolling the page forces additional items to render. Use the js_code parameter to inject JavaScript that scrolls to the bottom of the page before your selectors run, loading additional list items into the DOM.

For very large virtual lists, multiple scroll events may be needed to load all items. Inject a scroll loop that repeatedly scrolls to the document bottom with a delay between each scroll, waiting for new items to appear. Combine with wait_for to ensure the DOM has stabilized before extraction begins.

Server-Side Rendering vs Client-Side Rendering

Next.js, Nuxt, and SvelteKit offer server-side rendering (SSR) which renders the initial HTML on the server. For SSR pages, static HTTP scrapers work on the first load — the server sends fully populated HTML. However, after hydration, navigation within the SPA is still client-side. If you need data from pages reached by in-app navigation, you still need a headless browser or screenshot API to capture the rendered result.

SnapAPI handles both SSR and CSR pages identically — it loads the URL in a full Chromium browser, waits for the page to stabilize, and returns the rendered result. For SSR pages this is faster since the initial HTML is already populated. For CSR pages the API waits for JavaScript to execute and data fetching to complete. Either way, your extraction schema works without modification. Sign up at snapapi.pics for 200 free monthly extractions.