Extracting Structured Data from Websites with a Scraping API

The web is the largest database ever created, but most of it is locked behind HTML markup. Extracting useful, structured data from web pages is a fundamental task for AI pipelines, competitive intelligence, market research, content aggregation, and dozens of other applications. The challenge is doing it reliably, at scale, and without building a fragile scraping infrastructure.

This guide covers modern approaches to web data extraction, with practical examples using SnapAPI's /v1/scrape and /v1/extract endpoints.

Two Approaches to Web Data Extraction

There are two fundamentally different ways to get data from a webpage:

1. Structured Scraping (CSS/XPath Selectors)

You specify exactly which elements you want using CSS selectors or XPath expressions. The API returns structured JSON with the data from those elements. This approach is precise and fast, but requires you to know the page structure in advance.

2. Content Extraction (URL to Markdown/Text)

You point the API at a URL, and it extracts the main content as clean markdown or plain text. No selectors needed. This is ideal for AI/LLM pipelines where you want to feed web content into a language model without HTML noise.

SnapAPI supports both approaches. Let us look at each one in detail.

Structured Scraping with /v1/scrape

The /v1/scrape endpoint renders a page in a real browser (handling JavaScript, SPAs, and dynamic content) and then extracts data using CSS selectors you define.

Basic Example: Extracting Page Metadata

Bash

curl "https://api.snapapi.pics/v1/scrape"   -H "Authorization: Bearer YOUR_API_KEY"   -H "Content-Type: application/json"   -d '{
    "url": "https://news.ycombinator.com",
    "selectors": {
      "title": "title",
      "top_stories": ".titleline > a",
      "scores": ".score"
    }
  }'

The response is clean JSON:

JSON

{
  "success": true,
  "data": {
    "title": "Hacker News",
    "top_stories": [
      "Show HN: I built a screenshot API that does 5 things",
      "PostgreSQL 17 is now available",
      "The state of WebAssembly in 2026"
    ],
    "scores": [
      "342 points",
      "287 points",
      "201 points"
    ]
  },
  "metadata": {
    "url": "https://news.ycombinator.com",
    "statusCode": 200,
    "loadTime": 1243
  }
}

Advanced Selectors: Extracting Attributes and Nested Data

You are not limited to text content. Extract attributes, HTML, and nested structures:

Python

from snapapi import SnapAPI

client = SnapAPI("YOUR_API_KEY")

result = client.scrape(
    url="https://example-ecommerce.com/products",
    selectors={
        # Extract text content (default)
        "product_names": ".product-card h3",

        # Extract specific attributes with @attr syntax
        "product_images": ".product-card img@src",
        "product_links": ".product-card a@href",

        # Extract prices
        "prices": ".product-card .price",

        # Extract data attributes
        "product_ids": ".product-card@data-product-id",
    }
)

# Build structured product data
products = []
for i in range(len(result["data"]["product_names"])):
    products.append({
        "name": result["data"]["product_names"][i],
        "price": result["data"]["prices"][i],
        "image": result["data"]["product_images"][i],
        "link": result["data"]["product_links"][i],
        "id": result["data"]["product_ids"][i],
    })

print(f"Found {len(products)} products")

Handling Dynamic Content

Many modern websites load content dynamically via JavaScript. SnapAPI renders pages in a real browser, so JavaScript-rendered content is available. You can also wait for specific elements to appear:

Python

# Wait for dynamic content to load
result = client.scrape(
    url="https://spa-app.com/dashboard",
    selectors={
        "metrics": ".metric-value",
        "chart_labels": ".chart-label",
    },
    wait_for_selector=".metric-value",  # Wait until this element exists
    delay=2000,  # Additional 2-second wait after element appears
)

Pagination: Scraping Multiple Pages

Python

all_products = []

for page_num in range(1, 11):  # Scrape 10 pages
    url = f"https://example-shop.com/products?page={page_num}"

    result = client.scrape(
        url=url,
        selectors={
            "names": ".product-name",
            "prices": ".product-price",
            "ratings": ".product-rating@data-score",
        }
    )

    page_products = result["data"]["names"]
    if not page_products:
        break  # No more results

    for i in range(len(page_products)):
        all_products.append({
            "name": result["data"]["names"][i],
            "price": result["data"]["prices"][i],
            "rating": result["data"]["ratings"][i] if i < len(result["data"]["ratings"]) else None,
        })

    print(f"Page {page_num}: found {len(page_products)} products")

print(f"
Total: {len(all_products)} products scraped")

Content Extraction with /v1/extract

The /v1/extract endpoint is designed for a different use case: converting an entire webpage into clean, readable text or markdown. This is the endpoint you want when building AI/LLM data pipelines.

Why Content Extraction Matters for AI

Large language models work with text, not HTML. If you feed raw HTML into a language model, most of the token budget is wasted on tags, attributes, scripts, and styling. Content extraction strips all of that away, giving you just the meaningful content in a format the model can work with efficiently.

Basic Content Extraction

Bash

curl "https://api.snapapi.pics/v1/extract?url=https://example.com/blog-post&format=markdown"   -H "Authorization: Bearer YOUR_API_KEY"

Response:

JSON

{
  "success": true,
  "content": "# Blog Post Title

This is the main content of the blog post, extracted as clean markdown...

## Section Heading

More content here with **bold** and *italic* formatting preserved.

- List item 1
- List item 2
- List item 3
",
  "metadata": {
    "title": "Blog Post Title",
    "description": "Meta description of the page",
    "author": "Author Name",
    "publishedDate": "2026-03-01",
    "wordCount": 1847,
    "url": "https://example.com/blog-post"
  }
}

Python Example: Feeding Web Content to an LLM

Python

from snapapi import SnapAPI
from openai import OpenAI

snap = SnapAPI("YOUR_SNAPAPI_KEY")
openai = OpenAI(api_key="YOUR_OPENAI_KEY")

# Step 1: Extract content from a webpage
extracted = snap.extract(
    url="https://techcrunch.com/2026/03/01/some-article",
    format="markdown"
)

# Step 2: Send to LLM for analysis
response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "You are a tech news analyst. Summarize the following article and identify key takeaways."
        },
        {
            "role": "user",
            "content": f"Analyze this article:

{extracted['content']}"
        }
    ]
)

print(response.choices[0].message.content)

Building a RAG Pipeline

Retrieval-Augmented Generation (RAG) systems need clean text to build their knowledge base. Here is how to use SnapAPI extract with a vector database:

Python

from snapapi import SnapAPI
import chromadb
from sentence_transformers import SentenceTransformer

snap = SnapAPI("YOUR_API_KEY")
model = SentenceTransformer("all-MiniLM-L6-v2")
chroma = chromadb.Client()
collection = chroma.create_collection("web_knowledge")

urls = [
    "https://docs.python.org/3/tutorial/index.html",
    "https://fastapi.tiangolo.com/tutorial/",
    "https://docs.docker.com/get-started/",
]

for url in urls:
    # Extract clean content
    result = snap.extract(url=url, format="markdown")
    content = result["content"]

    # Chunk the content (simple approach: split by paragraphs)
    chunks = [c.strip() for c in content.split("

") if len(c.strip()) > 50]

    # Generate embeddings and store
    for i, chunk in enumerate(chunks):
        embedding = model.encode(chunk).tolist()
        collection.add(
            documents=[chunk],
            embeddings=[embedding],
            ids=[f"{url}_{i}"],
            metadatas=[{"url": url, "chunk_index": i}]
        )

    print(f"Indexed {len(chunks)} chunks from {url}")

# Now query the knowledge base
query = "How do I create a Docker container?"
query_embedding = model.encode(query).tolist()
results = collection.query(query_embeddings=[query_embedding], n_results=3)

for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
    print(f"
From {meta['url']}:")
    print(doc[:200] + "...")

Scrape vs. Extract: When to Use Which

Use Case	Best Endpoint	Why
Product data from e-commerce	/v1/scrape	You know the exact selectors for price, name, image
Blog content for LLM analysis	/v1/extract	You want clean markdown, not specific elements
Competitor pricing tables	/v1/scrape	Structured data from known page layouts
News article summarization	/v1/extract	Main content extraction without HTML noise
Job listings from career pages	/v1/scrape	Repeating elements with consistent selectors
Documentation for RAG	/v1/extract	Full page content in embedding-ready format
Social media profiles	/v1/scrape	Specific data points (followers, bio, posts)
Research paper analysis	/v1/extract	Full text content for LLM processing

Handling Common Challenges

Anti-Bot Protection

Many websites use bot detection (Cloudflare, DataDome, PerimeterX). SnapAPI uses stealth browser profiles that mimic real user behavior, making your scraping requests harder to detect and block. Features include:

Realistic browser fingerprints (navigator properties, WebGL, Canvas)
Human-like timing and interaction patterns
Automatic proxy rotation for high-volume scraping
Cookie and session management

Rate Limiting and Politeness

Responsible scraping means respecting the target website's resources. Here is a pattern for polite, rate-limited scraping:

Python

import time
import random

def scrape_politely(client, urls, selectors, min_delay=1.0, max_delay=3.0):
    """Scrape multiple URLs with random delays to be polite."""
    results = []

    for url in urls:
        try:
            data = client.scrape(url=url, selectors=selectors)
            results.append({"url": url, "data": data["data"], "status": "ok"})
        except Exception as e:
            results.append({"url": url, "error": str(e), "status": "error"})

        # Random delay between requests
        delay = random.uniform(min_delay, max_delay)
        time.sleep(delay)

    return results

Handling JavaScript-Heavy SPAs

Single-page applications often load data via API calls after the initial page render. SnapAPI handles this automatically since it uses a real browser, but you may need to add wait conditions:

Python

# For React/Vue/Angular apps
result = client.scrape(
    url="https://react-app.com/products",
    selectors={"items": ".product-item"},
    wait_for_selector=".product-item",
    delay=2000,  # Extra wait for lazy-loaded content
    block_ads=True
)

Building a Complete Data Pipeline

Here is a production-ready pattern that combines scraping with data storage and monitoring:

Python

import json
import time
from datetime import datetime
from pathlib import Path
from snapapi import SnapAPI

client = SnapAPI("YOUR_API_KEY")

def run_competitor_monitor():
    """Daily competitor pricing monitor."""

    competitors = [
        {
            "name": "competitor_a",
            "url": "https://competitor-a.com/pricing",
            "selectors": {
                "plan_names": ".plan-name",
                "plan_prices": ".plan-price",
                "plan_features": ".plan-features li",
            }
        },
        {
            "name": "competitor_b",
            "url": "https://competitor-b.com/pricing",
            "selectors": {
                "plan_names": ".tier-title",
                "plan_prices": ".tier-price",
                "plan_features": ".tier-feature",
            }
        },
    ]

    timestamp = datetime.now().isoformat()
    results = {"timestamp": timestamp, "data": {}}

    for comp in competitors:
        try:
            scraped = client.scrape(
                url=comp["url"],
                selectors=comp["selectors"],
                block_ads=True,
                block_cookie_banners=True
            )
            results["data"][comp["name"]] = scraped["data"]
            print(f"  Scraped {comp['name']}: OK")
        except Exception as e:
            results["data"][comp["name"]] = {"error": str(e)}
            print(f"  Scraped {comp['name']}: FAILED - {e}")

        time.sleep(2)

    # Save results
    output_dir = Path("pricing_data")
    output_dir.mkdir(exist_ok=True)
    date_str = datetime.now().strftime("%Y-%m-%d")

    with open(output_dir / f"{date_str}.json", "w") as f:
        json.dump(results, f, indent=2)

    return results

# Run the monitor
data = run_competitor_monitor()
print(json.dumps(data, indent=2))

Conclusion

Web data extraction has evolved beyond simple HTML parsing. Modern APIs like SnapAPI handle the hard parts -- JavaScript rendering, bot detection, dynamic content, and browser management -- so you can focus on what to do with the data rather than how to get it.

Key takeaways:

Use /v1/scrape when you need specific data points from known page structures
Use /v1/extract when you want clean content for AI/LLM processing
Always implement rate limiting and respect target websites
Cache results to minimize redundant API calls
Build monitoring pipelines for ongoing data collection

Two Approaches to Web Data Extraction

1. Structured Scraping (CSS/XPath Selectors)

2. Content Extraction (URL to Markdown/Text)

Structured Scraping with /v1/scrape

Basic Example: Extracting Page Metadata

Advanced Selectors: Extracting Attributes and Nested Data

Handling Dynamic Content

Pagination: Scraping Multiple Pages

Content Extraction with /v1/extract

Why Content Extraction Matters for AI

Basic Content Extraction

Python Example: Feeding Web Content to an LLM

Building a RAG Pipeline

Scrape vs. Extract: When to Use Which

Handling Common Challenges

Anti-Bot Protection

Rate Limiting and Politeness

Handling JavaScript-Heavy SPAs

Building a Complete Data Pipeline

Conclusion

Start Capturing for Free

Related Reading