Master scraping JavaScript sites with proxies. Covers rendering approaches, API interception, infinite scroll handling, and performance optimization.
Why Modern Websites Break Traditional Scrapers
Make a standard HTTP request to a React-based site and the response is a minimal HTML document with a root div and script tags. The actual content, product listings, prices, reviews, search results, isn't in that HTML. It's generated by JavaScript executing in the browser, fetching data from APIs, and injecting rendered HTML into the DOM after page load. A traditional scraper parsing the raw HTTP response sees none of this.
Client-side rendering isn't limited to single-page applications. Even traditionally server-rendered sites lean on JavaScript for critical content delivery:
- Lazy loading: Images and content below the fold load only when the user scrolls, triggered by Intersection Observer APIs.
- Infinite scroll: Pagination replaced by continuous content loading as the user scrolls, new items fetched through API calls.
- Dynamic pricing: Some e-commerce sites load prices through separate JavaScript API calls after the page structure renders, specifically to make price scraping harder.
- Authentication walls: Content behind login screens requires session management through a browser context, not just cookies in HTTP requests.
Scraping JavaScript sites needs a fundamentally different approach than parsing static HTML. Understanding the available techniques, and when to apply each, decides whether your operation succeeds or fails.
Detecting Whether a Site Requires JavaScript Rendering
View-source comparison. Simplest test: open the target page in your browser and use View Page Source (not the DevTools Elements panel, which shows the rendered DOM). Search the raw source for the specific data you need, a product price, a listing title, a review text. If the data is in the source, you can extract it with standard HTTP requests and HTML parsing. If the source shows empty containers, loading spinners, or placeholder text where data should be, you need rendering.
Curl test. Fetch the page with a basic HTTP request using curl or Python's requests and inspect the body. Compare against what the browser renders. Pay attention to the body element. If it's just a single div with an ID like 'root' or 'app' plus script tags, it's a client-side rendered SPA. If the body contains structured HTML with actual content, the initial load is server-rendered.
DevTools Network analysis. Open Chrome DevTools, go to the Network tab, reload the page with caching disabled. Watch for XHR or Fetch requests that fire after the initial load and return JSON. Those are the API calls populating the page. If critical data arrives through those secondary requests, you have two options: render the page to trigger those calls naturally, or intercept and replicate the API calls directly. The second is usually faster and cheaper.
JavaScript-disabled browsing. Temporarily disable JavaScript in your browser and reload. Whatever content remains visible is what a non-rendering scraper can access. This shows the gap between the static HTML and the fully rendered page in one step.
Headless Browser Rendering: The Reliable Approach
Playwright is the current best choice for scraping JS sites. It supports Chromium, Firefox, and WebKit, has a clean async API, and includes built-in support for proxy configuration, request interception, and waiting strategies. Unlike Puppeteer (Chromium only), Playwright's multi-browser support lets you match your browser engine to the site's expectations. Some anti-bot systems flag Chromium-specific behaviours that Firefox handles differently.
Proxy integration with headless browsers is straightforward but detail-sensitive. Configure the proxy at browser launch so all traffic, the HTML page, JavaScript bundles, API calls, images, fonts, routes through the same proxy IP. That consistency is critical because anti-bot systems verify all requests from a session originate from the same IP. With Databay's residential proxies, configure the proxy endpoint with authentication credentials in the browser launch options.
Browser contexts in Playwright enable parallel scraping with different proxy IPs. Each context is an isolated browser session with its own cookies, local storage, and proxy configuration. Launch one browser instance with 5 to 10 contexts, each on a different residential IP, to scrape multiple pages in parallel without session interference. This scales well because browser contexts share the underlying browser process, using far less memory than separate instances.
Intercepting API Calls: The Fast and Efficient Approach
API discovery. Open Chrome DevTools, switch to Network, filter by XHR/Fetch, and interact with the page (load, scroll, click, paginate). Watch for requests that return JSON containing the data you need. Product listings, search results, pricing, and reviews commonly arrive through API endpoints returning structured JSON. Note the full request URL, headers, query parameters, and any authentication tokens.
Replicating API calls. Once you identify the endpoint, recreate the request outside the browser using Python's requests or similar. The critical challenge is reproducing the headers and auth tokens the browser sends. Most internal APIs require:
- A valid session cookie or auth token (obtained by loading the page once in a browser or through a login flow).
- Specific request headers including User-Agent, Accept, and often custom headers like X-Requested-With or platform-specific API keys embedded in the page's JavaScript.
- Correct Content-Type and request body format for POST endpoints.
The speed advantage is dramatic. An API call returning JSON in 200ms and 5KB replaces a full page render that takes 3 to 5 seconds and consumes 500KB to 2MB. That's 10 to 20x higher throughput and 90%+ bandwidth savings. Scraping JS sites through proxy infrastructure, the bandwidth reduction directly cuts cost, especially on residential proxies priced per GB.
The risk: internal APIs change without warning. They aren't public contracts, so the endpoint URL, parameter format, or auth mechanism can change with any deployment. Build monitoring that catches API changes (unexpected response formats, auth failures) and alerts you immediately so you can update the integration.
The Hybrid Approach: Render Once, Request Many
Phase 1: Browser-based discovery. Run a headless browser with network interception enabled to load your target pages. As the browser renders each page, capture every API request and response, URLs, headers, parameters, response bodies. Automated discovery identifies every data-fetching endpoint the site uses, including ones that are difficult to find through manual DevTools inspection (requests triggered by scroll events, delayed loads, interaction-dependent fetches).
Phase 2: API endpoint validation. For each discovered endpoint, test whether you can replicate the call outside the browser with a simple HTTP request. Send the same URL, headers, and parameters through your proxy, compare the response against what the browser received. Some endpoints will work identically, those are your targets for direct HTTP collection. Others may require browser-generated tokens that expire quickly, which forces you to periodically refresh tokens through a browser session.
Phase 3: Production collection. Build your production scraper to use direct HTTP for every endpoint that works without browser-generated tokens. Handles 60 to 80% of typical data collection needs. For the remaining endpoints requiring fresh session tokens, run periodic browser sessions (every 15 to 30 minutes) solely to generate tokens, then use those tokens in HTTP requests until they expire.
The economics are compelling. If full rendering costs $X per page in compute and proxy bandwidth, and direct API calls cost $X/15, then shifting 80% of your traffic to API calls reduces total cost by roughly 70%. At scale, millions of pages a month, that's thousands of dollars saved while keeping data completeness the same.
Optimising Rendering Costs with Proxy Bandwidth Management
Block unnecessary resources. Playwright and Puppeteer both support request interception, which lets you abort requests before they're sent. Block the following resource types that contribute zero value to extraction:
- Images: Unless you're specifically scraping image URLs (which appear in the HTML without loading the actual file), block all image requests. Saves 40 to 60% of page bandwidth on its own.
- Fonts: Custom web fonts are purely visual. Block them.
- Tracking scripts: Google Analytics, Facebook Pixel, Hotjar, and similar analytics scripts burn bandwidth and execution time while contributing nothing to your scraper. Block requests to known analytics domains.
- Video and media: Block video preloads and audio files unless they're your scraping target.
Limit CSS loading selectively. CSS files are needed if anti-bot systems check for CSS-dependent rendering behaviour, but most scrapers can block third-party stylesheets (marketing tools, chat widgets) while keeping the site's primary stylesheet.
Cache static resources. JavaScript bundles and CSS files rarely change between page loads. Configure a local cache that stores them after the first download and serves them from cache on subsequent loads. Particularly effective when scraping multiple pages on the same site, the first page downloads all assets, subsequent pages reuse cached bundles. Can reduce proxy bandwidth by 50 to 70% across multi-page sessions.
Waiting Strategies for Complete Data Extraction
Wait for specific selectors. Most reliable approach. Identify a CSS selector matching an element containing your target data, wait for that element to appear in the DOM. In Playwright, use page.waitForSelector() with a timeout. Better than waiting a fixed time because it adapts to actual rendering speed. Fast pages go through quickly; slow pages get the time they need without arbitrary delays.
Wait for network idle. Playwright's waitForLoadState('networkidle') waits until no network requests have fired for 500ms. Works well for pages that load all data through a burst of API calls after the initial render. The risk: some pages have persistent network activity (analytics heartbeats, WebSocket connections, polling requests) that prevent networkidle from ever triggering. Set a reasonable timeout (10 to 15 seconds) as a backstop.
Wait for DOM stability. For complex pages where data loads progressively, monitor the DOM for changes using MutationObserver. Wait until the DOM has been stable (no new nodes added or modified) for a defined period. 1 to 2 seconds of stability usually indicates dynamic content loading is done. More solid than networkidle for pages with ongoing background network activity.
Avoid fixed delays. Never use a hardcoded sleep(5) as your waiting strategy. Fixed delays are either too short (missing data on slow loads) or too long (wasting time and proxy bandwidth on fast loads). Every second a headless browser sits idle is compute cost and proxy session time burned with zero data collected. Use event-based waiting, reserve fixed delays only as a minimum floor combined with an event-based primary wait.
Handling Infinite Scroll and Lazy Loading
Scroll simulation approach. The direct method: programmatically scroll the page in increments, wait for new content to load, repeat until no more appears. In Playwright, execute JavaScript to scroll the viewport by a set amount, wait for new elements (using waitForSelector on the content container), check whether the content count increased. Stop when two consecutive scroll cycles produce no new content, meaning you've hit bottom.
Use realistic scroll behaviour: vary scroll distances (300 to 800px per scroll), add random delays between scrolls (1 to 3 seconds), occasionally scroll up slightly before continuing down. Anti-bot systems on sites using infinite scroll often monitor scroll patterns, and perfectly uniform scrolling is detectable.
Pagination parameter discovery, the better approach. Most infinite scroll implementations fetch data from an API endpoint with a pagination parameter (page number, offset, or cursor). Inspect the network requests triggered by scrolling in DevTools and identify the data-fetching API call. It typically uses a parameter like ?page=2, ?offset=20, or ?cursor=abc123. Once identified, call the API directly with incrementing pagination parameters. No scrolling or browser rendering needed. 10 to 50x faster and a fraction of the bandwidth.
Lazy-loaded content. Images and secondary content that load when scrolled into view use Intersection Observer or scroll event listeners. For images, the actual URL is typically stored in a data attribute (data-src) rather than the src attribute, with JavaScript swapping them on visibility. Extract URLs from data attributes directly, no scrolling required. For lazy-loaded text content, trigger loading by scrolling to the element's position or by calling the loading function directly through JavaScript injection.
Extracting Data from Shadow DOM and Web Components
Understanding the problem. Shadow DOM is designed to encapsulate component internals. Styles and DOM structure inside a shadow root don't leak out, and external queries can't reach in. A product price rendered inside a custom Web Component with shadow DOM won't appear in document.querySelectorAll('.price') results if the .price element is inside the shadow root. By design, not a bug, but it breaks naive scraping approaches.
Piercing shadow DOM in Playwright. Playwright handles shadow DOM more gracefully than raw JavaScript selectors. Its default selector engine automatically pierces open shadow roots, so page.locator('.price') finds elements with class 'price' even inside shadow DOM boundaries. That makes Playwright the preferred tool for scraping sites that use Web Components extensively. Only works for open shadow roots. Closed shadow roots are intentionally inaccessible and need alternative approaches.
JavaScript-based extraction. When Playwright's built-in piercing isn't enough, inject JavaScript that explicitly traverses shadow boundaries. Access the shadow root through element.shadowRoot, then query within it: document.querySelector('my-component').shadowRoot.querySelector('.price'). For deeply nested shadow DOMs (shadow root inside shadow root), chain the traversals. Build recursive traversal functions that walk the full DOM tree including all shadow boundaries.
Practical prevalence. Shadow DOM scraping issues are increasingly common but still affect a minority of targets. Major e-commerce platforms and most content sites use standard DOM without shadow encapsulation. You're most likely to hit shadow DOM on sites built with frameworks like Lit, Stencil, or native Web Components, common in enterprise SaaS applications and some modern e-commerce storefronts. Test for shadow DOM during your initial site analysis rather than debugging it in production.
When to Build a Custom Renderer vs Using a Scraping API
Build your own when:
- You scrape a small number of high-value targets (under 20 sites) where you need precise control over rendering behaviour, timing, and extraction logic.
- Your data extraction needs complex multi-step interactions: logins, navigating multi-page flows, filling forms, or triggering specific UI actions to expose hidden data.
- Volume justifies the infrastructure investment. At millions of pages per month, the per-page cost of running your own cluster drops below API pricing.
- You need to maintain session state across requests (authenticated sessions, shopping carts, user profiles) that third-party APIs can't replicate.
Use a scraping or rendering API when:
- You scrape many different sites with standard extraction needs (product data, article text, search results) where per-site customisation is minimal.
- Your team lacks the infrastructure expertise to manage headless browser clusters, handle memory leaks, manage browser crashes, and optimise concurrency.
- Time-to-value matters more than per-unit cost. APIs get you collecting data in hours rather than weeks of development.
- Your volume is moderate (under 100,000 pages per month) where the API per-request cost is acceptable and building infrastructure isn't economically justified.
The proxy layer stays constant regardless of approach. Whether you run your own headless browsers or use a rendering API, the requests still originate from IP addresses the target sites evaluate for legitimacy. Residential proxies from Databay work with both approaches, configure them as the proxy endpoint for your headless browser instances or pass them as parameters to rendering APIs that support custom proxy configuration. Proxy quality determines your success rate against anti-bot systems regardless of your rendering strategy.
Performance Benchmarks: Rendering vs Direct HTTP
Direct HTTP requests (no rendering):
- Speed: 100 to 500ms per page including proxy latency.
- Bandwidth: 20 to 100KB per page (HTML only).
- Concurrency: 50 to 200 simultaneous requests per worker with async HTTP.
- Memory: Minimal, under 100MB for the entire process.
- Best for: Server-rendered sites, discovered API endpoints, static content.
Headless browser (full rendering):
- Speed: 3 to 15 seconds per page depending on site complexity.
- Bandwidth: 500KB to 5MB per page including all resources (reducible to 200KB to 1MB with resource blocking).
- Concurrency: 5 to 15 simultaneous pages per worker (limited by RAM and CPU).
- Memory: 200 to 500MB per browser context, 1 to 4GB total per worker.
- Best for: SPAs, sites requiring JS execution, complex interaction flows.
Hybrid (render to discover APIs, then direct HTTP):
- Speed: 3 to 15 seconds for initial discovery, then 100 to 500ms per subsequent request.
- Bandwidth: High for discovery phase, minimal for production collection.
- Concurrency: Matches direct HTTP after the discovery phase.
- Best for: High-volume scraping of JS sites where API endpoints are discoverable.
The numbers make the optimisation case clear. Scraping 100,000 pages per month through residential proxies, full rendering at 2MB average bandwidth costs roughly 200GB of proxy bandwidth. The hybrid approach at 200KB average drops that to 20GB. 10x cost reduction. At residential proxy rates, that's hundreds of dollars per month. Multiply by scale and the rendering approach you choose becomes one of the largest cost drivers in your scraping operation.
