Headless Browsers with Proxies: Setup and Best Practices

Sophie Marchand Sophie Marchand 15 min read

Set up headless browsers with proxies using Playwright and Puppeteer. Covers stealth configuration, resource optimization, and when headless is overkill.

When You Actually Need a Headless Browser

A headless browser is a full browser engine running without a graphical interface, controlled programmatically through an API. It renders HTML, executes JavaScript, processes CSS, handles cookies, and maintains sessions exactly like a browser you would use manually — but automated and invisible. The question is not whether headless browsers are powerful. They are. The question is whether you need that power for your specific task.

You need a headless browser when the target site renders content with JavaScript. Single-page applications built with React, Vue, Angular, or Svelte load an empty HTML shell and populate it through JavaScript execution. The initial HTML response contains no usable data — the content exists only after the framework mounts and fetches data from APIs. Raw HTTP clients receive the empty shell. A headless browser receives the same shell, executes the JavaScript, waits for the data to render, and then you extract the populated DOM.

You need a headless browser when the site requires interaction to access data. Infinite scroll pages that load content as you scroll, tabs that reveal hidden content when clicked, accordions that expand on interaction, modals that open after button clicks — all require a browser that can simulate user actions. Some sites gate data behind login flows that require JavaScript execution and cookie management that exceeds what simple HTTP clients handle cleanly.

You need a headless browser when anti-bot defenses require JavaScript challenge solving. Cloudflare Managed Challenge, Akamai Bot Manager sensor scripts, and PerimeterX HUMAN challenges all demand a browser environment that executes their verification JavaScript. Without a browser, you cannot pass these challenges and cannot access the protected content behind them.

Puppeteer vs Playwright vs Selenium

Three frameworks dominate headless browser automation. Each has distinct strengths, and the right choice depends on your scraping requirements and technical stack.

Playwright is the recommended choice for most scraping operations in 2026. Built by Microsoft, it supports Chromium, Firefox, and WebKit (Safari's engine) from a single API. Its browser context model allows running multiple isolated sessions within a single browser instance, each with its own cookies, storage, and proxy configuration. This is ideal for scraping: one browser process hosts dozens of contexts, each routed through different proxy IPs. Playwright's auto-wait mechanism handles dynamic content loading without manual sleep statements, and its network interception API enables request filtering and modification. Available in Python, Node.js, Java, and .NET.

Puppeteer is Google's automation library for Chrome/Chromium. It has a mature ecosystem and deep Chrome integration but only supports Chromium-based browsers. Puppeteer's plugin ecosystem (via puppeteer-extra) includes the widely used stealth plugin for anti-detection. It is slightly lighter weight than Playwright for Chrome-only tasks. Node.js is the primary language, with a Python port (pyppeteer) that lags behind in features.

Selenium is the oldest framework, supporting all major browsers through the WebDriver protocol. Its advantage is broad language support (Python, Java, C#, Ruby, JavaScript) and extensive documentation. However, Selenium is slower than Playwright and Puppeteer because WebDriver uses HTTP-based communication rather than direct browser protocol connections. Selenium is also the most easily detected by anti-bot systems due to its distinctive browser flags. For new scraping projects, Playwright or Puppeteer are strongly preferred over Selenium.

Configuring Proxies in Headless Browsers

Proxy configuration in headless browsers routes all browser traffic — page loads, API calls, asset fetches, WebSocket connections — through the proxy server. This is fundamentally different from HTTP-level proxying, where you proxy individual requests. Browser-level proxying ensures that every network operation uses the proxy, maintaining consistency that anti-bot systems verify.

In Playwright, proxy configuration happens at two levels. Browser-level proxy applies to all contexts created from that browser instance: pass the proxy parameter when launching the browser with server, username, and password fields. Context-level proxy overrides the browser default for specific contexts, enabling per-context proxy assignment. This is the most flexible approach: launch one browser, create 20 contexts, each with a different Databay proxy endpoint for geographic diversity.

In Puppeteer, proxy configuration uses Chrome launch arguments. Pass --proxy-server=protocol://host:port as a launch argument. Authentication is handled separately — Puppeteer does not support proxy auth via launch arguments, so you intercept the authentication challenge using page.authenticate() with username and password. For multiple proxy configurations, you need separate browser instances or use a local proxy router that handles per-request proxy selection.

Authentication with proxy services typically uses username:password credentials passed with each connection. Databay's proxy endpoints support both basic authentication (username:password in the proxy URL) and IP whitelisting (pre-authorized IPs that connect without credentials). For headless browsers, basic authentication is more practical because browser instances may run on varying infrastructure with different outbound IPs. Include authentication credentials in the proxy configuration and verify connectivity before starting your scraping workflow.

Stealth Configuration: Avoiding Headless Detection

Default headless browser installations expose dozens of signals that anti-bot systems use to detect automation. Stealth configuration patches these signals to make automated browsers indistinguishable from manually operated ones.

The most obvious detection vector is the navigator.webdriver property, which is set to true in automated browsers. Anti-bot scripts check this property early in their evaluation. Stealth patches override it to return false or undefined. Beyond this, Chrome's headless mode modifies the window.chrome object, omits the Chrome plugins array, exposes specific WebGL renderer strings (SwiftShader instead of a real GPU), and sets permissions API responses differently than headed Chrome.

For Puppeteer, the puppeteer-extra-plugin-stealth package patches 10+ detection vectors: webdriver property, Chrome runtime, plugin array, language settings, WebGL vendor/renderer, broken image dimensions, media codec support, and notification permissions. Installation is straightforward — wrap the standard Puppeteer import with puppeteer-extra and add the stealth plugin.

For Playwright, stealth is less plug-and-play. Community-maintained patches apply similar fixes through browser context configuration and page.addInitScript() calls that modify browser APIs before page scripts execute. Key patches include overriding navigator.webdriver, injecting a realistic plugins array, fixing the window.chrome object, and ensuring consistent WebGL output. The Playwright team has also improved built-in stealth in recent versions, but dedicated patches remain necessary for the most aggressive anti-bot systems.

Stealth is an arms race. Anti-bot systems continuously discover new detection vectors, and stealth patches must be updated in response. Test your stealth configuration regularly against detection test pages (bot.sannysoft.com, browserleaks.com) and update patches when new browser versions release or when success rates decline on protected targets.

Resource Optimization: Reducing Bandwidth Through Proxies

Headless browsers load entire web pages — HTML, CSS, JavaScript, images, fonts, videos, tracking scripts, and advertising frames. When all this traffic routes through proxy bandwidth, costs multiply. A single product page that transfers 2MB in a normal browser might cost 10-20x more than the 50-100KB needed to extract the actual data. Resource optimization cuts proxy bandwidth dramatically without affecting data extraction.

Request interception is the primary optimization tool. Both Playwright and Puppeteer provide APIs to intercept network requests before they execute and decide whether to allow, block, or modify each one. Block requests for images (png, jpg, gif, webp, svg), fonts (woff, woff2, ttf), stylesheets (css), and media files (mp4, webm) when you only need text data. This typically eliminates 60-80% of transferred bytes.

Be selective about what you block. Blocking all CSS can break layout-dependent extraction (if you need to check visibility or computed styles). Blocking JavaScript files is risky — the wrong blocked script might be the one that renders your target data or the one that passes the anti-bot challenge. A safe default: block images, fonts, and media. Block CSS only if you are extracting from the DOM without layout calculations. Never block first-party JavaScript on sites with anti-bot protection.

Additional optimizations include disabling unnecessary browser features. Turn off geolocation prompts, notification requests, and background sync. Disable the cache if sessions are single-use (cache storage consumes memory). Set a viewport size that is common but not oversized — 1920x1080 is standard. These tweaks individually save small amounts of bandwidth and memory, but they compound significantly across thousands of browser sessions.

Managing Browser Contexts for Session Isolation

Browser contexts are Playwright's mechanism for running isolated sessions within a single browser process. Each context has its own cookies, local storage, session storage, and cache — completely independent from other contexts. This maps directly to the scraping requirement of maintaining separate identities across concurrent sessions.

The architecture is one browser process, many contexts. Launch a single Chromium instance, then create contexts as needed. Each context gets its own proxy assignment, viewport configuration, locale, and timezone. From the target website's perspective, each context appears to be a different user on a different device in a different location. From your infrastructure's perspective, it is a single process consuming shared memory for the browser engine while isolating session state per context.

Map each context to a specific proxy IP for the duration of its session. In Playwright, pass the proxy configuration when creating the context. All pages opened within that context route through the assigned proxy. Cookies set by one page in the context are available to other pages in the same context — enabling multi-page workflows like login, navigation, and extraction to share session state. Cookies from one context are invisible to other contexts, preventing cross-contamination between identities.

Context lifecycle management is important at scale. Create a context, perform your scraping session (typically 5-30 minutes of activity on a target domain), extract the data, then close the context to release its memory. Do not reuse contexts across domains or sessions that should be independent — the accumulated cookies and storage create linkability. A pool of 20-50 concurrent contexts per browser instance is typical. If you need more concurrency, launch additional browser instances rather than overloading a single one.

Headless vs Headed Mode: When Visibility Matters

Headless mode runs the browser without rendering to a screen, which is the default for scraping because it eliminates GPU rendering overhead. However, some anti-bot systems detect headless mode specifically, and running in headed mode — even on a server without a display — can improve success rates on heavily protected sites.

The detection difference exists because headless Chrome uses a different rendering pipeline than headed Chrome. Certain WebGL operations, canvas rendering paths, and pixel output differ between modes. Anti-bot scripts that perform rendering consistency checks can identify these differences. Chrome's headless mode has improved significantly in recent versions (the "new headless" mode introduced in Chrome 112 is much closer to headed behavior), but edge cases remain detectable by sophisticated systems.

To run headed mode on a Linux server without a physical display, use Xvfb (X Virtual Framebuffer). Xvfb creates a virtual display that the browser renders to, producing pixel-accurate output as if a monitor were connected. The browser does not know it is rendering to a virtual display — all rendering paths behave identically to a real display. Launch Xvfb with a resolution matching your target viewport (xvfb-run with --server-args for resolution), then launch the browser in headed mode within the Xvfb environment.

The tradeoff is resource consumption. Headed mode with Xvfb uses more CPU and memory because it actually performs pixel rendering. For most targets, headless mode with proper stealth patches is sufficient — invest in headed mode only when you have confirmed that a specific target detects headless browsers after your stealth patches are applied. The performance cost of headed mode is roughly 20-40% more CPU and 10-20% more memory per browser instance compared to headless.

Memory Management: Browsers Are RAM-Hungry

Each Chromium instance consumes 150-400MB of RAM depending on the pages loaded, and each context within an instance adds 30-100MB. At scale, memory management determines how many concurrent sessions your infrastructure can support and whether your processes crash unexpectedly.

Set hard limits on concurrent browser instances and contexts. A server with 16GB of usable RAM can reliably run 3-4 browser instances with 10-15 contexts each — roughly 30-60 concurrent scraping sessions. Exceeding these limits causes swap usage (dramatically slowing everything) or out-of-memory kills. Calculate your capacity conservatively: assume 300MB per browser instance plus 75MB per context, and keep total projected usage below 70% of available RAM to leave headroom for OS operations and data processing.

Implement aggressive cleanup. Close pages immediately after extracting data — an open page continues consuming memory for its DOM, JavaScript heap, and cached resources. Close contexts after completing each session rather than leaving them idle. Periodically restart browser instances (every 1-2 hours of continuous operation) to reclaim fragmented memory that accumulates from repeated context creation and destruction. Chromium's memory allocator does not always return freed memory to the OS, so process restart is the definitive cleanup mechanism.

Monitor memory usage programmatically. Track the resident set size of each browser process and set warnings at 80% of your per-process limit. If a single browser process exceeds 1.5GB, it is likely leaking memory through unclosed pages or contexts. Log context creation and destruction to audit for leaks. Browser-level crash recovery is essential — when a browser process dies (and at scale, it will), your orchestrator should detect the failure, clean up the dead process, launch a replacement, and re-queue the failed tasks.

Combining Headless Browsers with Proxy Rotation

The most effective scraping architecture combines headless browsers with rotating proxies in a structured pipeline where each tool handles what it does best. The browser provides authenticity. The proxy provides identity diversity. Together, they pass every detection layer.

The session initialization pattern works as follows. Create a Playwright browser context with a Databay residential proxy endpoint configured for the target geography. Navigate to the target site, allowing the browser to execute JavaScript challenges, load anti-bot sensor scripts, and accumulate cookies and tokens from the challenge-solving process. This initialization phase takes 5-15 seconds and establishes a verified session that the site trusts.

For data extraction, two approaches work depending on the site's requirements. If the site requires JavaScript rendering for every page (true SPAs), continue using the browser context for extraction — navigate to each target page within the authenticated context, wait for content to render, and extract data from the DOM. If the site only requires JavaScript for initial authentication but serves static content for data pages, export the cookies from the browser context and transfer them to a TLS-mimicking HTTP client (curl_cffi with the same Chrome impersonation). The HTTP client makes extraction requests using the browser's authenticated cookies through the same proxy IP.

The hybrid approach (browser for auth, HTTP client for extraction) is dramatically more efficient. A single browser instance establishes sessions that 50-100 HTTP worker threads then exploit for rapid extraction. The browser handles 5% of the total requests (the authentication-critical ones), and the HTTP clients handle 95% (the volume extraction). This reduces your browser infrastructure needs by an order of magnitude while maintaining authenticated access.

When Headless Is Overkill: Use Raw HTTP Instead

The default assumption should be that you do not need a headless browser. Raw HTTP requests (using TLS-mimicking libraries) are 10x faster, use 100x less memory, and cost a fraction of browser-based scraping in both proxy bandwidth and infrastructure. Reach for a headless browser only when raw HTTP definitively cannot accomplish the task.

Raw HTTP works when the target site returns complete HTML in the initial response without requiring JavaScript execution. Check this by disabling JavaScript in your browser and loading the target page — if the data is visible, raw HTTP can extract it. Many sites that use JavaScript frameworks still implement server-side rendering (SSR) or static site generation (SSG), meaning the full HTML arrives in the first response even though the site appears JavaScript-heavy. Check the initial HTML source for your target data before assuming you need a browser.

Raw HTTP works when you can replicate the API calls that the JavaScript framework makes. Modern SPAs load data through JSON APIs — open browser developer tools, observe the XHR/fetch requests, and replicate them directly. One targeted API request returning structured JSON is infinitely more efficient than rendering an entire page to extract the same data from the DOM. Many scraping tasks that appear to need a browser actually need one API call with the right authentication headers.

The decision framework: start with raw HTTP using curl_cffi or tls-client for TLS authenticity. If the target data is not in the HTML response, inspect network traffic for API endpoints. If APIs exist, call them directly. If the data is only available through JavaScript rendering or the site requires browser-level challenge solving, then deploy a headless browser — for that specific site only. A production scraping system that uses headless browsers for 100% of requests is almost certainly over-engineered and under-optimized. The most efficient operations use browsers for fewer than 20% of their targets.

Frequently Asked Questions

Which headless browser framework should I use for scraping?
Playwright is the best choice for most scraping use cases in 2026. It supports Chromium, Firefox, and WebKit from one API, offers per-context proxy configuration for session isolation, includes built-in auto-wait for dynamic content, and has strong network interception capabilities. Puppeteer is a solid alternative if you only need Chrome support and want access to the puppeteer-extra stealth plugin ecosystem. Avoid Selenium for new projects — it is slower, more detectable, and lacks modern features like context-level proxy configuration.
How many headless browser instances can I run on one server?
A general formula: divide available RAM by 400MB for a conservative estimate of maximum browser instances. A 16GB server supports approximately 30-40 instances. Each context within an instance adds 50-100MB. For practical scraping, 3-5 browser instances with 10-20 contexts each is a reliable configuration for 16GB servers. Monitor memory usage and CPU — browser rendering is CPU-intensive, so even with available RAM, CPU saturation (typically at 60-80 concurrent rendering operations) becomes the bottleneck before memory does.
Do headless browsers automatically handle proxy authentication?
Playwright handles proxy authentication natively — pass username and password in the proxy configuration when creating a browser or context. Puppeteer requires using page.authenticate() to provide credentials when the proxy sends an authentication challenge. Selenium supports proxy auth through browser capabilities but the implementation varies by browser. For simplicity, many scraping setups use IP-whitelisted proxy access (where the proxy recognizes authorized IPs without credentials), eliminating authentication handling entirely.
Can anti-bot systems detect Playwright or Puppeteer?
Yes, default installations are detectable. Both frameworks modify browser properties (navigator.webdriver, window.chrome, plugin arrays, WebGL renderer) that anti-bot scripts check. Stealth patches address these modifications and significantly improve detection resistance. However, stealth is a continuous arms race — anti-bot providers discover new detection vectors regularly, and patches must be updated. No stealth configuration provides permanent undetectability. Regular testing against your target sites and prompt patch updates are essential for maintaining access.
Is it better to use one browser with many contexts or many browsers with one context each?
One browser with many contexts is more memory-efficient because contexts share the browser engine overhead (about 150-200MB). Each additional context adds only 50-100MB versus 300-400MB for a separate browser instance. Use multiple contexts within a browser for concurrent sessions on the same or similar targets. Use separate browser instances when you need different browser configurations (Chrome vs Firefox), when you want process isolation for stability (one crashing context does not affect others), or when a single browser process approaches its memory limit.

Start Collecting Data Today

35M+ IPs across 200+ countries. Pay as you go, starting at $0.50/GB.

Latest from the Blog

Expert guides on proxies, web scraping, and data collection.

Start Using Rotating Proxies Today

Join 8,000+ users using Databay's rotating proxy infrastructure for web scraping, data collection, and automation. Access 35M+ residential, datacenter, and mobile IPs across 200+ countries with pay-as-you-go pricing from $0.50/GB. No monthly commitment, no connection limits - start collecting data in minutes.