Master multi-threaded scraping with proxies. Covers threading vs async, proxy-per-thread models, connection pooling, backpressure, and concurrency debugging.
Why Concurrency Matters for Web Scraping
The numbers make this concrete. A single HTTP request to a well-performing website takes 200-500 milliseconds round trip. A sequential scraper maxes out at 2-5 pages per second. The same hardware running 50 concurrent requests can fetch 100-250 pages per second — a 50x throughput increase with zero additional CPU cost.
This is where multi-threaded scraping with proxies becomes essential. Concurrency lets you issue dozens or hundreds of requests simultaneously, each through a different proxy, each waiting for its own response independently. But concurrency also introduces complexity: shared state problems, resource exhaustion, connection management, and error propagation all become harder when requests run in parallel.
The payoff justifies the complexity. A price monitoring system scraping 50,000 product pages needs 14 hours sequentially. With 100 concurrent connections through a rotating proxy pool, the same job finishes in 8 minutes. For time-sensitive data — competitive pricing, news monitoring, inventory tracking — concurrency is not an optimization. It is a requirement.
Threading vs Async vs Multiprocessing
Threading is the most intuitive model. Each thread runs an independent scraping task, and the operating system handles scheduling. In Python, the Global Interpreter Lock (GIL) limits threads to one executing Python bytecode at a time — but this barely matters for scraping because threads release the GIL during I/O operations (network calls, file writes). Python threads work well for scraping workloads up to 50-100 concurrent connections. In Java, Go, or C#, threads (or goroutines) scale further without a GIL limitation.
Async/await is the best model for high-concurrency I/O-bound work. A single thread manages thousands of concurrent connections using an event loop. Python's asyncio with aiohttp or httpx, JavaScript's native async model, and Rust's tokio all support this pattern. Async scales to thousands of concurrent requests on a single process with minimal memory overhead — each connection costs kilobytes rather than the megabytes required per thread. The tradeoff is code complexity: async code requires careful handling of coroutines, and debugging is harder than sequential code.
Multiprocessing provides true parallelism by running separate OS processes. Use this when CPU-bound work (HTML parsing, data transformation, image processing) is the bottleneck, not network I/O. A common hybrid architecture runs an async event loop for fetching and distributes parsed HTML to a process pool for extraction. Each process handles its own portion of the parsing workload without GIL contention.
The Proxy-Per-Thread Model
Implementation approaches depend on your proxy provider's rotation model:
Provider-managed rotation. Services like Databay offer rotating proxy endpoints that assign a different IP to each new connection automatically. In this model, your threads simply point at the same proxy endpoint, and the provider handles IP assignment. This is the simplest approach and works well when you do not need to control which specific IP handles which request.
Application-managed rotation. When you need sticky sessions (same IP for a sequence of requests) or domain-specific proxy assignment, manage the mapping in your application. Maintain a pool of proxy addresses, assign one to each thread or task at creation, and return it to the pool when the task completes. Use a thread-safe data structure (a concurrent queue or a mutex-protected list) for the pool to prevent two threads from claiming the same proxy.
Domain-partitioned assignment. For scraping multiple domains concurrently, partition proxies by domain. If you have 100 proxies and 10 target domains, assign 10 proxies to each domain. This prevents cross-domain IP overlap (the same IP hitting multiple targets looks like a bot scanning) and simplifies per-domain rate limiting. Each domain's proxy partition rotates independently.
Track proxy-to-thread assignments in a shared registry so your monitoring system can correlate failures with specific proxies. When a proxy starts failing, you need to know which threads are affected.
Connection Pooling with Proxies
Connection pooling reuses existing TCP connections for multiple requests, amortizing the handshake cost. With proxies, pooling works at two levels:
Scraper-to-proxy pooling. Maintain persistent connections between your scraper and the proxy server. HTTP/1.1 Keep-Alive and HTTP/2 multiplexing both enable this. A pool of 50 persistent connections to the proxy endpoint can support 50 concurrent requests without re-establishing connections. Python's httpx and aiohttp, Go's http.Client, and Java's HttpClient all support connection pooling natively.
Proxy-to-target pooling. This is managed by the proxy provider, not your application. High-quality proxy infrastructure maintains its own connection pools to frequently accessed targets. Databay's infrastructure handles this transparently — your scraper benefits without configuration.
Practical connection pool sizing follows a formula: set the pool size to your target concurrency level plus a 20% buffer. If you run 100 concurrent scrapers, set the connection pool to 120. Too small a pool creates contention — threads block waiting for available connections. Too large a pool wastes file descriptors and memory.
Monitor pool utilization. If your pool is consistently 90%+ utilized, increase its size. If utilization is below 50%, you are either over-provisioned or your concurrency is limited by something else (proxy rate limits, target server capacity, processing bottleneck).
Calculating Optimal Concurrency
The formula for a single target domain:
Max concurrent requests = (target site rate limit per IP per minute / 60) x number of available proxies x safety factor
Example: a target site tolerates 30 requests per minute per IP. You have 200 proxies available. With a 0.7 safety factor (to stay well under detection thresholds): (30/60) x 200 x 0.7 = 70 concurrent requests.
For multiple target domains, calculate per-domain and sum, but respect your total proxy pool size and bandwidth limits. If you calculate 70 concurrent requests for Domain A and 50 for Domain B, you need 120 proxies assigned concurrently — feasible with a 200-proxy pool, but you are using 60% of your capacity.
The safety factor accounts for variance. Rate limits are not perfectly uniform — a site might tolerate 30 requests per minute on average but flag bursts of 10 requests in 5 seconds. The safety factor keeps you below burst thresholds.
Start at 50% of your calculated maximum and increase gradually while monitoring success rates. If success rates stay above 95%, increase concurrency by 20%. If success rates drop below 90%, reduce immediately. This iterative approach finds the real-world optimum faster than theoretical calculation alone, because actual site tolerance depends on variables (server load, time of day, detection system sensitivity) that you cannot measure externally.
Backpressure: Slowing Down Without Crashing
Signs that you need backpressure handling:
- Response times increasing steadily over the course of a scraping run
- Growing memory usage as pending requests queue in memory
- Increasing timeout rates despite stable proxy and target server health
- Success rates dropping gradually rather than suddenly (sudden drops indicate blocking; gradual drops indicate overload)
Implementation strategies:
Semaphore-based limiting. Use a counting semaphore initialized to your target concurrency level. Each scraping task acquires the semaphore before making a request and releases it after receiving the response. When all semaphore slots are occupied, new tasks block until a slot opens. This caps concurrent requests regardless of how many tasks are queued.
Adaptive rate control. Monitor average response time over a sliding window (last 100 requests). When response time exceeds a threshold (typically 2x the baseline), reduce concurrency by 20%. When response time returns to normal, gradually increase. This creates a feedback loop that automatically adjusts to target server capacity fluctuations.
Token bucket rate limiting. Control the rate of new requests using a token bucket. Tokens are added at a fixed rate (your target requests per second), and each request consumes one token. When the bucket is empty, requests wait. This smooths bursty request patterns that trigger anti-bot detection even when the average rate is acceptable.
Proxy Assignment Strategies for Concurrent Scrapers
Round-robin rotation. Cycle through proxies sequentially: request 1 gets proxy A, request 2 gets proxy B, request 3 gets proxy C, then back to A. Simple and fair, but creates a predictable pattern. If an anti-bot system monitors multiple IPs and detects a rotating sequence, it can flag the entire pool. Add randomization to break the pattern.
Least-recently-used (LRU). Always assign the proxy that has not been used for the longest time. This maximizes the cooldown period between uses of any single IP and is the best general-purpose strategy. Maintain a sorted list or priority queue of proxies ordered by last-use timestamp. When a request needs a proxy, pop the oldest one; when it finishes, push the proxy back with the current timestamp.
Domain-dedicated assignment. Assign specific proxies to specific target domains. Proxy A only ever hits domain1.com, proxy B only hits domain2.com. This prevents cross-domain correlation (an anti-bot system noticing the same IP scraping multiple unrelated sites) but requires enough proxies to dedicate subsets to each domain.
Weighted assignment. Assign more traffic to proxies with higher success rates. Track each proxy's recent success rate and weight the assignment probability accordingly. A proxy with a 98% success rate gets more requests than one with 85%. This naturally shifts load toward healthier proxies while keeping underperforming ones in light rotation for recovery testing.
In practice, LRU with success-rate weighting provides the best balance for most multi-threaded scraping workloads.
Avoiding Thundering Herd Problems
The problem typically surfaces in three scenarios:
Job startup. When you launch a scraping job with 200 URLs and 100 concurrent workers, all 100 workers grab URLs and fire requests simultaneously. The proxy endpoint receives 100 connection requests at once. The target server sees 100 requests arrive within milliseconds. Even if the sustained rate is acceptable, the burst is not.
Retry storms. If 50 requests fail simultaneously (a common occurrence when a target server hiccups), and all 50 retries execute after the same backoff delay, the retry storm recreates the original burst. This is actually worse because the target server is likely still recovering.
Scheduled job overlap. Multiple scraping jobs scheduled at the same cron time (hourly on the hour, daily at midnight) compete for proxy pool capacity simultaneously.
Solutions:
- Staggered startup. Do not start all workers simultaneously. Launch one worker every 100-200 milliseconds. A 100-worker pool takes 10-20 seconds to reach full capacity — negligible in the context of a multi-hour scraping job, but it eliminates the startup burst
- Jittered backoff. When calculating retry delays, add random jitter of plus or minus 30-50%. A 10-second backoff becomes a random delay between 5 and 15 seconds. This desynchronizes retries and prevents storms
- Schedule offset. Stagger cron schedules. Instead of running three jobs at exactly 00:00, run them at 00:00, 00:05, and 00:10. This distributes proxy demand across the hour
Resource Management at Scale
File descriptors. Every open TCP connection consumes a file descriptor. A scraper with 1000 concurrent proxy connections, plus connections to the database and message queue, needs 1200+ file descriptors. The default limit on most Linux systems is 1024. Increase it with ulimit or systemd configuration before you hit the wall — the failure mode is cryptic connection errors, not a clear "file descriptor limit reached" message.
Memory. Each concurrent request holds a response buffer. If target pages average 500KB and you run 1000 concurrent requests, that is 500MB of response data in memory simultaneously — before parsing, before any application-level caching. Monitor memory usage during peak concurrency and set explicit limits. If memory pressure becomes an issue, stream responses to disk instead of buffering them entirely in memory.
DNS resolution. At high concurrency, DNS resolution can become a bottleneck. The default system resolver handles queries sequentially. Use a concurrent DNS resolver (c-ares, or async DNS libraries) and implement DNS caching to avoid redundant lookups. For large proxy pools, pre-resolve proxy endpoint addresses at startup rather than resolving per-connection.
CPU for parsing. While scraping is I/O-bound, HTML parsing is CPU-bound. If you parse responses synchronously in the scraper thread, parsing latency reduces effective concurrency. Decouple parsing by pushing raw responses to a processing queue and parsing in separate worker threads or processes. This keeps scraper threads focused on I/O and parser threads focused on CPU work.
Monitoring and Debugging Concurrent Scrapers
Per-proxy metrics. Track success rate, average response time, and error distribution for each proxy IP. A proxy with degrading performance affects all threads using it. When you see a cluster of slow or failed requests, correlate them by proxy IP to determine if the issue is proxy-specific or target-specific.
Concurrency waterfall visualization. Log the start time, end time, proxy IP, target URL, and status code for every request. Visualize these as a waterfall chart — horizontal bars on a timeline, one per request. This reveals patterns invisible in aggregate metrics: sequential bottlenecks where threads block on a shared resource, burst patterns that trigger rate limits, and stalls where all threads wait simultaneously.
Common concurrency bugs:
- Race conditions in shared state. Two threads updating a shared success counter or URL queue without synchronization. Use thread-safe data structures or explicit locks for all shared mutable state
- Connection leaks. A thread that hits an exception before closing its proxy connection. Over time, leaked connections exhaust the connection pool. Always close connections in a finally block or use context managers
- Deadlocks. Thread A holds lock 1 and waits for lock 2; thread B holds lock 2 and waits for lock 1. Minimize the number of locks in your design. If you need multiple locks, always acquire them in the same global order
- Stale proxy references. A thread caches a proxy reference that has been marked as failed by the proxy manager. Use proxy assignment that validates proxy health at request time, not at thread creation time