Multi-Threaded Scraping with Proxies: Concurrency Done Right

Daniel Okonkwo Daniel Okonkwo 15 min read

Master multi-threaded scraping with proxies. Covers threading vs async, proxy-per-thread models, connection pooling, backpressure, and concurrency debugging.

Why Concurrency Matters for Web Scraping

Web scraping is fundamentally I/O-bound. Your scraper spends 95% of its time waiting — waiting for DNS resolution, waiting for TCP handshakes, waiting for TLS negotiation, waiting for the target server to process the request and send a response. During all that waiting, your CPU sits idle. A sequential scraper that processes one URL at a time wastes almost all of its potential throughput.

The numbers make this concrete. A single HTTP request to a well-performing website takes 200-500 milliseconds round trip. A sequential scraper maxes out at 2-5 pages per second. The same hardware running 50 concurrent requests can fetch 100-250 pages per second — a 50x throughput increase with zero additional CPU cost.

This is where multi-threaded scraping with proxies becomes essential. Concurrency lets you issue dozens or hundreds of requests simultaneously, each through a different proxy, each waiting for its own response independently. But concurrency also introduces complexity: shared state problems, resource exhaustion, connection management, and error propagation all become harder when requests run in parallel.

The payoff justifies the complexity. A price monitoring system scraping 50,000 product pages needs 14 hours sequentially. With 100 concurrent connections through a rotating proxy pool, the same job finishes in 8 minutes. For time-sensitive data — competitive pricing, news monitoring, inventory tracking — concurrency is not an optimization. It is a requirement.

Threading vs Async vs Multiprocessing

Three concurrency models are available to scrapers, each with distinct tradeoffs. Choosing the right one depends on your language, your target scale, and whether your bottleneck is I/O or CPU.

Threading is the most intuitive model. Each thread runs an independent scraping task, and the operating system handles scheduling. In Python, the Global Interpreter Lock (GIL) limits threads to one executing Python bytecode at a time — but this barely matters for scraping because threads release the GIL during I/O operations (network calls, file writes). Python threads work well for scraping workloads up to 50-100 concurrent connections. In Java, Go, or C#, threads (or goroutines) scale further without a GIL limitation.

Async/await is the best model for high-concurrency I/O-bound work. A single thread manages thousands of concurrent connections using an event loop. Python's asyncio with aiohttp or httpx, JavaScript's native async model, and Rust's tokio all support this pattern. Async scales to thousands of concurrent requests on a single process with minimal memory overhead — each connection costs kilobytes rather than the megabytes required per thread. The tradeoff is code complexity: async code requires careful handling of coroutines, and debugging is harder than sequential code.

Multiprocessing provides true parallelism by running separate OS processes. Use this when CPU-bound work (HTML parsing, data transformation, image processing) is the bottleneck, not network I/O. A common hybrid architecture runs an async event loop for fetching and distributes parsed HTML to a process pool for extraction. Each process handles its own portion of the parsing workload without GIL contention.

The Proxy-Per-Thread Model

The central rule for multi-threaded scraping with proxies: each concurrent request to a given domain should use a different proxy IP. If two threads hit the same domain through the same proxy, you double the request rate from that IP and halve the time before it gets rate-limited or blocked.

Implementation approaches depend on your proxy provider's rotation model:

Provider-managed rotation. Services like Databay offer rotating proxy endpoints that assign a different IP to each new connection automatically. In this model, your threads simply point at the same proxy endpoint, and the provider handles IP assignment. This is the simplest approach and works well when you do not need to control which specific IP handles which request.

Application-managed rotation. When you need sticky sessions (same IP for a sequence of requests) or domain-specific proxy assignment, manage the mapping in your application. Maintain a pool of proxy addresses, assign one to each thread or task at creation, and return it to the pool when the task completes. Use a thread-safe data structure (a concurrent queue or a mutex-protected list) for the pool to prevent two threads from claiming the same proxy.

Domain-partitioned assignment. For scraping multiple domains concurrently, partition proxies by domain. If you have 100 proxies and 10 target domains, assign 10 proxies to each domain. This prevents cross-domain IP overlap (the same IP hitting multiple targets looks like a bot scanning) and simplifies per-domain rate limiting. Each domain's proxy partition rotates independently.

Track proxy-to-thread assignments in a shared registry so your monitoring system can correlate failures with specific proxies. When a proxy starts failing, you need to know which threads are affected.

Connection Pooling with Proxies

Every HTTP request through a proxy involves multiple TCP handshakes: your scraper to the proxy, and the proxy to the target server. Each handshake takes 20-50 milliseconds, and TLS negotiation adds another 50-100 milliseconds. Without connection pooling, these costs multiply by the number of requests and dominate your scraping latency.

Connection pooling reuses existing TCP connections for multiple requests, amortizing the handshake cost. With proxies, pooling works at two levels:

Scraper-to-proxy pooling. Maintain persistent connections between your scraper and the proxy server. HTTP/1.1 Keep-Alive and HTTP/2 multiplexing both enable this. A pool of 50 persistent connections to the proxy endpoint can support 50 concurrent requests without re-establishing connections. Python's httpx and aiohttp, Go's http.Client, and Java's HttpClient all support connection pooling natively.

Proxy-to-target pooling. This is managed by the proxy provider, not your application. High-quality proxy infrastructure maintains its own connection pools to frequently accessed targets. Databay's infrastructure handles this transparently — your scraper benefits without configuration.

Practical connection pool sizing follows a formula: set the pool size to your target concurrency level plus a 20% buffer. If you run 100 concurrent scrapers, set the connection pool to 120. Too small a pool creates contention — threads block waiting for available connections. Too large a pool wastes file descriptors and memory.

Monitor pool utilization. If your pool is consistently 90%+ utilized, increase its size. If utilization is below 50%, you are either over-provisioned or your concurrency is limited by something else (proxy rate limits, target server capacity, processing bottleneck).

Calculating Optimal Concurrency

Running too few concurrent requests wastes throughput. Running too many triggers rate limits, exhausts proxies, and overwhelms target servers. The optimal concurrency level balances these constraints, and it is calculable rather than guesswork.

The formula for a single target domain:

Max concurrent requests = (target site rate limit per IP per minute / 60) x number of available proxies x safety factor

Example: a target site tolerates 30 requests per minute per IP. You have 200 proxies available. With a 0.7 safety factor (to stay well under detection thresholds): (30/60) x 200 x 0.7 = 70 concurrent requests.

For multiple target domains, calculate per-domain and sum, but respect your total proxy pool size and bandwidth limits. If you calculate 70 concurrent requests for Domain A and 50 for Domain B, you need 120 proxies assigned concurrently — feasible with a 200-proxy pool, but you are using 60% of your capacity.

The safety factor accounts for variance. Rate limits are not perfectly uniform — a site might tolerate 30 requests per minute on average but flag bursts of 10 requests in 5 seconds. The safety factor keeps you below burst thresholds.

Start at 50% of your calculated maximum and increase gradually while monitoring success rates. If success rates stay above 95%, increase concurrency by 20%. If success rates drop below 90%, reduce immediately. This iterative approach finds the real-world optimum faster than theoretical calculation alone, because actual site tolerance depends on variables (server load, time of day, detection system sensitivity) that you cannot measure externally.

Backpressure: Slowing Down Without Crashing

Backpressure is what happens when a downstream component cannot keep up with the upstream producer. In scraping, this occurs when the proxy pool or target server is slower than your concurrency level assumes. Without backpressure handling, requests pile up in memory, connections time out, and the system either crashes or wastes resources on doomed requests.

Signs that you need backpressure handling:

  • Response times increasing steadily over the course of a scraping run
  • Growing memory usage as pending requests queue in memory
  • Increasing timeout rates despite stable proxy and target server health
  • Success rates dropping gradually rather than suddenly (sudden drops indicate blocking; gradual drops indicate overload)


Implementation strategies:

Semaphore-based limiting. Use a counting semaphore initialized to your target concurrency level. Each scraping task acquires the semaphore before making a request and releases it after receiving the response. When all semaphore slots are occupied, new tasks block until a slot opens. This caps concurrent requests regardless of how many tasks are queued.

Adaptive rate control. Monitor average response time over a sliding window (last 100 requests). When response time exceeds a threshold (typically 2x the baseline), reduce concurrency by 20%. When response time returns to normal, gradually increase. This creates a feedback loop that automatically adjusts to target server capacity fluctuations.

Token bucket rate limiting. Control the rate of new requests using a token bucket. Tokens are added at a fixed rate (your target requests per second), and each request consumes one token. When the bucket is empty, requests wait. This smooths bursty request patterns that trigger anti-bot detection even when the average rate is acceptable.

Proxy Assignment Strategies for Concurrent Scrapers

How you assign proxies to concurrent requests affects both success rates and proxy longevity. The wrong strategy burns through proxies unnecessarily or creates patterns that anti-bot systems detect.

Round-robin rotation. Cycle through proxies sequentially: request 1 gets proxy A, request 2 gets proxy B, request 3 gets proxy C, then back to A. Simple and fair, but creates a predictable pattern. If an anti-bot system monitors multiple IPs and detects a rotating sequence, it can flag the entire pool. Add randomization to break the pattern.

Least-recently-used (LRU). Always assign the proxy that has not been used for the longest time. This maximizes the cooldown period between uses of any single IP and is the best general-purpose strategy. Maintain a sorted list or priority queue of proxies ordered by last-use timestamp. When a request needs a proxy, pop the oldest one; when it finishes, push the proxy back with the current timestamp.

Domain-dedicated assignment. Assign specific proxies to specific target domains. Proxy A only ever hits domain1.com, proxy B only hits domain2.com. This prevents cross-domain correlation (an anti-bot system noticing the same IP scraping multiple unrelated sites) but requires enough proxies to dedicate subsets to each domain.

Weighted assignment. Assign more traffic to proxies with higher success rates. Track each proxy's recent success rate and weight the assignment probability accordingly. A proxy with a 98% success rate gets more requests than one with 85%. This naturally shifts load toward healthier proxies while keeping underperforming ones in light rotation for recovery testing.

In practice, LRU with success-rate weighting provides the best balance for most multi-threaded scraping workloads.

Avoiding Thundering Herd Problems

The thundering herd problem occurs when many concurrent tasks start simultaneously and all compete for the same resources at the same moment. In multi-threaded scraping with proxies, this manifests as hundreds of requests launching in the same millisecond, overwhelming the proxy endpoint, triggering burst detection on target sites, and causing mass timeouts.

The problem typically surfaces in three scenarios:

Job startup. When you launch a scraping job with 200 URLs and 100 concurrent workers, all 100 workers grab URLs and fire requests simultaneously. The proxy endpoint receives 100 connection requests at once. The target server sees 100 requests arrive within milliseconds. Even if the sustained rate is acceptable, the burst is not.

Retry storms. If 50 requests fail simultaneously (a common occurrence when a target server hiccups), and all 50 retries execute after the same backoff delay, the retry storm recreates the original burst. This is actually worse because the target server is likely still recovering.

Scheduled job overlap. Multiple scraping jobs scheduled at the same cron time (hourly on the hour, daily at midnight) compete for proxy pool capacity simultaneously.

Solutions:

  • Staggered startup. Do not start all workers simultaneously. Launch one worker every 100-200 milliseconds. A 100-worker pool takes 10-20 seconds to reach full capacity — negligible in the context of a multi-hour scraping job, but it eliminates the startup burst
  • Jittered backoff. When calculating retry delays, add random jitter of plus or minus 30-50%. A 10-second backoff becomes a random delay between 5 and 15 seconds. This desynchronizes retries and prevents storms
  • Schedule offset. Stagger cron schedules. Instead of running three jobs at exactly 00:00, run them at 00:00, 00:05, and 00:10. This distributes proxy demand across the hour

Resource Management at Scale

Concurrent scrapers consume system resources proportionally to their concurrency level. At moderate scales (10-50 concurrent connections), resource usage is barely noticeable. At high scales (500-5000 concurrent connections), resource management becomes a primary engineering concern.

File descriptors. Every open TCP connection consumes a file descriptor. A scraper with 1000 concurrent proxy connections, plus connections to the database and message queue, needs 1200+ file descriptors. The default limit on most Linux systems is 1024. Increase it with ulimit or systemd configuration before you hit the wall — the failure mode is cryptic connection errors, not a clear "file descriptor limit reached" message.

Memory. Each concurrent request holds a response buffer. If target pages average 500KB and you run 1000 concurrent requests, that is 500MB of response data in memory simultaneously — before parsing, before any application-level caching. Monitor memory usage during peak concurrency and set explicit limits. If memory pressure becomes an issue, stream responses to disk instead of buffering them entirely in memory.

DNS resolution. At high concurrency, DNS resolution can become a bottleneck. The default system resolver handles queries sequentially. Use a concurrent DNS resolver (c-ares, or async DNS libraries) and implement DNS caching to avoid redundant lookups. For large proxy pools, pre-resolve proxy endpoint addresses at startup rather than resolving per-connection.

CPU for parsing. While scraping is I/O-bound, HTML parsing is CPU-bound. If you parse responses synchronously in the scraper thread, parsing latency reduces effective concurrency. Decouple parsing by pushing raw responses to a processing queue and parsing in separate worker threads or processes. This keeps scraper threads focused on I/O and parser threads focused on CPU work.

Monitoring and Debugging Concurrent Scrapers

Debugging concurrent systems is notoriously harder than debugging sequential ones. Problems are intermittent, timing-dependent, and often impossible to reproduce on demand. Build monitoring and debugging capabilities into your scraper from the start, not after the first production failure.

Per-proxy metrics. Track success rate, average response time, and error distribution for each proxy IP. A proxy with degrading performance affects all threads using it. When you see a cluster of slow or failed requests, correlate them by proxy IP to determine if the issue is proxy-specific or target-specific.

Concurrency waterfall visualization. Log the start time, end time, proxy IP, target URL, and status code for every request. Visualize these as a waterfall chart — horizontal bars on a timeline, one per request. This reveals patterns invisible in aggregate metrics: sequential bottlenecks where threads block on a shared resource, burst patterns that trigger rate limits, and stalls where all threads wait simultaneously.

Common concurrency bugs:

  • Race conditions in shared state. Two threads updating a shared success counter or URL queue without synchronization. Use thread-safe data structures or explicit locks for all shared mutable state
  • Connection leaks. A thread that hits an exception before closing its proxy connection. Over time, leaked connections exhaust the connection pool. Always close connections in a finally block or use context managers
  • Deadlocks. Thread A holds lock 1 and waits for lock 2; thread B holds lock 2 and waits for lock 1. Minimize the number of locks in your design. If you need multiple locks, always acquire them in the same global order
  • Stale proxy references. A thread caches a proxy reference that has been marked as failed by the proxy manager. Use proxy assignment that validates proxy health at request time, not at thread creation time

Frequently Asked Questions

Should I use threading or async for web scraping with proxies?
For most scraping workloads, async is the better choice. It handles thousands of concurrent connections with minimal memory overhead on a single thread. Python's asyncio with aiohttp or httpx supports async scraping natively. Use threading when you need simplicity for moderate concurrency (under 100 connections) or when your scraping library does not support async. Use multiprocessing when CPU-bound HTML parsing is the bottleneck, not network I/O.
How many concurrent requests can I run through a proxy pool?
Calculate it: (target site rate limit per IP per minute / 60) multiplied by the number of available proxies, multiplied by a 0.7 safety factor. For example, with 200 proxies and a site allowing 30 requests per IP per minute, you can safely run about 70 concurrent requests. Start at 50% of this calculated maximum and increase while monitoring success rates. If success rates stay above 95%, increase concurrency. Below 90%, reduce immediately.
Why do my concurrent scraping requests fail even with enough proxies?
Common causes include thundering herd effects (all requests launching simultaneously), connection pool exhaustion (more concurrent requests than pooled connections), file descriptor limits (default 1024 on Linux), and DNS resolution bottlenecks. Check for connection leaks where exceptions prevent proper cleanup. Also verify that your proxy assignment gives each concurrent request a different IP — two threads sharing one proxy IP doubles the apparent request rate from that address.
What is the best proxy assignment strategy for multi-threaded scraping?
Least-recently-used (LRU) with success-rate weighting works best for most workloads. LRU maximizes cooldown time between uses of each IP, and success-rate weighting shifts traffic toward healthier proxies. For sites that require session persistence, use sticky proxy assignment that binds an IP to a thread for the session duration. For multi-domain scraping, partition proxies by domain to prevent cross-site correlation.
How do I prevent my concurrent scraper from overwhelming the target server?
Implement three controls. First, use a semaphore to cap maximum concurrent requests to a level the target can handle. Second, add staggered startup so workers launch gradually over 10-20 seconds instead of simultaneously. Third, implement adaptive rate control that monitors response times and reduces concurrency when latency increases. Rate limit per domain, not per IP — distributing 100 requests per second across 100 proxies still sends 100 requests per second to the server.

Start Collecting Data Today

35M+ IPs across 200+ countries. Pay as you go, starting at $0.50/GB.

Latest from the Blog

Expert guides on proxies, web scraping, and data collection.

Start Using Rotating Proxies Today

Join 8,000+ users using Databay's rotating proxy infrastructure for web scraping, data collection, and automation. Access 35M+ residential, datacenter, and mobile IPs across 200+ countries with pay-as-you-go pricing from $0.50/GB. No monthly commitment, no connection limits - start collecting data in minutes.