Scrapy Proxy Integration
Route Scrapy through Databay residential and datacenter proxies using request meta, a project-wide downloader middleware, and a rotating session middleware, with fixes for 407 authentication errors, CONNECT tunnel failures, TLS issues and timeout tuning for high-volume crawls.

What Is Scrapy
Scrapy is the standard Python framework for crawling at scale: an asynchronous engine built on Twisted, a structured pipeline for items, and a middleware system that makes cross-cutting concerns like proxying, retries and throttling pluggable. Unlike browser-based tools, Scrapy fetches raw HTTP responses without executing JavaScript, which makes it dramatically faster and cheaper per page, and also makes its traffic pattern easy for targets to profile: high request rates from a single IP with no browser behaviour around them. That is why serious Scrapy deployments treat proxies as core infrastructure rather than an add-on, and Scrapy's middleware architecture makes the integration cleaner than in any browser tool.
Connecting Scrapy to Databay Proxies
Databay's gateway is gw.databay.co:8888, with the pool, country and session selected by username flags: USER-zone-residential, USER-zone-datacenter, and optional -countryCode-us or -sessionId-abc123 suffixes. Scrapy's built-in HttpProxyMiddleware (enabled by default) reads a proxy URL from request.meta['proxy'], extracts any credentials embedded in it, and sets the Proxy-Authorization header for you. That gives you two integration levels: per-request via meta, or project-wide via a small custom middleware.
Per-Request Proxy with Request Meta
The smallest possible integration is one line per request:
import scrapy
class IpSpider(scrapy.Spider):
name = 'ip'
def start_requests(self):
yield scrapy.Request(
'https://httpbin.org/ip',
meta={'proxy': 'http://USER-zone-residential:PASS@gw.databay.co:8888'}
)
def parse(self, response):
self.logger.info(response.text)Credentials ride inside the proxy URL and HttpProxyMiddleware turns them into a Proxy-Authorization header automatically; you never construct that header yourself. One sharp edge: if your password contains characters that are special in URLs (@, :, /, #), percent-encode them with urllib.parse.quote before building the URL, or authentication will fail in confusing ways.
Project-Wide Proxy Middleware
Setting meta on every request does not scale past one spider, so the idiomatic setup is a downloader middleware that applies the proxy to all requests. In middlewares.py:
class DatabayProxyMiddleware:
PROXY = 'http://USER-zone-residential:PASS@gw.databay.co:8888'
def process_request(self, request, spider):
request.meta.setdefault('proxy', self.PROXY)and in settings.py:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.DatabayProxyMiddleware': 350,
}The priority matters: 350 places your middleware before the built-in HttpProxyMiddleware at 750, so the proxy URL is already on the request when the built-in middleware processes the credentials. Using setdefault rather than direct assignment lets individual requests override the project default, for example to send one request through the datacenter pool or a specific country.
Datacenter Proxies and Zone Flags
Switching pools is a username change, nothing more. Datacenter proxies (USER-zone-datacenter) are the economical default for high-volume crawls of permissive targets, while residential proxies (USER-zone-residential) exit through real consumer connections and are the answer when a target filters datacenter IP ranges or serves different content per region. Geo-targeting works the same way in both pools: USER-zone-residential-countryCode-us exits from the United States, and the same pattern accepts other country codes. Because Scrapy crawls are typically long-running and high-volume, many projects run a cheap datacenter pass first and re-crawl only the blocked remainder through residential, which keeps bandwidth costs proportional to difficulty.
Proxy Rotation Patterns
Through a rotating gateway, each new proxy connection can receive a different exit IP, so a Scrapy crawl naturally spreads across the pool. Two refinements give you control over exactly how. First, sticky sessions: appending -sessionId-XYZ pins all requests carrying that username to one exit IP, which you need for stateful flows (login, pagination tied to server-side session). Second, deliberate rotation: assigning a session id per request, per domain or per cookiejar gives you named identities you can retire on demand. A rotating-session middleware that pins one identity per domain and rotates it on retry looks like this:
import random
import string
class RotatingSessionMiddleware:
USER = 'USER-zone-residential'
PASSWORD = 'PASS'
def __init__(self):
self.sessions = {}
def _proxy_for(self, domain):
sid = self.sessions.get(domain)
if sid is None:
sid = ''.join(random.choices(string.ascii_lowercase + string.digits, k=8))
self.sessions[domain] = sid
return f'http://{self.USER}-sessionId-{sid}:{self.PASSWORD}@gw.databay.co:8888'
def process_request(self, request, spider):
domain = request.url.split('/')[2]
if request.meta.get('retry_times', 0) > 0:
# Retry: drop the pinned session so this domain gets a fresh IP
self.sessions.pop(domain, None)
request.meta['proxy'] = self._proxy_for(domain)Register it at priority 350 like the simple middleware above. The effect: every domain gets a stable identity while things go well, and any retry (which Scrapy triggers on connection errors and retryable HTTP codes) silently swaps that domain to a fresh exit IP. One honest caveat about expecting per-request rotation without session ids: Scrapy reuses persistent connections to the proxy, and requests that ride an existing tunnel keep its exit IP, so true per-request rotation is best implemented with explicit per-request session ids rather than assumptions about connection behaviour. The wider trade-offs are covered in static vs rotating proxies.
Common Errors and Fixes
Scrapy reports proxy failures differently from browsers; here is how the classic problems look in a crawl log and what fixes them.
HTTP 407 Proxy Authentication Required
A 407 in Scrapy usually appears as TunnelError: Could not open CONNECT tunnel ... 407 for HTTPS targets, or as a plain 407 response for HTTP ones. Causes, in order of frequency: a typo in the username flags (zone names and flag spelling must match exactly), special characters in the password that were not percent-encoded in the proxy URL, or a custom middleware ordered after the built-in HttpProxyMiddleware so the credentials never get processed (your middleware must have priority below 750). Confirm credentials independently of Scrapy first:
curl -x http://USER-zone-residential:PASS@gw.databay.co:8888 https://httpbin.org/ipIf curl succeeds and Scrapy 407s, the bug is in URL construction or middleware ordering, not the account.
CONNECT Tunnel Failures
Where a browser shows ERR_TUNNEL_CONNECTION_FAILED, Scrapy raises TunnelError: Could not open CONNECT tunnel (without the 407 code). This means the gateway refused or failed to establish the HTTPS tunnel: check the endpoint is exactly gw.databay.co:8888, then the username flags, since a malformed flag combination can cause a refusal at the tunnel stage. Intermittent tunnel errors under high concurrency are different: they are usually individual exit IPs failing mid-crawl, which is normal at scale and exactly what RetryMiddleware exists for; combined with the rotating-session middleware above, each retry lands on a fresh IP. Make sure connection-level errors stay retryable by leaving RETRY_ENABLED = True and keeping RETRY_TIMES at 2 or 3.
TLS and Certificate Errors
The gateway tunnels TLS end-to-end without re-signing, so certificate verify failed errors rarely implicate the proxy. The usual culprits in a Scrapy stack are outdated pyOpenSSL, cryptography or Twisted packages that cannot negotiate with modern target servers, or a target whose TLS configuration is genuinely strange. Upgrading those three packages fixes most cases. Scrapy also lets you swap the TLS context factory via DOWNLOADER_CLIENTCONTEXTFACTORY for unusual targets, which is a better-scoped tool than disabling verification globally. Note also that your HTTP client's TLS handshake itself is a fingerprint that survives IP rotation; TLS fingerprinting and proxy detection explains how targets use it.
Timeouts and Throughput Tuning
Residential exits add latency per connection, and crawl settings tuned for direct connections will misbehave through a proxy. The settings that matter:
# settings.py
DOWNLOAD_TIMEOUT = 60 # default 180 is too patient for crawling
RETRY_TIMES = 3
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 4
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1.0
AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0Two principles behind the numbers. Lower DOWNLOAD_TIMEOUT from its generous default so a dead exit IP costs you 60 seconds, not 180, and let retries (on a fresh session) do the recovery. And throttle per domain rather than globally: total concurrency can stay high across many targets while each individual target sees a polite request rate, which protects both the target and your exit IPs' reputation. AutoThrottle adapts the per-domain rate to observed latency, which is especially useful when residential latency varies between sessions.
Best Practices for Scrapy with Proxies
- Middleware, not meta. Centralize proxy logic in one downloader middleware; per-request
metais for exceptions, not the rule. - Session per domain, rotate on retry. Stable identities while a crawl is healthy, automatic fresh IPs when it is not, with no manual blocklist management.
- Match the pool to the target. Datacenter for volume on permissive sites, residential for protected ones; crawl cheap first and escalate only the blocked remainder. A worked example against a hard target is in scraping Amazon with proxies.
- Spend bandwidth on HTML only. Scrapy does not fetch page assets unless told to, so keep it that way: avoid enabling image pipelines on metered residential plans unless images are the deliverable.
- Throttle like a guest. AutoThrottle, per-domain concurrency caps and honest retry budgets keep exit IPs unblocked longer than any rotation scheme can compensate for; the ethical web scraping guide covers where to draw the lines.
- Log the exit IP. A periodic request to
https://httpbin.org/ipthrough the active session makes proxy issues diagnosable from crawl logs alone.
For end-to-end strategy, from pool selection to block recovery, see the web scraping with proxies guide.
Frequently Asked Questions
How do I use a different proxy for each request?
Why do many requests show the same exit IP even without a sessionId?
Does Scrapy support SOCKS5 proxies?
Do I need to URL-encode my proxy password?
Should I use residential or datacenter proxies for Scrapy crawls?
Start Using Databay Proxies Today
Set up residential, datacenter, or mobile proxies in minutes. Pay as you go with no commitments.