Back to Integrations
Integration Guide

Scrapy Proxy Integration

Jun 11, 2026 10 min read Databay Research Team

Route Scrapy through Databay residential and datacenter proxies using request meta, a project-wide downloader middleware, and a rotating session middleware, with fixes for 407 authentication errors, CONNECT tunnel failures, TLS issues and timeout tuning for high-volume crawls.

Scrapy Proxy Integration

What Is Scrapy

Scrapy is the standard Python framework for crawling at scale: an asynchronous engine built on Twisted, a structured pipeline for items, and a middleware system that makes cross-cutting concerns like proxying, retries and throttling pluggable. Unlike browser-based tools, Scrapy fetches raw HTTP responses without executing JavaScript, which makes it dramatically faster and cheaper per page, and also makes its traffic pattern easy for targets to profile: high request rates from a single IP with no browser behaviour around them. That is why serious Scrapy deployments treat proxies as core infrastructure rather than an add-on, and Scrapy's middleware architecture makes the integration cleaner than in any browser tool.

Connecting Scrapy to Databay Proxies

Databay's gateway is gw.databay.co:8888, with the pool, country and session selected by username flags: USER-zone-residential, USER-zone-datacenter, and optional -countryCode-us or -sessionId-abc123 suffixes. Scrapy's built-in HttpProxyMiddleware (enabled by default) reads a proxy URL from request.meta['proxy'], extracts any credentials embedded in it, and sets the Proxy-Authorization header for you. That gives you two integration levels: per-request via meta, or project-wide via a small custom middleware.

Per-Request Proxy with Request Meta

The smallest possible integration is one line per request:

import scrapy

class IpSpider(scrapy.Spider):
    name = 'ip'

    def start_requests(self):
        yield scrapy.Request(
            'https://httpbin.org/ip',
            meta={'proxy': 'http://USER-zone-residential:PASS@gw.databay.co:8888'}
        )

    def parse(self, response):
        self.logger.info(response.text)

Credentials ride inside the proxy URL and HttpProxyMiddleware turns them into a Proxy-Authorization header automatically; you never construct that header yourself. One sharp edge: if your password contains characters that are special in URLs (@, :, /, #), percent-encode them with urllib.parse.quote before building the URL, or authentication will fail in confusing ways.

Project-Wide Proxy Middleware

Setting meta on every request does not scale past one spider, so the idiomatic setup is a downloader middleware that applies the proxy to all requests. In middlewares.py:

class DatabayProxyMiddleware:
    PROXY = 'http://USER-zone-residential:PASS@gw.databay.co:8888'

    def process_request(self, request, spider):
        request.meta.setdefault('proxy', self.PROXY)

and in settings.py:

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.DatabayProxyMiddleware': 350,
}

The priority matters: 350 places your middleware before the built-in HttpProxyMiddleware at 750, so the proxy URL is already on the request when the built-in middleware processes the credentials. Using setdefault rather than direct assignment lets individual requests override the project default, for example to send one request through the datacenter pool or a specific country.

Datacenter Proxies and Zone Flags

Switching pools is a username change, nothing more. Datacenter proxies (USER-zone-datacenter) are the economical default for high-volume crawls of permissive targets, while residential proxies (USER-zone-residential) exit through real consumer connections and are the answer when a target filters datacenter IP ranges or serves different content per region. Geo-targeting works the same way in both pools: USER-zone-residential-countryCode-us exits from the United States, and the same pattern accepts other country codes. Because Scrapy crawls are typically long-running and high-volume, many projects run a cheap datacenter pass first and re-crawl only the blocked remainder through residential, which keeps bandwidth costs proportional to difficulty.

Proxy Rotation Patterns

Through a rotating gateway, each new proxy connection can receive a different exit IP, so a Scrapy crawl naturally spreads across the pool. Two refinements give you control over exactly how. First, sticky sessions: appending -sessionId-XYZ pins all requests carrying that username to one exit IP, which you need for stateful flows (login, pagination tied to server-side session). Second, deliberate rotation: assigning a session id per request, per domain or per cookiejar gives you named identities you can retire on demand. A rotating-session middleware that pins one identity per domain and rotates it on retry looks like this:

import random
import string

class RotatingSessionMiddleware:
    USER = 'USER-zone-residential'
    PASSWORD = 'PASS'

    def __init__(self):
        self.sessions = {}

    def _proxy_for(self, domain):
        sid = self.sessions.get(domain)
        if sid is None:
            sid = ''.join(random.choices(string.ascii_lowercase + string.digits, k=8))
            self.sessions[domain] = sid
        return f'http://{self.USER}-sessionId-{sid}:{self.PASSWORD}@gw.databay.co:8888'

    def process_request(self, request, spider):
        domain = request.url.split('/')[2]
        if request.meta.get('retry_times', 0) > 0:
            # Retry: drop the pinned session so this domain gets a fresh IP
            self.sessions.pop(domain, None)
        request.meta['proxy'] = self._proxy_for(domain)

Register it at priority 350 like the simple middleware above. The effect: every domain gets a stable identity while things go well, and any retry (which Scrapy triggers on connection errors and retryable HTTP codes) silently swaps that domain to a fresh exit IP. One honest caveat about expecting per-request rotation without session ids: Scrapy reuses persistent connections to the proxy, and requests that ride an existing tunnel keep its exit IP, so true per-request rotation is best implemented with explicit per-request session ids rather than assumptions about connection behaviour. The wider trade-offs are covered in static vs rotating proxies.

Common Errors and Fixes

Scrapy reports proxy failures differently from browsers; here is how the classic problems look in a crawl log and what fixes them.

HTTP 407 Proxy Authentication Required

A 407 in Scrapy usually appears as TunnelError: Could not open CONNECT tunnel ... 407 for HTTPS targets, or as a plain 407 response for HTTP ones. Causes, in order of frequency: a typo in the username flags (zone names and flag spelling must match exactly), special characters in the password that were not percent-encoded in the proxy URL, or a custom middleware ordered after the built-in HttpProxyMiddleware so the credentials never get processed (your middleware must have priority below 750). Confirm credentials independently of Scrapy first:

curl -x http://USER-zone-residential:PASS@gw.databay.co:8888 https://httpbin.org/ip

If curl succeeds and Scrapy 407s, the bug is in URL construction or middleware ordering, not the account.

CONNECT Tunnel Failures

Where a browser shows ERR_TUNNEL_CONNECTION_FAILED, Scrapy raises TunnelError: Could not open CONNECT tunnel (without the 407 code). This means the gateway refused or failed to establish the HTTPS tunnel: check the endpoint is exactly gw.databay.co:8888, then the username flags, since a malformed flag combination can cause a refusal at the tunnel stage. Intermittent tunnel errors under high concurrency are different: they are usually individual exit IPs failing mid-crawl, which is normal at scale and exactly what RetryMiddleware exists for; combined with the rotating-session middleware above, each retry lands on a fresh IP. Make sure connection-level errors stay retryable by leaving RETRY_ENABLED = True and keeping RETRY_TIMES at 2 or 3.

TLS and Certificate Errors

The gateway tunnels TLS end-to-end without re-signing, so certificate verify failed errors rarely implicate the proxy. The usual culprits in a Scrapy stack are outdated pyOpenSSL, cryptography or Twisted packages that cannot negotiate with modern target servers, or a target whose TLS configuration is genuinely strange. Upgrading those three packages fixes most cases. Scrapy also lets you swap the TLS context factory via DOWNLOADER_CLIENTCONTEXTFACTORY for unusual targets, which is a better-scoped tool than disabling verification globally. Note also that your HTTP client's TLS handshake itself is a fingerprint that survives IP rotation; TLS fingerprinting and proxy detection explains how targets use it.

Timeouts and Throughput Tuning

Residential exits add latency per connection, and crawl settings tuned for direct connections will misbehave through a proxy. The settings that matter:

# settings.py
DOWNLOAD_TIMEOUT = 60          # default 180 is too patient for crawling
RETRY_TIMES = 3
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 4

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1.0
AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0

Two principles behind the numbers. Lower DOWNLOAD_TIMEOUT from its generous default so a dead exit IP costs you 60 seconds, not 180, and let retries (on a fresh session) do the recovery. And throttle per domain rather than globally: total concurrency can stay high across many targets while each individual target sees a polite request rate, which protects both the target and your exit IPs' reputation. AutoThrottle adapts the per-domain rate to observed latency, which is especially useful when residential latency varies between sessions.

Best Practices for Scrapy with Proxies

  • Middleware, not meta. Centralize proxy logic in one downloader middleware; per-request meta is for exceptions, not the rule.
  • Session per domain, rotate on retry. Stable identities while a crawl is healthy, automatic fresh IPs when it is not, with no manual blocklist management.
  • Match the pool to the target. Datacenter for volume on permissive sites, residential for protected ones; crawl cheap first and escalate only the blocked remainder. A worked example against a hard target is in scraping Amazon with proxies.
  • Spend bandwidth on HTML only. Scrapy does not fetch page assets unless told to, so keep it that way: avoid enabling image pipelines on metered residential plans unless images are the deliverable.
  • Throttle like a guest. AutoThrottle, per-domain concurrency caps and honest retry budgets keep exit IPs unblocked longer than any rotation scheme can compensate for; the ethical web scraping guide covers where to draw the lines.
  • Log the exit IP. A periodic request to https://httpbin.org/ip through the active session makes proxy issues diagnosable from crawl logs alone.

For end-to-end strategy, from pool selection to block recovery, see the web scraping with proxies guide.

Frequently Asked Questions

How do I use a different proxy for each request?
Set request.meta['proxy'] per request, or generate a fresh sessionId per request in a downloader middleware so each request carries different credentials through the same gateway. Remember that Scrapy reuses proxy connections, so explicit session ids are the reliable way to guarantee per-request IP changes.
Why do many requests show the same exit IP even without a sessionId?
Persistent connections. Scrapy keeps tunnels to the proxy alive and requests that reuse an open tunnel keep its exit IP. That is good for throughput, but if you need guaranteed rotation, assign session ids deliberately rather than relying on connection churn.
Does Scrapy support SOCKS5 proxies?
Not natively; Scrapy's downloader speaks HTTP proxies only. Databay's gateway supports HTTP, so the standard integration works without anything extra. If you specifically need SOCKS5, the usual options are a local HTTP-to-SOCKS bridge or handling those fetches outside Scrapy.
Do I need to URL-encode my proxy password?
Only if it contains characters with special meaning in URLs, such as @, :, / or #. Percent-encode the username and password with urllib.parse.quote when building the proxy URL; otherwise HttpProxyMiddleware will parse the URL incorrectly and authentication fails.
Should I use residential or datacenter proxies for Scrapy crawls?
Datacenter proxies suit most high-volume crawling: they are fast, economical and fine for targets without reputation-based blocking. Switch the affected portion of a crawl to residential when you see datacenter ranges being filtered, CAPTCHAs appearing, or region-specific content you need to access from a particular country.

Start Using Databay Proxies Today

Set up residential, datacenter, or mobile proxies in minutes. Pay as you go with no commitments.

Start Using Rotating Proxies Today

Join 8,000+ users using Databay's rotating proxy infrastructure for web scraping, data collection, and automation. Access 34M+ residential, datacenter, and mobile IPs across 200+ countries from a flat $0.55/GB (Flex), pay-as-you-go. No monthly commitment, no connection limits - start collecting data in minutes.