Crawler middleware runbook

Scrapy Proxy Integration

Route Scrapy through Databay residential and datacenter proxies with request meta or a project-wide downloader middleware. Includes a rotating session middleware plus fixes for 407 errors, CONNECT tunnel failures, TLS issues and timeout tuning for high-volume crawls.

Start free Jump to setup

Scrapy

Crawler framework

M-04

M-04 / Crawler middlewarePut routing policy in the downloaderserver-rendered route map

request /catalog

request /search

retry /detail

Downloader middleware

session per domain; rotate on retry

Exit pool

residential | datacenter

Centralize proxy selection in middleware and reserve per-request metadata for exceptions.

Maintained by Databay ResearchUpdated Jun 11, 202610 min runbook

Operating principle

Centralize proxy selection in middleware and reserve per-request metadata for exceptions.

What Is Scrapy

Scrapy is the standard Python framework for crawling at scale: an asynchronous engine built on Twisted, a structured pipeline for items, and a middleware system that makes cross-cutting concerns like proxying, retries and throttling pluggable. Unlike browser-based tools, Scrapy fetches raw HTTP responses without executing JavaScript, which makes it dramatically faster and cheaper per page, and also makes its traffic pattern easy for targets to profile: high request rates from a single IP with no browser behavior around them. That is why serious Scrapy deployments treat proxies as core infrastructure rather than an add-on, and Scrapy's middleware architecture makes the integration cleaner than in any browser tool.

Connecting Scrapy to Databay Proxies

Databay's gateway is gw.databay.co:8888, with the pool, country and session selected by username flags: USER-zone-residential, USER-zone-datacenter, and optional -countryCode-us or -sessionId-abc123 suffixes. Scrapy's built-in HttpProxyMiddleware (enabled by default) reads a proxy URL from request.meta['proxy'], extracts any credentials embedded in it, and sets the Proxy-Authorization header for you. That gives you two integration levels: per-request via meta, or project-wide via a small custom middleware.

Per-Request Proxy with Request Meta

The smallest possible integration is one line per request:

import scrapy

class IpSpider(scrapy.Spider):
    name = 'ip'

    def start_requests(self):
        yield scrapy.Request(
            'https://httpbin.org/ip',
            meta={'proxy': 'http://USER-zone-residential:PASS@gw.databay.co:8888'}
        )

    def parse(self, response):
        self.logger.info(response.text)

Credentials ride inside the proxy URL and HttpProxyMiddleware turns them into a Proxy-Authorization header automatically; you never construct that header yourself. One sharp edge: if your password contains characters that are special in URLs (@, :, /, #), percent-encode them with urllib.parse.quote before building the URL, or authentication will fail in confusing ways.

Project-Wide Proxy Middleware

Setting meta on every request does not scale past one spider, so the idiomatic setup is a downloader middleware that applies the proxy to all requests. In middlewares.py:

class DatabayProxyMiddleware:
    PROXY = 'http://USER-zone-residential:PASS@gw.databay.co:8888'

    def process_request(self, request, spider):
        request.meta.setdefault('proxy', self.PROXY)

and in settings.py:

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.DatabayProxyMiddleware': 350,
}

The priority matters: 350 places your middleware before the built-in HttpProxyMiddleware at 750, so the proxy URL is already on the request when the built-in middleware processes the credentials. Using setdefault rather than direct assignment lets individual requests override the project default, for example to send one request through the datacenter pool or a specific country.

Datacenter Proxies and Zone Flags

Switching pools is a username change: USER-zone-datacenter or USER-zone-residential. Choose the pool from an authorized network-origin or regional requirement, not as an escalation path around blocking. Geo-targeting uses flags such as USER-zone-residential-countryCode-us. A country-targeted exit changes one network-location signal and does not reproduce every local user's account, device, language, or personalization state.

Proxy Rotation Patterns

A rotating gateway may select a different exit for a new proxy connection. Add -sessionId-XYZ only when an authorized public workflow needs short-term continuity. Apply one domain-level request budget across every exit, cache unchanged responses, and back off globally on errors. Do not rotate a session on 403, 429, CAPTCHA, or another access control; stop or move to an approved API, feed, or license. Scrapy may reuse proxy connections, so log the actual exit and do not assume that one request equals one IP.

Common Errors and Fixes

Scrapy reports proxy failures differently from browsers; here is how the classic problems look in a crawl log and what fixes them.

HTTP 407 Proxy Authentication Required

A 407 in Scrapy usually appears as TunnelError: Could not open CONNECT tunnel ... 407 for HTTPS targets, or as a plain 407 response for HTTP ones. Causes, in order of frequency: a typo in the username flags (zone names and flag spelling must match exactly), special characters in the password that were not percent-encoded in the proxy URL, or a custom middleware ordered after the built-in HttpProxyMiddleware so the credentials never get processed (your middleware must have priority below 750). Confirm credentials independently of Scrapy first:

curl -x http://USER-zone-residential:PASS@gw.databay.co:8888 https://httpbin.org/ip

If curl succeeds and Scrapy 407s, the bug is in URL construction or middleware ordering, not the account.

CONNECT Tunnel Failures

Where a browser shows ERR_TUNNEL_CONNECTION_FAILED, Scrapy raises TunnelError: Could not open CONNECT tunnel. Check the endpoint is exactly gw.databay.co:8888 and validate the username flags. Under load, reduce concurrency and distinguish local resource exhaustion from gateway or destination errors. Retry only idempotent, authorized requests within a strict source-level budget; never use a fresh identity merely to route around a block or access control.

TLS and Certificate Errors

The gateway tunnels TLS end-to-end without re-signing, so certificate verify failed errors rarely implicate the proxy. The usual culprits in a Scrapy stack are outdated pyOpenSSL, cryptography or Twisted packages that cannot negotiate with modern target servers, or a target whose TLS configuration is genuinely strange. Upgrading those three packages fixes most cases. Scrapy also lets you swap the TLS context factory via DOWNLOADER_CLIENTCONTEXTFACTORY for unusual targets, which is a better-scoped tool than disabling verification globally. Note also that your HTTP client's TLS handshake itself is a fingerprint that survives IP rotation; TLS fingerprinting and proxy detection explains how targets use it.

Timeouts and Throughput Tuning

Proxy routing adds variable latency, so tune timeouts and concurrency from measurements of the authorized workflow:

# settings.py
DOWNLOAD_TIMEOUT = 60
RETRY_TIMES = 2
CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 2

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1.0
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

Apply one domain-level request budget across every worker and exit. Retry only idempotent requests after transient transport or server errors, with a capped global attempt budget. Stop rather than changing sessions after a 401, 403, 429, CAPTCHA, or other access control.

Best Practices for Scrapy with Proxies

Centralize proxy logic in one downloader middleware so authorization, request budgets, and logging are enforced consistently.
Keep a session per domain only when an authorized workflow needs continuity. Retries must obey a strict source-level budget and must not swap identities to bypass a block.
Choose a pool from the documented network or regional requirement; do not escalate blocked traffic to another address class.
Fetch only the approved fields and assets. Avoid image pipelines unless images are part of the authorized deliverable.
Use AutoThrottle, per-domain concurrency caps, caching, and global backoff. Stop on persistent 403, 429, or CAPTCHA responses.
Log the exit IP and source response so an authorized crawl can be audited and diagnosed.

Prefer official APIs, feeds, bulk downloads, or licenses when available.

Troubleshooting desk

Questions specific to Scrapy

How do I use a different proxy for each request?

Set request.meta['proxy'] only from centralized policy. A rotating gateway may select another exit for a new connection, but Scrapy can reuse tunnels, so do not promise one new IP per request. Use explicit sessions only for an independent, authorized requirement.

Why do many requests show the same exit IP even without a sessionId?

Persistent connections can reuse the same tunnel and exit. Log the observed address if the authorized test depends on it; no per-request rotation guarantee should be inferred from the gateway label.

Does Scrapy support SOCKS5 proxies?

Not natively; Scrapy's downloader speaks HTTP proxies only. Databay's gateway supports HTTP, so the standard integration works without anything extra. If you specifically need SOCKS5, the usual options are a local HTTP-to-SOCKS bridge or handling those fetches outside Scrapy.

Do I need to URL-encode my proxy password?

Only if it contains characters with special meaning in URLs, such as @, :, / or #. Percent-encode the username and password with urllib.parse.quote when building the proxy URL; otherwise HttpProxyMiddleware will parse the URL incorrectly and authentication fails.

Should I use residential or datacenter proxies for Scrapy crawls?

Choose a network type from the approved data-source and regional requirements. Do not switch blocked or challenged traffic to residential merely to bypass the control; stop, back off, or use an official API, feed, or license.

Adjacent runbooks

Same gateway, different control surface

Incogniton
Multi-profile browser
Profile editor -> Databay gateway -> pinned residential exit
Playwright
Browser automation
Browser process -> isolated contexts -> independent exit sessions
Puppeteer
Browser automation
--proxy-server -> page.authenticate() -> page.goto()
Selenium
Browser automation
WebDriver -> auth decision -> Databay gateway

Choose the exit pool after the control path works. Use residential for reputation-sensitive targets or datacenter for throughput.

Residential →Datacenter →

Ship Scrapy to production

Create an account, drop in your gateway credentials, and route your first Scrapy request in minutes.

Create account Browse all guides

Pricing, order minimums, and traffic validity vary by product.

Build

Free tools

Learn

Company