Design and build a production data pipeline with proxies. Covers architecture, proxy manager design, scheduling, error handling, scaling, and cost management.
Anatomy of a Proxy-Powered Data Pipeline
The pipeline breaks into seven distinct components, each with a specific responsibility:
- Scheduler. Determines when each scraping job runs — cron-based for recurring tasks, event-triggered for on-demand collection, or priority-queued for mixed workloads
- Request queue. Buffers URLs between the scheduler and scraper workers, providing backpressure when the system is overloaded and persistence when workers crash
- Proxy manager. The most critical custom component. Handles proxy health checking, rotation logic, geographic routing, and retry assignment
- Scraper pool. Stateless workers that pull URLs from the queue, fetch pages through the proxy manager, and push raw responses downstream
- Parser. Transforms raw HTML into structured data using CSS selectors, XPath, or custom extraction logic
- Data store. The final destination — database, data warehouse, or API endpoint
- Alert system. Monitors pipeline health and notifies operators of failures, quality degradation, or cost anomalies
The key design principle is loose coupling. Each component communicates through queues or streams, never through direct function calls. This means you can restart, scale, or replace any component without affecting the others. When the proxy manager needs to cool down IPs, the request queue absorbs the backlog. When the target site changes its HTML structure and the parser breaks, raw responses remain in storage for re-parsing after you fix the extraction logic.
Choosing Pipeline Components
Queue technology. Redis works well up to tens of thousands of queued URLs and provides fast in-memory performance. For larger pipelines or when you need message persistence and replay capability, RabbitMQ gives you durable queues with acknowledgment semantics — no URL is lost if a worker crashes mid-scrape. Apache Kafka is overkill for most scraping pipelines but makes sense if you need to replay historical scrape requests or feed multiple downstream consumers from the same scrape stream.
Scraper framework. Scrapy is the standard for Python-based pipelines, with built-in support for concurrent requests, middleware chains, and item pipelines. For JavaScript-heavy targets, Playwright or Puppeteer behind a task queue provide browser-based scraping. Custom HTTP clients (Python's httpx, Go's standard library) give maximum control for teams that need it.
Storage. PostgreSQL handles structured, relational scrape data well — product catalogs, price histories, listing databases. MongoDB accommodates variable schemas when different sources produce different data shapes. S3 or equivalent object storage is the right choice for raw HTML archival and large-scale datasets.
Orchestration. Apache Airflow is the most established option for scheduling and monitoring complex data pipelines. Prefect offers a more Pythonic API with less configuration overhead. For simpler pipelines, a well-designed cron job with proper logging and alerting works and avoids the operational cost of managing an orchestration platform.
Proxy Manager Design
Health checking. Continuously test proxy endpoints for connectivity, response time, and success rates. Mark proxies that fail health checks as unavailable and retest them periodically. A simple implementation pings each proxy every 60 seconds; a sophisticated one tracks per-domain success rates and marks a proxy as failed for specific domains while keeping it active for others.
Rotation logic. Different scraping targets need different rotation strategies. For sites that track sessions, use sticky proxies — the same IP for all requests in a session. For stateless page fetching, rotate on every request to maximize IP diversity. Implement both strategies and let the job configuration choose.
Geographic routing. When scraping geo-restricted content, the proxy manager must route requests through proxies in the correct country. Maintain a mapping of proxy IPs to geographies and expose a geo-targeting API to scraper workers. Databay's geo-targeting capabilities handle this at the provider level, but your proxy manager should verify that responses contain the expected regional content.
Retry assignment. When a request fails, the proxy manager should assign a different proxy for the retry — never the same one. Track which proxies have been tried for each URL to avoid retrying with previously failed IPs. After a configurable number of retries (typically 3-5), escalate to a higher-quality proxy type (datacenter to residential, or residential to mobile) before giving up.
Scheduling Strategies That Actually Work
Cron-based scheduling is the starting point. Run collection jobs at fixed intervals — hourly for prices, daily for listings, weekly for catalogs. The advantage is simplicity and predictability. The disadvantage is inefficiency: you scrape pages that have not changed and miss pages that changed between scheduled runs.
Adaptive frequency scheduling tracks how often each target actually changes and adjusts scrape frequency accordingly. If a product page updates its price once per week on average, scraping it hourly wastes 99% of those requests. By recording the last-changed timestamp for each URL and computing a change frequency, you can schedule scrapes just often enough to catch updates. This typically reduces total requests by 50-70% compared to fixed-interval scheduling.
Event-triggered scheduling initiates scraping in response to external signals. A competitor launches a sale, a keyword appears in an RSS feed, or a monitoring system detects a price change on a sentinel page — any of these events can trigger a targeted scrape of related pages. This approach gives the freshest data for high-priority targets without the cost of continuous monitoring.
Priority queues combine all three approaches. High-priority URLs (actively changing, business-critical) get scraped first and most frequently. Low-priority URLs fill remaining proxy capacity during off-peak hours. This ensures your proxy budget goes to the most valuable data collection first.
Error Handling and Retry Logic
Exponential backoff with jitter. When a request fails with a retryable error (429, 503, timeout), wait before retrying. Double the wait time on each subsequent failure: 2 seconds, 4 seconds, 8 seconds, up to a maximum of 60 seconds. Add random jitter (plus or minus 30%) to prevent multiple workers from retrying simultaneously and overwhelming the target.
Circuit breakers. If a specific domain returns errors on more than 50% of requests in a 5-minute window, stop sending requests to that domain entirely. A circuit breaker prevents wasting proxy bandwidth on a target that is either down or actively blocking you. After a cooldown period (5-15 minutes), send a single test request. If it succeeds, resume normal operation. If it fails, extend the cooldown.
Dead letter queues. After exhausting all retries (typically 3-5 attempts with escalating proxy quality), move the failed URL to a dead letter queue for manual inspection. This prevents infinite retry loops while preserving the information about what failed and why. Review dead letter queues daily to identify systemic issues — a sudden spike in dead letters often indicates a target site redesign or new anti-bot deployment.
Idempotent operations. Design every pipeline stage so that processing the same input twice produces the same output without side effects. This is essential because retries, queue redeliveries, and worker restarts will inevitably cause duplicate processing. Idempotency means duplicates are harmless.
Data Validation and Quality Monitoring
Schema validation. Define the expected shape of extracted data — required fields, data types, value ranges — and validate every record against the schema. A product record with a null price, a negative quantity, or a date in 1970 indicates a parsing failure, not a legitimate data point. Reject invalid records and route them to an error queue for investigation.
Statistical anomaly detection. Track distributions of extracted values over time. If the average product price on a site suddenly drops 80%, either there is a massive sale or your parser is broken. Set thresholds based on historical baselines: alert when today's data deviates from the trailing 30-day average by more than two standard deviations.
Freshness monitoring. Track the timestamp of the most recent successful scrape for each data source. Alert when any source exceeds its expected update interval — if you expect hourly data and the last successful scrape was 3 hours ago, something is wrong. Freshness gaps are the most common data quality issue in pipeline operations.
Cross-source validation. When collecting the same data from multiple sources (a common pattern in price monitoring and competitive intelligence), compare values across sources. Significant disagreements indicate an extraction error in at least one source. This catch is invaluable for detecting silent parser failures that schema validation alone would miss.
Scaling Horizontally: Workers and Proxies
Scaling scraper workers. Because workers are stateless (they pull from a queue, fetch through proxies, push to storage), adding workers is as simple as launching more processes or containers. The queue absorbs the coordination: each URL is delivered to exactly one worker, and failed URLs return to the queue automatically. In practice, you scale workers until either the request queue stays empty (workers are consuming URLs as fast as the scheduler produces them) or the proxy pool becomes saturated.
Scaling proxy capacity. When workers are idle waiting for proxy connections, you need more proxy bandwidth. With Databay, this means upgrading your plan to access more concurrent connections and monthly bandwidth. The proxy pool itself (23M+ IPs) is large enough that IP exhaustion is rarely the constraint — bandwidth and concurrent connection limits are.
Scaling storage and processing. The parser and data store layers scale through standard data engineering approaches: partition data by source or date, use write-optimized databases for ingestion, and batch-process heavy transformations during off-peak hours.
The scaling sequence matters. Start by adding workers until the queue drains efficiently. Then increase proxy capacity until workers are fully utilized. Then optimize parsing throughput. Finally, tune storage. Working through these bottlenecks in order prevents over-provisioning expensive resources (proxies) while under-utilizing cheap ones (compute).
Monitoring and Alerting for Pipeline Operations
Track these metrics at minimum:
- Success rate by domain. The percentage of requests that return HTTP 200 with parseable content. A drop below 90% for any domain triggers investigation. Track this per-domain because a single domain's issues should not be masked by the aggregate
- Request latency by proxy type. Median and P95 response times reveal proxy performance degradation before it causes failures. Residential proxies are inherently slower than datacenter proxies — set type-specific latency thresholds
- Queue depth over time. A growing queue means workers or proxies are not keeping up with scheduled tasks. A consistently empty queue means you have excess capacity you could redirect or save on
- Data freshness by source. How long ago was each source last successfully scraped? This is the metric your data consumers care about most
- Proxy cost per successful request. Track unit economics to catch inefficiencies. If cost per request rises, it usually means more retries, lower success rates, or unnecessary escalation to expensive proxy types
- Records extracted per run. A sudden drop in record count — even with high success rates — indicates a parser failure extracting fewer items per page
Build dashboards for daily monitoring and configure alerts for anomalies. Use tools like Grafana, Datadog, or even a simple script that checks metrics against thresholds and sends Slack notifications. The goal is to detect problems within minutes, not days.
Handling Target Site Changes
Separate scraping from parsing. This is the single most important architectural decision for resilience. Store raw HTML responses before parsing them. When a site redesign breaks your parser, you have not lost any data — you have the raw responses waiting to be re-parsed once you update the extraction logic. Without this separation, a parser failure means the scrape is wasted and must be repeated.
Version your parsers. Maintain parser versions for each target domain. When a site changes its HTML structure, write a new parser version without deleting the old one. This lets you re-parse historical data with the correct parser version and ensures that parser updates do not break extraction of already-collected pages.
Implement change detection. Before parsing, run a structural fingerprint check on the HTML. Compare the CSS selector paths, key element counts, and page structure against the last known good structure. If the fingerprint diverges significantly, flag the response for manual review rather than parsing it with potentially broken logic.
Build parser test suites. Maintain a set of sample pages for each target domain and write assertions that verify extraction logic produces expected results. Run these tests on every parser update and nightly against freshly scraped samples. When tests fail, you learn about site changes before they corrupt your data pipeline — not after.
Target site changes are inevitable. The goal is not to prevent breakage but to detect it immediately and recover without data loss.
Cost Management for Proxy-Dependent Pipelines
Track cost per data point. Not per request — per usable data point that reaches your final data store. This metric captures the full cost including retries, failed requests, and discarded low-quality results. If you are paying $0.003 per request and it takes an average of 1.5 requests to get a usable data point (accounting for retries and failures), your actual cost is $0.0045 per data point. Track this over time and investigate when it increases.
Use the cheapest proxy that works. Profile each target domain's detection capabilities and assign the minimum proxy tier that achieves a 95%+ success rate. Many pipelines waste money using residential proxies for targets that datacenter proxies handle fine. A quarterly review of proxy tier assignments catches targets that have loosened or tightened their anti-bot measures.
Eliminate redundant scrapes. Audit your scheduled jobs for overlap — multiple jobs scraping the same pages at different times, or jobs scraping data that no downstream consumer actually uses. In mature pipelines, 10-20% of scraping volume often serves no current business purpose.
Set budget alerts. Configure daily and monthly spend thresholds that trigger alerts before you hit budget limits. A runaway scraper with a parsing bug that generates infinite pagination can burn through a monthly proxy budget in hours. Budget alerts are your circuit breaker for cost overruns.
The most cost-effective pipelines scrape only what is needed, at the minimum frequency that maintains data freshness, through the cheapest proxy tier that achieves acceptable success rates.