How to Handle CAPTCHAs While Web Scraping

Maria Kovacs Maria Kovacs 15 min read

Learn practical strategies for handling CAPTCHAs while scraping, from prevention-first approaches with residential proxies to solving services and cost analysis.

Why Scrapers Encounter CAPTCHAs

CAPTCHAs are a response, not a default. Websites do not serve CAPTCHAs to every visitor — they serve them when their anti-bot system determines that a request is likely automated. Understanding the triggers is the first step toward reducing your CAPTCHA encounter rate, which is always more efficient than solving them after the fact.

The triggers vary by anti-bot provider and site configuration, but common ones include: a suspicious IP reputation (datacenter ASN, known proxy, previously flagged address), elevated request rate from a single IP or fingerprint, TLS fingerprint mismatch (non-browser client claiming to be a browser), missing or invalid cookies from prior JavaScript challenges, behavioral anomalies (no mouse movement, instant page interactions, uniform timing), and accessing sensitive pages (login forms, checkout flows, pricing APIs) that have lower challenge thresholds.

CAPTCHAs exist on a spectrum of difficulty. Text CAPTCHAs are the simplest — distorted characters that must be typed. Image classification CAPTCHAs ("select all images with buses") require visual understanding. Interactive CAPTCHAs demand specific actions like dragging a puzzle piece. Invisible challenges like reCAPTCHA v3 and Cloudflare Turnstile run without user interaction, scoring the session based on behavioral signals. Each type requires a different handling strategy.

The key metric for any CAPTCHA solving scraping operation is encounter rate: what percentage of your requests trigger a CAPTCHA. A well-configured scraper with residential proxies, proper headers, and realistic behavior typically sees CAPTCHA rates below 2-3% on most sites. If your rate exceeds 10%, the priority should be fixing your detection signals rather than scaling up solving capacity.

Types of CAPTCHAs You Will Encounter

reCAPTCHA v2 is Google's checkbox and image challenge system. The "I'm not a robot" checkbox triggers a risk analysis — low-risk sessions pass with a click, while suspicious sessions get image classification challenges. The challenges present a 4x4 or 3x3 grid of images and ask users to select those containing specific objects. New images may fade in after selections, extending the challenge. reCAPTCHA v2 is deployed on millions of sites and remains one of the most common CAPTCHAs encountered during scraping.

reCAPTCHA v3 runs entirely in the background without any user interaction. It assigns a score from 0.0 to 1.0 based on the session's behavioral signals — mouse movements, browsing patterns, interaction history, and whether the user has an active Google account with browsing history. Site operators set their own threshold: a score below 0.5 might trigger a v2 fallback, while below 0.3 might block the request. The invisible nature makes it harder to detect when you are being scored.

hCaptcha functions similarly to reCAPTCHA v2 with image classification challenges but includes more diverse task types: point selection (click on a specific object), multi-step classification, and drag-and-drop puzzles. It processes roughly 15% of CAPTCHA traffic globally and is the default for Cloudflare's free tier.

Cloudflare Turnstile is a non-interactive challenge that replaced reCAPTCHA on Cloudflare-protected sites. It verifies the browser environment through JavaScript execution, proof-of-work computations, and behavioral telemetry without presenting any visual puzzle. Turnstile is significantly harder to solve programmatically because there is no image or text to intercept — the challenge is environmental verification.

Custom CAPTCHAs are site-specific challenges — simple math problems, custom image puzzles, or text-based questions. These are less common but can appear on sites that implement their own protection.

Prevention First: Avoiding CAPTCHAs Entirely

The most cost-effective CAPTCHA solving scraping strategy is not solving CAPTCHAs at all. Every signal that triggers a CAPTCHA can be addressed at the configuration level, reducing encounter rates to near zero on most targets.

Start with proxy quality. Residential proxies from providers like Databay carry clean IP reputations that pass IP reputation checks without triggering challenges. Datacenter proxies are flagged by default on many CAPTCHA-protected sites — switching to residential IPs alone can reduce CAPTCHA rates by 60-80%. Ensure your proxy rotation distributes requests widely enough that no single IP accumulates suspicious activity volume. For sites with strict rate limits, city-level geo-targeting prevents concentration on a single subnet.

Fix your TLS fingerprint. If your scraper uses Python requests, Node.js axios, or Go net/http with a browser User-Agent, the TLS mismatch triggers CAPTCHAs on Cloudflare and Akamai before any behavioral analysis runs. Use TLS-mimicking libraries (curl_cffi, tls-client) to produce browser-matching TLS handshakes. This single fix often eliminates CAPTCHA encounters on sites where TLS fingerprinting is the primary trigger.

Manage your request behavior. Add randomized delays between requests (1-5 seconds with jitter). Include realistic referrer chains — arrive at pages through navigation rather than direct access. Accept and persist cookies from every response. Load the homepage or a category page before accessing product pages. These patterns satisfy behavioral analysis models that determine whether to serve a CAPTCHA. The combined effect of proper proxies, correct TLS fingerprints, and realistic behavior keeps CAPTCHA encounter rates below 1-2% on the vast majority of protected websites.

CAPTCHA Solving Services: How They Work

When CAPTCHAs cannot be avoided, solving services provide API-based resolution. Two distinct approaches exist: human-powered solving and AI-powered solving, each with different speed, accuracy, and cost profiles.

Human-powered services (2Captcha, Anti-Captcha, CapMonster Cloud) maintain pools of workers — real people who solve CAPTCHA challenges submitted via API. Your scraper encounters a CAPTCHA, extracts the challenge parameters (site key, page URL, challenge image), submits them to the solving API, and receives a solution token within 10-60 seconds. The token is then submitted to the target site as if the user solved the challenge locally. For reCAPTCHA v2, the solution is a g-recaptcha-response token. For hCaptcha, it is an h-captcha-response token. Human solvers achieve 90-98% accuracy on image challenges.

AI-powered solvers use machine learning models trained on millions of CAPTCHA images. They are faster (2-10 seconds per solve) and cheaper per unit, but accuracy varies by challenge type. Simple text CAPTCHAs and standard image classifications achieve 85-95% accuracy with AI. Complex multi-step challenges or novel image types have lower accuracy. Some services blend AI and human solving — the AI handles easy challenges, and difficult ones fall back to human workers.

The API integration follows a standard pattern regardless of provider: submit the CAPTCHA parameters, poll for the solution (or use a callback URL), and inject the solution token into your scraping session. Most CAPTCHA solving services charge per solve: $1-3 per 1,000 reCAPTCHA v2 solves, $2-5 per 1,000 hCaptcha solves, and $2-4 per 1,000 reCAPTCHA v3 token generations. Turnstile tokens typically cost $1-3 per 1,000.

Browser-Based Solving Approaches

Headless browsers provide an alternative to external solving services by interacting with CAPTCHAs directly within a browser environment. This approach works particularly well for challenges that evaluate browser authenticity rather than requiring human visual recognition.

For reCAPTCHA v3, the score is computed based on behavioral signals collected by the reCAPTCHA JavaScript library running in the browser. A headless browser that simulates realistic user behavior — mouse movements, scroll events, page navigation — can achieve scores of 0.7-0.9 without any external solving service. The key is allowing the reCAPTCHA script enough time and behavioral data to compute a favorable score. Loading several pages with natural interaction patterns before accessing the protected page builds a session history that elevates the score.

For Cloudflare Turnstile, headless browsers with stealth patches can pass the challenge natively because Turnstile evaluates browser environment authenticity rather than presenting visual puzzles. A Playwright instance with proper stealth configuration executes the Turnstile JavaScript, completes the proof-of-work computation, and receives the cf-turnstile-response token without external assistance. The success rate depends on the stealth quality — default Playwright configurations fail, but well-patched instances pass consistently.

For visual CAPTCHAs (reCAPTCHA v2 image selection, hCaptcha), browser-based solving requires either integrating with an external solving service through the browser (injecting solved tokens into the page) or using AI models running locally. Browser extensions like Buster for reCAPTCHA use audio challenge solving as an alternative to image challenges, converting the audio to text using speech recognition. However, reCAPTCHA has become more aggressive at detecting and blocking audio-based solving, reducing this approach's reliability.

Token-Based Solutions and Session Reuse

CAPTCHA tokens have a defined validity period, and understanding token lifecycle enables efficient reuse strategies that reduce the total number of CAPTCHAs you need to solve.

A reCAPTCHA v2 token is valid for approximately 120 seconds after generation. During this window, the token can be submitted as a solution to the challenge on the page where it was generated. For scraping workflows that process multiple pages on the same domain, a single solved CAPTCHA can unlock a session that persists through cookies rather than requiring a solve per page. After solving, the resulting cookies (often including anti-bot tokens like cf_clearance or _px cookies) maintain access for 15-30 minutes or longer.

The efficient pattern is solve-once-scrape-many. Use a headless browser to navigate to the target site, encounter and solve the CAPTCHA (either through a solving service or browser-based approach), collect all resulting cookies and session tokens, then transfer these to a lightweight HTTP client for high-speed data extraction. The session remains valid as long as the cookies are fresh and requests continue from the same IP. With sticky proxy sessions, a single CAPTCHA solve can authorize hundreds of subsequent requests.

For reCAPTCHA v3, token generation can be batched. Since v3 tokens are generated silently by the reCAPTCHA script, a browser session with good behavioral history can generate multiple tokens in sequence by repeatedly triggering the grecaptcha.execute() function. Each token is valid for 2 minutes and can be used for API requests that require a reCAPTCHA token parameter. Pre-generating a pool of tokens before starting extraction ensures that your scraping workers never wait for token generation.

reCAPTCHA v3 Scoring: What Determines Your Bot Score

reCAPTCHA v3 is the invisible scoring system that runs on more sites than most scrapers realize. It does not present a challenge — it quietly computes a score that determines whether subsequent requests succeed or trigger secondary verification. Understanding how the score is calculated is essential for any CAPTCHA solving scraping strategy.

The scoring model evaluates multiple signal categories. Browser environment signals check for automation indicators: navigator.webdriver flag, missing Chrome plugin array, headless-specific window properties, and WebGL renderer anomalies. A browser that fails these checks starts with a low score floor that behavioral signals cannot easily overcome.

Behavioral signals carry the most weight. Mouse movement patterns are analyzed for naturalness — velocity variation, curvature, micro-corrections. Keyboard input timing is checked for human-like irregularity. Scroll behavior is evaluated for reading patterns rather than mechanical scrolling. Page interaction sequence (what was clicked, in what order, with what timing) feeds the scoring model. Sessions with no mouse movement or with movement that follows mathematical curves instead of natural trajectories score poorly.

Session history and cookies matter significantly. A browser with existing Google cookies (from normal browsing) and reCAPTCHA history (from passing previous challenges) starts with elevated trust. A fresh browser with no cookies and no history starts near the bottom. This is why reCAPTCHA v3 is harder to handle in scraping contexts — every session starts cold without the trust accumulated through normal browsing.

The practical implication: warm up sessions before accessing protected pages. Load several unprotected pages with natural interaction patterns to build behavioral history. Allow the reCAPTCHA script to collect data for 10-20 seconds of active browsing before triggering the scored action. This warmup can elevate scores from 0.1-0.3 (likely bot) to 0.7-0.9 (likely human) without any external solving service.

Cloudflare Turnstile: The Non-Interactive Challenge

Cloudflare Turnstile represents the next generation of CAPTCHA technology — a challenge that verifies humanity without requiring any user interaction. Deployed as a replacement for hCaptcha on Cloudflare-protected sites, Turnstile is becoming one of the most commonly encountered challenges and one of the hardest to solve programmatically.

Turnstile works by executing a JavaScript challenge in the browser that performs three categories of verification. First, it checks the browser environment for automation signals — similar to reCAPTCHA v3's environment checks but using Cloudflare's proprietary detection methods. Second, it runs proof-of-work computations that verify the browser has computational capabilities consistent with a real device (not a server farm running thousands of instances). Third, it collects behavioral telemetry including interaction patterns and timing data to feed a machine learning classifier.

The challenge completes in 1-3 seconds for legitimate browsers, producing a cf-turnstile-response token that authenticates subsequent requests. For automated tools, the difficulty lies in the environment verification — Turnstile's detection of headless browsers, automation frameworks, and modified browser configurations is continuously updated by Cloudflare's security team, which observes attack patterns across their network of millions of sites.

Handling Turnstile in scraping requires a well-configured headless browser with comprehensive stealth patches. Playwright with stealth modifications can pass Turnstile challenges, but the configuration must be current — patches that worked last month may fail today as Cloudflare updates its detection. Some CAPTCHA solving services now offer Turnstile solving, where they execute the challenge in their own browser infrastructure and return the token. This is more reliable than maintaining your own stealth configuration but adds per-challenge costs.

Cost Analysis of CAPTCHA Solving at Scale

At scale, CAPTCHA solving costs can dominate your scraping budget. A clear-eyed cost analysis reveals why prevention is always the superior strategy and helps you budget accurately when solving is unavoidable.

Consider a scraping operation making 100,000 requests per day to a protected e-commerce site. With a poor configuration (datacenter proxies, default Python TLS, no behavioral optimization), the CAPTCHA encounter rate might be 15-25%. At 20,000 CAPTCHAs per day and $2.50 per 1,000 solves, that is $50 per day or $1,500 per month just in CAPTCHA solving. With a well-configured stack (residential proxies, browser-matching TLS, realistic behavior), the encounter rate drops to 1-2%, meaning 1,000-2,000 CAPTCHAs per day — $2.50-5.00 per day or $75-150 per month. The configuration investment pays for itself within days.

Cost per solve varies by CAPTCHA type and provider. Standard pricing in early 2026:

  • reCAPTCHA v2 (human-powered): $1.00-3.00 per 1,000 solves
  • reCAPTCHA v3 token generation: $2.00-5.00 per 1,000 tokens
  • hCaptcha: $2.00-4.00 per 1,000 solves
  • Cloudflare Turnstile: $1.50-3.50 per 1,000 tokens
  • Custom image CAPTCHAs: $1.00-2.00 per 1,000 solves


Solving speed affects throughput. Human-powered solving averages 15-45 seconds per challenge, creating a bottleneck when hundreds of CAPTCHAs need solving simultaneously. AI-powered services return results in 2-10 seconds but may have lower accuracy. Factor in retry costs for failed solves — at 90% accuracy, 10% of your CAPTCHA budget is wasted on solutions that do not work. The most cost-effective approach combines aggressive prevention (reducing encounter rate), token reuse (reducing solve count), and a reliable solving service for the remainder.

When CAPTCHAs Signal You Should Rethink Your Approach

A persistently high CAPTCHA rate is not a problem to solve through more solving capacity — it is diagnostic feedback that your scraping approach needs fundamental revision. Treating CAPTCHAs as a normal cost of doing business leads to escalating expenses and declining reliability.

If your CAPTCHA rate exceeds 5% on a target site, investigate systematically. Check your TLS fingerprint first — this is the most common root cause of immediate challenges. Verify using a fingerprint testing service through your proxy setup. If the JA3/JA4 hash does not match a known browser, fix this before investigating other layers. A single TLS fix often drops CAPTCHA rates from double digits to below 2%.

If TLS is clean, examine your proxy quality. Check whether your IP addresses appear on known proxy lists or blacklists. Test with a small sample of proxies and compare CAPTCHA rates — if certain IPs trigger CAPTCHAs consistently, the issue is IP reputation. Switching to a higher-quality residential proxy pool or different geographic regions can resolve this. Databay's pool of 35 million IPs ensures you are not cycling through overused addresses that have accumulated negative reputation.

If proxies and TLS are both clean, the trigger is likely behavioral. Review your request patterns: timing uniformity, missing referrer chains, cookie handling gaps, or accessing pages in non-human sequences. Implement the warmup patterns discussed in earlier sections — load navigation pages before target pages, add interaction delays, maintain cookie state across requests. Each fix reduces the signals that trigger CAPTCHA challenges, systematically lowering your encounter rate to a level where occasional solving is a minor cost rather than a budget-defining expense.

Frequently Asked Questions

What is the cheapest way to handle CAPTCHAs at scale?
Prevention is the cheapest approach by far. Investing in residential proxies, TLS-mimicking HTTP clients, and realistic request behavior reduces CAPTCHA encounter rates to 1-2%, eliminating 90% or more of potential solving costs. For the remaining CAPTCHAs, token reuse (solving once and using the session for many requests) minimizes the number of individual solves needed. When solving is required, AI-powered services are cheaper per solve than human-powered ones, though slightly less accurate.
Can AI completely replace human CAPTCHA solvers?
Not yet for all types. AI achieves 85-95% accuracy on standard image classification CAPTCHAs (reCAPTCHA v2, hCaptcha) and handles text CAPTCHAs effectively. However, complex multi-step challenges, novel image types, and adversarial CAPTCHAs designed to resist AI still benefit from human solving. The trend is toward AI handling the majority of volume with human workers as a fallback for difficult cases. Invisible challenges like reCAPTCHA v3 and Turnstile are not image-based, so they require browser-based approaches rather than either AI or human solving.
How do I integrate a CAPTCHA solving service into my scraper?
Most services use a submit-and-poll API pattern. When your scraper encounters a CAPTCHA, extract the site key and page URL from the CAPTCHA HTML, submit them to the solving API, then poll the API every 3-5 seconds until a solution is returned. Inject the solution token into your request (as a form field or query parameter) and submit. Libraries exist for major scraping languages: python-anticaptcha, 2captcha-python, and equivalents for Node.js and Go. The integration typically adds 15-60 seconds of latency per CAPTCHA encounter.
Does solving CAPTCHAs violate any terms of service?
CAPTCHA solving exists in a legal gray area. CAPTCHA providers' terms of service generally prohibit automated solving, but enforcement is directed at the solving services rather than their users. The legality depends on jurisdiction and the purpose of your scraping. Scraping publicly available data is broadly legal in many jurisdictions, but circumventing access controls adds legal complexity. Consult legal counsel if your scraping operation is commercial or involves protected data. Using CAPTCHAs as a signal to improve your configuration (reducing encounters) is a more sustainable and lower-risk strategy than relying on mass solving.
Why does my CAPTCHA rate increase over time even with good proxies?
Anti-bot systems track cumulative behavior, not just individual requests. Over time, your proxy IPs accumulate activity history, your fingerprint may get flagged, and the site may adjust its detection thresholds. Additionally, anti-bot providers continuously update their detection models. Regular maintenance is required: rotate to fresh proxy pools, update your TLS impersonation profiles to current browser versions, vary your request patterns, and monitor your CAPTCHA encounter rate as a health metric. A sudden rate increase usually signals that one of your detection layers needs updating.

Start Collecting Data Today

35M+ IPs across 200+ countries. Pay as you go, starting at $0.50/GB.

Latest from the Blog

Expert guides on proxies, web scraping, and data collection.

Start Using Rotating Proxies Today

Join 8,000+ users using Databay's rotating proxy infrastructure for web scraping, data collection, and automation. Access 35M+ residential, datacenter, and mobile IPs across 200+ countries with pay-as-you-go pricing from $0.50/GB. No monthly commitment, no connection limits - start collecting data in minutes.