Have a question?

Web scraping API tips — Scraping-bot.io 5-step process: observe, target data, secure connections, optimize flow, expert advice

Top 7 Web Scraping Tips

5 min read
Web Scraping API Tips: 7 Best Practices for 2026
Web Scraping 12 min read  ·  Published: 07/05/2026

Web Scraping API Tips: 7 Best Practices for 2026

These web scraping API tips are drawn from years of production scraping experience across thousands of targets. Whether you are building your first scraper or hardening an existing pipeline, following these seven best practices will help you collect cleaner data, avoid blocks, and ship more reliable automations — with Python and Node.js code examples throughout.

Tip 1 — Respect the website and its robots.txt

The first of our web scraping API tips is also the most fundamental: always read the robots.txt file before scraping a site. This file, maintained by the website owner, specifies which pages are allowed or disallowed for automated access — and sometimes even defines acceptable crawl rates.

Beyond legality, respecting these rules is also practical. Scraping aggressively without reading robots.txt increases your chances of being blocked, rate-limited, or served honeypot data designed to detect bots.

Parsing robots.txt automatically in Python

from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse

def can_scrape(url, user_agent="*"):
    """Returns True if the URL is allowed by robots.txt."""
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    rp = RobotFileParser()
    rp.set_url(robots_url)
    rp.read()

    allowed    = rp.can_fetch(user_agent, url)
    crawl_delay = rp.crawl_delay(user_agent)

    return {"allowed": allowed, "crawl_delay": crawl_delay}

result = can_scrape("https://example.com/products")
print(result)
# {"allowed": True, "crawl_delay": 2.0}
💡 Tip: If crawl_delay is set, use it as your minimum delay between requests. Ignoring it is the fastest way to get your IP blacklisted.

Tip 2 — Simulate human behaviour with realistic delays

Browsing speed is one of the clearest signals websites use to distinguish humans from bots. A script that fires requests every 50ms looks nothing like a human — and will be flagged almost immediately. Furthermore, many modern anti-bot systems track inter-request timing across sessions, not just individual request rates.

The solution is to add randomised delays that mimic genuine user behaviour: a pause while "reading" the page, occasional longer gaps, and varied timing between requests.

Realistic delay pattern in Python

import time, random

def human_delay(min_s=1.0, max_s=4.0, long_pause_prob=0.1):
    """
    Wait between min_s and max_s seconds.
    Occasionally wait 8–15s to simulate a user reading a page.
    """
    if random.random() < long_pause_prob:
        delay = random.uniform(8, 15)
        print(f"Long pause: {delay:.1f}s")
    else:
        delay = random.uniform(min_s, max_s)
    time.sleep(delay)

urls = ["https://example.com/page/1",
        "https://example.com/page/2",
        "https://example.com/page/3"]

for url in urls:
    data = scrape(url)   # your scraping API call
    process(data)
    human_delay()        # pause before the next request

Realistic delay in Node.js

function humanDelay(minMs = 1000, maxMs = 4000, longPauseProb = 0.1) {
  const isLong = Math.random() < longPauseProb;
  const delay  = isLong
    ? 8000 + Math.random() * 7000   // 8–15s long pause
    : minMs + Math.random() * (maxMs - minMs);
  return new Promise(r => setTimeout(r, delay));
}

for (const url of urls) {
  const data = await scrape(url);
  process(data);
  await humanDelay();
}

Tip 3 — Detect when you have been blocked

Not all blocks are obvious. A 403 Forbidden is easy to catch, but some websites use subtler techniques: they return a 200 OK status with a CAPTCHA page, serve deliberately fake data, or silently redirect you to a honeypot. Therefore, validating the response content — not just the status code — is essential.

Multi-layer block detection in Python

from bs4 import BeautifulSoup
import re

BLOCK_SIGNALS = [
    "access denied",
    "captcha",
    "unusual traffic",
    "blocked",
    "403 forbidden",
    "rate limit exceeded",
    "you have been banned"
]

def is_blocked(response_data):
    """
    Returns (True, reason) if a block is detected, (False, None) otherwise.
    Checks: HTTP status, captchaFound flag, and HTML content.
    """
    # Check API-level flags
    if response_data.get("captchaFound"):
        return True, "captchaFound flag"

    status = response_data.get("statusCode", 200)
    if status in (403, 429, 503):
        return True, f"HTTP {status}"

    # Check HTML content for block signals
    html  = response_data.get("html", "").lower()
    soup  = BeautifulSoup(html, "html.parser")
    text  = soup.get_text(" ", strip=True).lower()

    for signal in BLOCK_SIGNALS:
        if signal in text:
            return True, f"Block signal in content: '{signal}'"

    # Check for suspiciously short response (honeypot / fake data)
    if len(html) < 500:
        return True, f"Suspiciously short response: {len(html)} chars"

    return False, None

# Usage
data = scrape("https://example.com/listings")
blocked, reason = is_blocked(data)

if blocked:
    print(f"⚠️  Blocked: {reason} — switching to premium proxy")
    data = scrape("https://example.com/listings", premium=True)
else:
    process(data)
💡 Tip: Log every block detection with its URL, timestamp, and reason. Over time, this data reveals which targets require premium proxies by default, saving you wasted credits on standard proxy attempts.

Tip 4 — Avoid getting blocked with the right headers and rotation

When a browser visits a website, it sends a bundle of headers — User-Agent, Accept-Language, Referer, and others — that together form a browser fingerprint. Requests without these headers, or with outdated browser strings, are immediately flagged as bots.

Fortunately, using a managed web scraping API like Scraping-bot.io handles header rotation automatically. However, if you are making direct requests for simpler targets, here is how to do it correctly:

Rotating realistic headers in Python

import random

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",

    "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4) AppleWebKit/605.1.15 "
    "(KHTML, like Gecko) Version/17.4 Safari/605.1.15",

    "Mozilla/5.0 (X11; Linux x86_64; rv:125.0) Gecko/20100101 Firefox/125.0"
]

def get_headers(referer=None):
    return {
        "User-Agent":      random.choice(USER_AGENTS),
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept":          "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Referer":         referer or "https://www.google.com/",
        "DNT":             "1",
        "Connection":      "keep-alive"
    }
💡 Keep your user agents fresh: Check useragentstring.com or MDN's User-Agent documentation periodically to update your pool with current browser versions. Outdated browser strings are one of the most common reasons scrapers get flagged.

Tip 5 — Use a headless browser for JavaScript-heavy pages

A large proportion of modern websites load their content dynamically via JavaScript after the initial HTML response. Consequently, if you scrape the raw HTML directly, you will often receive an empty shell — no products, no prices, no listings — because the data has not yet been injected by JavaScript.

The solution is to use a headless browser that executes JavaScript and waits for the page to fully render before returning the HTML. With Scraping-bot.io, this is a single option in your API call:

Enabling JS rendering with Scraping-bot.io

import requests, base64

creds = base64.b64encode(b"your_username:your_api_key").decode()

def scrape_js(url, country=None):
    """
    Scrape a JavaScript-rendered page.
    waitForNetworkIdle: waits for all XHR/fetch calls to complete.
    """
    options = {"waitForNetworkIdle": True}
    if country:
        options["country"] = country

    r = requests.post(
        "https://api.scraping-bot.io/scrape/raw-html",
        headers={"Authorization": f"Basic {creds}",
                 "Content-Type": "application/json"},
        json={"url": url, "options": options}
    )
    return r.json()

# Scrape a JS-rendered product page with US geo-location
data = scrape_js("https://example-shop.com/products", country="us")
print(data["html"][:500])

When to use waitForNetworkIdle vs a simple request

Page typeRecommended approach
Static HTML (blogs, news articles)Simple request — faster and cheaper
Prices / listings loaded via AJAXwaitForNetworkIdle: true
Single-page apps (React, Vue, Angular)waitForNetworkIdle: true
Pages behind geo-restrictionswaitForNetworkIdle: true + country
Pages with CAPTCHAspremiumProxy: true + waitForNetworkIdle: true

Tip 6 — Use the right proxies for the right targets

Not all proxies are equal, and using the wrong type for a given target is one of the most common reasons scrapers fail. Specifically, datacenter IPs are fast and cheap but easily detected, while residential IPs are slower and more expensive but far harder to block.

Proxy typeIP sourceDetection riskBest for
DatacenterCloud servers (AWS, GCP...)High — easily flaggedSimple sites, internal tools, low-protection targets
ResidentialReal ISP-assigned home IPsLow — looks like a real userE-commerce, social platforms, Google, Amazon
Geo-targetedResidential IPs in a specific countryVery lowPrice comparison across markets, geo-restricted content

Upgrading to residential proxies on demand

import requests, base64

creds = base64.b64encode(b"your_username:your_api_key").decode()

def scrape(url, premium=False, country=None):
    options = {
        "premiumProxy":      premium,
        "waitForNetworkIdle": True
    }
    if country:
        options["country"] = country

    r = requests.post(
        "https://api.scraping-bot.io/scrape/raw-html",
        headers={"Authorization": f"Basic {creds}",
                 "Content-Type": "application/json"},
        json={"url": url, "options": options}
    )
    return r.json()

def scrape_with_fallback(url, country=None):
    """Try standard proxy first, upgrade to residential on block."""
    result = scrape(url, premium=False, country=country)

    if result.get("captchaFound") or result.get("statusCode") in (403, 429):
        print("Standard proxy blocked — retrying with residential proxy")
        result = scrape(url, premium=True, country=country)

    return result

# Example: scrape a geo-restricted page from Germany
data = scrape_with_fallback("https://example-shop.de/products", country="de")

Tip 7 — Build a web crawler to feed your scraping API

A scraper collects data from a known URL. A crawler, by contrast, discovers new URLs automatically by following links across pages. Together, they form a complete data collection pipeline: the crawler feeds URLs to the scraping API, which returns structured data for each page.

Simple breadth-first crawler in Python

from collections import deque
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import requests, base64

creds = base64.b64encode(b"your_username:your_api_key").decode()

def scrape(url):
    r = requests.post(
        "https://api.scraping-bot.io/scrape/raw-html",
        headers={"Authorization": f"Basic {creds}",
                 "Content-Type": "application/json"},
        json={"url": url, "options": {"waitForNetworkIdle": False}}
    )
    return r.json()

def crawl(start_url, max_pages=50, allowed_domain=None):
    """
    Breadth-first crawler.
    Discovers internal links and feeds them to the scraping API.
    Returns a list of (url, html) tuples.
    """
    domain   = allowed_domain or urlparse(start_url).netloc
    visited  = set()
    queue    = deque([start_url])
    results  = []

    while queue and len(visited) < max_pages:
        url = queue.popleft()
        if url in visited:
            continue

        print(f"Scraping ({len(visited)+1}/{max_pages}): {url}")
        data = scrape(url)
        visited.add(url)

        if data["statusCode"] != 200:
            continue

        results.append({"url": url, "html": data["html"]})

        # Discover new links on the page
        soup  = BeautifulSoup(data["html"], "html.parser")
        links = soup.find_all("a", href=True)

        for link in links:
            abs_url = urljoin(url, link["href"])
            parsed  = urlparse(abs_url)
            # Only follow links within the same domain
            if parsed.netloc == domain and abs_url not in visited:
                queue.append(abs_url)

        human_delay()   # polite delay between pages

    print(f"Crawl complete: {len(results)} pages collected")
    return results

pages = crawl("https://example.com", max_pages=100)
💡 Scale tip: For large crawls, replace the in-memory queue with a persistent queue (Redis, PostgreSQL) so the crawler can resume after interruptions without losing progress. Also consider checking robots.txt for each new domain you discover (see Tip 1).

Putting it all together

These seven web scraping API tips work best as a system rather than individually. Here is how they fit into a production pipeline:

StageTips applied
Before scrapingTip 1 — check robots.txt and crawl delay
During requestsTips 2, 4 — human delays, rotated headers
Response validationTip 3 — multi-layer block detection
Hard targetsTips 5, 6 — JS rendering + residential proxies
URL discoveryTip 7 — crawler feeds URLs to the scraping API

Most of these techniques are handled automatically by Scraping-bot.io — proxy rotation, JS rendering, header management, and CAPTCHA bypassing are all built into the API. As a result, you can focus on parsing and using your data rather than maintaining scraping infrastructure.

Looking for something more specific?

Start using ScrapingBot

Ready to Unlock Web Data?
Data is only useful once it’s accessible. Let us do the heavy lifting so you can focus on insights.