Web Scraping API Tips: 7 Best Practices for 2026
These web scraping API tips are drawn from years of production scraping experience across thousands of targets. Whether you are building your first scraper or hardening an existing pipeline, following these seven best practices will help you collect cleaner data, avoid blocks, and ship more reliable automations — with Python and Node.js code examples throughout.
Table of contents
Tip 1 — Respect the website and its robots.txt
The first of our web scraping API tips is also the most fundamental: always read the robots.txt file before scraping a site. This file, maintained by the website owner, specifies which pages are allowed or disallowed for automated access — and sometimes even defines acceptable crawl rates.
Beyond legality, respecting these rules is also practical. Scraping aggressively without reading robots.txt increases your chances of being blocked, rate-limited, or served honeypot data designed to detect bots.
Parsing robots.txt automatically in Python
from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse
def can_scrape(url, user_agent="*"):
"""Returns True if the URL is allowed by robots.txt."""
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
rp = RobotFileParser()
rp.set_url(robots_url)
rp.read()
allowed = rp.can_fetch(user_agent, url)
crawl_delay = rp.crawl_delay(user_agent)
return {"allowed": allowed, "crawl_delay": crawl_delay}
result = can_scrape("https://example.com/products")
print(result)
# {"allowed": True, "crawl_delay": 2.0}
crawl_delay is set, use it as your minimum delay between requests. Ignoring it is the fastest way to get your IP blacklisted.
Tip 2 — Simulate human behaviour with realistic delays
Browsing speed is one of the clearest signals websites use to distinguish humans from bots. A script that fires requests every 50ms looks nothing like a human — and will be flagged almost immediately. Furthermore, many modern anti-bot systems track inter-request timing across sessions, not just individual request rates.
The solution is to add randomised delays that mimic genuine user behaviour: a pause while "reading" the page, occasional longer gaps, and varied timing between requests.
Realistic delay pattern in Python
import time, random
def human_delay(min_s=1.0, max_s=4.0, long_pause_prob=0.1):
"""
Wait between min_s and max_s seconds.
Occasionally wait 8–15s to simulate a user reading a page.
"""
if random.random() < long_pause_prob:
delay = random.uniform(8, 15)
print(f"Long pause: {delay:.1f}s")
else:
delay = random.uniform(min_s, max_s)
time.sleep(delay)
urls = ["https://example.com/page/1",
"https://example.com/page/2",
"https://example.com/page/3"]
for url in urls:
data = scrape(url) # your scraping API call
process(data)
human_delay() # pause before the next request
Realistic delay in Node.js
function humanDelay(minMs = 1000, maxMs = 4000, longPauseProb = 0.1) {
const isLong = Math.random() < longPauseProb;
const delay = isLong
? 8000 + Math.random() * 7000 // 8–15s long pause
: minMs + Math.random() * (maxMs - minMs);
return new Promise(r => setTimeout(r, delay));
}
for (const url of urls) {
const data = await scrape(url);
process(data);
await humanDelay();
}
Tip 3 — Detect when you have been blocked
Not all blocks are obvious. A 403 Forbidden is easy to catch, but some websites use subtler techniques: they return a 200 OK status with a CAPTCHA page, serve deliberately fake data, or silently redirect you to a honeypot. Therefore, validating the response content — not just the status code — is essential.
Multi-layer block detection in Python
from bs4 import BeautifulSoup
import re
BLOCK_SIGNALS = [
"access denied",
"captcha",
"unusual traffic",
"blocked",
"403 forbidden",
"rate limit exceeded",
"you have been banned"
]
def is_blocked(response_data):
"""
Returns (True, reason) if a block is detected, (False, None) otherwise.
Checks: HTTP status, captchaFound flag, and HTML content.
"""
# Check API-level flags
if response_data.get("captchaFound"):
return True, "captchaFound flag"
status = response_data.get("statusCode", 200)
if status in (403, 429, 503):
return True, f"HTTP {status}"
# Check HTML content for block signals
html = response_data.get("html", "").lower()
soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(" ", strip=True).lower()
for signal in BLOCK_SIGNALS:
if signal in text:
return True, f"Block signal in content: '{signal}'"
# Check for suspiciously short response (honeypot / fake data)
if len(html) < 500:
return True, f"Suspiciously short response: {len(html)} chars"
return False, None
# Usage
data = scrape("https://example.com/listings")
blocked, reason = is_blocked(data)
if blocked:
print(f"⚠️ Blocked: {reason} — switching to premium proxy")
data = scrape("https://example.com/listings", premium=True)
else:
process(data)
Tip 4 — Avoid getting blocked with the right headers and rotation
When a browser visits a website, it sends a bundle of headers — User-Agent, Accept-Language, Referer, and others — that together form a browser fingerprint. Requests without these headers, or with outdated browser strings, are immediately flagged as bots.
Fortunately, using a managed web scraping API like Scraping-bot.io handles header rotation automatically. However, if you are making direct requests for simpler targets, here is how to do it correctly:
Rotating realistic headers in Python
import random
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4) AppleWebKit/605.1.15 "
"(KHTML, like Gecko) Version/17.4 Safari/605.1.15",
"Mozilla/5.0 (X11; Linux x86_64; rv:125.0) Gecko/20100101 Firefox/125.0"
]
def get_headers(referer=None):
return {
"User-Agent": random.choice(USER_AGENTS),
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Referer": referer or "https://www.google.com/",
"DNT": "1",
"Connection": "keep-alive"
}
Tip 5 — Use a headless browser for JavaScript-heavy pages
A large proportion of modern websites load their content dynamically via JavaScript after the initial HTML response. Consequently, if you scrape the raw HTML directly, you will often receive an empty shell — no products, no prices, no listings — because the data has not yet been injected by JavaScript.
The solution is to use a headless browser that executes JavaScript and waits for the page to fully render before returning the HTML. With Scraping-bot.io, this is a single option in your API call:
Enabling JS rendering with Scraping-bot.io
import requests, base64
creds = base64.b64encode(b"your_username:your_api_key").decode()
def scrape_js(url, country=None):
"""
Scrape a JavaScript-rendered page.
waitForNetworkIdle: waits for all XHR/fetch calls to complete.
"""
options = {"waitForNetworkIdle": True}
if country:
options["country"] = country
r = requests.post(
"https://api.scraping-bot.io/scrape/raw-html",
headers={"Authorization": f"Basic {creds}",
"Content-Type": "application/json"},
json={"url": url, "options": options}
)
return r.json()
# Scrape a JS-rendered product page with US geo-location
data = scrape_js("https://example-shop.com/products", country="us")
print(data["html"][:500])
When to use waitForNetworkIdle vs a simple request
| Page type | Recommended approach |
|---|---|
| Static HTML (blogs, news articles) | Simple request — faster and cheaper |
| Prices / listings loaded via AJAX | waitForNetworkIdle: true |
| Single-page apps (React, Vue, Angular) | waitForNetworkIdle: true |
| Pages behind geo-restrictions | waitForNetworkIdle: true + country |
| Pages with CAPTCHAs | premiumProxy: true + waitForNetworkIdle: true |
Tip 6 — Use the right proxies for the right targets
Not all proxies are equal, and using the wrong type for a given target is one of the most common reasons scrapers fail. Specifically, datacenter IPs are fast and cheap but easily detected, while residential IPs are slower and more expensive but far harder to block.
| Proxy type | IP source | Detection risk | Best for |
|---|---|---|---|
| Datacenter | Cloud servers (AWS, GCP...) | High — easily flagged | Simple sites, internal tools, low-protection targets |
| Residential | Real ISP-assigned home IPs | Low — looks like a real user | E-commerce, social platforms, Google, Amazon |
| Geo-targeted | Residential IPs in a specific country | Very low | Price comparison across markets, geo-restricted content |
Upgrading to residential proxies on demand
import requests, base64
creds = base64.b64encode(b"your_username:your_api_key").decode()
def scrape(url, premium=False, country=None):
options = {
"premiumProxy": premium,
"waitForNetworkIdle": True
}
if country:
options["country"] = country
r = requests.post(
"https://api.scraping-bot.io/scrape/raw-html",
headers={"Authorization": f"Basic {creds}",
"Content-Type": "application/json"},
json={"url": url, "options": options}
)
return r.json()
def scrape_with_fallback(url, country=None):
"""Try standard proxy first, upgrade to residential on block."""
result = scrape(url, premium=False, country=country)
if result.get("captchaFound") or result.get("statusCode") in (403, 429):
print("Standard proxy blocked — retrying with residential proxy")
result = scrape(url, premium=True, country=country)
return result
# Example: scrape a geo-restricted page from Germany
data = scrape_with_fallback("https://example-shop.de/products", country="de")
Tip 7 — Build a web crawler to feed your scraping API
A scraper collects data from a known URL. A crawler, by contrast, discovers new URLs automatically by following links across pages. Together, they form a complete data collection pipeline: the crawler feeds URLs to the scraping API, which returns structured data for each page.
Simple breadth-first crawler in Python
from collections import deque
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import requests, base64
creds = base64.b64encode(b"your_username:your_api_key").decode()
def scrape(url):
r = requests.post(
"https://api.scraping-bot.io/scrape/raw-html",
headers={"Authorization": f"Basic {creds}",
"Content-Type": "application/json"},
json={"url": url, "options": {"waitForNetworkIdle": False}}
)
return r.json()
def crawl(start_url, max_pages=50, allowed_domain=None):
"""
Breadth-first crawler.
Discovers internal links and feeds them to the scraping API.
Returns a list of (url, html) tuples.
"""
domain = allowed_domain or urlparse(start_url).netloc
visited = set()
queue = deque([start_url])
results = []
while queue and len(visited) < max_pages:
url = queue.popleft()
if url in visited:
continue
print(f"Scraping ({len(visited)+1}/{max_pages}): {url}")
data = scrape(url)
visited.add(url)
if data["statusCode"] != 200:
continue
results.append({"url": url, "html": data["html"]})
# Discover new links on the page
soup = BeautifulSoup(data["html"], "html.parser")
links = soup.find_all("a", href=True)
for link in links:
abs_url = urljoin(url, link["href"])
parsed = urlparse(abs_url)
# Only follow links within the same domain
if parsed.netloc == domain and abs_url not in visited:
queue.append(abs_url)
human_delay() # polite delay between pages
print(f"Crawl complete: {len(results)} pages collected")
return results
pages = crawl("https://example.com", max_pages=100)
robots.txt for each new domain you discover (see Tip 1).
Putting it all together
These seven web scraping API tips work best as a system rather than individually. Here is how they fit into a production pipeline:
| Stage | Tips applied |
|---|---|
| Before scraping | Tip 1 — check robots.txt and crawl delay |
| During requests | Tips 2, 4 — human delays, rotated headers |
| Response validation | Tip 3 — multi-layer block detection |
| Hard targets | Tips 5, 6 — JS rendering + residential proxies |
| URL discovery | Tip 7 — crawler feeds URLs to the scraping API |
Most of these techniques are handled automatically by Scraping-bot.io — proxy rotation, JS rendering, header management, and CAPTCHA bypassing are all built into the API. As a result, you can focus on parsing and using your data rather than maintaining scraping infrastructure.


