The second most popular question people ask when they learn that I work in the web data collection / acquisition space is “how do you go about banning?“. Could you guess what the first question is? And no, it’s not related to the first challenge we looked at last week: Setup. One hint: if this banning question is only asked by people who have some experience with web data acquisition, then the first question is also asked by people who has only heard of this concept.
I see many articles going over the tactics to overcome banning or blocking, but it’s important to understand the fundamental of how the different mechanisms are mapped out conceptually.
First, what do we mean by banning? Banning is when the web servers implement limiting mechanism to not return the response in the format you’re expecting at the rate that you’re requesting.
Each web server have different capacity and implement their own rules to ensure their resources are available, not abused, legitimately used (minimise DDoS attacks, frauds), and well distributed. To do this, the servers will try and recognise whether it is a real human accessing the webpages or it is a bot / crawler. This is also known as rate limiting, anti-bots systems, bot detection systems, or crawling countermeasures.
As much that we follow best practices so to not abuse the service providers while accessing public web data, our data collection needs will conflict with these ranging set of rules and varying capacity. Essentially, how to ensure you don’t cross that thin and arguable line between running legitimate web scrapers and malicious bots throwing aggressive traffic?
Let’s now look at different bot detection mechanisms and common methods in overcoming the challenges — activities I like to label as anti-ban efforts.
This image illustrates the different methods commonly used (in combination) arranged by their level of sophistication.
If I do it in-house, what should I expect?
Here is a more tactical look at what you need to expect if you’re looking to overcome anti-bots, with some quick wins for each approach.
Network and back end level
- Manage your sessions (a combination of user agents, source IP, headers sent, cookies used). Quick win: think like a browser. Fake it until you make it.
- Mind your request pattern (be mindful of suspicious path-bypass, velocity of requests). Quick win: apply throttling, slow down.
- Use appropriate geolocation. Quick win: proxy management. Start from static, subscribe to more advanced solution with larger pool when needed.
- Ensure your TCP/IP fingerprint is consistent. Quick win: ensure TTL and window size fields is consistent.
Front end level. Be prepared for:
- Browser fingerprinting. Browsers have a set of properties that anti-bot systems could use to check for inconsistencies. Things like OS version, the canvas API checks, WebGL test, TLS fingerprinting, WebRTC. (not so) Quick win: headless-browser-farm-as-a-service.
- Captchas (graphical / explicit, behavioral / implicit). Quick win: find ways to not trigger these. This is one good place to practice “You don’t need to solve your problem when you can just run away from them”. One thing to note is this method is also the most intrusive to human visitors, so websites tend to use this sensibly.
If I want to buy, what are my options?
All the solutions employ a combination of methods described above to either detect or evade detection.
Different antibot solutions
- F5 Bot Detection & Security
- Alibaba Cloud
- Google ReCaptcha
Different antiban solutions:
- Zyte Smart Browser API
- Zyte Smart Proxy Manager
- Bright Data Web Unlocker
- Bright Data Proxy Manager
- Oxylabs.io offers different proxy types and services
- Limeproxies offers different proxy types and services
- Netnut.io (Proxy provider)
- https://smartproxy.com/ (Proxy provider)
- https://rayobyte.com/ (Proxy provider)
- https://github.com/claffin/cloudproxy (self hosted, if you manage your own pool of proxies)
- Death by Captcha
You can’t help but notice that a lot of these solutions (and many more) revolve around proxies. The proxy rotation and management space is indeed a whole other interesting and substantial topic that I think deserve its own post to go deeper. It is still one of the most effective to overcome most bot detection systems.
At the end of the day there is no silver bullet for the anti-ban efforts. If you decided to run the show yourself, it all comes down to: be consistent, be thorough, and be respectful. If you go with SaaS anti-ban solutions, be prepared to experiment with your crawlers’ parameters, be patient, and be ready to make trade offs.
One Reply to “Banning: The Second Challenge in Web Data Acquisition Project”