Banning: The Second Challenge in Web Data Acquisition Project

The second most popular question people ask when they learn that I work in the web data collection / acquisition space is “how do you go about banning?“. Could you guess what the first question is? And no, it’s not related to the first challenge we looked at last week: Setup. One hint: if this banning question is only asked by people who have some experience with web data acquisition, then the first question is also asked by people who has only heard of this concept.

I see many articles going over the tactics to overcome banning or blocking, but it’s important to understand the fundamental of how the different mechanisms are mapped out conceptually.

First, what do we mean by banning? Banning is when the web servers implement limiting mechanism to not return the response in the format you’re expecting at the rate that you’re requesting.

Each web server have different capacity and implement their own rules to ensure their resources are available, not abused, legitimately used (minimise DDoS attacks, frauds), and well distributed. To do this, the servers will try and recognise whether it is a real human accessing the webpages or it is a bot / crawler. This is also known as rate limiting, anti-bots systems, bot detection systems, or crawling countermeasures.

As much that we follow best practices so to not abuse the service providers while accessing public web data, our data collection needs will conflict with these ranging set of rules and varying capacity. Essentially, how to ensure you don’t cross that thin and arguable line between running legitimate web scrapers and malicious bots throwing aggressive traffic?

Let’s now look at different bot detection mechanisms and common methods in overcoming the challenges — activities I like to label as anti-ban efforts.

This image illustrates the different methods commonly used (in combination) arranged by their level of sophistication.

[image here]

We have seen the amount of innovation made in web technologies the past 12 years. In the early days, sites are served in HTML, vanilla CSS, and basic Javascript. These days websites are built to deliver more interactive user experience, which require more front end assets and allow for more advanced tracking. This coupled with how more people realise the value of web data, more sophisticated bots are developed alongside the technologies to protect them. Up till 5 years ago, it used to be sufficient to rely on the back end methods to keep most bots away.

If I do it in-house, what should I expect?

Here is a more tactical look at what you need to expect if you’re looking to overcome anti-bots, with some quick wins for each approach.

Network and back end level

  • Manage your sessions (a combination of user agents, source IP, headers sent, cookies used). Quick win: think like a browser. Fake it until you make it.
  • Mind your request pattern (be mindful of suspicious path-bypass, velocity of requests). Quick win: apply throttling, slow down.
  • Use appropriate geolocation. Quick win: proxy management. Start from static, subscribe to more advanced solution with larger pool when needed.
  • Ensure your TCP/IP fingerprint is consistent. Quick win: ensure TTL and window size fields is consistent.

Front end level. Be prepared for:

  • Javascript capability / rendering check. Quick win: Playwright or Puppeteer.
  • Browser fingerprinting. Browsers have a set of properties that anti-bot systems could use to check for inconsistencies. Things like OS version, the canvas API checks, WebGL test, TLS fingerprinting, WebRTC. (not so) Quick win: headless-browser-farm-as-a-service.
  • Captchas (graphical / explicit, behavioral / implicit). Quick win: find ways to not trigger these. This is one good place to practice “You don’t need to solve your problem when you can just run away from them”. One thing to note is this method is also the most intrusive to human visitors, so websites tend to use this sensibly.

If I want to buy, what are my options?

All the solutions employ a combination of methods described above to either detect or evade detection.

Different antibot solutions

  • Datadome
  • PerimeterX
  • Incapsula
  • Akamai
  • F5 Bot Detection & Security
  • Cloudflare
  • Alibaba Cloud
  • Google ReCaptcha

Different antiban solutions:

You can’t help but notice that a lot of these solutions (and many more) revolve around proxies. The proxy rotation and management space is indeed a whole other interesting and substantial topic that I think deserve its own post to go deeper. It is still one of the most effective to overcome most bot detection systems.

All of these are simple to grasp but not easy to execute. Most of the time it’s better to buy SaaS solutions mentioned above and free up your team’s precious time from the hassle (hence, cost) in managing different proxy types, applying throttling, manual ban handling, writing middlewares, setting up a headless browser infrastructure, and reverse engineering of complex Javascript. Of course I’m addressing this from the point of view of data collection project, but the same goes if you have a website and are struggling with malicious bots.

At the end of the day there is no silver bullet for the anti-ban efforts. If you decided to run the show yourself, it all comes down to: be consistent, be thorough, and be respectful. If you go with SaaS anti-ban solutions, be prepared to experiment with your crawlers’ parameters, be patient, and be ready to make trade offs.

Leave a Reply

Your email address will not be published. Required fields are marked *