Sometimes you find that your current setup doesn’t give you the sufficient data to achieve your business goals. You might feel the need to increase the frequency, add the breadth of the data collection parameters for example keywords, zipcodes, categories, or add new data sources. You need to scale the project.
What do you need to pay attention to be able to successfully scale your data acquisition project? Three words: Aim, build, and arm.
I see three building blocks to scalability:
- Feasibility (aim)
- Architecture (build)
- Infrastructure (arm)
Set your target. What scale can you aim for? This usually looks concretely in the unit of “HTTP(S) requests per month”.
Find out: What kind of target volume you’re looking to achieve?
How: Check your target site using similarweb.com to get a sense of the monthly traffic volume the target site usually handle and make sure you’re not abusing, raising the flag, or flooding the site unnecessarily. A good rule of thumb is to add at most 10% to a site’s monthly traffic.
Remember: the higher the burden to the server, the higher chance of banning to occur. It’s all about getting to the sweet spot.
You want something efficient and foolproof. Your solution can only go as fast as its bottleneck.
Three components to design:
a. Discovery process (your crawlers)
b. Extraction process (your scrapers)
c. Maintenance process (your monitors)
Crawling / discovery
What is discovery? To collect the data, you need to first discover the page. This is built into the crawlers’ logic.
You’d want to identify the navigation flow that gets you the page you need with least number of requests without raising anti-bot awareness. Scripting your headless browser often looks like the easiest way to arrive at your data especially on more interactive rich websites, but reverse engineering the network activity is always the less resource-intensive approach.
Tip: Implement producer – consumer architecture if you expect to have more than 100k items per crawl — when there’s more at stake when the job failed. This architecture also works best when the detail pages are stateless / all information can be requested through the URL, without any sessions. You can emulate all the sessions needed in the code but it takes more effort.
Once you have a grasp of the range of monthly volume you want to start aiming for, work backwards to a sensible number of concurrent crawlers you need to spawn and at what frequency. The key here is experimentation. Websites are living beings and you need to be willing to keep up with the tango.
Scraping / extraction
Extraction is the part where you parse and extract the data from the page already discovered by the crawlers. Implementation-wise, scrapers are often built and ran in the same unit of code as the crawlers’ code. I separated the two to better highlight how we can architect the solution.
Rules of thumb:
- Keep it atomic, generic, and modular. If you’re working with several layouts or sites that share the same pattern, it’s a good idea to implement base crawlers and then inherit as needed to increase maintainability.
- Mind your memory footprint. E.g. when you’re looking to store HTML of screenshots, it’s better to decouple this process into a post-processing step/script to optimise memory usage and crawl runtime rather than doing it on the fly.
Don’t step on the gas before putting your seatbelt on. How do you make sure your datafeed is healthy, what’s your plan if something breaks, and how do you know which part has broken?
Next week we will have a deeper look at data quality (and maintenance!) but for now, know that there are three parts you want to ensure you have early detection systems, scenario handling, and fallback plan for:
- Data coverage
- Field coverage
- Field accuracy
Have a look at these Scrapy-based solution for monitoring in the meanwhile: spidermon and spider contracts.
There is also a monitoring SaaS like scrapeops.io that can help (have yet supported NodeJS solution at this moment).
Load up the ammo. You have the trip planned, car bought and configured, now it’s time to fill up the gas tank.
Two types of infrastructure involved in data acquisition projects:
- Hosting: somewhere to run your code. if you use Scrapy, the most popular self-hosted option is ScrapydWeb. SpiderKeeper is another good alternative but it hasn’t been maintained for some time. If you don’t want the hassle, then Scrapy Cloud is the best Platform As a Service for this, based on the scrapyd opensource tool.
- Fetching: the proxy management and other antiban solutions
Rule of thumb: understand fundamentals of memory and runtime complexity, be frugal. Remember, the more you try to collect, the more resource you need. So collect what’s necessary and decide what’s sufficient.
I often see problems can be solved without pushing the scale to an unsustainable rate. It’s also surprisingly common to run into cases where the datafeed recipient realised they don’t yet have the technical capacity or expertise to handle the volume of data they thought they need and demanded.
As the saying goes: anything is possible with enough time and budget. I’d add creativity and common sense to that list. Understand exactly why you need that much data at that exact frequency, see what trade-offs you can make, and be flexible in reshaping the scope. Start small, scale up, iterate.