Let’s dive into the first common challenge in data acquisition projects: Setup. You can find the previous post in this series here.
How do you get value from web data? What does the end to end process look like? What questions should you be asking?
To reiterate the context, the use case that I’m covering here is when the data you need exists on a website / other HTTP-accessible resources. This type of projects are also commonly known as web scraping, web data, or web automation projects. The idea is you need structured data from an unstructured source(s).
Definition
I define setup as the process that spans from
1. translating the initial business problem statement into a requirement
2. turning the requirement into a technically feasible, commercially viable, and legally compliant specification
3. mobilising engineering teams to build the mechanisms to a) collect, b) run any processing to turn the data into a usable state, and c) integrate the data into the appropriate location so it can be ingested and utilised by your end users, and finally
4. iterating with all the staleholders to ensure the solution is fit for purpose.
Essentially: getting the project from inception to production. This is the part of a data project that demands the most collaboration and alignment from different parts of the organisation.
Process
Let’s break the four phases above further into concrete activities that ideally happen at this stage:
1. Flesh out the business problem and needs
2. Translate the need into requirement docs
3. Turn the requirement into a technically, legally, and financially feasible project specification
4. Identify the data sources
5. Design the data model
6. Develop the crawlers / data collectors
7. Builid the data pre/post processor
8. Write the integration scripts
9. Test the data quality
10. Build / run any necessary remaining pieces of work to be able to come back full circle to answer the business question. Dashboards? Presentation? Launch of the data-driven service?
Having a conceptual breakdown of these can help you plan, apply good governance, and most importantly, make sense of the process and aid decision making. E.g. you could populate the RACI matrix for each activity — especially helpful when you’re working with vendors.
Setup is the part of the project that needs to be designed to anticipate and handle iteration because data needs are fluid and contextual. Business needs evolve and decision makers adapt as they receive more information. But also more specifically in the context of web data acquisition projects where the data is on external systems, these systems can also change, adding to the fragility.
Issues to anticipate
Non technical:
1. Expectation alignment
2. Overlooked requirements
3. Compliance
4. Ownership alignment
Technical
1. Technology choices
2. Navigation complexity
3. Extraction complexity
4. Data source protection (antibot, rate limiting, captcha, JS checks)
5. Pre/post processing needed on the collected data (data engineering challenge)
6. Interoperability of systems to integrate with
What questions should you ask?
Here are some questions you need to answer at some point to ensure successful and smooth setup.
- What is the business impact? Cut costs? Increase revenue? Increase market share?
- What is the issue with our current process / setup to accomplish the goal, if any?
- What does success look like? What metrics do we need to track to confirm if the data help us hit the business impact?
- What is the timeline for the project? Why do this now?
- If we need to start small, what would the first iteration need to look like?
- Who is the business owner of the project? Who will sign the invoice if we need to purchase any external product or service for the project?
- Who is the technical owner of the project? Who will sign off on the technical specifications?
- Who will be using the data? Human? Programmatic? At what stage?
- What mechanism do we need to put in place to handle the data in compliant manner?
- If we need to make trade offs between cost, speed, and quality, which one should we prioritise? Which one we cannot compromise on?
- Which part do we have the capacity to handle in house? Which part do we need an external partner for?
- What is the budget for this project?
- Which websites and web data sources contain the data we need?
- What fields to obtain? From which section and pages?
- How often do we need the data to be collected?
- Do we need an incremental crawl or a full snapshot each time?
- Do we need the list of input that the crawlers will need to use?
- Which programming language will we use to build the crawlers?
- Where should we host the crawlers?
- Who will maintain the solution moving forward?
Let me know if you can think of any other ones based on your experience
Next week we’ll look at another challenge in web data acquisition projects: banning.
Mentions