Full-Stack Data Science at Orbital Witness
Data science requires a diverse skill set. That's a huge challenge, but if you develop that breadth you'll be enormously effective, especially at early-stage startups with big ambitions
November 10, 2021
Data science roles
The role of a data scientist is famously broad and in my experience there’s a distinct, and sometimes amusing, lack of consensus in the industry about the role’s responsibilities. It might include any of the following:
- Ad hoc data analysis (using tools like SQL, Tableau, R)
- Experiment design and product analytics (such as A/B testing)
- Machine learning or statistical modelling (e.g. regression, SVM, XGBoost)
- Deep learning (often computer vision or NLP using TensorFlow or PyTorch)
- Conducting new fundamental research (or reproducing published papers)
- Model lifecycle management (deploying production-grade APIs e.g. on Kubernetes)
- Data collection, transformation, and warehousing (e.g. using Spark, Airflow)
- Software engineering (building tools, integrating with existing products or APIs)
Also, depending on the selection above, the role might sit in various organisations with completely different expectations and working culture: you might be in the analytics team playing a supporting role to decision makers, or you might be in the engineering organisation building the foundational technology behind the company’s product.
Naturally this leads to problems in practice:
- Candidates might struggle to meet the hiring bar (or even know what it is and how to prepare) across such a diverse set of criteria, especially if it doesn’t match their training. It’s extremely challenging to be an expert across the board.
- New joiners might be quickly dissatisfied if they expect to be doing modelling (3. or 4. above), but find themselves doing any of the other tasks.
Specialisation vs generalisation
One solution is to split up the role and assign the tasks to new or existing specialisms:
- Data Analysts can take 1. and 2.
- Data Scientists can focus on 3. and 4.
- Research Scientists can publish, while Research Engineers reproduce papers for 5.
- Machine Learning Engineers specialise on 6.
- Data Engineers own 7.
- Software Engineers naturally take 8.
This idea makes a lot of intuitive sense — after all, isn’t the division of labour good for productivity? — and in practice I think the industry is moving in this direction. However, in 2019 Eric Colson wrote an outstanding article which argues against specialisation for the data science profession in particular and in favour of ‘full-stack’ generalists capable of taking a problem “from conception to modelling to implementation to measurement”.
He argues specialisation increases coordination costs (time spent discussing, justifying, and prioritising work), exacerbates wait-time (while the specialists get up-to-speed and finish their other work), and narrows context (e.g. the data scientist focuses on tuning their algorithm, missing low-hanging fruit from other domains like data collection).
Eric was previously a VP at Netflix, so you might expect his perspective to be tilted towards larger enterprises1. But in my view2 his argument is all the more convincing when applied to early-stage venture-backed startups and especially those where machine learning is central to the mission, for several reasons.
First, if you are building a new product from scratch, speed (meaning time-to-market) is everything. You need to get feedback on your concept (and ideally on the quality your model’s real-world predictions) as soon as possible, so decreasing coordination costs is extremely valuable.
Second, small companies are less likely to have dedicated Machine Learning Engineers, Data Engineers, or DevOps Engineers, so there may simply be no option but to build the tooling required to get data science off the ground. The ‘wait-time’ above would be infinite if there are no specialists available.
Third, if the company’s mission is based around machine learning, the entire end-to-end process needs to be designed around the desired outcome. For instance, the lowest hanging fruit is very likely to be in data collection — particularly if you have humans in the loop performing labelling — and the best people to advise on the data collection process are those also responsible for its use in model training: the data scientists!
A case study
Orbital Witness is precisely the kind of company that benefits from generalist data scientists. We’re a seed-stage company looking to automate due diligence for the world’s largest asset class — real estate — using natural language processing. For the last year, we’ve been a two-person data science team. A selection of our projects will give you a sense for the breadth we’ve encountered:
- Mapping the property due diligence domain into ML tasks.
- Building a React-based labelling app for human labellers to perform the tasks.
- Training PyTorch-based models for those tasks (e.g. fine-tuned transformer-based language models).
- Creating labelling guidelines to ensure the task is performed consistently.
- Building a job system to assign relabelling work in case it isn’t performed consistently.
- Automating pipelines (using Airflow) for data processing, model training and evaluation.
- Designing dashboards (using Metabase) to track inter-labeller agreement and to share model performance metrics.
For part of the year, our entire product team comprised only the two of us in data science. I’m sure you’ll agree that’s a diverse set of projects and we wouldn’t have been able to make nearly as much progress in a short amount of time if we weren’t willing to go end-to-end and learn new skills on the job.
Unicorn data scientists
Okay, we understand the value a ‘full-stack’ data scientist can bring to a startup, but isn’t it expecting too much for candidates to have expertise across the board? Aren’t they unicorns — impossible to find? In my view, expertise in all aspects of the job isn’t needed. Instead, think about being ‘T-shaped’. If you’re a data scientist adding a feature to a React-based labelling app used internally, for example, you don’t need to understand how React works under the hood, you just need the ability (and the motivation) to quickly learn just enough to get the job done in a new domain. Eric Colson agrees:
Finally, the full-stack data science model relies on the assumption of great people. They are not unicorns; they can be found as well as made. But they are in high demand and it will require competitive compensation, strong company values, and interesting work to attract and retain them. Be sure your company culture can support this.
We agree, and if you find the idea of solving a very ambitious machine learning problem end-to-end exciting, please reach out, we’re hiring!
Indeed, he thinks this strategy requires a solid ‘data platform’ maintained by specialists, which probably doesn’t exist in your average startup. ↩︎
Admittedly, as something of a generalist myself (former software engineer, poker player, and philosopher, among other things…), perhaps I have a vested interest in this argument. ↩︎