A Day in the Life of a Data Scientist
Wearing many hats in a rapidly growing startup
February 22, 2023
Building the foundations
As the first data scientist at Orbital Witness, the early days of my career here were probably not what you’d expect from a traditional data science role.
Back in 2019, when I was fresh out of my Ph.D. programme, Orbital Witness sponsored me to work with them on an eight-week data science training programme called the Faculty Fellowship1. During the placement, I used OCR and Natural Language Processing (NLP) models to extract obligations from scanned images of leases. Here I am presenting my project at the Faculty Fellowship Demo Day:
When Orbital Witness subsequently offered me a job to help build out the automation for legal risk discovery in property transactions, I jumped at the chance. It’s the kind of domain that poses challenges that any data scientist would relish. We’re working on problems that are so hard that they are just about possible using modern machine learning techniques, while having immediate, tangible societal benefits. Right in the sweet spot of being difficult, interesting and worthwhile.
And that was how my journey as Orbital Witness’s first data scientist began, and what a journey it has been…
Joining a company of less than ten people, it’s not surprising that the first year was hugely varied, being part-time data scientist, project manager and engineer. As well as tasks typical of an early data science hire, such as setting up infrastructure, building data pipelines and exploring various use-cases for machine learning across the business, I spent considerable time building features for our data platform in React/Typescript - a job that data scientists normally wouldn’t go anywhere near! All of this was a fantastic experience, but before my pandas dataframe withdrawal symptoms got too bad, we brought in some excellent engineers so I could transition back to data science full-time.
Automating the boring stuff
We’ve come a long way in the past couple of years, and all that foundational work has paid off. Today, we’ve built a data science platform that automates and streamlines the whole process from data labelling and quality control, to model building, training, validation, deployment, monitoring and refinement. This all allows us to get feedback from domain experts on model performance as quickly as possible.
What does that look like? Well, in addition to our custom-built data platform and labelling tools, we use PyTorch to build our machine learning models, Apache Airflow to orchestrate our training and inference pipelines, and MLflow for experiment tracking and model lifecycle management. In line with engineering best practices, we also use Jira for product requirements, GitHub for version control and code reviews, and GitHub Actions for continuous integration—so when a change to a model is approved and merged, it’s automatically trained, loaded into our model registry and deployed to production.
All that automation means that today, when we bring a new data scientist into the team, their day-to-day looks very different to what I described above. We’re all still generalists that wear many hats2 - what my colleague Matt, who’s Head of Data Science, calls “Full-Stack Data Scientists” - but there is more time spent on pure data science, and less on building out infrastructure.
A typical day starts with our 20-minute daily stand-up, where our cross-functional automation team made up of two data scientists, three legal domain experts, a tech lead / full stack engineer, and a product manager, discuss our priorities on the Kanban board. As well as raising blockers and ensuring we remain aligned as we move quickly, we frequently demo new things we’ve built and discuss potential solutions to problems - which sometimes spills over into healthy debate!
Creative problem solving
After stand-up, I’ll start work on whatever the top priority is for that day. The current priority for our team is automatically answering about 40 different legal questions from different HM Land Registry documents - not trivial given how complex and varied these documents often are!
As a small team, we are responsible for collecting the labelled data that allows us to solve a problem, as well as figuring out which models, tools or techniques we should use to automate it. In practice, this starts by doing a review of the latest research, using Jupyter Notebooks to pull data from the relevant company databases and doing some exploratory data analysis. At this point, I might need to feed information back to the labelling team if I notice any quality or consistency issues that can be fixed at source. The next step might be to start to sketch out how to build a model, and how to train that model on our data, for example using a suitable language model from Hugging Face Transformers, which I would build into a training pipeline using PyTorch Lightning. Then I use MLflow to kick off training in our Kubernetes cluster on Google Cloud Platform. Once the model is trained, which can take hours or sometimes overnight, I’d then review the metrics to see how it performs against various baselines and assess whether we’ve moved the needle in the right direction. The culture in the team is definitely one of collaboration and support. I often have meetings with the Head of Data Science to discuss what we’re each working on and help each other when we get stuck. All of the automation team review each other’s code, so we are constantly learning about the details of different problems people are working on across the product. We might even do some pair programming if there’s a particularly tough problem, or if we’re bringing someone new onto the team and want to help them get up to speed.
Autonomy and trust
We’re flexible about remote working, but I typically work in the office at least a couple of days a week, because it’s a nice atmosphere. We have a pretty flat hierarchy and there are no silos between teams, so you’ll often find engineers having coffee or lunch with our legal experts or the sales team. We also have plenty of after work socials, some very casual and some more organised—the latter especially after monthly all-hands meetings or our yearly offsite getaways. The company is good about professional development too—everyone has their own annual development budget, and we’re trusted to spend it on whatever we think is most useful for us to upskill. We also have 20% time, so if you think there’s some longer-term R&D work that would make everyone’s lives easier or will have a big impact for our customers, you can advocate for it and start working on it immediately. In data science particularly, there’s never a shortage of ideas for fun things to work on!
Bringing people on board
I’m excited about this next phase of growth at Orbital Witness, and it will be great to grow the data science team. It’s a fascinating and varied job that combines cutting-edge NLP and Computer Vision with a complex and interesting legal real estate domain. Real estate is the largest asset class in the world, and touches everyone’s lives every day. Everything is real estate - from the homes we live in, and the offices we work in, to the land our trains travel over/under - so the law that governs property is central to a functioning society. It is also not an area that has been overhauled by technology yet, so what we’re doing feels genuinely new and ground-breaking. If you want to be part of the data science nervous system of a company which will shape the future of a whole industry, I can’t think of a better place to work.
If you are interested in the above challenges and are curious to know more about how we currently solve them, please see our open roles and get in touch with us via our Careers Page. If you’re an ambitious data scientist and there isn’t currently a role posted there, still feel free to connect and message Andrew Thompson, CTO, directly on LinkedIn and he’d be happy to have a casual chat via video conference or over coffee ☕️ as we’re always looking for great people.
The Faculty Fellowship is an experience I’d highly recommend to anyone who wants to move from academia into a data science job—if you’re interested, you can apply here. ↩︎
Fun fact: the hero image for this blog post was made by typing the prompt “3D render of a friendly metal robot wearing many hats stacked on top of each other in a light blue room” into DALL-E 2. ↩︎