A blog series for solving concrete issues in the sprawling world of data-driven value creation.
If you’re reading this, odds are you have lots of data to deal with, and great ideas of how various dashboards, machine learning models and microservices will revolutionise your department / company / market — and that’s exactly the way it should be.
So, let’s assume you get the data piped to where it needs to go, nicely cleaned up and modelled, the dashboards and ML models developed, and it all goes live with minimal budget overruns and delays. Congratulations, you’ve succeeded, and probably things will run smoothly for a few weeks or even months. 🥳
Just kidding, going into production merely means you’ve finished the first leg on a much longer journey (though the good news is that it was the hardest one; creating something from nothing). Now, you get a new set of challenges:
- The input data for your application keeps changing, both structurally and semantically.
- The components of your architecture keep becoming outdated, possibly in a security-relevant manner.
- Your users keep finding things that are not working, yet those same users will complain loudly if your application is not working basically at all times.
- You will have bugs that hit production which may cause wrong data to be emitted and/or persisted to your data layer.
- Depending on your regulatory context, someone might ask you to justify past output of your application in excruciating detail.
- Various stakeholders will resist necessary changes, but demand unnecessary ones.
- You will be hit by Hyrum’s law both as a consumer as well as a producer.
- And so on and so forth.
You see, we’re all dancing on a volcano of barely-contained complexity, and the b̶i̶g̶g̶e̶s̶t̶ ̶w̶r̶e̶n̶c̶h̶e̶s̶ ̶i̶n̶ ̶t̶h̶e̶ ̶w̶o̶r̶k̶s magma chambers are time and the (very human) need for change. That’s why the development and deployment of your project is only a snapshot, and will deteriorate quickly if it is not ready to keep moving with the times in a controlled manner.
The good news is that pretty much everyone else in this ecosystem is in the same boat as you are, and a lot of people have already gone through what you’re about to go through (maybe including your past self?), and are building a growing arsenal of tools and practices to make this dance more enjoyable, or at least less explosive.
Perhaps you won’t be surprised that this is a continuous process as well. The cycle goes roughly as follows:
- We build things that are complex enough such that it becomes impossible for humans to anticipate all possible consequences of a change.
- Inevitably, this eventually leads to mistakes sneaking through all defences into production, which cause very high levels of stress to fix.
- Between 1-N similar failures, people start implementing automated checks to catch those mistakes before they can do damage.
- Due to the huge variety of ways to fail, there’s now a large toolbox of concepts, frameworks and methods to counteract those failures.
- In addition to catching already diagnosed failure modes, leveraging these tools makes previously impossible problems solvable.
- Your now-enhanced productivity allows you to create new complexity even faster. 🥳
So really, that cycle is actually a spiral, and while it’s a constant arms race between productivity and complexity, the former can keep rising as long as you can keep the latter manageable (=more or less constant).
Scope & Goals
Let’s be honest, your data products probably don’t have the availability requirements of Google, the security concerns of an ATM machine nor life-or-death criticality if things go wrong. That means that you don’t need to over-engineer your solution to implement all best-in-class DevOps approaches. The key is as always to find the right trade-off between effort and impact.
We’re also not trying to reinvent overviews like Martin Fowler’s CD4ML article or various MLOps frameworks. The goal for this series is first and foremost to discuss tools and approaches that solve concrete problems in our ecosystem and which are — hopefully — worth the read for you, to check out whether they apply to your situation too.
In the best case scenario, some in-the-trenches experience will help you plan your own approach better. By considering the foreseeable problems that arise from operating and evolving a given data solution already from the very beginning, you will most likely save yourself a lot of hassle. This is especially true compared to going live and then adding the required infrastructure after the fact, or worse: continuously hotfixing things in an architecture too fragile to survive long-term in the sea of change.
Some examples of such questions that we encounter all the time in our line of work:
- How to make the history of the data auditable and reproducible
- How to make the data processing tractable and reproducible
- How to model your data so you don’t have to constantly redesign
- How to version your ML models and artefacts
- How to track KPIs over time, analyse them, and use them to make decisions
- How to design, version, and evolve interfaces
- How to encapsulate, isolate and scale your models & services (environments, containers, clusters, etc.)
- How to orchestrate everything in a way that is reliable, introspectable & flexible
We look forward to sharing more of our experiences and expertise with you on all these questions. As a first entry, go have a look at this article introducing a blueprint of a simple MLOps pipeline, complete with a public repo you can just clone and run, as well as an easy-to-digest slide deck.