Maciej Kępa | Beyond the Notebook

What MLOps actually is and why teams keep misdefining it

Wed, 20 May 2026 00:00:00 GMT

import Callout from '../../components/Callout.astro'; MLOps is one of those terms that became popular faster than it became precise. Depending on who is speaking, it can mean model deployment, ML platform work, pipeline automation, or simply "the stuff that has to happen after the notebook". That is exactly why teams keep talking past each other. They use the same word, but describe different parts of the same problem. ![Hand-drawn lifecycle sketch showing data, model, deployment and monitoring](/blog-covers/what-mlops-actually-is.png) ## Why the term gets blurry so quickly There is a simple reason for the confusion: machine learning projects do not end with training, but training is still the part most people focus on. When the model starts producing promising results, a new set of questions appears: - how do we reproduce the training process - how do we version data, code and model artifacts together - how do we deploy changes safely - how do we know the model is still useful after release - who owns the system when something starts drifting or failing Those questions are operational, not experimental. And they are exactly where MLOps starts to matter. ## The most common definition is too narrow The popular short version usually sounds like this: MLOps is about deploying and maintaining machine learning models in production. That definition is not entirely wrong. It is just incomplete. If we stop there, MLOps becomes a deployment label. It suggests that the real work happens during modeling, and then some separate operational function takes care of "production". In practice, that framing is too small for what teams actually need to manage. ## A more useful definition The better way to describe MLOps is this: MLOps is the discipline of building and maintaining a machine learning project across its full operational lifecycle. That wording matters. It shifts the focus away from one isolated step such as deployment. It also makes room for the parts that repeatedly decide whether the project is maintainable: - reproducibility - continuity - collaboration - evaluation These are not optional refinements. They are the conditions under which ML work stops being a promising experiment and starts becoming an operable system. If a team can train a good model but cannot reliably reproduce, deploy, observe and update it, then the problem is not “missing tooling”. The problem is that the project is still incomplete. ## Why MLOps is not just DevOps with a new label This is the second source of confusion. If we treat a model as just another software artifact, it is fair to ask whether MLOps is only DevOps with extra branding. The question is valid, because a large part of MLOps really does rely on familiar engineering foundations: - version control - CI/CD - environment consistency - artifact management - observability - access control - rollback paths So yes, a lot of MLOps looks like software engineering, because it is software engineering. But machine learning adds a layer that classic DevOps does not have to carry in the same way: - data quality directly affects system behavior - model performance can degrade without code changes - experimentation is a first-class part of delivery - evaluation is tied not only to code correctness, but also to statistical behavior - retraining and release decisions may depend on new data rather than new code That is why MLOps is not a fake discipline. It is an extension of good engineering practices into a system where data, models and operational feedback loops all matter at the same time. ## The lifecycle is the real center of gravity The cleanest way to understand MLOps is to stop looking at a single deployment step and instead look at the whole lifecycle. In practice, most ML projects move through four broad phases: 1. **Scoping**: understanding the use case, constraints, risks and data availability. 2. **Data preparation**: building the path from source data to training-ready inputs. 3. **Model creation**: running experiments, comparing approaches and selecting a candidate. 4. **Deployment and operation**: serving, monitoring, updating and maintaining the solution. The important part is not just the existence of these phases. It is the fact that they form a cycle. Data changes. Requirements shift. Model behavior drifts. Infrastructure evolves. Teams rotate. Good ML systems are not delivered once. They are revisited, corrected, retrained and upgraded over time. That is why MLOps should be understood as lifecycle management, not as a final deployment activity. ## The four principles that make the term useful If the word MLOps is going to stay, it should point to something concrete. For me, the most useful breakdown is still a small set of principles rather than a shopping list of tools. ### Reproducibility Every important part of the system should be reproducible: code, data inputs, experiment context, model artifacts and release decisions. Without that, teams lose the ability to debug, compare, recover and improve the system with confidence. ### Continuity ML systems live in moving environments. Data sources change, libraries age, business expectations evolve, and models need refresh cycles. MLOps should support that ongoing movement instead of pretending that the first deployment is the final state. ### Collaboration Machine learning projects are usually multidisciplinary. Data scientists, ML engineers, data engineers, software engineers and domain experts do not work from the same assumptions by default. MLOps should reduce friction between them by making workflows, ownership and artifacts easier to understand and share. ### Evaluation Everything should be evaluated, not only the model. That includes: - data quality - training outcomes - release readiness - live behavior after deployment - operational health of the surrounding platform This is where many teams fail. They evaluate the model once, but do not build a habit of evaluating the system continuously. ## Why automation is not the first question Automation is often treated as the defining signal of MLOps maturity. I think that is backwards. Automation matters, but only after the workflow is standardized enough to deserve it. If a small team is still exploring the use case, rapidly changing data assumptions and iterating on problem framing, building full automation around every step can be premature. At that stage, the higher-value work is often: - agreeing on naming and versioning conventions - keeping experiments traceable - defining what counts as a release candidate - documenting the handoff between model work and production work Only then does automation start compounding instead of amplifying chaos. Automating a weak process does not create maturity. It only makes the weak process run faster and fail more consistently. ## What teams usually get wrong The recurring mistake is not that teams ignore MLOps entirely. It is that they reduce it to one visible layer. They might equate it with: - an orchestrator - a model registry - one cloud service - CI/CD for notebooks - deployment scripts for models All of those can be part of the picture. None of them are the whole picture. Once the term is reduced to tooling, teams start asking the wrong questions: - which platform should we buy - which pipeline framework should we adopt - which service makes us “do MLOps” The better questions are much more operational: - what exactly has to be reproducible in this project - where do data and model ownership meet - how do we evaluate change before and after release - what part of the lifecycle is currently the least controlled That is where the real MLOps work begins. ## FAQ ### Is MLOps only about model deployment? No. Deployment is one part of it, but the term becomes misleading if we collapse everything into deployment. MLOps is broader because it deals with the lifecycle around data, model evolution, release operations and post-deployment evaluation. ### Is MLOps just DevOps for machine learning? Partly, but not fully. It inherits a lot from DevOps, because ML systems still need engineering discipline. The difference is that data behavior, experimentation and model drift add extra operational problems that classic software systems do not face in the same form. ### When does a team actually need MLOps? Earlier than most teams assume. The moment you expect an ML workflow to be repeatable, collaborative, versioned and maintainable beyond one person or one demo, you already need MLOps thinking, even if the implementation is still lightweight. ## Final point MLOps is not useful because it sounds modern. It is useful because machine learning systems create operational problems that do not disappear after the first successful experiment. If we use the term, we should use it precisely. MLOps is not a synonym for deployment. It is not a cloud product category. It is not a magical layer that appears after data science work is finished. It is the discipline that keeps the whole ML project coherent once the work needs to be reproducible, collaborative, maintainable and real. ## Further reading - [Beyond the Notebook: Moving ML to Production](/blog/beyond-the-notebook-moving-ml-to-production/) - [MLOps topic archive](/blog/topic/mlops/) - [The confusion about MLOps](https://www.datumo.io/blog/the-confusion-about-mlops)

Beyond the Notebook: what has to exist before ML can run in Production

Fri, 15 May 2026 00:00:00 GMT

import Callout from '../../components/Callout.astro'; Most machine learning stories look impressive right until the notebook ends. The notebook is where we prove that a model can work. Production is where we prove that a system can keep working. These are related goals, but they are not the same goal, and many teams get into trouble when they treat them as if they were. This is also the point where MLOps stops sounding abstract and starts becoming a practical delivery problem. ![Notebook sketch transitioning into a production ML system](/blog-covers/beyond-the-notebook.png) ## Why the notebook is not the final milestone There is a reason notebooks became the default environment for machine learning work. They are fast, flexible and very good at exploration. You can inspect the data, test ideas, compare experiments and keep momentum without much ceremony. The problem is that a notebook is optimized for discovery, not for repeatable operation. It usually relies on hidden state, manual order of execution, local dependencies and assumptions that live only in the author's head. That is acceptable during exploration. It becomes risky the moment we want another person, another environment or another release cycle to rely on the same work. The model is usually not the fragile part. The fragile part is everything around it: data preparation, packaging, deployment logic, permissions, monitoring and ownership. ## What really starts after the first successful demo Once the notebook gives a promising result, the question changes. We are no longer asking whether the model can learn. We are asking whether the whole solution can be operated in a predictable way. That is where the real engineering work begins: 1. We need a repeatable path from source data to training data. 2. We need a reliable method for packaging code, dependencies and runtime assumptions. 3. We need a serving interface that can be versioned, observed and rolled back. 4. We need a clear ownership model for failures, retraining and release decisions. 5. We need a way to evaluate the system after deployment, not just before it. Without these elements, a working notebook is still only a local success. If you want a cleaner definition of that operational layer, see [What MLOps Actually Is and Why Teams Keep Misdefining It](/blog/what-mlops-actually-is-and-why-teams-keep-misdefining-it/). ## The confusion usually comes from the word "working" When someone says "the model works", they often mean one of two different things. The first meaning is experimental: the model achieved acceptable results in development conditions. The second meaning is operational: the solution can run under real constraints with clear inputs, versioning, monitoring and support. These two meanings are easy to mix up, especially when the team is moving quickly. That confusion is one of the main reasons why many ML initiatives look close to production for a long time, but never really become production systems. ## A more useful readiness check Instead of asking whether the model is finished, it is better to ask whether the system is ready for repeated use. ```yaml production_readiness: source_to_feature_path: reproducible training_process: documented inference_interface: versioned environment_build: repeatable monitoring: latency: enabled failures: enabled data_quality: enabled ownership: incident_response: clear rollback_path: clear ``` This kind of checklist is less exciting than a benchmark chart, but it is much closer to the real delivery problem. The earlier you standardize the path from experimentation to deployment, the earlier you pay the engineering cost. The later you do it, the more notebook assumptions become part of the architecture by accident. ## The last mile is mostly systems work This is the part that often gets underestimated. The final gap between a notebook and a usable service is rarely just one deployment step. It is a chain of decisions about interfaces, environments, automation, observability and collaboration between roles. The questions are usually familiar: - where does feature logic live - how do we keep training and inference consistent - what has to be versioned - how do we detect silent degradation - who is responsible when the system starts behaving differently next month These are not glamorous questions, but they are the ones that determine whether the solution survives outside a demo. ## What teams most often get wrong The most common failure modes are rarely exotic. They are usually very ordinary: - feature logic differs between training and serving - deployment steps depend on undocumented manual actions - model quality is evaluated once and then assumed to stay stable - data freshness degrades without anyone noticing - one person becomes the only real operator of the pipeline This is why the transition beyond the notebook should be treated as a software and platform engineering problem, not as a final polish step for data science work. ## FAQ ### Is a good notebook enough to call an ML project production-ready? No. A good notebook proves that an approach can work under development conditions. Production readiness requires repeatable data flows, controlled deployment, monitoring, ownership and a clear rollback path. ### What usually breaks first after a successful demo? Usually not the model itself. The first problems tend to appear in data preparation, environment consistency, deployment steps, permissions and the lack of operational ownership. ### When should teams start thinking about MLOps? Earlier than they usually do. The right time is not after the first deployment crisis. It is when the team starts expecting the same workflow to run more than once, in more than one environment, with more than one person involved. ## Final point The notebook should absolutely exist. It is one of the best tools we have for understanding data and iterating on models. The mistake is not using notebooks. The mistake is expecting them to carry responsibilities they were never designed to carry. If a team wants machine learning to be reliable, maintainable and useful in production, the real project begins exactly where the successful notebook ends. ## Further reading - [What MLOps Actually Is and Why Teams Keep Misdefining It](/blog/what-mlops-actually-is-and-why-teams-keep-misdefining-it/) - [MLOps topic archive](/blog/topic/mlops/) - [The confusion about MLOps](https://www.datumo.io/blog/the-confusion-about-mlops)