A Digital Transformation pattern that is found in the architecture centers of cloud vendors and many ML related products is the application of CMMI style maturity levels to ML Ops. These ML Ops maturity levels promise to add the processes needed to reduce friction in the development of model-based applications. But do they work? In this blog, we take a look at several ML Ops maturity models to examine if they provide true process improvement in the spirit of the CMMI model.
What is ML Ops?
ML Ops is used to develop AI/ML models. ML Ops typically covers tools to facilitate the coding, testing, training, validation, and retraining of models. Model development is performed by Data Scientists who typically work for an individual line of business and are focused on specific projects. ML Ops helps Data Scientists with rapid experimentation and deployment of ML models for the purpose of Data Science.
A major paradigm in ML Ops is CI/CD/CT. CI being Continuous Integration, the checking in of code to an integration hub or repository. CD is Continuous Delivery, the automated process of placing a model into gated environments such as development and production. Lasty, CT is for Continuous Testing, the idea that in the instances where a model needs to be retrained, that process can be completely automated. Unfortunately, it is often misconstrued that CT implies models need to be constantly retrained, perhaps even daily. In fact, a well-designed model will effectively generalize such that it does not require frequent retraining, or it will be designed with a known retraining period as part of its spec.
What is a Maturity Model?
The CMMI maturity model was originally developed from 1987 to 1997. In 2002 the 1.1 version was released. In 2006 a 1.2 version was released. Version 1.3 was released in 2010.
The objective of the CMMI project was to develop a model for the evaluation of an organizations process maturity. Furthermore, the CMMI model was intended to act as a guide to develop and improve processes that meet business goals.
The CMMI model consists of either four capability levels to be continuously improved, or five maturity levels to be improved from one stage to the next.
The five maturity levels are listed below:
Level 1: Initial | Level 2: Managed | Level 3: Defined | Level 4: Quantitatively Defined | Level 5: Optimizing |
Process unpredictable, poorly controlled, and reactive | Processes characterized for projects and is often reactive | Processes characterized for the organization and is proactive. | Processes measured and controlled | Focus on process improvement |
Over the course of several decades many projects have yielded performance data. The median time for an entire enterprise to move from Level 1 to Level 2 is 5 months. An additional 21 months is typical for the move from Level 2 to Level 3.
DevOps Maturity Model
There is a wide variety of Dev Ops maturity models available. The model from Ineta Bienca’s Master’s Thesis on DevOps adoption for very small entities is more compelling than most due to the methodology including Technology, Processes, People, and Culture. This level of detail makes the model more practical for real organizations.
Level 1 | Level 2 | Level 3 | Level 4 | Level 5 |
Initial | Repeatable | Defined | Managed | Optimized |
A deep dive into the Bienca model is outside the scope of this article but an infographic is available on the Ineta Bienca website.
What is Model Ops?
Before discussing the current ML Ops Maturity Models, it is useful to showcase a new trend in the extended concept of ML automation. Model Ops is the application of automated ML at an enterprise scale, rather than at a Data Science team scale. The concept was a major driver in the first ML Ops maturity model article.
Model Ops ensures reliable and optimal outcomes for all the models in production. Model Ops deals with all elements of model production including model inventory, ensuring model reliability, regulation, compliance, risk, and controls. CIOs along with business stakeholders are the parties responsible for developing and maintaining a Model Ops platform. Model Ops is the enterprise level operations and governance for ALL AI/ML models in production. Business impacts such as independent validation and accountability enable strategic decision making support regardless of the type of model, how developed, or where run.
William McKnight Article
An article written in 2020 by William McKnight is perhaps the earliest maturity model for application to ML Ops. The article introduces many Model Ops concepts at a time before Model Ops was a known buzzword. To address these Model Ops concepts, the author subdivides maturity goals into areas of Strategy, Architecture, Modeling, Processes, and Governance.
This article seems to be inspired from the Microsoft TDSP methodology for coordinating between data science and engineering teams. It shows a variety of implementations utilizing the Azure ML toolset.
Level 0 | Level 1 | Level 2 | Level 3 | Level 4 |
build data science strategy | Model and Development work happen independently | Add reproducibility and iteration to development and model creation | DevOps and ML Pipelines converge to shared release pipeline | Scale up ML efforts to a consistent approach |
Google Architecture Center ML Ops
Google quickly released their own ML Oops maturity model in their architecture center. This model is quite minimal and largely details GCP platform services.
Level 0 | Level 1 | Level 2 |
Manual Process | ML Pipeline Automation | CI/CD Pipeline Automation |
Azure Architecture Center. ML Ops Maturity Levels
The Azure Architecture Center ML Ops maturity model follows a restricted model lifecycle concept for its levels. However, the Azure ML Ops model does include more process criteria than all the other cloud offerings with dimensions of People, Model Creation, Model Release, and Application integration.
Azure ML Ops maturity model also has a GitHub site with documentation and some code examples centered around Azure ML services.
Level 0 | Level 1 | Level 2 | Level 3 | Level 4 |
No ML Ops | DevOps, no ML Ops | Automated Training | Automated Model Deployment | Full ML Ops Automated Retraining |
AWS Workload Orchestrator
AWS offers little to no process improvements, but it does have a service with a workflow designer allowing a user to build an AWS centric Ops based on Amazon services. The Workload Orchestrator tool includes a template and reference architectures.
Evaluation of Maturity Models
ML Ops maturity patterns have morphed from a search for dealing with AI/ML related issues via Model Ops to a CI/CD/CT paradigm that is largely centered around tools implementation. From a process standpoint, are the cloud vendors maturity models sufficient for leading a digital transformation leading to better model outcomes? The table below demonstrates matches or mismatches between proposed ML Ops maturity levels and CMMI targets:
Model | Level of Process Improvement |
DevOps (Bienca) | Good segmentation across domains of human interaction but not technical enough. Enough to achieve Level 2 even without being ML Ops related. Potentially capable of reaching Level 3. |
McKnight | Too closely tied to Azure ML and TDSP concepts but has good segmentation across each domain. Potential to achieve Level 3. |
Minimal approach. Achieves Level 2 only. | |
Azure | Removed many of the useful details of the McKnight article. Mostly CI/CD buzzwords. |
AWS Workload Orchestrator | A software based upon lock-in to AWS services. Achieves Level 2 only. |
In general, there is a lack of robustness in the vendor based ML Ops maturity levels with respect to CMMI targets.
How much process is correct?
In terms of CMMI progress, the cloud models offer minimal levels of maturity. Model development remains reactive, unorganized, with many issues lacking clear vision for next steps.
CMMI Level 2: Processes characterized for projects and is often reactive
The DevOps and McKnight models have additional process details that are relevant perhaps to level 3 of the CMMI standard. A near miss was the Azure ML Ops set of guidelines which did address people and model issues by copying some of the McKnight details but ultimately failed to address the processes needed to scale the model’s organization wide.
CMMI Level 3: Processes characterized for the organization and is proactive.
Breaking through from the Level 2 use case of simply publishing a model to the Level 3 use case of building a process to implement models across an organization/enterprise is difficult.
Conclusion
How much process is correct?
ML Ops is designed to solve one problem of many that a modern ML shop must deal with. A huge number of other problems need to be solved for an organization to truly be effective at utilizing model technology. These include:
- Data Issues
- Talent issues
- Lack of user adoption
- Missing transformation
- Missing processes
- Lack of due diligence
- Business problem
- Wrong target
- End user trust
- Production support
Do the ML Ops maturity levels address these issues? The answer is no. The current CI/CD/CT paradigm driving most of the cloud versions of ML ops maturity only handles putting a model somewhere. There is much more needed to be successful at a large scale, across an entire enterprise.
In the next article, we will go back to the original ML Ops maturity model to reframe the pattern at an enterprise scale as a Model Ops maturity model. Stay tuned!