MLOps Simplified: What is an Artifact?
An artifact in machine learning is any file or object that is produced as part of a machine learning process.
They describe the data used to train the model, the design and function, and system it was created in. Artifacts help ML engineers and data scientist reproduce, trace, and understand the machine learning process and the models used in it.
Artifacts might be raw data, training and testing data, video or audio files, pickled machine learning model files, or even pictures from an exploratory data analysis. This can extend to metadata, such as diagrams of the ML system, documentation, and use cases that were used to develop the model
Any artifacts cam be stored in a repository during experiment tracking so they can be retrieved by both ML engineers and data scientists on demand.
Thanks for reading DataLife360! Subscribe for free to receive new posts and support my work.
Examples of Artifacts
Artifacts can include the following:
- Diagrams. These help developers map out the structure of the software.
- Images. These design or reference images help develop the software.
- Files. These are documentation, flowcharts, etc.
- Model and Use Case documentation. These documents describe the characteristics and attributes of the software.
- Baseline and Prototype Models. These are the original testing models that the final models are based on. Including these as artifacts are useful for explaining and context for final models.
Types of Artifacts:
Artifacts can fall under the following three main categories:
- Model Related Artifacts. These artifacts are inputs used to train the ML model. They can range from sound files to csv files to tables — anything that was used as input in the model itself. Model related artifacts can include requirements.txt, Docker containers for models, pickled ML models, and even screenshots from EDAs.
- Code Related Artifacts. This code acts as the foundation for the software and enables the developer to test the software before launching it. Code artifacts can include compiled code, setup scripts, test suites, generated objects and logs generated during testing and quality assurance.
- Project management artifacts. These artifacts are created after code is developed to test its functionality. Artifacts here include minimum required standards, benchmarks, project vision statements, roadmaps, change logs, scope management plans and quality plans.
- Documentation artifacts. These artifacts keep track of relevant documents, including diagrams, end-user agreements, internal documentation or written guides.
This is a general list. The types of artifacts included with the ML model may vary by the industry, regulations, and policy. Model and code related artifacts usually included when saving the models.
Why are Artifacts Important?
Model Management
Model management is big, and artifacts help this . They provide a way to save and manage the trained model, and any other files or data that are necessary to use the model. This can help machine learning engineers and data scientists keep track of the different versions of the trained model that are created during the project. It also allows them to easily compare and evaluate the performance of different models.
Collaboration
ML artifacts can be used to share trained models with other members of the machine learning team or business users. This allows the trained models to be used and evaluated in different contexts, including user acceptance testing, quality evaluation, etc. Reproducibility helps collaboration, to ensuring that the models are deployed and used effectively in production.
Model Versioning
Artifacts make model versioning possible. They provide context to different version of the models, showing changed and updates that were made overtime. This can include different versions of data, project requirements, inputs, etc. By having a record of different model versions and their changes, ML engineers and data scientists can easily revert to previous model versions. They can also use it to track the performance of the model over time to identify any potential issues.
What Larger Processes Do Artifacts Help?
Data Governance. Artifacts are useful for explainable machine learning and the models. It also allows data scientists and ML engineers to allow others to consistently reproduce the models and their environments.
Data Quality. By having a record of the training data and evaluation metrics, artifacts can ensure that the data is consistent and of high quality. It can allow ML engineers and data scientists to easily compare the performance of different models that are trained on the same data.
Maintenance and Fixes. Knowing what worked in a successful model helps maintenance. It helps identify issues by comparing artifacts from successful models to unsuccessful ones.
Best Practices for Artifacts
Track, Record, and Save Artifacts
Best practices for ML artifacts include using an experiment tracker to store and save artifacts, along with using a git repository to compare the latest version against them. If an experiment tracker ins unavailable, an excel spreadsheet is useful.
Useful trackers include: Comet, Neptune, and MLflow.
Checklists
It’s also helpful to have a checklist of artifacts that should be added in each run of an experiment tracker. This should be assessed occasionally to make sure that the right artifacts are being added. A better method is to establish a standard operating procedure (SOP) that can be followed and referenced by team members. Checklists and SOPs keep consistency.
Designate an Artifact Repository Manager
Another useful rule is to have a ML engineer or data scientist in charge of moving, copying, and deleting artifacts for consistency. As with any data being deleted, a rule should be created how long artifacts must be kept before deletion.
Please note these are some, but not all of the best practices for working with machine learning artifacts. Best practices are highly dependent on the industry, team, and company.
Final Thoughts
Artifacts are important thing to consider as you create ML models, and scale the infrastructure to support them. They go a long way to capturing insights and knowledge gained during the process of creating an ML model.
Improving ML models is possible without artifacts. However, I feel to get consistent, reliable, and high quality models, artifacts must be kept with the model in some form. They’re also very important if you want to build on the current models you have.
Overall, ML artifacts are an essential part of development and deploying ML models at scale in the long run. They help boost efficiency, collaboration, and play a small and important role in building credibility and trust in a model.
If you enjoyed this article, you might also enjoy my other articles at Heartbeat and beyond:
Follow me on Medium to stay informed with my latest data science articles: