Pachyderm Overview
Pachyderm provides the ability to modularize, orchestrate, and scale the steps of your ML pipeline within a language-agnostic platform — with the added ability to trace the lineage and versioning of both code and data.
What Problems Does Pachyderm Solve?
Typical software development entails managing only a codebase, whereas ML development demands that organizations manage both code and data. This amplifies complexity considerably due to data assets constantly changing; resulting in drastic differences in model performance. Pachyderm handles this complexity by empowering organizations with version control, from ingestion to model deployment, for both code and data. All of which is built on top of Kubernetes for scalability and leverages Docker for language agnostic modularization.
Where Does Pachyderm Fit Within the ML Lifecycle?
Pachyderm offers features spanning the ML lifecycle, with a particular emphasis on data acquisition and model preparation, with support for less complex model deployment as well. It has also been described as the “glue layer” that connects your various bespoke solutions within your organization’s ML pipeline.
Pachyderm Features
Attributes | Features | Description |
---|---|---|
Data Acquisition and Preparation | Through the use of containers, Pachyderm provides clear documentation of ML workflow steps and version control of both data assets and code. The solution’s platform provides DAGs and real-time changes of data preparation processes. | |
Model Development and Training | Built on top of Kubernetes and leveraging Docker containers, Pachyderm users can seamlessly scale in parallel their ML workflows for each respective pipeline step. Furthermore, this solution supports access to instances with GPUs for training deep learning models. Finally, notebook services allow users to quickly experiment and share results. | |
Model Deployment and Operations | Pachyderm supports either self-managing Kubernetes clusters or abstracting this management through their Kubernetes as a service. | |
System-wide Features | Pachyderm provides clear documentation of how organization’s various data and ML processes are related to each other through DAGs on it’s platform. In conjunction with data lineage, users can quickly assess changes made by others within the ML workflow.
In addition, Enterprise Edition customers have further control of collaboration and access on the platform through their various security features.
|
Pachyderm Deployment Options
In addition to Pachyderm’s previously mentioned Kubernetes support, a variety of deployment options are offered:
- Amazon Web Services
- Google Cloud Platform
- Microsoft Azure
- Other Public Cloud
- Kubernetes
- Private Cloud or Datacenter
Pachyderm Deployment Options
With only two boutique consulting partners and a handful of marketing partnerships, Pachyderm is early on in the process of building out a partner network. Partnerships include:
- ixpantia
- pelotech
- Microsoft Partner
- Red Hat Technology Partner
- AI Infrastructure Alliance
Pachyderm Pricing
Pachyderm offers utility-based pricing that varies based on product tier— free, pro, and enterprise. Billing is determined by the number of credits—called PCUs for standard compute instances and PGUs for GPU-based compute instances—used by an organization within a month. One PCU equates to $0.14/hr, while one PGU ranges from $0.70/hr to $2.80/hr depending on the configuration. More information regarding Pachyderm’s pricing can be found on their website.
The TWIML Take
The challenges of managing both a codebase and changing data assets becomes clear as an organization moves forward in their ML journey. For organizations in such scenarios, the need for clear version control of these assets are paramount as AI/ML spaces become further regulated in respect to data governance and explainability — such solutions are a requirement in industries such as healthcare, finance, and any other domain heavily utilizing PII.
This solution has strong features in the data acquisition and model preparation phases of the ML lifecycle though is lacking in robustness for deploying more complex models. For steps beyond deployment, such as monitoring and retraining, Pachyderm users would need to complement it with additional solutions.
The ideal Pachyderm customers are organizations in the early stages of integrating machine learning within their respective products and operations who have a need to move more quickly. Specifically, as an organization grows in data maturity there is a need to create robust data pipelines that connect their various data sources and data transformation operations within a unifying platform. Pachyderm provides a foundation to bring together an organization’s codebase, data assets, and their various microservices from data acquisition to ML model deployment. All of which is organized by Pachyderm through clear pipelines and data lineage. The versioning, lineage, and repeatability offered by Pachyderm will also be of particular interest to those building ML models in regulated environments.
To learn more about a real-world use case of Pachyderm, visit Pachyderm’s profile on the TWIML Solutions Guide to view our interview with user Daan Odijk on how RTL uses this solution to manage their MLOps infrastructure and scale a video AI application.
Summary
Pachyderm’s strengths are in its ability to modularize, orchestrate, and scale the steps of your ML pipeline within a language-agnostic platform — with the added ability to trace the lineage and versioning of both code and data.
Pachyderm may be viewed as a specialist tool focused on data pipeline versioning and reproducibility, or as the basis for an end-to-end ML platform. Relative to other end-to-end ML platforms, where this solution is limited is in deployment capabilities, particularly as they relate to complex ML models, and the steps beyond deployment such as model monitoring and operations.
For the right customer, utilizing Pachyderm provides the opportunity to establish a “glue layer” that connects your various bespoke solutions within your organization’s ML pipeline. In addition, its strong data lineage functionality is a major asset for organizations within highly regulated industries.