Open-source Data Lakehouse And MLOps Platform — A Unified Approach To Data Management And Machine Learning

5 min readSep 20, 2024

There has been tremendous progress in the data engineering and machine learning. Many new tools have been developed that provide users flexibility and interoperability. It is possible to create a complete data and machine learning stack using just the open source tools that can scale to petabytes and handle. One of the biggest problem with cloud computing is Vendor lock-in, which occurs when organizations become dependent on a single cloud provider’s tools, services, and infrastructure, making it difficult to switch providers without incurring significant costs or effort. This can be problematic for data engineering and MLOps because it limits flexibility, stifles innovation, and can lead to inflated costs as organizations are now tied to the cloud provider and may have to pay for services they could get cheaper elsewhere. Additionally, it hinder the ability to integrate with diverse technologies and interoperability, making it challenging to adapt to changing business needs or emerging technologies. The best alternative is open-source route, but how can we build a complete and open-source data and machine learning platform. In this article we will look into that.

A successful MLOps implementation relies on good availability of good, clean, versioned and organized data as well. Recently Apache Iceberg rose to the popularity, and already organizations are moving away from Snowflake to Apache Iceberg.

For open-source MLOps Platform we have Kubeflow, which is an open source alternative to MLOps platforms provided by cloud vendors. There are other tools like Kubeflow but none of them offer flexibility to integrate tools like Kubeflow does. Kubeflow also allows advanced deployment techniques and multi-tenancy. It is built keeping in mind flexibility, and scalability.

Open-source Data Lakehouse And MLOps Platform — A Unified Approach To Data Management And Machine Learning

COMPONENTS OF A DATA LAKEHOUSE

Data lakehouse has several components:

Data Storage Layer

It could be data lake, data warehouse or data lakehouse. In modern data architecture, data lakehouse is preferred.

Data Ingestion Layer

We can ingest data in batches or streams, tools like Apache NiFi, Apache Kafka can be used for this.

Data Processing Layer

Here we do data transformation, Apache Spark and Dbt are the most popular choices in this regards and for data governance we have tools like Apache Atlas, etc.

Query and Analysis Layer

For querying and analyzing the data we can use different SQL interfaces like Presto or Apache Drill. For data visualization and report generation, we can use Tableau, Apache Superset or Power Bi.

Metadata Management

We also need to manage metadata for data discovery and to understand data assets, for this purpose we need to maintain data catalog.

MLOps Platform Components

Experiment Tracking:

Tools for logging experiments, tracking metrics, and managing model versions (e.g., MLFlow, Weights & Biases).

Model Training:

Kubeflow comes with Training Operator that works with PyTorch, Tensorflow and many other libraries.

Additionally for AutoML and Hyperparameter tuning Katib can be used, which uses Optuna, Hyperopt, etc.

Model Deployment:

- Serving Infrastructure:

Kubeflow uses KServe for model deployment, but other model inference platforms can also be integrated in Kubeflow.

Monitoring and Logging:

Tools for monitoring model performance in production and logging inference requests (e.g., Prometheus, Grafana).

Collaboration and CI/CD:

Tools for version control, collaboration, and continuous integration/continuous deployment (e.g., Git, Jenkins, GitLab CI/CD).

Data and Feature Stores:

A system for managing and serving machine learning features is needed in Kubeflow. Feast is a good open source option for feature store.

Data Access Layer:

Tools for managing data access and pipelines for retrieving training and inference data

Explainability:

For prediction explainability, SHAP or other tools can be used in Kubeflow.

Data Drift:

For drift detection, we can integrate tools like Evidently Ai.

Why Apache Iceberg is amazing?

Apache Iceberg is amazing for large datasets it allows users to build an open-source data lakehouse that can scale to petabytes, and allows data versioning with the help of some other tools. It is also tremendously cheaper, it achieves it by giving users the data and compute engine choice. It also allows users to store data in formats like Parquet, CSV, and JSON. Iceberg does it by allowing users to create tables that gives same transactional guarantees that databases offers but it does that on storage layer. (Reference: https://www.cummulative.io/p/the-iceberg-that-sank-the-snowflake)
Iceberg unlocks cross-platform, by creating iceberg tables directly in cloud storage like S3, Azure Blob Storage or Google Cloud, we can make the data accessible in any platform.

Why is Kubeflow misunderstood?

Being coming from the computer science and data science background, having the privilege to look at Kubeflow both perspective I understand why Kubeflow is misunderstood. Data science and infrastructure are two ends of the spectrum, Kubeflow is build on top of Kubernetes, which even many DevOps don’t know, and to work with Kubeflow, you not only need to know Kubernetes but also GitOps. If software engineers and DevOps find Kubernetes difficult then data team will definitely not understand the infrastructure of Kubernetes and Kubeflow. This is why Kubeflow comes with Kubeflow Dashboard and different SDKs to on-board the data team.

However make a successful MLOps and DataOps successful in an organization, data team needs to be re-structured, instead of having 100% data scientists in a team, the percentage needs to be reduced and the role and responsibility to manage machine learning model deployments, machine learning lifecycle and infrastructure needs to be given to MLOps engineers. MLOps engineers can additionally enable AutoML and an ability to integrate other tools.
REFERENCE: https://www.contino.io/insights/data-science-mlops

Integration of Data Lakehouse and MLOps

The integration of data lakehouse and MLOps components allows for seamless data flow from ingestion and storage to model training, deployment, and monitoring. This holistic approach facilitates efficient data management, collaboration, and reproducibility in machine learning workflows.

By combining these components, organizations can create a robust and scalable environment that supports advanced analytics and machine learning initiatives.

To learn about the whole open source, cross platform data and machine learning platform visit Hands On MLOps With Kubeflow.