Data Lake to Data Mesh: Converting Data Sets into Data Products

The Data Mesh concept, first introduced by Zhamak Dehghani, is gaining significant popularity. Data Mesh proposes a new approach to thinking about data based on a distributed architecture, governance, and ownership for enterprise data management. We have seen some of our customers adopting the Data Mesh concept and creating their customized versions. As implementation partners, we often must answer the ‘double-click’ details –

What is Data Product and how is it different from what we are doing today?
How do I prepare my platform to support Data Mesh architecture?
How to cultivate the data-driven decision-making culture by leveraging Data Products and Enterprise Data Marketplace?

In this 3-blog series, we will try to answer these questions and introduce Persistent Data Foundry, a platform that supports Data Mesh architecture.

Let’s first talk about the characteristics of Data Products. Let’s also understand how to convert the existing inventory of Data Pipelines (ETL/ELT) and Data Sets to Data Products.

Characteristics of Data Products

As per Zhamak’s initial positioning, Data Product should have the following characteristics:

It should be owned by the Domain Team.
It should be usable and valuable to other domains.
It should be feasible to implement
It should be discoverable and self-describing, both semantically and syntactically.
It should be Trustworthy, i.e., the owner should be able to measure completeness, lag, timeliness, and statistical shape and stand behind it.
It should be interoperable, or in other words, governed by global standards.
It should be secure. The Data Product Owner should be able to define the access control policy.

But we have Data Pipelines and Datasets in our Data Lake, and not Data Products

There are Data Pipelines and Datasets. The ownership is with the central Data and Analytics (D&A) Team
The Datasets are usable and valuable to the domain owning them. It is probably useful to others
The Datasets are definitely feasible to implement, in fact, they are already implemented
Datasets may be discoverable if a Catalog is already implemented and maintained
Datasets are not self-describing. They are at mercy of the Cataloging Tool
Datasets might be trustworthy. Only the pipeline team knows about the quality measurement
Datasets are intentionally NOT governed by global standards
Datasets are by default protected and access is given on a case-by-case basis

Datasets to Data Products – Our Migration Methodology

No enterprise wishes to simply throw away their existing investments that have gone into implementing Data Lake. So, the question is, how do we convert the existing Data Pipelines and Datasets into Data Products? Here is how we approach it:

Phase	Task	Outcome
Global Discovery	Go over the datasets across the lake/Data Warehouse (DWH) and identify the domain owners. Mark first-party and third-party ownership and second-party data usage. The task is simplified if the data catalog is maintained. Else, we may need to go over the technical catalog like Hive Metastore.	Dataset domain ownership Dataset cross-domain usability
Local Discovery	The domain owners conduct the local discovery of the datasets identified for them. They identify which data entities are created by them (first party), which entities are purchased by them (third party) and which ones are only consumed by them. Once the tentative definition of Data Products is created, define the Data Quality (DQ) rules to identify trustworthiness and global access policy. Create a roadmap to convert the dataset into the Data Product.	Feasibility study Data Product identification (which ones to drop, which ones to combine) Review of current technology, infrastructure, and its usefulness in the Data Mesh world Defining Trustworthiness (defining DQ rules in other words) Defining the global access control policy Defining execution roadmap
Ownership Handover	The organization needs to decide whether it wants to do a big-bang transformation of domain-based ownership. The alternative mechanism is to create a domain-based technical team under D&A to own a specific set of domains	Tenant creation in infrastructure Admin rights Code handover Monitoring and Management handover
Migration Implementation	Start by identifying the Data product gaps. For example, DQ rules may not have been implemented, and so on. Then create a project plan and fill in the gaps to convert the data sets into data products	Discoverability, e.g., implementation into the central cataloging tool Data Quality – Adopt platform standard DQ implementation. The DQ scores available in the data catalog.
Run	Monitor the data product from the DQ, Freshness, and other agreed-upon SLAs. Trap the schema drifts using Data Product Versioning. Measure and improve the adoption	Seamless availability of the Data Products

Summary

The Data Mesh pattern introduces two concepts – Distributed Domain Driven Ownership and Data Products. Domain-driven ownership needs deep organizational change, which is not simple to adopt to. The Data Product concept is easy to adopt to and is very useful to make an organization data-driven. Migrating from Data Sets to the Data Product concept is a right first step towards marching to the Data Mesh paradigm.

Author

Avadhoot Agasti

Chief Principal Consultant

Data Lake to Data Mesh: Converting Data Sets and Data Pipelines into Data Products