What is Data Mesh and why do I care? – Part II.
In the previous part of our series on Data Mesh, we introduced the concept and principles of Data Mesh. In this part of the series, we will look at the technology enablers of introducing the Data Mesh idea to your organization and typical objections to Data Mesh.
Technology capabilities must exist to deploy Data Mesh effectively. These are mandatory for any data-mesh-oriented initiative:
Data Product Catalog
The centralized repository where users can publish their data and Data Products with required details for other users (dataset description, SLAs, etc.), technical details (schema, access ports, sample data sets, etc.), and business meanings/usability of data. It can optionally also enforce additional publishing QA and approval workflows. The Catalog can be as simple as a set of Confluence templates or a sophisticated data governance tool.
Typically an ELT product that:
- Enables users to create, test, debug and monitor end-to-end flows.
- Provide a rich library of various generic (JDBC, REST, file…) and application-specific (Salesforce, SAP, Google Analytics…) connectors.
- Publish and share flows within the organization.
- Data Pipelines should also be integrated with Data Product Catalog to automate product publishing, monitor its status, refresh rates, etc.
Data Mesh will also need different patterns for delivery and execution based on the underlying technology and the location of the data sets that need to be combined. These patterns could range from physical dataset copies to data virtualization, views, and many more. Most of these patterns require an additional structured or semi-structured storage location.
Typical – but not exclusive – consumers of such storage are composite Data Products where the dataset from source systems or other Data Products undergo a transformation, enrichment, or data cleaning, and results need to be stored somewhere.
Storage is also needed for situations when source systems can not (for security reasons or because data are provided as a batch file or stream that can’t be directly queried) or should not (mainly for performance reasons) access directly. In such a case, a data replica needs to be established and maintained up-to-date as specified in a Data Product.
There can be one or multiple Storage variants provided to the users and Data Hub – e.g., PostgreSQL, Snowflake, Azure Synapse, MongoDB, etc. Critical for Data Mesh use is that users can provision new Data Products within this storage themselves and very quickly – while all applicable governance & security rules are automatically applied.
Other services can consume data provided by Data Products – e.g. machine learning or data science platforms – the most common use case is their visualization into a set of reports and dashboards. A flexible Reporting Engine must empower users to define and share data visualizations and dashboards to enable this. Such an engine should also enable users to add new data sources as needed.
The security platform should act as a centralized place to define, enforce and verify/audit compliance rules, data access privileges, and mapping of users onto roles/groups/claims. It must be integrated with the user’s identity provider (Active Directory, OAuth / JWT issuer, etc.), source systems, and data hub to ensure end-to-end compliance.
Ideally, it should also be capable of automatically detecting (based on defined patterns) sensitive information (e.g., SSN, card number, IP address, etc.) and reporting/masking them accordingly.
The solution should support RBAC and/or ABAC security models and be able to provide clear reports where each user/role/group has what kind of access. Also, the masking engine should provide multiple options for transforming sensitive data – obfuscation, tokenization, randomization, pseudo-randomization, encryption, etc.
From the user’s perspective, the security platform should be fully transparent and not interfere with their work – unless they need to assign an additional role(s) for new Data Products.
Data Mesh sounds great but…
…we already have DWH / ODS
DWH or ODS are great dataset sources for Data Products and typically continue to be used and evolved as many existing integrations and reports don’t make financial sense to replace without any additional reason.
Nevertheless, DWH / ODS are not substitutions for Data Mesh as they are highly centralized systems with carefully curated data. Instead, Data Mesh builds Data Products “on top and alongside” them.
That said, in some setups where DWH is built using expensive technologies, it may make sense to gradually work on replacing existing DWH to save on license costs.
…we already have Data Lake
Data Lake is an evolutionary step from the DWH concept where it empowers users to access datasets and process them as needed quickly. It results in higher agility but still has these concerns:
- Adding new data sources into a data lake is a typically tedious process managed by a centralized team.
- The data lake concept doesn’t address publishing, sharing, and reusing produced datasets.
- Due to semi-structured or unstructured datasets, defining and enforcing a unified security governance model is practically impossible.
- Not scalable for heterogeneous organizations with large datasets or data volume-intensive analytics (e.g., machine learning).
Data Mesh is designed to address all these shortcomings.
…we already have a Data Catalog
Establishing a data catalog (glossary, data lineage, etc.) is a great accelerator for Data Product definitions. Existing tools can be re-used to act also as a Data Product Catalog. Data Mesh deployment requires additional capabilities than just a catalog.
What do real-world customers say?
Use case in a large universal bank
“We have created data journeys from end-to-end, joining many data sources with my team of 3 and with no IT support” – head of the retail lending department.
One of the largest universal banks in Central Europe embraced the Data Mesh concept together with the rollout of state of the art ELT tool. This enabled the business users to create ad-hoc Data Products and reports on their own.
In a typical case, the Data Product Owner and her team connected data from Google Analytics, social media campaigns, and BigQuery with extracts from Oracle DWH that were productized together with Hadoop data lake and XLS. All are done directly by business users in the selected tool with zero support from IT needed.
Reports were configured in Tableau and connected to dataset storage. The business then could answer questions such as “how many people from social networks campaign come to our physical branches, how do they behave in our online banking, and what effect does promotion X have on their upsell to product Y”.
The usage of Data Mesh supported by a modern data stack enabled them to make business decisions and evaluate the impact of marketing campaigns and the rollout of new products at an unprecedented speed and data richness.
- “Finally we have a great analysis of our retail outlets network. We know how our clients behave at these outlets, know how they operate in real-time and where we can optimize” – Retail banking and major European bank after the first year of Data Mesh deployment
- “HR data is very sensitive. We wanted to analyze the trends, individual people, and how we can support them in their growth. Due to the high sensitivity of data, we need an environment that is fully functioning and runs without any IT support and that we can audit and make sure no data or access to it leaks outside of the HR department. Data Mesh provides us with just that. We were able to create advanced people analytics within half a year only within the boundaries of our tribe. This already creates huge benefits in how we can support our workers in their daily lives.” – Major European institution
- “To automate KPIs reporting over diverse systems and branches was previously not possible for us. After one year, we have it across all of our companies, and it runs reliably day after day without the need for intervention” – Large bank
- “Everybody talks about Customer360, but to put data from all internal and external sources and applications, prepare it in a format for automation tools to reliably work with, and enrich it by models developed in DBX was previously not possible for us. With ELT-tool powered Data Products, it’s been in production within the first year and now serves as bases for automation in SalesForce, marketing automation tools, and our internal applications” – Large retailer
- “We have used the Data Mesh concept for our self-service platform where we prepare and productize our Risk Scoring product that is used automatically by all other departments in their processes like Customer 360, lending applications, or sales predictions. The combination of AWS Lambda functions and ELT tool’s productization allowed us to put this in production within a couple of weeks, and now over 30 other processes within other tribes reap the benefits” – Top global lending institution
- “Understanding transactions and their context that people are doing with our bank has been a tremendous boost to our step to more proactively manage our customer experience, predict revenues and risks” – Retail bank
Grow2FIT BigData Consultant
Miloš has more than ten years of experience designing and implementing BigData solutions in both cloud and on-premise environments. He focuses on distributed systems, data processing, and data science using the Hadoop tech stack and in the cloud (AWS, Azure). Together with the team, Miloš delivered many batch and streaming data processing applications.
He is experienced in providing solutions for enterprise clients and start-ups. He follows transparent architecture principles, cost-effectiveness, and sustainability within a specific client’s environment. It is aligned with enterprise strategy and related business architecture.
The entire Grow2FIT consulting team: Our team