Are data mesh and data fabric the latest and greatest initiative, or new buzzwords aimed at selling solutions? It’s hard to say, but these emerging new corporate initiatives have a goal in common–namely dealing with disparate data. You can often achieve more value from your data if you can use disparate data for your analytics without having to copy data excessively and repeatedly. Data mesh and data fabric take different approaches to solving the disparate data problem.
Both data mesh and fabric focus on metadata and a semantic layer to leverage multiple data sources for analytics. However, the major difference seems to be about context.
In layman’s terms, data mesh is about the ability to offer various data sources to an analytical engine. Data mesh counts on the fact that you know the structure of your source data files and that the context of the data is solid. Using data mesh assumes you know the who, when, where, why, and how the data was created. Data mesh might be the strategy you use, for example, if you want to analyze data from several data warehouses in your company. It’s a use case where the original metadata is fairly well-defined.
Data fabric focuses on orchestration, metadata management, and adding additional context to the data. In the data fabric, managing the semantic layer is the focus. Use the semantic layer to represent critical corporate data and develop a common dialect for your data. A semantic layer in a data fabric project might map complex data into familiar business terms such as product, customer, or revenue to offer a unified, consolidated view of data across the organization. Pharmaceutical trials are a good example of where you might use data fabric, since the data from a trial comes from a combination of machines, reports, and other studies where the data has little accurate metadata to rely on. This data may be ‘sparse’ as well, meaning that a significant number of rows and columns are blank or null.
There are really no data-mesh-in-a-box or data-fabric-in-a-box solutions. As of the writing of this article, there was no one-stop shop for fabric and data mesh. In other words, data mesh and fabric aren’t software products. They are more commonly strategic initiatives that require multiple solutions.
Today, companies might use several technologies to create a data mesh or a data fabric. Here are a few examples:
Tradition databases
Modern databases can leverage external tables in data mesh style. Vertica, for example, allows you to use PARQUET files and other file types seamlessly without loading them into the main repository. In addition, if you have semi-structured data in AVRO, JSON, or TEXT, there is an easy way to leverage schema on read features to use the data. This functionality is valuable for creating a data mesh if you have disparate sources and want to leverage them like you would data in a database.
Query engines
A whole generation of query engines (sometimes called query accelerators) make data mesh possible, too. Solutions like Dremio, Starburst, and Druid primarily focus on analyzing external tables. They sometimes lack ACID compliance and the ability to do analytics with high concurrency, but they are often helpful in the data mesh mission. More and more traditional databases have added query engines to allow for seamless querying in a database and a data lake.
Visualization tools
Some advanced visualization tools have a semantic layer system. MicroStrategy, for example, offers a layer of abstraction that provides a consistent way of interpreting data from multiple sources. In addition, it maps complex data into familiar business terms. This capability is not only a simplified data fabric but can also leverage your database’s external tables capabilities. Combined, it can be mighty powerful.
Graph databases
Graph databases are good at orchestration and context and are the engines behind many data fabric solutions. Implementing data fabric with a graph DB is a significant project, but you will get a true data fabric when complete.
Data virtualization
Data virtualization tools like those offered by AtScale and Denodo present a consistent view for BI and Data Science teams to consume data. Modern databases also have data virtualization capabilities.
Data catalog
A data catalog is an organized inventory of data assets in the organization. Companies like Collibra provide data discovery and governance catalogs by collecting, organizing, accessing, and enriching metadata.
On-premises object store
It can be helpful to store all of your files in a central location. Object stores let you centrally manage databases, data repositories, and data lakes in one place with superb performance, security, and disaster recovery. For that reason, object stores such as those from Pure, Vast, Dell ECS, and many others can help with data mesh.
Data mesh is a way of accessing data that may be disparate and works particularly well when all the data sources:
If data mesh has a weakness, it is context. If your analytics is asking the question “according to whom?” then a data fabric can be more powerful to understand this. Data engineers often run into conflicting information when integrating sources together. For example, a new system might be reporting a customer’s age at 32, while legacy data might be reporting the same customer at 30 years old. Data lineage is an added feature of data fabric that let’s you decide which data sources to trust more when there are conflicts.
Data fabric solutions will tend to combine more tools to solve your disparate data problem. The tools are both more elegant and usually more complex than data mesh. They might include greater transformation capabilities, enhanced fine-grained security, graphical interfaces for governance and the lineage. However, if there is a weakness in data fabric is that you’ll probably have to spend significant effort in creating/managing a semantic layer.
Those vendors touting a data fabric strategy often promote the capabilities of a knowledge graph. A knowledge graph replaces the data mesh data integration strategy with a semantic representation of both structured and unstructured data – one that often better supports multiple schemas and dimensions that change.
More than ever, data is often diversely located in databases and data lakes. Cloud databases vary greatly in terms of accessing external data. Some solutions require data to be stored in specific formats in data warehouses and offer no support for data lakes. Still others support data lakes but require multiple tools to do so. Look for a solution that can handle common formats, (like ORC, PARQUET, AVRO, JSON) and leverage those sources into daily analysis with grace and speed. Look for solutions that can reach into other databases in your organization (data virtualization) so that no data is difficult to access.
Analyze massive data sets with minimal compute and storage