Since its founding 13 years ago as an online shoe seller, Zalando SE has grown to become one of the largest e-commerce fashion retailers in Europe. Given its roots on the web, it’s not surprising that about 15% of the firm’s 14,000 employees work in technology disciplines. Zalando “has been a data-driven company pretty much from the start,” said Max Schultze, a lead data engineer at Zalando, in a presentation at the Spark + AI Summit last year.
The company had used a central data warehouse for data analysis since its early days, but scalability eventually became a problem. Moving to the cloud was a partial solution but the bigger issue was how to satisfy the growing demand for new uses of that data. Caught in the middle were the data engineers who were responsible for both for cleansing and transforming an ever-increasing amount of data and satisfying demand for access.
“They were mostly firefighting issues that were introduced upstream by changes from the data generating teams,” said Arif Wider a professor for software engineering at HTW Berlin, Germany, and a lead technology consultant with ThoughtWorks Inc. “They needed to solve issues where they were not the domain experts.”
Two years ago Zalando shifted to a strategy of distributing data across the company and handing over ownership to the business groups that created it. Data scientists and engineers were assigned to work with business leaders to whip the data into shape so that it could be easily shared. People on the technology side were now expected to understand the context of the data they worked with.
The result has been the best of both worlds, Schulte said. “Even though we have decentralized ownership, we still have a central governance layer that allows us to tie all these things together.”
The architecture Zalando chose was a “data mesh,” a concept that’s is arguably the hottest topic in the data analytics world right now, despite being so new that it doesn’t even have a Wikipedia entry.
The term was coined by Zhamak Dehghani, a principal consultant a Thoughtworks, in a post on a blog maintained by Thoughtworks’ Chief Scientist Martin Fowler two years ago. Around the same time, Gartner Inc. began talking about a similar but not identical concept called a data fabric.
Both notions proceed from the same presumption: The way organizations manage data is woefully out of step with the way they go to market. Enterprises spent the last 20 years decentralizing their organizations and investing more authority in the people closest to their products and customers.
But at the same time, the data people need to make decisions is held in a centralized data warehouse, a data management construct that dates back to the 1980s. That store is tended by a team of data scientists and engineers who ensure quality and usability but know little about how data is used by the business. They field requests and manage data pipelines without context.
“The notion of storing all data together within a centralized platform creates bottlenecks where everyone is largely dependent on everyone else,” said Gil Feig, co-founder of Merge API Inc., an application integration startup. “Data mesh addresses this head-on.”
The technology foundation for data mesh is steadily being put in place, but the bigger challenges are cultural and organizational, advocates say. Companies have a lot invested in their warehouses and the teams that maintain them. The organizational upheaval of tearing apart centralized teams and distributing data ownership throughout the organization is a massive task. However, the penalties for maintaining the status quo may be even greater.
The data mesh concept “comes from a place of empathy for the pains that CEOs, CIOs and chief data officers who have been going through decades of spending a lot of money on infrastructure a not seeing the results they want,” Dehghani said in an interview with SiliconANGLE.
Transfer of ownership
Simply stated, a data mesh invests ownership of data in the people who create it. They’re responsible for ensuring quality and relevance and for exposing data for use by others in the organization who might want to use it. A consistent and organization-wide set of definitions and governance standards ensures consistency, and an overarching metadata layer lets others find what they need. “Data mesh is the concept of data-aligned data products,” Dehghani said in a video introduction. “Find the analytical data each part of the organization can share.”
Dehghani lists eight attributes of a data mesh. Elements must be discoverable, understandable, addressable, secure, interoperable, trustworthy, natively accessible and have value on their own.
The concept of decentralized data management is nothing new. Distributed databases rode the coattails of the client/server craze in the 1990s. Part of the appeal of the Hadoop software library of a decade ago was that processing was distributed to where data lived. More recently, data virtualization has gained traction with its concept of a logical data layer that integrates data siloes.
However, distributed databases were ahead of their time and Hadoop was torpedoed by complexity. Virtualization struggles with running queries across diverse data sources. The mesh approach, however, may be in the right place at the right time.
“The tools are eminently more capable than they have been in the past, specifically for using a combination of data domain discovery and use case analysis,” said Mark Beyer, Gartner distinguished research vice president.
Open-source query engines for processing data in place have proliferated in recent years. One is the Presto distributed SQL query engine and a forked project called Trino, which were created by Facebook Inc. and are highly regarded for their performance. There’s also the high-speed Apache Spark analytics engine that is sold commercially by Databricks Inc. as well as the open-source projects Apache Drill, Apache Impala and Apache Flink.
Another piece in the puzzle – data catalogs – have grown in sophistication to enable organizations to create a master record of all the data they have tagged for easy access. Data integration software has also improved to streamline the messy task of cleaning up data into a consistent format.
An interesting new dynamic is Delta Sharing, an open protocol developed by Databricks and released to open source in May that provides for secure data-sharing across different data management platforms. Delta Sharing can be used to distribute not only SQL queries but machine learning models and entire data sets. “It’s what vendors have had for decades but this is open,” said Ali Ghodsi, CEO of Databricks. Delta Sharing has earned endorsements from numerous business intelligence and analytics vendors but so far no sellers of database management systems.
Another driver is the broad popularity of Amazon Web Services Inc.’s S3 object storage, which has enticed organizations to move large amounts of their data to a single cloud repository or data lake, making it easier for query engines to work against it. “Having a data lake in place is essential to having a performant architecture,” said Dipti Borkar, chief product offers at Ahana Cloud Inc., which sells a Presto managed service. “If most of the data is in S3, which is increasingly the case, a service mesh can sit on top and pull the data from wherever it lives.”
The growing popularity of microservices architectures, in which applications are composed of many loosely coupled and ephemeral services, has also created a foundation for distributed data access. “In a service mesh you have microservices across many different applications and business units that connect parts of the system together,” Borkar said. “That is now being applied to data.”
But cultural impediments and organizational inertia are likely to make the road to data meshes or fabrics a long one. “The data mesh paradigm is really much more of a mindset shift than of a technological shift. You have to go from central ownership to…