How to avoid getting stuck in a data swamp

In the run up to the much anticipated Cloud Architect Alliance diner and award show on January 30th, our speakers give a sneak peak into what they have to say that night. Data Solution Architect René Bremer and Senior Cloud Solution Architect Sarath Sasidharan, both currently at Microsoft, share their take on architecting secure data lakes that add business value.

The challenge of scattered data

Many enterprises are considering setting up an enterprise data lake. They want (and often need) to get more business value out of the data that is available within the organization. But even though the potential value of data is recognized, actually using data in this way can be a challenge.

According to René Bremer and Sarath Sasidharan, enterprises need to centralize data first before they can use it to create business value. “Large enterprises face a problem when data needs to be accessible to different teams, departments and offices.” Bremer says. “The last thing you want is teams exchanging data bilaterally, as this data will be ‘marginalized’ and most likely is not accessible to the rest of the organization. It’s difficult to control data that does not have a single source that is used by the whole organization. Centralizing data therefore is the first step to creating a data lake that adds business value.”

Controlling accessible and useful data in a data lake

Centralizing data is a first step, but it certainly should not be the only one. For data in a data lake to become useful, it should be clear who owns it and who is allowed to access it. A pub-sub service should be in place to ensure asynchronous use of data. “Both metadata and a clearly defined pub-sub paradigm are key to creating a functional data lake” Sasidharan explains. “Any data that is put into a data lake needs to be stored according to a proper methodology, with proper metadata and scheming validation for pub-sub to work as it is intended to.”

Adding metadata may seem a fairly straightforward affair, but when terabytes or petabytes of data are involved, a strict policy for assigning metadata fields and labels makes the difference between a functional data lake and a dysfunctional data swamp, Bremer says. “You need to think long and hard about the way metadata is added to data. Business metadata, technical metadata and operational metadata need to be compulsory additions to data that is put in a data lake. Without them, it will be difficult to determine the source and value of data. And even worse, it will be difficult to have tools use the data at all. Ideally, metadata can be interpreted by a variety of tools, for example Bricks or Sequel.” 

And that’s not all, according to Sasidharan. “You will also need to think about the data lake from the consumer’s perspective. You need to adhere to certain standards for API’s, streaming and other ways data in a data lake can be published. It should be easy for a data consumer to access the desired data while at the same time controls should be in place to ensure only authorized users can access certain data for a predefined period of time. A single pane of glass to manage consumption of data in a data lake is indispensable to a secure and efficient enterprise data lake.”

Learnings from real customer cases

To eliminate data siloes between application and business departments, Microsoft has created an open source data model (called Common Data Model) in collaboration with SAP and Adobe. During the upcoming Cloud Architect Alliance event on January 30th, Bremer and Sasidharan will discuss two customer cases and will elaborate on pub-sub flows, metadata and the Common Data Model, and general principles that need to be considered when implementing a data lake. To get guests started, they will even share some generic patterns on metadata and pub-sub. Make sure you don’t miss out and claim your free ticket right now! Don’t wait too long, there’s only a few spots left.