The do’s and don’ts when architecting a cloud-native data lake
The cloud community still is on the fence about data lakes. On the one hand one could argue it’s a fancy word for something that has existed since humans started using computers. On the other hand, with the advent of Cloud computing, it is finally possible to store, distribute and process (or mine) unstructured data in a safe, governable way that can actually create business value. So when should a Cloud-native data lake be considered, and how can you make it work?
On the fence about data lakes
From a business perspective, cloud-native data lakes can be tricky. In the recent past, large enterprises have attempted to connect and mine (old) datasets with the expectation to gain new insights and ultimately create value. Too often, such projects are guided by a misconception of what ‘Big Data’ is or can do for an organization. Up until this day the internet continues to be littered with promises about AI, machine learning and the secrets that can be unlocked with these technologies in datasets. And while there are many examples of successful cloud-native data lake deployments that add value, the opposite is also true. As was pointed out during the CAA Cloud-native data lake event, senior leadership on occasion becomes enamored by the (real or imagined) possibilities of data, requiring IT to make it happen. Unfortunately, there is a real risk of a waste of time, resources and goodwill.
The challenge for the Cloud Architect
The Cloud Architect thus faces an interesting challenge. He or she needs to be able to make or unmake the case for a Cloud-native data lake, depending on the feasibility of such a project. If there is sufficient ground for architecting a Cloud-native data lake, he or she of course needs to know how to do it. There are numerous variables to take into account that will eventually make a Cloud-native data lake add value for the organization while avoiding the operational and Governance, Risk and Compliance risks that accompany it.
How then should the business case for a cloud-native data lake take shape? We could probably write a book about this (and maybe will someday), but these are the 5 questions a Cloud Architect needs to consider from a business and operational perspective when planning a Cloud-native data lake:
- Is there a business oriented problem that can be solved with acquirable data?
- Is there sufficient scale in the data sets, is there enough data to justify the required resources?
- Does the eventual solution touch the core business?
- Are the right tools available, do business owners and the team understand there needs to be a code base instead of a webportal or GUI?
- Is there a team that knows what to do and can autonomously engineer the data lake, with limited oversight?
Getting these things right will make the difference between a misguided approach and getting the desired results. Those results in turn are highly dependent on the way the data lake is architected.
How a cloud-native data lake can add business value
A useful perspective on architecting a data lake is how the data in the lake should be made available to consumers of this data. Given the fact the data lake holds unstructured data, consumers will have to know which is what, and whether they have access. To see how this would work in practice, it is useful to use an example.
Consider a data lake that holds data sets on mortgages and relevant variables, spanning about two decades. As one can imagine, an institution that is in the business of providing mortgages wants to know how many mortgage holders historically default on them, what the average returns on these mortgages are and will also want to know how mortgage defaults relate to historical interest rates and the historical health of the overall economy in a given country. The data involved has its origin in different data sets, and some of the data is probably privacy sensitive. Also, the data can be sensitive in a competitive sense, justifying tight controls of who has access.
Given the age of the data sets, its highly probable compatibility issues will arise when pushing this data into the data lake. These type of issues can take a lot of time to mitigate, unless an unified data model is issued that is able to assign metadata in a consistent way and making the data available through a publisher-subscriber model. This will also help to control access, and enable teams to discover data without extracting data from the data lake. It is advisable to create an automated workflow which offers a potential subscriber a sample from a data lake, giving him the opportunity to decide whether the data in the lake is useful to him. This workflow will need to automatically provide or deny access based on predefined rules, such as clearances for specific types of data. In this way, the use of this data can be aligned with organizational goals, creating real business value.
Make a data lake work for you
All in all, the Cloud-native data lake requires a lot of thought, from its conception to its actual use. And even though there is some (contextually justified) skepticism about data lakes, with the right business case and right architecture, innovation can be accelerated with a thoroughly automated data lake.