We live in a world that operates many different data architectures; no two organisations are likely to share the same pattern though all attempt to solve similar problems. When it comes to data warehousing there are numerous different architectural choices. Some opt for traditional Enterprise Data Warehouses which combine source system data to create structured data marts with governance for querying by BI tools. Others build large Data Lakes and databases to enable them to cheaply store unstructured, semi-structured and structured data, often feeding data science and analytics tools. Each architecture is designed to solve a problem. (Find a full comparison between the two architectures here.)
Traditional Data Warehouses are great for BI reporting and offer governance and rigour around the consumption of the data. The Data Lake is great for analysing vast quantities of varied data, that can be quickly ingested and analysed to make decisions. However, this can sometimes come at the cost of data governance in favour of speed and flexibility of data analysis.
For many years, it seemed like there was no middle ground or best of both worlds. However, this is about to change with the introduction of the Data Lakehouse. In this blog we will look at the key features of this architecture and discuss some of the challenges it seeks to resolve.
The Data Lakehouse
As Data Warehouses and Data Lakes have evolved, so has the architecture and business desire to extract value and insight from the data. Many organisations have opted for hybrid models that contain both the controlled and rigorous reporting of a Data Warehouse, and the flexibility and insight that you get from a Data Lake. However, operating two differing architectures leads to compromises regarding data transfer between both platforms. Which is where the Data Lakehouse comes into play. This architecture seeks to place the Data Warehouse and the Data Lake on to the same platform. The advantage of using this architecture is that it removes the problem of having multiple data stores (a warehouse and a lake) for different business needs. This means that instead of processing data within the lake and then building ETL routines to load the data to a warehouse, a single data store is instead created that can service all use cases. It is this shared storage platform that forms one of the main differences between a hybrid architecture and the Data Lakehouse.
There is of course a need for security, governance, data lineage, and data management as these remain vital in any enterprise system. This can be applied within the Data Lakehouse architecture and has the potential to simplify these actions, as the tools which provide these critical tasks are applied to one data store rather than multiple.
We have discussed the main characteristics, however there are others which a Data Lakehouse should have, these are:
- The capability for multiple business users and processes to concurrently read or write data whilst maintaining data integrity.
- The ability to support schema enforcement and evolution, robust governance, auditing, and data management.
- A single data store that allows multiple data tools such as BI, Machine Learning and Data Science to directly access the source data.
- Separation of Storage and Compute, allowing systems to scale with more users and greater data volumes.
- Standardised and open storage formats, to enable access to a wide range of tools and engines.
- The ability to store varied data types such as structured, semi-structured and unstructured data, so it can be analysed by different tools across the organisation.
- The end to end streaming of data to eliminate the need for separate systems to serve real time data.
Technology
Technology providers are starting to develop their offerings that will allow organisations to introduce a Data Lakehouse architecture. Two of the pioneers in this are Amazon and Microsoft. We reviewed their offerings to see how they stand and up against the characteristics of a Data Lakehouse.
Amazon Redshift and S3
Amazon announced that it is planning a series of updates to Redshift and S3 to allow for a lakehouse query engine. Redshift Spectrum allows standard SQL queries to be submitted against data stored in S3 and Redshift. Amazon has also released design patterns for lakehouse architecture using Redshift, utilising AWS to enable them to deliver on the characteristics listed above. This allows for a flexible tooling approach.Microsoft Azure Synapse
Microsoft have launched Azure Synapse, which they describe as an Azure SQL Data Warehouse evolved, it brings together Enterprise Data Warehouse and Big Data analytics under one service. They have done this by enabling T-SQL to run on the embedded Apache Spark engine and data warehouse, paired with built in management and security. The platform supports SQL, Python, R and Machine Learning so business users can use their preferred language to interrogate data. Since it’s a Microsoft product, integration with Power BI is also available. This enables them to offer a service that provides the characteristics of a Data Lakehouse.
Recent Comments