We live in a world where there are many different models of Data Architecture. No two organisations are likely to share the same pattern, although they all attempt to solve similar problems.

When it comes to Data Warehousing there’s a broad range of architectural choices. Some businesses opt for traditional Enterprise Data Warehouses, combining source system data to create structured data marts with

governance for querying by BI tools.

Other organisations build large Data Lakes and databases to enable them to economically store unstructured, semi-structured and structured data, often feeding data science and analytics tools.

Each architecture is designed to solve a problem. Traditional Data Warehouses are great for BI reporting, while offering governance and rigour around the consumption of the data. The Data Lake is great for analysing vast quantities of varied data, which can be quickly ingested and analysed to make decisions. However, this can sometimes come at the cost of Data Governance in favour of speed and flexibility of data analysis.

For many years, it seemed like there was no middle ground or best-of-both-worlds. However, this is changing with the introduction of the Data Lakehouse. In this blog, we review the key features of this architecture and discuss some of the challenges the Data Lakehouse seeks to resolve.

The Data Lakehouse
As Data Warehouses and Data Lakes have evolved, so has the architecture and business requirement to extract value and insight from the data. Many organisations have opted for hybrid models that contain both the controlled and rigorous reporting of a Data Warehouse, plus the flexibility and insight that you can achieve with a Data Lake.

However, operating two different architectures leads to compromises in terms of data transfer between both platforms. Which is where the Data Lakehouse comes into play. This architecture integrates the Data Warehouse and the Data Lake on the same platform.

The advantage of using a Data Lakehouse is that it removes the problem of having multiple data stores (a separate warehouse and a lake) for different business needs. This means that instead of processing data within the lake and then building ETL routines to load the data to a warehouse, a single data store is created that can service all use cases. It’s this shared storage platform that forms one of the main differences between a hybrid architecture and the Data Lakehouse.

There is, of course, the need to carefully consider security, governance, data lineage, and data management, all of which remain vital in any enterprise system. This can be applied within the Data Lakehouse architecture, which has the potential to simplify these actions, as the tools that provide these critical tasks are applied to one data store rather than multiple.

Additional characteristics of a typical Data Lakehouse include:

  • The capability for multiple business users and processes to concurrently read or write data whilst maintaining data integrity.
  • The ability to support schema enforcement and evolution, robust governance, auditing, and data management.
  • A single data store that allows multiple data tools such as BI, Machine Learning and Data Science to directly access the source data.
  • Separation of Storage and Compute, allowing systems to scale with more users and greater data volumes.
  • Standardised and open storage formats, enabling access to a wide range of tools and engines.
  • The ability to store varied data types such as structured, semi-structured and unstructured data, so the data can be analysed by different tools across the organisation.
  • The end to end streaming of data to eliminate the need for separate systems to serve real time data.

Technology
Technology providers are starting to develop solutions that will allow organisations to introduce a Data Lakehouse architecture. Two of the pioneers in this field are Amazon and Microsoft. The Dufrain team reviewed their offerings to see how they compare against the characteristics of a typical Data Lakehouse.

Amazon Redshift and S3
Amazon announced plans for a series of updates to Redshift and S3 to allow for a Lakehouse query engine. Redshift Spectrum allows standard SQL queries to be submitted against data stored in S3 and Redshift. Amazon has also released design patterns for Lakehouse architecture based on Redshift, using AWS to enable them to deliver on the characteristics listed above. This allows for a flexible tooling approach.

Microsoft Azure Synapse
Microsoft have launched Azure Synapse, which they describe as an Azure SQL Data Warehouse that has evolved, bringing together Enterprise Data Warehouse and Big Data analytics under one service. They have achieved this by enabling T-SQL to run on the embedded Apache Spark engine and Data Warehouse, paired with built in management and security. The platform supports SQL, Python, R and Machine Learning, so business users can use their preferred language to interrogate data. Since it’s a Microsoft product, integration with Power BI is also available. This enables Microsoft to offer a service that provides the characteristics of a Data Lakehouse.

Conclusion

The benefits of the Data Lakehouse architecture include:

  • Reduced cost through one architectural platform keeping data stored in one system.
  • Greater insight into data via a unified analytical platform that allows a multitude of different tools to access the data.
  • Governance and Data Management capabilities to unstructured and semi-structured data with an easier to manage architecture.

However, there currently few examples of implemented Data Lakehouses, and it has been reported that those which have been implemented favour certain elements of the Lake or Warehouse. In addition, organisations that have just invested in new Data Warehouses and Data Lakes are unlikely to move to another architecture that could potentially fail to provide them with greater, or even the same, levels of functionality.

Technology providers are still developing Data Lakehouse solutions, with some of the features not fully available yet, and promised in future releases. Although these solutions have not accomplished the status of being an all-rounder, they are able to offer a single solution for all the data needs an organisation might have from its architecture, and it appears there is a trend towards this as the solutions mature.

A Data Lakehouse offers some great advantages to organisations with a varied data and analytical estate. But these developments are in their early stages, and if they can deliver on all the promises, this will become clear in time.

Ultimately, this single platform approach does offer many advantages over running a hybrid architecture. But as organisations have developed these strategies over the years with these technologies embedded, it’s likely that these hybrid architectures will continue.