Data Warehouse
The data warehouse (DWH) architecture is relatively straightforward. Extract data from all of the enterprise’s operational systems, transform the source data into an analytical model, and load the resultant data into a data analysis–oriented database. This database is the data warehouse.
This data management architecture is based primarily on the extract-transform-load (ETL) scripts. The data can come from various sources: operational databases, streaming events, logs, and so on. In addition to translating the source data into a
facts/dimensions-based model, the transformation step may include additional oper‐ ations such as removing sensitive data, deduplicating records, reordering events, aggregating fine-grained events, and more. In some cases, the transformation may require temporary storage for the incoming data. This is known as the staging area.
The resultant data warehouse, shown in Figure 16-7, contains analytical data covering all of the enterprise’s business processes. The data is exposed using the SQL language (or one of its dialects) and is used by data analysts and BI engineers.
![]() |
Figure 16-7. A typical enterprise data warehouse architecture
The careful reader will notice that the data warehouse architecture shares some of the challenges discussed in Chapters 2 and 3.
First, at the heart of the data warehouse architecture is the goal of building an enterprise-wide model. The model should describe the data produced by all of the enterprise’s systems and address all of the different use cases for analytical data. The analytical model enables, for example, optimizing the business, reducing operational costs, making intelligent business decisions, reporting, and even training ML models. As you learned in Chapter 3, such an approach is impractical for anything by the smallest organizations. Designing a model for the task at hand, such as building reports or training ML models, is a much more effective and scalable approach.
The challenge of building an all-encompassing model can be partly addressed by the use of data marts. A data mart is a database that holds data relevant for well-defined analytical needs, such as analysis of a single business department. In the data mart model shown in Figure 16-8, one mart is populated directly by an ETL process from an operational system, while another mart extracts its data from the data warehouse.
Figure 16-8. The enterprise data warehouse architecture augmented with data marts
When the data is ingested into a data mart from the enterprise data warehouse, the enterprise-wide model still needs to be defined in the data warehouse. Alternatively, data marts can implement dedicated ETL processes to ingest data directly from the operational systems. In this case, the resultant model makes it challenging to query data across different marts—for example, across different departments—as it requires a cross-database query and significantly impacts performance.
Another challenging aspect of the data warehouse architecture is that the ETL pro‐ cesses create a strong coupling between the analytical (OLAP) and the operational (OLTP) systems. The data consumed by the ETL scripts is not necessarily exposed through the system’s public interfaces. Often, DWH systems simply fetch all the data residing in the operational systems’ databases. The schema used in the operational database is not a public interface, but rather an internal implementation detail. As a result, a slight change in the schema is destined to break the data warehouse’s ETL scripts. Since the operational and analytical systems are implemented and maintained by somewhat distant organizational units, the communication between the two is challenging and leads to lots of friction between the teams. This communication pat‐ tern is shown in Figure 16-9.
Figure 16-9. Data warehouse populated by fetching data directly from operational data‐ bases, ignoring the integration-oriented public interfaces
The data lake architecture addresses some of the shortcomings of the data warehouse architecture.