Components of Data Warehouse
Data Warehouse
As per Vincent Rainardi, a data warehouse is a system that retrieves and consolidates data periodically from the source systems into a dimensional or normalized data store. It usually keeps years of history and is queried for business intelligence or other analytical activities. It is typically updated in batches, not every time a transaction happens in the source system.
Components of a Typical Data Warehouse
A typical data warehouse diagram is given below:
Source Systems: The source systems are the OLTP systems that contain the data you want to load into the data warehouse. Online Transaction Processing (OLTP) is a system whose main purpose is to capture and store the business transactions. The source systems data is examined using a data profiler to understand the characteristics of the data. A data profiler is a tool that has the capability to analyze the data, such as finding out how many rows are in each table, how many rows contain NULL values and so on.
ETL: The extract, transform and load (ETL) system then brings data from various source systems into a staging area. ETL is a system that has the capability to connect to the source systems, read the data, transform the data and load it into a target system. The ETL system then integrates, transforms and loads the data into a dimensional data store (DDS).
DQ: When the ETL system loads the data into the DDS, the data quality rules do various data quality checks. Bad data is put into the data quality (DQ) database to be reported and then corrected in the source systems. Bad data can also be automatically corrected or tolerated if it is within a certain limit.
The ETL system is managed and orchestrated by the control system, based on the sequence, rules and logic stored in the metadata. The metadata is a database containing information about the data structure, the data meaning, the data usage, the data quality rules and other information about the data.
DDS: A DDS is a database that stores the data warehouse data in a different format than OLTP. The reason for getting the data from the source system into the the DDS and then querying the DDS instead of querying the source system directly.
In a DDS, the data is arranged in a dimensional format that is more suitable for analysis. The second reason is because a DDS contain integrated data from several source systems.
MDBs: Users use various front-end tools such as spreadsheets, pivot tables, reporting tools and SQL query tools to retrieve and analyze the data in a DDS. Some applications operate on a multi-dimensional database format. For these applications, the data in the DDS is loaded into multi-dimensional databases (MDBs), which are also known as cubes.
A multi-dimensional database is a form of database where the data is stored in cells and the position of each cell is defined by a number of variables called dimensions. Each cell represents a business events and the values of the dimensions show when and where this event happened.
Data Mining: Data mining is the process of exploring data to find the patterns and relationships that describe the data and to predict the unknown or future values of the data. The key value of data mining is the ability to understand why some things happened in the past and the ability to predict what will happen in the future.
To refer to predicting the future with regard to data mining, some people use the term forecasting, and some call it predictive analysis. On the other hand, when data mining is used to explain the current or past situation, it is called descriptive modeling, descriptive analytics or knowledge discovery.
Data Lakes: We are now moving into the era of data lakes. Every industry has a potential data lake use case. A data lake can be a way to gain more visibility or put an end to data silos. Many companies see data lakes as an opportunity to capture a 360-degree view of their customers or to analyze social media trends.