Data Warehouse and LakeHouse
tags:
- data-engineering
- software-engineering
- cloud
- data-warehouses
- lakehouses
- datalake
Introduction to Data Modeling Technique to Data Warehousing and Lakehouses
- Understanding about how Data Normalization and De-Normalization
- Understanding about how to create ERD and Relational Data Model
- Understanding about how to modeling and creating Dimensional Data Model
- Understanding the concepts of data warehousing and lakehouses
- Knowledge about lakehouse features: metadata management, backfilling, schema evolution
- Differences between traditional data warehouses and modern lakehouses
- Advantages and disadvantages of each approach
- Real-world use cases for data warehousing and lakehouses
Set up a local environment with a data warehousing solution (e.g., Amazon Redshift, Google BigQuery, Snowflake)
Using Apache Iceberg and MinIO for setting up lakehouse
Explore the provided interface and understand basic operations such as creating tables, loading data, and running queries
Schema Layer in database and data warehouse:
Designing data completely data warehouse.
-
ELT Schema
- Raw, Clean, Trans, Pub
- Landing, Staging, Warehousing
- Bronze, Silver, Gold
-
Audit Schema
- Logging
- Security
-
Security (optional)
Get familiar with the documentation and resources for the chosen data warehousing solution
Data Modeling for Data Warehousing and Lakehouses
Snowflake
I prefer to always take the Quick Start 101 as it helpful for all foundations and trigger the think process, getting this Practice Snowflake 101
SELECT
TOP 100 "VendorID",
"DOLocationID",
"PULocationID",
"total_amount",
AVG("total_amount") OVER (PARTITION BY "PULocationID", "DOLocationID") AS avg_amount_location
FROM
NYTAXI_INTERNAL_YELLOW_TRIPDATA;
Basic usage of modeling: Introduction to dimensional modeling and star schema:
-
Why dimensional models?
- Understanding the importance of data modeling in data warehousing and lakehouses
- How data changes from transactions to analyses (aka OLTP –> OLAP)
-
Why Data Lake ? and why Iceberg ?
- Expressive SQL with flexibly supports Read, Write, Update, Delete data or large data.
- Full Schema Evolution, meaning that easy rename, reorder, no re-writing data in table
- Automatically maintains efficient .partitioning of tables by queries and data.
- Time travel and Rollback likes undo changes on data, data versioning to revert the issues quickly.
- Data compaction tracks data file in manifest to avoid unnecessary small files with less efficient and more cost. It likes rePartitioning data on metadata by using
reWriteDataFiles
Different types of data models (e.g., snowflake schema, fact constellation) Best practices for designing efficient data models
Building data lake with Iceberg and MinIO, Spark Computing
Design a simple dimensional model for a sample dataset Implement the designed model in the chosen data warehousing solution Load sample data into the tables and perform basic queries to validate the model
ETL Processes and Data Integration
Exploring Extract, Transform, Load (ETL) processes and their significance in data warehousing and lakehouses Different approaches to ETL: batch processing vs. real-time processing Tools and technologies for data integration and ETL (e.g., Apache Airflow, Talend, Informatica)
Set up ETL workflows using a chosen tool or framework Extract data from different sources, transform it according to predefined rules, and load it into the data warehouse/lakehouse Monitor and troubleshoot the ETL process, handling errors and data quality issues
Data Extracting, and Loading
We have already mentioned that in this book, answer to the following questions:
- How to extract the data from raw data? (regex pattern, etc)
- How to extract the delta data from sources (delta data is the subset of data changed) ?
- Do we need a buffer for loading?
- etc…
Depending on the data source and data availability requirements, the data will be loaded and refreshed as needed.
When dealing with Source data ingestion, type of data loads is determined by:
- Full Load: loading small data (ex: dimensions, master data, etc)
- Delta Load: transactional data (delta extraction by using incremental columns)
Another, you’re working to load data to Target data loading, try to use these engineering technique to load data:
- Incremental: split data into smaller chunks
- Snapshot: snapshot data and merge into destination
Plus: Engineering columns during data loading and data transformation ?
From Source
: source of data was fromIncremental Columns
: seq, auto_increment(col), modified column,created columnHistorical Tracking Columns
: version, modified_datetime, created_datetimeDescriptive Columns
: Dimension columns, natural key(col)Identifier Column
: mostly PK and FKAudit Column
: Record_ID for backfilling data, data mismatch,
Data Analytics and Reporting
Understanding the role of data analytics and reporting in driving insights and decision-making Exploring different types of analytics: descriptive, diagnostic, predictive, prescriptive Tools and techniques for data visualization and reporting (e.g., Tableau, Power BI, Looker)
Create sample reports and dashboards using the data stored in the data warehouse/lakehouse Use visualization tools to analyze trends, patterns, and anomalies in the data Share insights from the reports with peers and discuss potential business implications
ClickHouse and S3 Minio
A single ClickHouse Instance configured with 1 S3 Minio instance
This recipe defines a S3 type disk pointing to the Minio S3 instance, a storage policy that makes use of the s3 disk and a MergeTree table backed by it.
Optionally to install clickhouse client for connecting to warehouse, but could by using any kind of database IDE client.
Data Governance and Security
Importance of data governance and security in data warehousing and lakehouses Key components of a data governance framework: policies, processes, roles, and responsibilities Security best Checkpoints for protecting sensitive data and ensuring compliance with regulations (e.g., GDPR, HIPAA)
Implement basic data governance policies in the data warehouse/lakehouse environment (e.g., data classification, access control)
Try to setup Data Governance on top of data using Open-Metadata
Set up encryption and other security measures to protect data at rest and in transit
Data Masking Security Policy
10 Masking data using encryption technique and jumping tables (Information are in Security Schema) Conduct a security audit and assess the compliance of the environment with relevant regulations and standards By following this bootcamp plan, participants will gain a solid understanding of data warehousing and lakehouses, along with hands-on experience in setting up environments, designing data models, implementing ETL processes, performing analytics, and ensuring data governance and security.