Data Warehouse and LakeHouse

tags:

data-engineering
software-engineering
cloud
data-warehouses
lakehouses
datalake

Introduction to Data Modeling Technique to Data Warehousing and Lakehouses

Understanding about how Data Normalization and De-Normalization
Understanding about how to create ERD and Relational Data Model
Understanding about how to modeling and creating Dimensional Data Model
Understanding the concepts of data warehousing and lakehouses
Knowledge about lakehouse features: metadata management, backfilling, schema evolution
Differences between traditional data warehouses and modern lakehouses
Advantages and disadvantages of each approach
Real-world use cases for data warehousing and lakehouses

Set up a local environment with a data warehousing solution (e.g., Amazon Redshift, Google BigQuery, Snowflake)

Using Apache Iceberg and MinIO for setting up lakehouse

Explore the provided interface and understand basic operations such as creating tables, loading data, and running queries

Schema Layer in database and data warehouse:

Designing data completely data warehouse.

ELT Schema
- Raw, Clean, Trans, Pub
- Landing, Staging, Warehousing
- Bronze, Silver, Gold
Audit Schema
- Logging
- Security
Security (optional)

Get familiar with the documentation and resources for the chosen data warehousing solution

Data Modeling for Data Warehousing and Lakehouses

Snowflake

I prefer to always take the Quick Start 101 as it helpful for all foundations and trigger the think process, getting this Practice Snowflake 101

SELECT
  TOP 100 "VendorID",
  "DOLocationID",
  "PULocationID",
  "total_amount",
  AVG("total_amount") OVER (PARTITION BY "PULocationID", "DOLocationID") AS avg_amount_location
FROM
  NYTAXI_INTERNAL_YELLOW_TRIPDATA;

Basic usage of modeling: Introduction to dimensional modeling and star schema:

Why dimensional models?
- Understanding the importance of data modeling in data warehousing and lakehouses
- How data changes from transactions to analyses (aka OLTP –> OLAP)
Why Data Lake ? and why Iceberg ?
- Expressive SQL with flexibly supports Read, Write, Update, Delete data or large data.
- Full Schema Evolution, meaning that easy rename, reorder, no re-writing data in table
- Automatically maintains efficient .partitioning of tables by queries and data.
- Time travel and Rollback likes undo changes on data, data versioning to revert the issues quickly.
- Data compaction tracks data file in manifest to avoid unnecessary small files with less efficient and more cost. It likes rePartitioning data on metadata by using reWriteDataFiles

Different types of data models (e.g., snowflake schema, fact constellation) Best practices for designing efficient data models

Building data lake with Iceberg and MinIO, Spark Computing

Design a simple dimensional model for a sample dataset Implement the designed model in the chosen data warehousing solution Load sample data into the tables and perform basic queries to validate the model

ETL Processes and Data Integration

Exploring Extract, Transform, Load (ETL) processes and their significance in data warehousing and lakehouses Different approaches to ETL: batch processing vs. real-time processing Tools and technologies for data integration and ETL (e.g., Apache Airflow, Talend, Informatica)

Set up ETL workflows using a chosen tool or framework Extract data from different sources, transform it according to predefined rules, and load it into the data warehouse/lakehouse Monitor and troubleshoot the ETL process, handling errors and data quality issues

Data Extracting, and Loading

We have already mentioned that in this book, answer to the following questions:

How to extract the data from raw data? (regex pattern, etc)
How to extract the delta data from sources (delta data is the subset of data changed) ?
Do we need a buffer for loading?
etc…

Depending on the data source and data availability requirements, the data will be loaded and refreshed as needed.

When dealing with Source data ingestion, type of data loads is determined by:

Full Load: loading small data (ex: dimensions, master data, etc)
Delta Load: transactional data (delta extraction by using incremental columns)

Another, you’re working to load data to Target data loading, try to use these engineering technique to load data:

Incremental: split data into smaller chunks
Snapshot: snapshot data and merge into destination

Plus: Engineering columns during data loading and data transformation ?

From Source: source of data was from
Incremental Columns: seq, auto_increment(col), modified column,created column
Historical Tracking Columns : version, modified_datetime, created_datetime
Descriptive Columns : Dimension columns, natural key(col)
Identifier Column : mostly PK and FK
Audit Column: Record_ID for backfilling data, data mismatch,

Data Analytics and Reporting

Understanding the role of data analytics and reporting in driving insights and decision-making Exploring different types of analytics: descriptive, diagnostic, predictive, prescriptive Tools and techniques for data visualization and reporting (e.g., Tableau, Power BI, Looker)

Create sample reports and dashboards using the data stored in the data warehouse/lakehouse Use visualization tools to analyze trends, patterns, and anomalies in the data Share insights from the reports with peers and discuss potential business implications

ClickHouse and S3 Minio

A single ClickHouse Instance configured with 1 S3 Minio instance

This recipe defines a S3 type disk pointing to the Minio S3 instance, a storage policy that makes use of the s3 disk and a MergeTree table backed by it.

Optionally to install clickhouse client for connecting to warehouse, but could by using any kind of database IDE client.

Data Governance and Security

Importance of data governance and security in data warehousing and lakehouses Key components of a data governance framework: policies, processes, roles, and responsibilities Security best Checkpoints for protecting sensitive data and ensuring compliance with regulations (e.g., GDPR, HIPAA)

Implement basic data governance policies in the data warehouse/lakehouse environment (e.g., data classification, access control)

Try to setup Data Governance on top of data using Open-Metadata

Set up encryption and other security measures to protect data at rest and in transit

Data Masking Security Policy

10 Masking data using encryption technique and jumping tables (Information are in Security Schema) Conduct a security audit and assess the compliance of the environment with relevant regulations and standards By following this bootcamp plan, participants will gain a solid understanding of data warehousing and lakehouses, along with hands-on experience in setting up environments, designing data models, implementing ETL processes, performing analytics, and ensuring data governance and security.

Solid Data Foundations (SDF)