Schema

Schema is a logical description of the entire database. It includes the name and description of records of all record types including all associated data-items and aggregates. Much like a database, a data warehouse also requires to maintain a schema. A database uses relational model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema. In this chapter, we will discuss the schemas used in a data warehouse.

Star schema

A star schema is a database organizational structure optimized for use in a data warehouse or business intelligence that uses a single large fact table and one or more

A smaller dimensional tables

It is called a star schema because the fact table sits at the center of the logical diagram, and the small dimensional tables branch off to form the points of the star. A fact table sits at the center of a star schema database, and each star schema database only has a single fact table. The fact table contains the specific quantifiable data to be analyzed, such as sales figures.

The fact table stores two types of information: numeric values and dimension attribute values. Using a sales database as an example:
  • Numeric value cells are unique to each row or data point and do not correlate or relate to data stored in other rows.
  • These might be facts about a transaction, such as an order ID, total amount, net profit, order quantity or exact time.
  • The dimension attribute values do not directly store data, but they store the foreign key value for a row in a related dimensional table. Many rows in the fact table will reference this type of information. So, for example, it might store the sales employee ID, a date value, a product ID or a branch office ID.

Dimension tables store supporting information to the fact table. Each star schema database has at least one dimension table. Each dimension table will relate to a column in the fact table with a dimension value, and will store additional information about that value.

  • The employee dimension table may use the employee ID as a key value and can contain information such as the employee’s name, gender, address or phone number.
  • A product dimension table may store information such as the product name, manufacture cost, color or first date on market.

Benefits of the Star Schema

  • It is extremely simple to understand and build.
  • No need for complex joins when querying data.
  • Accessing data is faster (because the engine doesn’t have to join various tables to generate results).
  • Simpler to derive business insights.
  • Works well with certain tools for analytics, in particular, with OLAP systems that can create OLAP cubes from data stored using star schema.

Disadvantages of the Star Schema

  • Denormalized data can cause integrity issues. This means some data can turn out to be inconsistent at times.
  • Maintenance may appear simple at the beginning, but the larger data warehouse you need to maintain, the harder it becomes (due to data redundancy).
  • It requires a lot more disk space than snowflake schema to store the same amount of data.
  • Many-to-many relationships are not supported.
  • Limited possibilities for complex queries development.

Snowflake schema

The snowflake schema is an extension of a star schema. The main difference is that in this architecture, each reference table can be linked to one or more reference tables as well. The aim is to normalize the data.

Star schema Snowflake schema

Benefits of the Snowflake Schema

  • Uses less disk space because data is normalized and there is minimal data redundancy.
  • Offers protection from data integrity issues.
  • Maintenance is simple due to a smaller risk of data integrity violations and low level of data redundancy.
  • It is possible to use complex queries that don’t work with a star schema. This means more space for powerful analytics.
  • Supports many-to-many relationships. Disadvantages of the Snowflake Schema
  • Harder to design compared to a star schema.
  • Maintenance can be more complex due to a large number of different tables in the data warehouse.
  • Queries can be very complex, including many levels of joins between many tables.
  • Queries can be slower in some cases because many joins should be done to produce final output.
  • More specific skills are needed for working with data stored using snowflake schema.

Schema Design

People have consufed about the name of schema and how to put the tables/views in to exact schema… Lets simplify those by functionality splition.

For the schema design of the data warehouse, we will use a star schema approach, which consists of one or more fact tables referencing multiple dimension tables. This design is efficient for querying and reporting purposes. Example schemas include:

ETL functional

  • Landing: External tables, ingested from sources
  • Staging: Cleansing data and re-formated data, data-transit
  • Transforming: Applying business, core data transfer
  • Housing: Fact and Dimension structure, agguration
  • Publishing: Sharing, optimizing, access control
  • Secret (Rarely existing): it is for customers/clerk data with HIGHEST restriction.

How to splited by functions?

TypeFunctional Works
LandingExternal Table, View, Mounting to storage
StagingWill be deleted soon, cleans data
TransformingDeleted soon, lookup, processing
HousingDimensional data, long term
PublishingDe-nomalized, one big data, sharing
SecretLeast privilege permission, HIIPA, PII

Audit functional

  • Archived/Backup: cold data for historical loads, backfilling
  • Failed: for re-processing, investigation
  • Validation: for pipeline monitoring, data validation
  • Logging: for recording all activities of pipelines

System and Metadata

  • Config: storing configuration tables, data of data
  • Info: information of data, pipeline, catalog for building lineage.

This is how end-to-end warehouse structure. Yes I know the life is not easy. Dig into it and you will see the beautiful picture.