Partitioning and Metadata management

Partition

Partitioning is done to enhance performance and facilitate easy management of data. Partitioning also helps in balancing the various requirements of the system. It optimizes the hardware performance and simplifies the management of data warehouse by partitioning each fact table into multiple separate partitions. In this chapter, we will discuss different partitioning strategies.

Why is it Necessary to Partition?

For Easy Management

  • The fact table in a data warehouse can grow up to hundreds of gigabytes in size. This huge size of fact table is very hard to manage as a single entity. Therefore it needs partitioning. To Assist Backup/Recovery
  • If we do not partition the fact table, then we have to load the complete fact table with all the data. Partitioning allows us to load only as much data as is required on a regular basis. It reduces the time to load and also enhances the performance of the system.

To Enhance Performance

  • By partitioning the fact table into sets of data, the query procedures can be enhanced.
  • Query performance is enhanced because now the query scans only those partitions that are relevant. It does not have to scan the whole data.

Partitioning Strategies

Partitioning by Time into Equal Segments

In this partitioning strategy, the fact table is partitioned on the basis of time period. Here each time period represents a significant retention period within the business. For example, if the user queries for month to date data then it is appropriate to partition the data into monthly segments. We can reuse the partitioned tables by removing the data in them.

Partition by Time into Different-sized Segments

This kind of partition is done where the aged data is accessed infrequently. It is implemented as a set of small partitions for relatively current data, larger partition for inactive data.

Partition on a Different Dimension

The fact table can also be partitioned on the basis of dimensions other than time such as product group, region, supplier, or any other dimension. Let’s have an example. Suppose a market function has been structured into distinct regional departments like on a state by state basis.

If each region wants to query on information captured within its region, it would prove to be more effective to partition the fact table into regional partitions. This will cause the queries to speed up because it does not require to scan information that is not relevant.

Partition by Size of Table

When there are no clear basis for partitioning the fact table on any dimension, then we should partition the fact table on the basis of their size. We can set the predetermined size as a critical point. When the table exceeds the predetermined size, a new table partition is created.

Partition by normalization

Normalization is the standard relational method of database organization. In this method, the rows are collapsed into a single row, hence it reduces space. Take a look at the following tables that show how normalization is performed.

Meatadata

Why we have to create a metadata tables

Metadata is simply defined as data about data; or the description of the structure, content, keys, indexes, etc., of data.

Metadata play a very important role than other data warehouse data and are important for many reasons. For example, metadata are used as a directory to help the decision support system analyst to locate the contents of the data warehouse and as a guide to the data mapping when data are transformed from the operational environment to the data warehouse environment.

Metadata also serve as a guide to the algorithms used for summarization between the current detailed data and the lightly summarized data, and between the lightly summarized data and highly summarized data.

Metadata should be stored and managed persistently (i.e., on disk). Here are several examples.

Metadata for a document may contain the document created date, last modified date, it’s size, author, description, etc.

Metadata for ETL includes the job name, source tables/files, target tables/files, and frequency.

Metadata associated with data management defines the data store in the Data Warehouse. Every object in the database needs to be described including the data in each table, index, and view, and any associated constraints.

Categories of Metadata

Business Metadata: It has the data ownership information, business definition and changing policies.

Technical Metadata: It includes database system names, table and column names and sizes, data types and allowed values. It also includes structural information such as primary and foreign key attributes and indices.

Operational Metadata: It includes currency of data and data lineage. Currency of data means whether the data is active, archived or purged. Lineage of data means the history of data migrated and transformation applied on it.

The Role of Metadata in the Data Warehouse

Consistency of definitions: One department refers to “revenues,” another to “sales.” Are they talking about the same activity? One unit talks about “customers,” another about “users” or “clients.” Are these different classifications or different terms for the same classification? Effective metadata management can ensure that the same data “language” applies throughout the organization.

Business users can easily understand the full meaning of data. With understanding comes eased communication, and an overall improved process. The business metadata primarily supports business end users who do not have a technical background, and cannot use the technical metadata to determine what information is stored inside the Data

Warehouse. Technical metadata primarily supports technical staff that must implement and deploy the Data Warehouse.

Clarity of relationships: Meta data management illuminates the associations among all components of the warehouse: business rules, tables, columns, transformations, and user views of the data. By clarifying relationships throughout the Data Warehouse environment, managed Meta data enables warehouse managers to see the bigger picture—to fully understand the meanings of the data assets, and to accurately predict and manage the impact of changes to the data warehouse.

Availability of information: Meta data exists “behind the scenes,” revealing the origin of data, who defined it, when it was modified, and much more.

  • Resource discovery: Metadata serves the same functions in resource discovery as good cataloging does by:
    • Allowing resources to be found by relevant criteria; o Identifying resources;
    • Bringing similar resources together;
    • Giving location information.

A data warehouse system at build time and use time

Metadata has been indented as a key success factor in data warehouse projects. It captures all kinds of information necessary to extract, transform and load data from source systems into the data warehouse, and afterwards to use and interpret the data warehouse contents. The generation and management of metadata serves two purposes: (1) to minimize the efforts for development and administration of a data warehouse and (2) to improve the extraction of information from it.

Partitioning Practices

To optimize query performance and manage large volumes of data efficiently, we will implement partitioning based on time intervals (e.g., monthly or quarterly partitions). This will allow for faster data retrieval and facilitate data pruning during archiving processes.

There are 4 kind of partion I almost seen:

  • Datetime: Analytics, preprocessing
  • Source: Transformation
  • Type: Transformation
  • Business: Querying objects for analysis