Data Warehouse and ETL Knowledgebase: Teradata 14.0 introduces Teradata Columnar

Teradata 14.0 introduces Teradata Columnar – a new option to organize the data of a user-defined table or join index on disk.

Teradata Columnar offers the ability to partition a table or join index by column. It introduces column-storage as an alternative choice to row-storage for a column partition and autocompression. Column partitioning can be used alone in a single-level partitioning definition or with row partitioning in a multilevel partitioning definition.

Teradata Columnar is a new paradigm for partitioning, storing data, and compression that changes the cost-benefit tradeoffs of the available physical database design choices and their combinations. Teradata Columnar provides a benefit to the user by reducing I/O for certain classes of queries while at the same time decreasing space usage.

A column-partitioned (CP) table or join index has several key characteristics:

It does not have a primary index (a future blog entry will discuss why).
Each of its column partitions can be composed of a single column or multiple columns.
Each column partition usually contains multiple physical rows. Physical rows are the disk-based structures that the Teradata file system uses to store data in order based on the rowid associated with each physical row. The first part of a rowid indicates the partition number so that physical rows are ordered by partition number first and within partition by the rest of the rowid.
A new physical row format COLUMN may be utilized for a column partition; such a physical row is called a container. This is used to implement column-storage, row header compression, and autocompression for a column partition. This provides a compact way to store a series of column partition values.
Alternatively, a column partition may have physical rows with ROW format that are used to implement row-storage; such a physical row is called a subrow. Each column partition value is in its own physical row. Usually a subrow is wide (multicolumn, large character strings, etc.) where the row header overhead for each column partition value is insignificant and having each column partition value in its own physical row provides more direct access to the value.
A CP table is just another type of table that can be accessed by a query. A single query can access multiple kinds of tables.

PPI stands for partitioned primary index which means the table has a primary index and the rows are partitioned on the AMPs (and within a partition, the rows are ordered by a hash of the primary index columns).

A CP table is not a PPI table since a CP table doesn't have a primary index. But a CP table can have RANGE_N and CASE_N row partitioning (the kind of partitioning of rows that is used in PPI) but since there is no primary index, the rows within a row partition are not ordered by a hash of some columns of each row -- they are just in insert order.

A CP table could have a join index on it where the join index does have primary index (but not column partitioning).

A PI or PPI table could have a join index on it where the join index has column partitioning (but not a primary index) plus optionally 1 or more levels of row partitioning.

A NoPI table can't have row partitioning unless it also has column partitioning.

For example:

CREATE TABLE SALES (

    TxnNo     INTEGER,

    TxnDate   DATE,

    ItemNo    INTEGER,

    Quantity  INTEGER )

  PARTITION BY COLUMN,

  UNIQUE INDEX (TxnNo);

This creates a column-partitioned (CP) table that partitions the data of the table vertically. Each column is in its own column partition that is stored using column-storage with row header compression and autocompression. All the data for TxnNo comes first, followed by the data for TxnDate, followed by the data for ItemNo, and then the data for Quantity. Note that a primary index is not specified so this is NoPI table. Moreover, a primary index must not be specified if the table is column partitioned.

The following adds a level of row partitioning (so the table has multilevel partitioning). All the data for TxnNo for the first day comes first, followed by the next day of data forTxnNo, etc. then all the data for TxnDate for the first day, the second day, etc, ending with the last day of data for Quantity.

CREATE TABLE SALES (

    TxnNo     INTEGER,

    TxnDate   DATE,

    ItemNo    INTEGER,

    Quantity  INTEGER )

  PARTITION BY (

      COLUMN,

      RANGE_N(TxnDate BETWEEN

          DATE '2011-01-01' AND DATE '2011-12-31' EACH INTERVAL '1' DAY) ),

  UNIQUE INDEX (TxnNo);

Data Warehouse and ETL Knowledgebase

Monday, December 15, 2014

Teradata 14.0 introduces Teradata Columnar

No comments:

Chat

Do you like it??