Certified Data Mining and Warehousing Professional Data partitioning and clustering for performance

Data partitioning and clustering for performance
 


Partitioning

Types of Partitioning

This section describes the partitioning features that significantly enhance data access and improve overall application performance. This is especially true for applications that access tables and indexes with millions of rows and many gigabytes of data.

Partitioned tables and indexes facilitate administrative operations by enabling these operations to work on subsets of data. For example, you can add a new partition, organize an existing partition, or drop a partition with minimal to zero interruption to a read-only application.

Using the partitioning methods described in this section can help you tune SQL statements to avoid unnecessary index and table scans (using partition pruning). You can also improve the performance of massive join operations when large amounts of data (for example, several million rows) are joined together by using partition-wise joins. Finally, partitioning data greatly improves manageability of very large databases and dramatically reduces the time required for administrative tasks such as backup and restore.

Granularity in a partitioning scheme can be easily changed by splitting or merging partitions. Thus, if a table's data is skewed to fill some partitions more than others, the ones that contain more data can be split to achieve a more even distribution. Partitioning also allows one to swap partitions with a table. By being able to easily add, remove, or swap a large amount of data quickly, swapping can be used to keep a large amount of data that is being loaded inaccessible until loading is completed, or can be used as a way to stage data between different phases of use. Some examples are current day's transactions or online archives.

Partitioning Methods

Oracle offers four partitioning methods:

Each partitioning method has different advantages and design considerations. Thus, each method is more appropriate for a particular situation.

Range Partitioning

Range partitioning maps data to partitions based on ranges of partition key values that you establish for each partition. It is the most common type of partitioning and is often used with dates. For example, you might want to partition sales data into monthly partitions.

Range partitioning maps rows to partitions based on ranges of column values. Range partitioning is defined by the partitioning specification for a table or index in PARTITIONBYRANGE(column_list)and by the partitioning specifications for each individual partition in VALUESLESSTHAN(value_list), where column_listis an ordered list of columns that determines the partition to which a row or an index entry belongs. These columns are called the partitioning columns. The values in the partitioning columns of a particular row constitute that row's partitioning key.

An ordered list of values for the columns in the column list is called a value_list. Each value must be either a literal or a TO_DATEor RPADfunction with constant arguments. Only the VALUESLESSTHANclause is allowed. This clause specifies a non-inclusive upper bound for the partitions. All partitions, except the first, have an implicit low value specified by the VALUESLESSTHANliteral on the previous partition. Any binary values of the partition key equal to or higher than this literal are added to the next higher partition. Highest partition being where MAXVALUEliteral is defined. Keyword, MAXVALUE, represents a virtual infinite value that sorts higher than any other value for the data type, including the null value.

The following statement creates a table sales_rangethat is range partitioned on the sales_datefield:

CREATE TABLE sales_range 
(salesman_id  NUMBER(5), 
salesman_name VARCHAR2(30), 
sales_amount  NUMBER(10), 
sales_date    DATE)
COMPRESS
PARTITION BY RANGE(sales_date)
(PARTITION sales_jan2000 VALUES LESS THAN(TO_DATE('02/01/2000','DD/MM/YYYY')),
 PARTITION sales_feb2000 VALUES LESS THAN(TO_DATE('03/01/2000','DD/MM/YYYY')),
 PARTITION sales_mar2000 VALUES LESS THAN(TO_DATE('04/01/2000','DD/MM/YYYY')),
 PARTITION sales_apr2000 VALUES LESS THAN(TO_DATE('05/01/2000','DD/MM/YYYY')));

Note:

This table was created with the COMPRESSkeyword, thus all partitions inherit this attribute.
 

Hash Partitioning

Hash partitioning maps data to partitions based on a hashing algorithm that Oracle applies to a partitioning key that you identify. The hashing algorithm evenly distributes rows among partitions, giving partitions approximately the same size. Hash partitioning is the ideal method for distributing data evenly across devices. Hash partitioning is also an easy-to-use alternative to range partitioning, especially when the data to be partitioned is not historical.

Oracle Database uses a linear hashing algorithm and to prevent data from clustering within specific partitions, you should define the number of partitions by a power of two (for example, 2, 4, 8).

The following statement creates a table sales_hash, which is hash partitioned on the salesman_idfield:

CREATE TABLE sales_hash
(salesman_id  NUMBER(5), 
salesman_name VARCHAR2(30), 
sales_amount  NUMBER(10), 
week_no       NUMBER(2)) 
PARTITION BY HASH(salesman_id) 
PARTITIONS 4;
You cannot define alternate hashing algorithms for partitions.

List Partitioning

List partitioning enables you to explicitly control how rows map to partitions. You do this by specifying a list of discrete values for the partitioning column in the description for each partition. This is different from range partitioning, where a range of values is associated with a partition and with hash partitioning, where you have no control of the row-to-partition mapping. The advantage of list partitioning is that you can group and organize unordered and unrelated sets of data in a natural way. The following example creates a list partitioned table grouping states according to their sales regions:

CREATE TABLE sales_list
(salesman_id  NUMBER(5), 
salesman_name VARCHAR2(30),
sales_state   VARCHAR2(20),
sales_amount  NUMBER(10), 
sales_date    DATE)
PARTITION BY LIST(sales_state)
(PARTITION sales_west VALUES('California', 'Hawaii') COMPRESS,
 PARTITION sales_east VALUES('New York', 'Virginia', 'Florida'),
 PARTITION sales_central VALUES('Texas', 'Illinois'));

Partition sales_westis furthermore created as a single compressed partition within sales_list. For details about partitioning and compression.

An additional capability with list partitioning is that you can use a default partition, so that all rows that do not map to any other partition do not generate an error. For example, modifying the previous example, you can create a default partition as follows:

CREATE TABLE sales_list
(salesman_id  NUMBER(5), 
salesman_name VARCHAR2(30),
sales_state   VARCHAR2(20),
sales_amount  NUMBER(10), 
sales_date    DATE)
PARTITION BY LIST(sales_state)
(PARTITION sales_west VALUES('California', 'Hawaii'),
 PARTITION sales_east VALUES ('New York', 'Virginia', 'Florida'),
 PARTITION sales_central VALUES('Texas', 'Illinois'),
 PARTITION sales_other VALUES(DEFAULT));

Composite Partitioning

Composite partitioning combines range and hash or list partitioning. Oracle Database first distributes data into partitions according to boundaries established by the partition ranges. Then, for range-hash partitioning, Oracle uses a hashing algorithm to further divide the data into subpartitions within each range partition. For range-list partitioning, Oracle divides the data into subpartitions within each range partition based on the explicit list you chose.

 

Index Partitioning

You can choose whether or not to inherit the partitioning strategy of the underlying tables. You can create both local and global indexes on a table partitioned by range, hash, or composite methods. Local indexes inherit the partitioning attributes of their related tables. For example, if you create a local index on a composite table, Oracle automatically partitions the local index using the composite method. 

Performance Issues for Range, List, Hash, and Composite Partitioning

This section describes performance issues for:

When to Use Range Partitioning

Range partitioning is a convenient method for partitioning historical data. The boundaries of range partitions define the ordering of the partitions in the tables or indexes.

Range partitioning is usually used to organize data by time intervals on a column of type DATE. Thus, most SQL statements accessing range partitions focus on timeframes. An example of this is a SQL statement similar to "select data from a particular period in time." In such a scenario, if each partition represents data for one month, the query "find data of month 98-DEC" needs to access only the December partition of year 98. This reduces the amount of data scanned to a fraction of the total data available, an optimization method called partition pruning.

Range partitioning is also ideal when you periodically load new data and purge old data. It is easy to add or drop partitions.

It is common to keep a rolling window of data, for example keeping the past 36 months' worth of data online. Range partitioning simplifies this process. To add data from a new month, you load it into a separate table, clean it, index it, and then add it to the range-partitioned table using the EXCHANGEPARTITIONstatement, all while the original table remains online. Once you add the new partition, you can drop the trailing month with the DROPPARTITIONstatement. The alternative to using the DROPPARTITIONstatement can be to archive the partition and make it read only, but this works only when your partitions are in separate tablespaces.

In conclusion, consider using range partitioning when:

  • Very large tables are frequently scanned by a range predicate on a good partitioning column, such as ORDER_DATEor PURCHASE_DATE. Partitioning the table on that column enables partition pruning.

  • You want to maintain a rolling window of data.

  • You cannot complete administrative operations, such as backup and restore, on large tables in an allotted time frame, but you can divide them into smaller logical pieces based on the partition range column.

The following example creates the table salestablefor a period of two years, 1999 and 2000, and partitions it by range according to the column s_salesdateto separate the data into eight quarters, each corresponding to a partition.

CREATE TABLE salestable
  (s_productid  NUMBER,
   s_saledate   DATE,
   s_custid     NUMBER,
   s_totalprice NUMBER)
PARTITION BY RANGE(s_saledate)
 (PARTITION sal99q1 VALUES LESS THAN (TO_DATE('01-APR-1999', 'DD-MON-YYYY')),
  PARTITION sal99q2 VALUES LESS THAN (TO_DATE('01-JUL-1999', 'DD-MON-YYYY')),
  PARTITION sal99q3 VALUES LESS THAN (TO_DATE('01-OCT-1999', 'DD-MON-YYYY')),
  PARTITION sal99q4 VALUES LESS THAN (TO_DATE('01-JAN-2000', 'DD-MON-YYYY')),
  PARTITION sal00q1 VALUES LESS THAN (TO_DATE('01-APR-2000', 'DD-MON-YYYY')),
  PARTITION sal00q2 VALUES LESS THAN (TO_DATE('01-JUL-2000', 'DD-MON-YYYY')),
  PARTITION sal00q3 VALUES LESS THAN (TO_DATE('01-OCT-2000', 'DD-MON-YYYY')),
  PARTITION sal00q4 VALUES LESS THAN (TO_DATE('01-JAN-2001', 'DD-MON-YYYY')));

When to Use Hash Partitioning

The way Oracle Database distributes data in hash partitions does not correspond to a business or a logical view of the data, as it does in range partitioning. Consequently, hash partitioning is not an effective way to manage historical data. However, hash partitions share some performance characteristics with range partitions. For example, partition pruning is limited to equality predicates. You can also use partition-wise joins, parallel index access, and parallel DML. 

As a general rule, use hash partitioning for the following purposes:

  • To improve the availability and manageability of large tables.

  • To avoid data skew among partitions. Hash partitioning is an effective means of distributing data because Oracle hashes the data into a number of partitions, each of which can reside on a separate device. Thus, data is evenly spread over a sufficient number of devices to maximize I/O throughput. Similarly, you can use hash partitioning to distribute evenly data among the nodes of an MPP platform that uses Oracle Real Application Clusters.

  • If it is important to use partition pruning and partition-wise joins according to a partitioning key that is mostly constrained by a distinct value or value list.

    Note:

    In hash partitioning, partition pruning uses only equality or IN-list predicates.

If you add or merge a hashed partition, Oracle automatically rearranges the rows to reflect the change in the number of partitions and subpartitions. The hash function that Oracle uses is especially designed to limit the cost of this reorganization. Instead of reshuffling all the rows in the table, Oracles uses an "add partition" logic that splits one and only one of the existing hashed partitions. Conversely, Oracle coalesces a partition by merging two existing hashed partitions.

Although the hash function's use of "add partition" logic dramatically improves the manageability of hash partitioned tables, it means that the hash function can cause a skew if the number of partitions of a hash partitioned table, or the number of subpartitions in each partition of a composite table, is not a power of two. In the worst case, the largest partition can be twice the size of the smallest. So for optimal performance, create a number of partitions and subpartitions for each partition that is a power of two. For example, 2, 4, 8, 16, 32, 64, 128, and so on.

The following example creates four hashed partitions for the table sales_hashusing the column s_productidas the partition key:

CREATE TABLE sales_hash
  (s_productid  NUMBER,
   s_saledate   DATE,
   s_custid     NUMBER,
   s_totalprice NUMBER)
PARTITION BY HASH(s_productid)
PARTITIONS 4;

Specify partition names if you want to choose the names of the partitions. Otherwise, Oracle automatically generates internal names for the partitions. Also, you can use the STOREINclause to assign hash partitions to tablespaces in a round-robin manner.

When to Use List Partitioning

You should use list partitioning when you want to specifically map rows to partitions based on discrete values.

Unlike range and hash partitioning, multi-column partition keys are not supported for list partitioning. If a table is partitioned by list, the partitioning key can only consist of a single column of the table.

When to Use Composite Range-Hash Partitioning

Composite range-hash partitioning offers the benefits of both range and hash partitioning. With composite range-hash partitioning, Oracle first partitions by range. Then, within each range, Oracle creates subpartitions and distributes data within them using the same hashing algorithm it uses for hash partitioned tables.

Data placed in composite partitions is logically ordered only by the boundaries that define the range level partitions. The partitioning of data within each partition has no logical organization beyond the identity of the partition to which the subpartitions belong.

Consequently, tables and local indexes partitioned using the composite range-hash method:

  • Support historical data at the partition level.

  • Support the use of subpartitions as units of parallelism for parallel operations such as PDML or space management and backup and recovery.

  • Are eligible for partition pruning and partition-wise joins on the range and hash partitions.

Using Composite Range-Hash Partitioning

Use the composite range-hash partitioning method for tables and local indexes if:

  • Partitions must have a logical meaning to efficiently support historical data

  • The contents of a partition can be spread across multiple tablespaces, devices, or nodes (of an MPP system)

  • You require both partition pruning and partition-wise joins even when the pruning and join predicates use different columns of the partitioned table

  • You require a degree of parallelism that is greater than the number of partitions for backup, recovery, and parallel operations

Most large tables in a data warehouse should use range partitioning. Composite partitioning should be used for very large tables or for data warehouses with a well-defined need for these conditions. When using the composite method, Oracle stores each subpartition on a different segment. Thus, the subpartitions may have properties that differ from the properties of the table or from the partition to which the subpartitions belong.

The following example partitions the table sales_range_hashby range on the column s_saledateto create four partitions that order data by time. Then, within each range partition, the data is further subdivided into 16 subpartitions by hash on the column s_productid:

CREATE TABLE sales_range_hash(
  s_productid  NUMBER,
  s_saledate   DATE,
  s_custid     NUMBER,
  s_totalprice NUMBER)
PARTITION BY RANGE (s_saledate)
SUBPARTITION BY HASH (s_productid) SUBPARTITIONS 8
 (PARTITION sal99q1 VALUES LESS THAN (TO_DATE('01-APR-1999', 'DD-MON-YYYY')),
  PARTITION sal99q2 VALUES LESS THAN (TO_DATE('01-JUL-1999', 'DD-MON-YYYY')),
  PARTITION sal99q3 VALUES LESS THAN (TO_DATE('01-OCT-1999', 'DD-MON-YYYY')),
  PARTITION sal99q4 VALUES LESS THAN (TO_DATE('01-JAN-2000', 'DD-MON-YYYY')));

Each hashed subpartition contains sales data for a single quarter ordered by product code. The total number of subpartitions is 4x8 or 32.

In addition to this syntax, you can create subpartitions by using a subpartition template. This offers better ease in naming and control of location for tablespaces and subpartitions. The following statement illustrates this:

CREATE TABLE sales_range_hash(
  s_productid  NUMBER,
  s_saledate   DATE,
  s_custid     NUMBER,
  s_totalprice NUMBER)
PARTITION BY RANGE (s_saledate)
SUBPARTITION BY HASH (s_productid)
SUBPARTITION TEMPLATE(
SUBPARTITION sp1 TABLESPACE tbs1,
SUBPARTITION sp2 TABLESPACE tbs2,
SUBPARTITION sp3 TABLESPACE tbs3,
SUBPARTITION sp4 TABLESPACE tbs4,
SUBPARTITION sp5 TABLESPACE tbs5,
SUBPARTITION sp6 TABLESPACE tbs6,
SUBPARTITION sp7 TABLESPACE tbs7,
SUBPARTITION sp8 TABLESPACE tbs8)
(PARTITION sal99q1 VALUES LESS THAN (TO_DATE('01-APR-1999', 'DD-MON-YYYY')),
   PARTITION sal99q2 VALUES LESS THAN (TO_DATE('01-JUL-1999', 'DD-MON-YYYY')),
   PARTITION sal99q3 VALUES LESS THAN (TO_DATE('01-OCT-1999', 'DD-MON-YYYY')),
   PARTITION sal99q4 VALUES LESS THAN (TO_DATE('01-JAN-2000', 'DD-MON-YYYY')));

In this example, every partition has the same number of subpartitions. A sample mapping for sal99q1is illustrated in Table 5-1. Similar mappings exist for sal99q2through sal99q4.