Wednesday 3 December 2014

Attribute clustering in Oracle Database 12c

Attribute clustering in Oracle database 12c.

This is a new feature in Oracle Database version 12.1.0.2 which allows the dba to lay down an order for the table’s records while writing them to the disk, keeping physically close together on disk (clustering together, co-locating) those records which have the similar values for specified attributes (columns).

By clustering together records which have similar values, it becomes likelier that data matching a specifc sql filter condition (like where dept_id = 10) will be stored on the same block or on adjacent blocks on the disk. With this kind of data placement, it is possible to retrieve the requested data with fewer disk IO operations, than when data is stored in the order that it is inserted. Because the nature of queries expected against the table can be anticipated, it is advantageous to tailor the physical ordering to match them, and that is the point of attribute clustering.

Below is a simple example which creates a table named employees for which we define a simple clustering rule, that records will be ordered first by employee id and then by department id (see bolded text showing the new syntax) when writing to the disk.


  CREATE TABLE EMPLOYEES
   (           EMP_ID NUMBER,    NAME VARCHAR2(20),
               JOIN_DATE DATE,  DEPT_ID NUMBER,
               MGR NUMBER
   )
 CLUSTERING
 BY LINEAR ORDER (EMP_ID, DEPT_ID)
   YES ON LOAD  YES ON DATA MOVEMENT;


In the sections below, we are exploring some attribute clustering basics.

Attribute clustering does not work for conventional dml. It only works for these  operations:
  1. Create table as select
  2. Bulk loads using direct path insert like :
insert /*+ append */ select … from <table.
  1. Data movement operations like
§       Alter table <table> move
§       Online table redefinition

This means, if you define a table with attribute clustering and do a conventional insert, then the record created will not be placed in the expect sort position.

See the plan below for a conventional insert which inserts data from a source table, in which records have been created in a random order for DEPT_ID=1. It’s a conventional insert, though the table is attribute-clustered. Due to use of a normal insert, there won’t be any attribute clustering done for this set of rows.


PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------
Plan hash value: 1772640232

--------------------------------------------------------------------------------------
| Id  | Operation                | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |           |     6 |   150 |     4   (0)| 00:00:01 |
|   1 |  LOAD TABLE CONVENTIONAL | EMPLOYEES |       |       |            |          |
|*  2 |   TABLE ACCESS FULL      | TRIAL     |     6 |   150 |     4   (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------

   2 - filter("DEPT_ID"=1)

14 rows selected.


Source table data (note that emp_id is in random order) :
SQL> select * from scott.trial where dept_id=1;

    EMP_ID NAME                 JOIN_DATE    DEPT_ID        MGR
---------- -------------------- --------- ---------- ----------
         1 emptest              01-JAN-14          1          1
         5 emptest              01-JAN-14          1          1
         2 emptest              01-JAN-14          1          1
         4 emptest              01-JAN-14          1          1
         3 emptest              01-JAN-14          1          1
         7 emptest              01-JAN-14          1          1

6 rows selected.


And below are the results of the insert…note that the data are still stored in the same random order in scott.employees as in the trial table.

SQL> insert into
  2   scott.employees
  3     select * from scott.trial where dept_id=1;

6 rows created.

SQL> commit;

Commit complete.

SQL> select * from scott.employees;

    EMP_ID NAME                 JOIN_DATE    DEPT_ID        MGR
---------- -------------------- --------- ---------- ----------
         1 emptest              01-JAN-14          1          1
         5 emptest              01-JAN-14          1          1
         2 emptest              01-JAN-14          1          1
         4 emptest              01-JAN-14          1          1
         3 emptest              01-JAN-14          1          1
         7 emptest              01-JAN-14          1          1

6 rows selected.

SQL>




Whereas if you do a direct path insert, a SORT operation will have been inserted into the plan...See below a direct path insert on the same table, for dept_id = 2

SQL>  explain plan for
  2    insert /*+ append */ into scott.employees
  3     select * from scott.trial
  4       where dept_id=2;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------
Plan hash value: 1954480923

----------------------------------------------------------------------------------------------
| Id  | Operation                        | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                 |           |     6 |   150 |     5  (20)| 00:00:01 |
|   1 |  LOAD AS SELECT                  | EMPLOYEES |       |       |            |          |
|   2 |   OPTIMIZER STATISTICS GATHERING |           |     6 |   150 |     5  (20)| 00:00:01 |
|   3 |    SORT ORDER BY                 |           |     6 |   150 |     5  (20)| 00:00:01 |
|*  4 |     TABLE ACCESS FULL            | TRIAL     |     6 |   150 |     4   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - filter("DEPT_ID"=2)

16 rows selected.

And the records in the employees table (dept_id=2) will be sorted… See below the result of a simple select (no order by specified, but the data is still sorted).


SQL>  insert /*+ append */ into scott.employees
  2    select * from scott.trial
  3     where dept_id = 2;

6 rows created.

SQL> commit;

Commit complete.

SQL> select * from scott.employees where dept_id=2;

    EMP_ID NAME                 JOIN_DATE    DEPT_ID        MGR
---------- -------------------- --------- ---------- ----------
         8 emptest              01-JAN-14          2          3
         9 emptest              01-JAN-14          2          3
        10 emptest              01-JAN-14          2          3
        11 emptest              01-JAN-14          2          3
        12 emptest              01-JAN-14          2          3
        14 emptest              01-JAN-14          2          3

6 rows selected.



Now, the table scott.employees has accumulated both unsorted data (dept_id = 1)
and sorted data (dept_id = 2).

Recalling the DDL of the table…

  CREATE TABLE EMPLOYEES
   (           EMP_ID NUMBER,    NAME VARCHAR2(20),
               JOIN_DATE DATE,  DEPT_ID NUMBER,
               MGR NUMBER
   )
 CLUSTERING
 BY LINEAR ORDER (EMP_ID, DEPT_ID)
   YES ON LOAD  YES ON DATA MOVEMENT;



Because it was defined with YES ON DATA MOVEMENT, an ALTER TABLE move will force the sort. (as will a table redefinition).

We can see below, that the table after an ALTER..MOVE operation gets sorted by employee id and department id


SQL> alter table scott.employees move;

Table altered.

SQL> select * from scott.employees;

    EMP_ID NAME                 JOIN_DATE    DEPT_ID        MGR
---------- -------------------- --------- ---------- ----------
         1 emptest              01-JAN-14          1          1
         2 emptest              01-JAN-14          1          1
         3 emptest              01-JAN-14          1          1
         4 emptest              01-JAN-14          1          1
         5 emptest              01-JAN-14          1          1
         7 emptest              01-JAN-14          1          1
         8 emptest              01-JAN-14          2          3
         9 emptest              01-JAN-14          2          3
        10 emptest              01-JAN-14          2          3
        11 emptest              01-JAN-14          2          3
        12 emptest              01-JAN-14          2          3

    EMP_ID NAME                 JOIN_DATE    DEPT_ID        MGR
---------- -------------------- --------- ---------- ----------
        14 emptest              01-JAN-14          2          3

12 rows selected.

SQL>


Further illustrating that attribute clustering works only on the specific set of being inserted, we will do a trial direct insert of a few records with dept_id = 3, generating 3 records (with a select … union all, to avoid sorting) and generate data in a random order.
insert  /*+ append */ into scott.employees
 select
17, 'emptest',sysdate, 3, 4 from dual
 union all
 select
20, 'emptest',sysdate, 3, 4 from dual
 union all
 select
13, 'emptest',sysdate, 3, 4 from dual

It’s a direct path insert, therefore, was an implicit sort was eventually done to enforce clustering.

SQL> explain plan for
  2  insert /*+ append */ into scott.employees
 select 17, 'emptest',sysdate, 3, 4 from dual
 union all
 select 20, 'emptest',sysdate, 3, 4 from dual
 union all
 select 13, 'emptest',sysdate, 3, 4 from dual  3    4    5    6    7  ;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------
Plan hash value: 2013921891

--------------------------------------------------------------------------------------
| Id  | Operation                        | Name      | Rows  | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                 |           |     3 |     7  (15)| 00:00:01 |
|   1 |  LOAD AS SELECT                  | EMPLOYEES |       |            |          |
|   2 |   OPTIMIZER STATISTICS GATHERING |           |     3 |     6   (0)| 00:00:01 |
|   3 |    SORT ORDER BY                 |           |     3 |     6   (0)| 00:00:01 |
|   4 |     UNION-ALL                    |           |       |            |          |
|   5 |      FAST DUAL                   |           |     1 |     2   (0)| 00:00:01 |
|   6 |      FAST DUAL                   |           |     1 |     2   (0)| 00:00:01 |
|   7 |      FAST DUAL                   |           |     1 |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------------------

14 rows selected.

SQL>


The sort is not seen for the same insert, without the /*+ APPEND */ hint.

SQL> explain plan for
  2   insert  into scott.employees
 select 17, 'emptest',sysdate, 3, 4 from dual
 union all
 select 20, 'emptest',sysdate, 3, 4 from dual
 union all
 select 13, 'emptest',sysdate, 3, 4 from dual  3    4    5    6    7  ;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------
Plan hash value: 2623540522

------------------------------------------------------------------------------
| Id  | Operation                | Name      | Rows  | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |           |     3 |     6   (0)| 00:00:01 |
|   1 |  LOAD TABLE CONVENTIONAL | EMPLOYEES |       |            |          |
|   2 |   UNION-ALL              |           |       |            |          |
|   3 |    FAST DUAL             |           |     1 |     2   (0)| 00:00:01 |
|   4 |    FAST DUAL             |           |     1 |     2   (0)| 00:00:01 |
|   5 |    FAST DUAL             |           |     1 |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------

12 rows selected.

SQL>



What we saw above was the case of a heap-table attribute clustering using a simple LINEAR ORDERing BY 2 columns (EMP_ID, DEPT_ID).  This kind of ordering will benefit queries which  always prefix EMP_ID at least to the filter clause as in:

SELECT…Where EMP_ID = 10  and DEPT_ID = 123;

But not a query which does not  specify the EMP_ID column first, as in:

SELECT… where DEPT_ID=123;


A variation of attribute clustering is clustering by interleaved ordering.
In the case of interleaved clustering the data is ordered by a z-order function calculated on the user-specified columns.

  CREATE TABLE EMPLOYEES
   (           EMP_ID NUMBER,    NAME VARCHAR2(20),
               JOIN_DATE DATE,  DEPT_ID NUMBER,
               MGR NUMBER
   )
 CLUSTERING
 BY INTERLEAVED ORDER (EMP_ID, DEPT_ID)
   YES ON LOAD  YES ON DATA MOVEMENT;

During loads, a z-order function operates on the values of emp_id and dept_id of record and the physical data placement as per the calculated z-order.

Due to z-ordering the ordering is not simply by employee id and within that, by dept_id, each of the two column’s data values will have a weight and affect the record’s position in the sort.

Consider a source table having these records.



Going from 1st record to 5th the department id assigned is random.
We can see the result of an interleaved ordered clustering by loading this source data into our modified employee table which is now clustered by interleaved ordering of (emp_id, dept_id). Loading the data into the interleaved-ordered table…

SQL> insert /*+ append */ into scott.employees
 select * from scott.trial2;
   2
5 rows created.






Rank # 1 goes to record with emp_id = 2 and dept_id =1.
Rank # 2 goes to record with emp_id = 3, dept_id = 2.
Rank # 3 goes to record with emp_id=5, dept_id = 1.

The z-ordering effect is seen here; its considering both columns during the sort.

Adding, a record with emp_id=dept_id=1 and a second one with emp_id=dept_id=5 to the source table and re-loading the resulting set, we get this:



Again, it looks like the z-ordering is assign a weight to data points from each of the columns specified, and computing a sort position for the record by considering both weights. Therefore, where a record is located is depends on the relative weights of its data point columns.

It seems to achieve a wider spread of different data points across the storage space, as compared to linear ordered clustering as applied to the same set of input data.

As per Oracle 12c documentation, Data Warehousing And Business Intelligence Guide, Chapter 12, this kind of clustering is favorable to queries with predicate on emp_id and/or dept_id.

Conclusion:

In this document we explored the attribute clustering feature’s basics, and seen how it works. Due to its very nature, attribute clustering is suited to data warehousing type of applications where there is little conventional DML and the tables are subject to frequent bulk data loading.

2 comments:

  1. z-order, i don't understand.
    Request to more detail explanation

    ReplyDelete
  2. My understanding of z-ordering is that it is a kind of weighted sort order.
    It considers the row's column data and computes a kind of rank.

    This rank seems to determine the physical placement of data.

    The rows with similar values of specified columns would tend to be co-located (or clustered).

    Typically a row of data in a database has a couple of key columns which can be considered as its dimensions.
    In the example above, perhaps, emp_id, and dept_id can be considered as 2 dimensions of the data.
    A z-order is a sort rule that considers both the dimensions.

    If we consider another case - a computer screen's windows.

    Each window on a computer screen has a specific length attribute, and a breadth attribute.

    Each window also has a position for its top-left corner.

    And we can regard the front-to-back placement of a window relative to others, as yet another property, or dimension, its z-order rank

    Perhaps a dimension that we'd plot along a z-axis.

    The Windows taskbar can be considered as a window with a special kind of z-dimension -- it typically stays on top of other windows.

    The taskbar can be regarded as having a z-dimension of 1, its the top-most and no window can obscure it, you want instant access to it always.

    The current non-taskbar window on top could perhaps, be regarded as having a z-dimension of 2, the one behind it 3, etc.

    So a windows's could have these dimensions, much like a row of data has dimension columns.
    Length
    Breadh
    Screen Top left corner co-ordinates.
    Its z-order rank (Front-to-back position tank relative to other windows)

    You could sort your collection of windows (or records, analogously) on the above properties and rank them low to high.

    Perhaps you want the top-most window to rank higher, and within that, prioritize screen top left position.
    Then you'd give its z-order rank max weightage, and after that to the screen top left co-ordinate in your sort.

    And coming to a table's row, such a sort order could perhaps, be used to determine the on-disk location.

    Coming to Oracle, Hash partitioning perhaps, accomplishes something similar, where you compute a hash for a specific column-set.

    And the hash value (typically) determines the partition to which the row goes.




    ReplyDelete