Attribute clustering in Oracle database 12c.
This is a new feature in
Oracle Database version 12.1.0.2 which allows the dba to lay down an order for
the table’s records while writing them to the disk, keeping physically close
together on disk (clustering together, co-locating) those records which have
the similar values for specified attributes (columns).
By clustering together
records which have similar values, it becomes likelier that data matching a
specifc sql filter condition (like where dept_id = 10) will be stored on the same block or on adjacent blocks on the disk. With
this kind of data placement, it is possible to retrieve the requested data with
fewer disk IO operations, than when data is stored in the order that it is
inserted. Because the nature of queries expected against the table can be
anticipated, it is advantageous to tailor the physical ordering to match them,
and that is the point of attribute clustering.
Below is a simple example
which creates a table named employees for which we define a simple clustering
rule, that records will be ordered first by employee id and then by department
id (see bolded text showing the new syntax) when writing to the disk.
CREATE TABLE
EMPLOYEES
( EMP_ID NUMBER, NAME
VARCHAR2(20),
JOIN_DATE
DATE, DEPT_ID
NUMBER,
MGR
NUMBER
)
CLUSTERING
BY LINEAR ORDER (EMP_ID, DEPT_ID)
YES ON LOAD
YES ON DATA MOVEMENT;
In the sections below, we
are exploring some attribute clustering basics.
Attribute clustering does
not work for conventional dml. It only works for these operations:
- Create table as select
- Bulk loads using direct path insert like :
insert
/*+ append */ select … from <table.
- Data movement operations like
§
Alter table
<table> move
§
Online table
redefinition
This means, if you define a
table with attribute clustering and do a conventional insert, then the record
created will not be placed in the expect sort position.
See the plan below for a
conventional insert which inserts data from a source table, in which records
have been created in a random order for DEPT_ID=1. It’s a conventional insert, though
the table is attribute-clustered. Due to use of a normal insert, there won’t be
any attribute clustering done for this
set of rows.
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------
Plan hash value: 1772640232
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows
| Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | |
6 | 150 | 4
(0)| 00:00:01 |
| 1 |
LOAD TABLE CONVENTIONAL | EMPLOYEES | |
| |
|
|* 2 |
TABLE ACCESS FULL |
TRIAL | 6 |
150 | 4 (0)| 00:00:01 |
--------------------------------------------------------------------------------------
Predicate Information
(identified by operation id):
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------
2 - filter("DEPT_ID"=1)
14 rows selected.
Source table data (note
that emp_id is in random order) :
SQL> select * from scott.trial where dept_id=1;
EMP_ID NAME JOIN_DATE DEPT_ID MGR
---------- -------------------- --------- ----------
----------
1
emptest 01-JAN-14 1 1
5
emptest 01-JAN-14 1 1
2
emptest 01-JAN-14 1 1
4
emptest 01-JAN-14 1 1
3
emptest 01-JAN-14 1 1
7
emptest 01-JAN-14 1 1
6 rows selected.
And below are the results
of the insert…note that the data are still stored in the same random order in
scott.employees as in the trial table.
SQL> insert into
2
scott.employees
3
select * from scott.trial where dept_id=1;
6 rows created.
SQL> commit;
Commit complete.
SQL> select * from
scott.employees;
EMP_ID NAME JOIN_DATE DEPT_ID MGR
----------
-------------------- --------- ---------- ----------
1 emptest 01-JAN-14 1 1
5 emptest 01-JAN-14 1 1
2 emptest 01-JAN-14 1 1
4 emptest 01-JAN-14 1 1
3 emptest 01-JAN-14 1 1
7 emptest 01-JAN-14 1 1
6 rows selected.
SQL>
Whereas if you do a direct
path insert, a SORT operation will have been inserted into the plan...See below
a direct path insert on the same table, for dept_id = 2
SQL> explain plan for
2
insert /*+ append
*/ into scott.employees
3
select * from scott.trial
4
where dept_id=2;
Explained.
SQL> select * from
table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------
Plan hash value: 1954480923
----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | |
6 | 150 | 5
(20)| 00:00:01 |
| 1 |
LOAD AS SELECT | EMPLOYEES | |
| | |
| 2 |
OPTIMIZER STATISTICS GATHERING | |
6 | 150 | 5
(20)| 00:00:01 |
| 3 | SORT ORDER BY | |
6 | 150 | 5
(20)| 00:00:01 |
|* 4 |
TABLE ACCESS FULL |
TRIAL | 6 |
150 | 4 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------
Predicate Information
(identified by operation id):
---------------------------------------------------
4 - filter("DEPT_ID"=2)
16 rows selected.
And the records in the
employees table (dept_id=2) will be sorted… See below the result of a simple
select (no order by specified, but the data is still sorted).
SQL> insert /*+ append */ into scott.employees
2
select * from scott.trial
3
where dept_id = 2;
6 rows created.
SQL> commit;
Commit complete.
SQL> select * from
scott.employees where dept_id=2;
EMP_ID NAME JOIN_DATE DEPT_ID MGR
----------
-------------------- --------- ---------- ----------
8 emptest 01-JAN-14 2 3
9 emptest
01-JAN-14 2 3
10 emptest 01-JAN-14 2 3
11 emptest 01-JAN-14 2 3
12 emptest 01-JAN-14 2 3
14 emptest
01-JAN-14 2 3
6 rows selected.
Now, the table scott.employees
has accumulated both unsorted data (dept_id = 1)
and sorted data (dept_id =
2).
Recalling the DDL of the
table…
CREATE
TABLE EMPLOYEES
( EMP_ID NUMBER, NAME
VARCHAR2(20),
JOIN_DATE
DATE, DEPT_ID
NUMBER,
MGR
NUMBER
)
CLUSTERING
BY LINEAR
ORDER (EMP_ID, DEPT_ID)
YES ON
LOAD YES ON DATA
MOVEMENT;
Because it was defined with
YES ON DATA MOVEMENT, an ALTER TABLE move will force the sort. (as will
a table redefinition).
We can see below, that the
table after an ALTER..MOVE operation gets sorted by employee id and department
id
SQL> alter table scott.employees move;
Table altered.
SQL> select * from
scott.employees;
EMP_ID NAME JOIN_DATE DEPT_ID MGR
----------
-------------------- --------- ---------- ----------
1 emptest 01-JAN-14 1 1
2 emptest 01-JAN-14 1 1
3 emptest 01-JAN-14 1 1
4 emptest 01-JAN-14 1 1
5 emptest 01-JAN-14 1 1
7 emptest 01-JAN-14 1 1
8 emptest 01-JAN-14 2 3
9 emptest 01-JAN-14 2 3
10 emptest 01-JAN-14 2 3
11 emptest 01-JAN-14 2 3
12 emptest 01-JAN-14 2 3
EMP_ID NAME JOIN_DATE DEPT_ID MGR
----------
-------------------- --------- ---------- ----------
14 emptest 01-JAN-14 2 3
12 rows selected.
SQL>
Further illustrating that
attribute clustering works only on the specific set of being inserted, we will
do a trial direct insert of a few records with dept_id = 3, generating 3
records (with a select … union all, to avoid sorting) and generate data in a random
order.
insert /*+ append */ into
scott.employees
select 17, 'emptest',sysdate, 3, 4 from dual
union all
select 20, 'emptest',sysdate, 3, 4 from dual
union all
select 13, 'emptest',sysdate, 3, 4 from dual
select 17, 'emptest',sysdate, 3, 4 from dual
union all
select 20, 'emptest',sysdate, 3, 4 from dual
union all
select 13, 'emptest',sysdate, 3, 4 from dual
It’s a direct path insert,
therefore, was an implicit sort was eventually done to enforce clustering.
SQL> explain plan for
2
insert /*+ append */ into scott.employees
select 17, 'emptest',sysdate, 3, 4 from dual
union all
select 20, 'emptest',sysdate, 3, 4 from dual
union all
select 13, 'emptest',sysdate, 3, 4 from
dual 3
4 5 6
7 ;
Explained.
SQL> select * from
table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------
Plan hash value: 2013921891
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows
| Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | |
3 | 7 (15)| 00:00:01 |
| 1 |
LOAD AS SELECT |
EMPLOYEES | | | |
| 2 |
OPTIMIZER STATISTICS GATHERING |
| 3 |
6 (0)| 00:00:01 |
| 3 | SORT ORDER BY | |
3 | 6 (0)| 00:00:01 |
| 4 |
UNION-ALL
| | | | |
| 5 |
FAST DUAL | |
1 | 2 (0)| 00:00:01 |
| 6 |
FAST DUAL | |
1 | 2 (0)| 00:00:01 |
| 7 |
FAST DUAL | |
1 | 2 (0)| 00:00:01 |
--------------------------------------------------------------------------------------
14 rows selected.
SQL>
The sort is not seen for the
same insert, without the /*+ APPEND */ hint.
SQL> explain plan for
2
insert into scott.employees
select 17, 'emptest',sysdate, 3, 4 from dual
union all
select 20, 'emptest',sysdate, 3, 4 from dual
union all
select 13, 'emptest',sysdate, 3, 4 from
dual 3
4 5 6
7 ;
Explained.
SQL> select * from
table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------
Plan hash value: 2623540522
------------------------------------------------------------------------------
| Id | Operation | Name | Rows
| Cost (%CPU)| Time |
------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | |
3 | 6 (0)| 00:00:01 |
| 1 |
LOAD TABLE CONVENTIONAL | EMPLOYEES | | | |
| 2 |
UNION-ALL | | | | |
| 3 |
FAST DUAL | |
1 | 2 (0)| 00:00:01 |
| 4 |
FAST DUAL | |
1 | 2 (0)| 00:00:01 |
| 5 |
FAST DUAL | |
1 | 2 (0)| 00:00:01 |
------------------------------------------------------------------------------
12 rows selected.
SQL>
What we saw above was the
case of a heap-table attribute clustering using a simple LINEAR ORDERing BY 2
columns (EMP_ID, DEPT_ID). This kind of
ordering will benefit queries which
always prefix EMP_ID at least to the filter clause as in:
SELECT…Where EMP_ID =
10 and DEPT_ID = 123;
But not a query which does not
specify the EMP_ID column first, as in:
SELECT… where DEPT_ID=123;
A variation of attribute
clustering is clustering by interleaved ordering.
In the case of interleaved
clustering the data is ordered by a z-order function calculated on the user-specified
columns.
CREATE
TABLE EMPLOYEES
( EMP_ID NUMBER, NAME
VARCHAR2(20),
JOIN_DATE
DATE, DEPT_ID
NUMBER,
MGR
NUMBER
)
CLUSTERING
BY INTERLEAVED ORDER (EMP_ID, DEPT_ID)
YES ON LOAD YES ON DATA MOVEMENT;
During loads, a z-order
function operates on the values of emp_id and dept_id of record and the physical
data placement as per the calculated z-order.
Due to z-ordering the
ordering is not simply by employee id and within that, by dept_id, each of the
two column’s data values will have a weight and affect the record’s position in
the sort.
Consider a source table
having these records.
Going from 1st
record to 5th the department id assigned is random.
We can see the result of an
interleaved ordered clustering by loading this source data into our modified
employee table which is now clustered by interleaved ordering of (emp_id,
dept_id). Loading the data into the interleaved-ordered table…
SQL>
insert /*+ append */ into scott.employees
select * from scott.trial2;
2
5
rows created.
Rank # 1 goes to record
with emp_id = 2 and dept_id =1.
Rank # 2 goes to record
with emp_id = 3, dept_id = 2.
Rank # 3 goes to record
with emp_id=5, dept_id = 1.
The z-ordering effect is seen
here; its considering both columns during the sort.
Adding, a record with
emp_id=dept_id=1 and a second one with emp_id=dept_id=5 to the source table and
re-loading the resulting set, we get this:
Again, it looks like the
z-ordering is assign a weight to data points from each of the columns
specified, and computing a sort position for the record by considering both weights. Therefore, where a record is located is depends on the relative weights
of its data point columns.
It seems to achieve a wider
spread of different data points across the storage space, as compared to linear
ordered clustering as applied to the same set of input data.
As per Oracle 12c
documentation, Data
Warehousing And Business Intelligence Guide, Chapter 12, this kind of
clustering is favorable to queries with predicate on emp_id and/or dept_id.
Conclusion:
In this document we
explored the attribute clustering feature’s basics, and seen how it works. Due
to its very nature, attribute clustering is suited to data warehousing type of
applications where there is little conventional DML and the tables are subject
to frequent bulk data loading.
z-order, i don't understand.
ReplyDeleteRequest to more detail explanation
My understanding of z-ordering is that it is a kind of weighted sort order.
ReplyDeleteIt considers the row's column data and computes a kind of rank.
This rank seems to determine the physical placement of data.
The rows with similar values of specified columns would tend to be co-located (or clustered).
Typically a row of data in a database has a couple of key columns which can be considered as its dimensions.
In the example above, perhaps, emp_id, and dept_id can be considered as 2 dimensions of the data.
A z-order is a sort rule that considers both the dimensions.
If we consider another case - a computer screen's windows.
Each window on a computer screen has a specific length attribute, and a breadth attribute.
Each window also has a position for its top-left corner.
And we can regard the front-to-back placement of a window relative to others, as yet another property, or dimension, its z-order rank
Perhaps a dimension that we'd plot along a z-axis.
The Windows taskbar can be considered as a window with a special kind of z-dimension -- it typically stays on top of other windows.
The taskbar can be regarded as having a z-dimension of 1, its the top-most and no window can obscure it, you want instant access to it always.
The current non-taskbar window on top could perhaps, be regarded as having a z-dimension of 2, the one behind it 3, etc.
So a windows's could have these dimensions, much like a row of data has dimension columns.
Length
Breadh
Screen Top left corner co-ordinates.
Its z-order rank (Front-to-back position tank relative to other windows)
You could sort your collection of windows (or records, analogously) on the above properties and rank them low to high.
Perhaps you want the top-most window to rank higher, and within that, prioritize screen top left position.
Then you'd give its z-order rank max weightage, and after that to the screen top left co-ordinate in your sort.
And coming to a table's row, such a sort order could perhaps, be used to determine the on-disk location.
Coming to Oracle, Hash partitioning perhaps, accomplishes something similar, where you compute a hash for a specific column-set.
And the hash value (typically) determines the partition to which the row goes.