Tuesday, May 22, 2012

Skewed Indexes

This example looks at skewed indexes and was tested on Oracle 10.2.0.3.0. First create a very basic salary table. Start off with 1 column to store the gender:

SQL> create table salaries
  2  (gender varchar2(1))
  3  pctfree 50
  4  /
 
Table created.
 
SQL>

Insert 10000 rows in the table. Approximately 1% will have a gender of F (for female) and the rest will be M (for male):

SQL> begin
  2  for a in 1..10000 loop
  3  insert into salaries values
  4  (decode(ceil(dbms_random.value*100),1,'F','M'));
  5  end loop;
  6  end;
  7  /
 
PL/SQL procedure successfully completed.
 
SQL>

Check the distribution of values is as expected:

SQL> select gender, count(*) from salaries
  2  group by gender
  3  /
 
GENDER   COUNT(*)
------ ----------
M            9902
F              98
 
SQL>
 
Add a salary column to the table:

SQL> alter table salaries add
  2  (salary number)
  3  /
 
Table altered.
 
SQL>

Now input random salaries. Males have a salary between 1 and 20000. Females have a salary between 1 and 40000:

SQL> update salaries
  2  set salary = ceil(dbms_random.value*20000)
  3  where gender = 'M'
  4  /
 
9902 rows updated.
 
SQL> update salaries
  2  set salary = ceil(dbms_random.value*40000)
  3  where gender = 'F'
  4  /
 
98 rows updated.
 
SQL>

Create an index on the gender column. There are far more rows for males than there are for females so the data and the associated index are said to be skewed:

SQL> create index gender_index
  2  on salaries(gender)
  3  /
 
Index created.
 
SQL>
 
Now, without analyzing the data, work out the average salary for each gender. Oracle does some dynamic sampling. This tells it to do a full table scan for the males and an index lookup for the females, which is what you would expect with the data we have created:

SQL> set autotrace on
SQL> select avg(salary) from salaries
  2  where gender = 'M'
  3  /
 
AVG(SALARY)
-----------
  10030.627
 
Execution Plan
----------------------------------------------------------
Plan hash value: 793938830
 
-------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |     1 |    15 |     9   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |          |     1 |    15 |            |          |
|*  2 |   TABLE ACCESS FULL| SALARIES |  9902 |   145K|     9   (0)| 00:00:01 |
-------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter("GENDER"='M')
 
Note
-----
   - dynamic sampling used for this statement
 
Statistics
----------------------------------------------------------
          9  recursive calls
          0  db block gets
         70  consistent gets
          5  physical reads
          0  redo size
        537  bytes sent via SQL*Net to client
        491  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed
 
SQL> select avg(salary) from salaries
  2  where gender = 'F'
  3  /
 
AVG(SALARY)
-----------
  18762.898
 
Execution Plan
----------------------------------------------------------
Plan hash value: 3244963775
 
---------------------------------------------------------------------------------------------
| Id  | Operation                    | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |              |     1 |    15 |     2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE              |              |     1 |    15 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| SALARIES     |    98 |  1470 |     2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | GENDER_INDEX |    98 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("GENDER"='F')
 
Note
-----
   - dynamic sampling used for this statement

Statistics
----------------------------------------------------------
          7  recursive calls
          0  db block gets
         64  consistent gets
          0  physical reads
          0  redo size
        537  bytes sent via SQL*Net to client
        491  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed
 
SQL> set autotrace off
 
Now analyze the table. Note that this creates 4 entries in DBA_HISTOGRAMS. I assume Oracle uses these to record the spread of the data:
 
SQL> select count(*) from dba_histograms
  2  where owner = 'ORACLE'
  3  and table_name = 'SALARIES'
  4  /
 
  COUNT(*)
----------
         0
 
SQL> exec dbms_stats.gather_table_stats -
> (ownname=>'oracle', -
>  tabname=>'salaries', -
>  cascade=>true);
 
PL/SQL procedure successfully completed.
 
SQL> select count(*) from dba_histograms
  2  where owner = 'ORACLE'
  3  and table_name = 'SALARIES'
  4  /
 
  COUNT(*)
----------
         4
 
SQL>

Work out the average salary by gender again. Oracle does not have to do dynamic sampling this time, presumably because it now has the entries in DBA_HISTOGRAMS to help. (If you work out the average salary by gender twice, without running DBMS_STATS in between, Oracle does dynamic sampling both times – I checked.) I will return to this in future when I have a better understanding:

SQL> set autotrace on
SQL> select avg(salary) from salaries
  2  where gender = 'M'
  3  /
 
AVG(SALARY)
-----------
  10030.627
 
Execution Plan
----------------------------------------------------------
Plan hash value: 793938830
 
-------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |     1 |     6 |     9   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |          |     1 |     6 |            |          |
|*  2 |   TABLE ACCESS FULL| SALARIES |  9902 | 59412 |     9   (0)| 00:00:01 |
-------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter("GENDER"='M')
 
Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         31  consistent gets
          0  physical reads
          0  redo size
        537  bytes sent via SQL*Net to client
        491  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed
 
SQL> select avg(salary) from salaries
  2  where gender = 'F'
  3  /
 
AVG(SALARY)
-----------
  18762.898

Execution Plan
----------------------------------------------------------
Plan hash value: 3244963775
 
---------------------------------------------------------------------------------------------
| Id  | Operation                    | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |              |     1 |     6 |     2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE              |              |     1 |     6 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| SALARIES     |    98 |   588 |     2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | GENDER_INDEX |    98 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("GENDER"='F')
 
Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         29  consistent gets
          0  physical reads
          0  redo size
        537  bytes sent via SQL*Net to client
        491  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed
 
SQL>

Summary: I also ran this example on Oracle 9.2.0.7.0. After creating the index, Oracle used it to work out the average salary for both genders. No dynamic sampling was attempted. After running DBMS_STATS, 4 new entries appeared in DBA_HISTOGRAMS but Oracle did a full table scan to work out the average salary for both genders. The behaviour under Oracle 10.2.0.3.0 is a great improvement.

No comments:

Post a Comment