One more reason why it is recommended to clone a database instead of importing datas for testing

One more reason why it is recommended to clone a database instead of importing datas for testing

Some days ago, a customer encountered an application hang in its production database.

A little version upgrade was in progress. The upgrade was done online during the day because of the small ddl changes.

There were 4 ddls to execute. The first 3 were ok but the problem occurred when the last one started.

It’s just a column update to allow null values.

Normally, the sql runs in a few milliseconds as it modifies only the dictionary.

alter table TABLE_NAME modify COLUMN null;

But this time, the sql took nearly 25 minutes to run with a lock in the library cache, preventing access to the table and causing the application hang.

The entire table was read and all blocks have been modified. The table contains more than 100 million rows.

So, why ?


After some investigation, it appears that the column was added a few days before with this ddl :

Alter table TABLE_NAME add COLUMN number default 0 not null;

Starting from 11g, this kind of sql has been optimized as it only modifies the dictionary, storing the default value to a new table called ecol$.

When the table is selected it find that information from that table to populate values for existing data

Here is a good explanation of this feature here fasttrue-for-adding-columns-in-11g

This optimization takes effect only if the column is defined as NOT NULL.

Changing this characteristic automatically disables this feature. And this is what appended with the ddl as it allows NULL values in the column.

So, to cancel the optimization, existing rows (affected by this optimization) had to be updated to put the default value on them.

And this is why, the entire table was read and all rows updated.

Here is a quick example demonstrating this behaviour

SQL>--
SQL>-- Create a simple table with 1000000 rows
SQL>--
SQL> create table TABLE_WITH_OPTIM as select rownum ID, sysdate DT from dual 

      connect by level < 1000001;
Table created.
Elapsed: 00:00:02.89
SQL>@my_db_block_changes
 
     VALUE NAME
---------- ----------------------------------------------------------------
       526 db block changes
Elapsed: 00:00:00.01

SQL>--
SQL>-- Add a column (optimized ddl)
SQL>--
SQL> alter table  TABLE_WITH_OPTIM add NEW_COLUMN number default 0 not null;
Table altered.
Elapsed: 00:00:00.06
SQL> @my_db_block_changes
 
     VALUE NAME
---------- ----------------------------------------------------------------
       643 db block changes
Elapsed: 00:00:00.00
SQL>--
SQL>-- Duplicate the first table
SQL>--
SQL> create table TABLE_DUPLICATED as select * from TABLE_WITH_OPTIM;
Table created.
Elapsed: 00:00:06.79
SQL>--
SQL>-- The ecol$ table contains only one row for the first table
SQL>--
SQL> select object_name "Table Name",COLNUM "Col Number",
     utl_raw.cast_to_number(BINARYDEFVAL) "Default Value stored in ecol$"
     from dba_objects o,ecol$ c where object_type='TABLE' and object_name like 'TABLE_%' 
and object_id=tabobj#;
Table Name        Col Number Default Value stored in ecol$
----------------- ---------- --------------------------
TABLE_WITH_OPTIM           3                          0
Elapsed: 00:00:00.03
SQL>@my_db_block_changes
 
     VALUE NAME
---------- ----------------------------------------------------------------
      1216 db block changes
Elapsed: 00:00:00.00
SQL>--
SQL>-- Modify the duplicated table to allow null values
SQL>-- A few milliseconds to execute
SQL>--
SQL> alter table TABLE_DUPLICATED modify NEW_COLUMN null;
Table altered.
Elapsed: 00:00:00.05
SQL>@my_db_block_changes
 
     VALUE NAME
---------- ----------------------------------------------------------------
      1255 db block changes
Elapsed: 00:00:00.00
SQL>--
SQL>-- Modify the first table to allow null values
SQL>-- more than one minute to execute with more than 2000000 blocks changed
SQL>--
SQL> alter table TABLE_WITH_OPTIM  modify NEW_COLUMN null;
Table altered.
Elapsed: 00:01:11.89
SQL>@my_db_block_changes
 
     VALUE NAME
---------- ----------------------------------------------------------------
   2011885 db block changes                   


The problem should have been detected on the pre-production database, really ?

The pre-production database was refreshed one day before to validate the ddl changes .

The ddls ran in the pre-production database without any problem in a few milliseconds.

But refreshing the database was done with the datapump tool.

Production datas were exported in a dumpfile and then imported in the pre-production database.

With this method, the column’s characteristic disappeared because the default value has been stored in each row after the import. (Result of the select done for the export)

This is why the problem has not been detected.


LogoLighty50Conclusion

Finally everything is quite logical, even if the result is disastrous.

It’s one more reason why it’s preferable to clone a database for testing (even for small databases) to avoid any unwanted side effect.

The Delphix tool seems to be an excellent solution. (snapshot technology)

Anyway, it’s always cool to see a nice chart showing us the hang thanks to Lighty 😉

Partager cet article