Saturday, September 5, 2009

ORA-00001: unique constraint (RMAN.TF_U2) violated

Last night, we added a new tempfile to our TEMP tablespace and started seeing the following error while running the RESYNC operation during rman backups --

RMAN> resync catalog;

starting full resync of recovery catalog
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of resync command on default channel at 10/05/2009 10:00:33
ORA-00001: unique constraint (RMAN.TF_U2) violated


The Oracle database and RMAN version is 10.2.0.4 on Sun Solaris 10. It appears to be caused by the fact that a tempfile with the same file# existed in the past but was subsequently dropped (after a switchover activity). The situation is documented in Note 5932181.8.
The catalog entry never got cleaned and still had the old entry for that file#.
The bug seems to be patched in version 11.1.

The workaround fix is rather straight.

1) Find out the DBID of the database (which you are trying to sync up w/ catalog). Also find out the tempfile # for the newly created tempfile.
PRD > select dbid from v$database;
DBID
----------
505738321

PRD > select ts#,file#,rfile#,CREATION_TIME from v$tempfile;
TS# FILE# RFILE# CREATION_TIME
---------- ---------- ---------- ---------------
3 1 1 01Mar2008 20:44
22 2 1 01Mar2008 20:44
3 3 2 01Mar2008 20:46
3 4 3 04Sep2009 19:51

4 rows selected.

From above, we see that file# 4 is the new file that was created and likely causing the unique constraint error in the catalog db.


2) Sqlplus into the catalog database as the rman user and find out the DBINC_KEY.
RCAT > select dbid,name,resetlogs_time,dbinc_key from rc_database_incarnation where dbid=505738321;

DBID NAME RESETLOGS DBINC_KEY
---------- ------------------------- --------- ----------
505738321 PWPRD 02-OCT-06 83583
505738321 PWPRD 16-FEB-06 83585
2 rows selected.


3) For the most recent DBINC_KEY, locate the record that is creating the unique constraint error.
RCAT > select dbinc_key, ts#,file#, rfile#, create_scn, create_time from tf where dbinc_key=83583;

DBINC_KEY TS# FILE# RFILE# CREATE_SCN CREATE_TI
---------- ---------- ---------- ---------- ---------- ---------
83583 3 1 1 1.0103E+10 01-MAR-08
83583 22 2 1 1.0103E+10 01-MAR-08
83583 3 3 2 1.0103E+10 01-MAR-08
83583 3 4 3 1.5606E+10 13-NOV-08

4 rows selected.



4) From above, file# 4 seems to be an old tempfile entry from the past and should be deleted.
RCAT > delete from tf where DBINC_KEY=83583 and ts#=3 and FILE#=4 and rfile#=3;
1 row deleted.

RCAT > select dbinc_key, ts#,file#, rfile#, create_scn, create_time from tf where dbinc_key=83583;
DBINC_KEY TS# FILE# RFILE# CREATE_SCN CREATE_TI
---------- ---------- ---------- ---------- ---------- ---------
83583 3 1 1 1.0103E+10 01-MAR-08
83583 22 2 1 1.0103E+10 01-MAR-08
83583 3 3 2 1.0103E+10 01-MAR-08
3 rows selected.

RCAT > commit;
Commit complete.



5) Execute the RESYNC command now.
RMAN> resync catalog;

starting full resync of recovery catalog
full resync complete



6) Query the TF table again in the catalog db and it should now create a new entry for file#4.
RCAT > select dbinc_key, ts#,file#, rfile#, create_scn, create_time from tf where dbinc_key=83583;

DBINC_KEY TS# FILE# RFILE# CREATE_SCN CREATE_TI
---------- ---------- ---------- ---------- ---------- ---------
83583 3 1 1 1.0103E+10 01-MAR-08
83583 22 2 1 1.0103E+10 01-MAR-08
83583 3 3 2 1.0103E+10 01-MAR-08
83583 3 4 3 1.7702E+10 04-SEP-09

4 rows selected.

Friday, July 10, 2009

ORA-15033 while trying to add a LUN to a diskgroup

While adding a new LUN to an existing diskgroup, getting this error --
+ASM1 > alter diskgroup PWPRD_AUDIT add disk '/dev/rdsk/oracle/RAC_PWPRD_Audit05';
alter diskgroup PWPRD_AUDIT add disk '/dev/rdsk/oracle/RAC_PWPRD_Audit05'
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15033: disk '/dev/rdsk/oracle/RAC_PWPRD_Audit05' belongs to diskgroup "PWRPPRD_REDO"



Apparently this belonged to a diskgroup called PWRPPRD_REDO at some point. The diskgroup was still there in ASM in "dismounted" state.
ASM1 > select name,state,round(total_mb/1024) "total_gb",round(free_mb/1024) "free_gb" from v$asm_diskgroup order by 1;

NAME STATE total_gb free_gb
------------------ ----------- ---------- ----------
ARCHIVES MOUNTED 250 206
ARCHIVES01 MOUNTED 250 247
..
..
PWRPPRD_REDO DISMOUNTED 0 0
REDO01 MOUNTED 25 10
..
..


An attempt to mount the diskgroup results into the following error. Further it cannot be dropped.
+ASM1 > alter diskgroup PWRPPRD_REDO mount;
alter diskgroup PWRPPRD_REDO mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "1" is missing

+ASM1 > drop diskgroup PWRPPRD_REDO;
drop diskgroup PWRPPRD_REDO
*
ERROR at line 1:
ORA-15039: diskgroup not dropped
ORA-15001: diskgroup "PWRPPRD_REDO" does not exist or is not mounted


Since the diskgroup PWRPPRD_REDO is not used, we can recreate with the "force" option (and then drop it, so that it releases the LUN and the metadata is wiped out).
+ASM1 > create diskgroup PWRPPRD_REDO external redundancy disk '/dev/rdsk/oracle/RAC_PWPRD_Audit05' force;

Diskgroup created.

+ASM1 > select name,state,round(total_mb/1024) "total_gb",round(free_mb/1024) "free_gb" from v$asm_diskgroup order by 1;


NAME STATE total_gb free_gb
------------------ ----------- ---------- ----------
ARCHIVES MOUNTED 250 200
..
..
PWRPPRD_REDO MOUNTED 50 50
REDO01 MOUNTED 25 10
..
..
13 rows selected.

+ASM1 > drop diskgroup PWRPPRD_REDO;

Diskgroup dropped.



This releases the LUN and it can now be reused for the original purpose (ie for the PWPRD_AUDIT diskgroup).
+ASM1 > alter diskgroup PWPRD_AUDIT add disk '/dev/rdsk/oracle/RAC_PWPRD_Audit05';

Diskgroup altered.

Friday, May 15, 2009

Re-IP RAC hosts

Recently we went thru the exercise of doing a re-IP of our RAC hosts. There are white papers and other documentation available on this but the steps (syntactically) weren't clear if you were using "IPMP" (multi-pathing) on the public and private networks.

Here is a step-by-step approach that we followed -

Configuration
4-node RAC
Sun Solaris 10
Oracle 10.2.0.4

Steps
1) Check current nodeapps configurations

pwqa@db01-lisqa /home/pwqa] $ srvctl config nodeapps -n db01-lisqa -a
VIP exists.: /vip-db01-lisqa.ctn/192.168.222.44/255.255.255.0/ce0:ce2
[pwqa@db01-lisqa /home/pwqa] $ srvctl config nodeapps -n db02-lisqa -a
VIP exists.: /vip-db02-lisqa.ctn/192.168.222.45/255.255.255.0/ce0:ce2
[pwqa@db01-lisqa /home/pwqa] $ srvctl config nodeapps -n db03-lisqa -a
VIP exists.: /vip-db03-lisqa.ctn/192.168.222.46/255.255.255.0/ce0:ce2
[pwqa@db01-lisqa /home/pwqa] $ srvctl config nodeapps -n db04-lisqa -a
VIP exists.: /vip-db04-lisqa.ctn/192.168.222.47/255.255.255.0/ce0:ce2


As we can see, the old VIP are 192.168.222.[44-47] for our 4 node QA servers. Interfaces "ce0" and "ce2" are used for the public.


2) Stop databases, ASM instance on all the nodes.

[pwqa@db01-lisqa /home/pwqa] $ srvctl stop database -d PWQA
[pwqa@db01-lisqa /home/pwqa] $ srvctl stop database -d PLQA
[pwqa@db01-lisqa /home/pwqa] $ srvctl stop database -d CGQA
[pwqa@db01-lisqa /home/pwqa] $ srvctl stop database -d HLPR
[pwqa@db01-lisqa /home/pwqa] $ srvctl stop asm -n db01-lisqa
[pwqa@db01-lisqa /home/pwqa] $ srvctl stop asm -n db02-lisqa
[pwqa@db01-lisqa /home/pwqa] $ srvctl stop asm -n db03-lisqa
[pwqa@db01-lisqa /home/pwqa] $ srvctl stop asm -n db04-lisqa



3) Stop Nodeapps on all 4 nodes

[pwqa@db01-lisqa /home/pwqa] $ srvctl status nodeapps -n db01-lisqa
VIP is running on node: db01-lisqa
GSD is running on node: db01-lisqa
Listener is running on node: db01-lisqa
ONS daemon is running on node: db01-lisqa

[pwqa@db01-lisqa /home/pwqa] $ srvctl stop nodeapps -n db01-lisqa
[pwqa@db01-lisqa /home/pwqa] $ srvctl stop nodeapps -n db02-lisqa
[pwqa@db01-lisqa /home/pwqa] $ srvctl stop nodeapps -n db03-lisqa
[pwqa@db01-lisqa /home/pwqa] $ srvctl stop nodeapps -n db04-lisqa

[pwqa@db01-lisqa /home/pwqa] $ srvctl status nodeapps -n db01-lisqa
VIP is not running on node: db01-lisqa
GSD is not running on node: db01-lisqa
Listener is not running on node: db01-lisqa
ONS daemon is not running on node: db01-lisqa



4) Disable databases and ASM instance (so they don't automatically restart on a CRS restart)
[pwqa@db01-lisqa /home/pwqa] $ srvctl disable database -d HLPR
[pwqa@db01-lisqa /home/pwqa] $ srvctl disable database -d CGQA
[pwqa@db01-lisqa /home/pwqa] $ srvctl disable database -d PLQA
[pwqa@db01-lisqa /home/pwqa] $ srvctl disable database -d PWQA
[pwqa@db01-lisqa /home/pwqa] $ srvctl disable asm -n db01-lisqa
[pwqa@db01-lisqa /home/pwqa] $ srvctl disable asm -n db02-lisqa
[pwqa@db01-lisqa /home/pwqa] $ srvctl disable asm -n db03-lisqa
[pwqa@db01-lisqa /home/pwqa] $ srvctl disable asm -n db04-lisqa


[Note: I believe only disabling ASM should be sufficient since ASM is a dependency for the database resource (could be verified with crs_stat-p command)]
Make sure that nothing else is running except the CRS.


5) Perform the IP and DNS changes at the OS level. No reboot of servers needed at this time.


6) From any one node (preferably node 1), check the new VIPs (/etc/hosts) and the nodeapps configuration for the 4 RAC nodes
[Note: It might show the new VIP address for the local node from where you're working - we still need to run the "srvctl modify" to change the VIPs]

[pwqa@db01-lisqa /home/pwqa] $ grep vip /etc/hosts|grep lisqa
156.30.179.98 vip-db01-lisqa.ctn vip-db01-lisqa
156.30.179.102 vip-db02-lisqa.ctn vip-db02-lisqa
156.30.179.106 vip-db03-lisqa.ctn vip-db03-lisqa
156.30.179.110 vip-db04-lisqa.ctn vip-db04-lisqa

[pwqa@db01-lisqa /home/pwqa] $ srvctl config nodeapps -n db01-lisqa -a
VIP exists.: /vip-db01-lisqa.ctn/156.30.179.98/255.255.255.0/ce0:ce2
[pwqa@db01-lisqa /home/pwqa] $ srvctl config nodeapps -n db02-lisqa -a
VIP exists.: /vip-db02-lisqa.ctn/192.168.222.45/255.255.255.0/ce0:ce2
[pwqa@db01-lisqa /home/pwqa] $ srvctl config nodeapps -n db03-lisqa -a
VIP exists.: /vip-db03-lisqa.ctn/192.168.222.46/255.255.255.0/ce0:ce2
[pwqa@db01-lisqa /home/pwqa] $ srvctl config nodeapps -n db04-lisqa -a
VIP exists.: /vip-db04-lisqa.ctn/192.168.222.47/255.255.255.0/ce0:ce2


7) Run SRVCTL MODIFY to change the VIP addresses. Make sure to use the correct netmask.
Netmask could be found out from running the "ifconfig -a" command --
[pwqa@db01-lisqa /home/pwqa] $ ifconfig -a
lo0: flags=2001000849 mtu 8232 index 1 inet 127.0.0.1 netmask ff000000
ce0: flags=1000843 mtu 1500 index 2 inet 156.30.179.97 netmask fffffe00 broadcast 156.30.179.255 groupname main
...

The netmask here is fffffe00 (255.255.254.0).

Execute the SRVCTL MODIFY command for all nodes (from any one node) -

[pwqa@db01-lisqa /home/pwqa] $ sudo srvctl modify nodeapps -n db01-lisqa -A 156.30.179.98/255.255.254.0/"ce0|ce2"
[pwqa@db01-lisqa /home/pwqa] $ sudo srvctl modify nodeapps -n db02-lisqa -A 156.30.179.102/255.255.254.0/"ce0|ce2"
[pwqa@db01-lisqa /home/pwqa] $ sudo srvctl modify nodeapps -n db03-lisqa -A 156.30.179.106/255.255.254.0/"ce0|ce2"
[pwqa@db01-lisqa /home/pwqa] $ sudo srvctl modify nodeapps -n db04-lisqa -A 156.30.179.110/255.255.254.0/"ce0|ce2"



8) Modify the public IP using "oifcfg"
From any node (preferably node 1),

[pwqa@db01-lisqa /home/pwqa] $ . oraenv
ORACLE_SID = [CRS] ?
[pwqa@db01-lisqa /home/pwqa] $
[pwqa@db01-lisqa /home/pwqa] $ oifcfg getif
ce0 192.168.222.0 global public
ce2 192.168.222.0 global public


[Note: The private interfaces are not shown here as they are configured using IPMP]
We need to change from the old network (192.168..) to the new network (156.30..)

[pwqa@db01-lisqa /home/pwqa] $ oifcfg delif -global ce0
[pwqa@db01-lisqa /home/pwqa] $ oifcfg delif -global ce2
[pwqa@db01-lisqa /home/pwqa] $ oifcfg setif -global ce0/156.30.179.0:public
[pwqa@db01-lisqa /home/pwqa] $ oifcfg setif -global ce2/156.30.179.0:public



9) Reboot the servers (all 4 nodes) and verify things using "srvctl config nodeapps -n -a" and "oifcfg getif". They should reflect the right IPs now.


10) Update listener.ora (if it has any hard coded IP addresses) and enable ASM and database.

[pwqa@db01-lisqa /home/pwqa] $ srvctl enable asm -n db01-lisqa
[pwqa@db01-lisqa /home/pwqa] $ srvctl enable asm -n db02-lisqa
[pwqa@db01-lisqa /home/pwqa] $ srvctl enable asm -n db03-lisqa
[pwqa@db01-lisqa /home/pwqa] $ srvctl enable asm -n db04-lisqa
[pwqa@db01-lisqa /home/pwqa] $ srvctl enable database -d PWQA
[pwqa@db01-lisqa /home/pwqa] $ srvctl enable database -d PLQA
[pwqa@db01-lisqa /home/pwqa] $ srvctl enable database -d CGQA
[pwqa@db01-lisqa /home/pwqa] $ srvctl enable database -d HLPR


Finally, make sure anyone tnsnames.ora entries with hard coded IP addresses are updated.

Tuesday, April 21, 2009

ITIL Training and Certification

I had this 1-week training that talks about ITIL processes and then let you to take the ITIL exam and be "certified". Well, initially I had an opinion that it was a complete waste of time spending a week in learning a non-technical subject that you do not apply in your day-to-day events. But the impression and the whole perception got changed after sitting in the classroom with 22 other folks (mix of administrators, managers and other infrastructure staff) and getting the essence of what ITIL is all about and how it applies to your business and the job.

ITIL is IT Infrastructure Library and talks about the entire gamut of IT processes that we need to follow in the IT world. The business customer is always on the top list of priorities and all the process work hand-in-hand to meet or exceed the customer's expections. Its all about one word "SERVICES". Everyone has a defined role and follow a set of rules to help run the show smoothly. We learned that the 5 stages of ITIL (Service Strategy, Service Design, Service Transition, Service Operation and Continual Service Improvement) worked thru a common goal towards customer satisfaction and meet the defined service level agreements.

The thing that came as a surprise to me were the different agreements that are out in the IT service industry. We only talk of Service Level Agreements and that kinda implies all - whether it is internal, with the customer or with outside vendors (suppliers). ITIL classifies an internal agreement (within the different units of the same business) as OLA (Operational Level Agreement) and the agreement with Suppliers (vendors) as UC (Underpinning Contracts). SLA in principal, and according to the general understanding, is between the business and its customer(s).

Further the roles and responsibilites of different groups are clearly defined. The SLM (Service Level Manager) is responsible for negotating and writing up all these different agreements/contracts. However, the Supplier Manager is responsible to ensure that vendors obey the terms and conditions in the UC (and nail them down, if need be). The Change Manager is a separate role than the Configuration Manager. We did a lot of interactive exercises and projects to reinforce the learning and it truly was fun!

There are tons of new terms and abbreviations that one will find out but they all make sense at the end of the course. So far when someone said "DML" word, Data Manipulation Language statement came to mind. ITIL uses DML for Definitive Media Library - an authoratitive and well protected source of Software (including the master copy of a software together with the License, documentation and any kind of agreements etc). Service value is defined in terms of Service Utility and Service Warranty which have different purposes. ITIL talks about KPIs (Key Performance Indicators) and CSFs (Critical Success Factors) to measure services. Then there is this Service Portfolio that covers the planned services (Pipeline), the existing services (the Service Catalog) and the retired services. There are teams with defined roles and responsibilites per ITIL but, in a practical scenario, I see a person wearing different hats if following the ITIL protocol. The Availability Management and Access Management are separate from the IT Service Continuity Management (ITSCM) whereas in most cases (where a business is small with small IT staff), it is pretty much the same that handles the services availibility, the access into the systems (security etc) and any Disaster Recovery (DR) solutions.
ITIL has roots somewhere in Europe and the way I understood is that ITIL is to Infrastructre team as CMMI is to Software Engg. in general.

Don't know how much weight that has on your resume but it certainly was a good experience from learning perspective!!

More info on ITIL is available here --
http://www.itsmf.co.uk/web/FILES/News/itSMF_ITILV3_Intro_Overview%5B1%5D.pdf




Monday, March 23, 2009

SRVCTL fails to start a RAC database

SRVCTL has problems while starting the database (SRVCTL STOP DATABASE works ok).
Also, the database can be started and stopped fine using SQL*Plus. No problems there.

$ srvctl start database -d HLPR
PRKP-1001 : Error starting instance HLPR1 on node db01-lisqa
CRS-0215: Could not start resource 'ora.HLPR.HLPR1.inst'.
PRKP-1001 : Error starting instance HLPR2 on node db02-lisqa
CRS-0215: Could not start resource 'ora.HLPR.HLPR2.inst'.
PRKP-1001 : Error starting instance HLPR3 on node db03-lisqa
CRS-0215: Could not start resource 'ora.HLPR.HLPR3.inst'.
PRKP-1001 : Error starting instance HLPR4 on node db04-lisqa
CRS-0215: Could not start resource 'ora.HLPR.HLPR4.inst'.
$

As seen above, it is a 4-node RAC database.

2 issues here

1) REQUIRED_RESOURCE missing for one of the nodes
Doing a CRS_STAT on the above resources, shows that the REQUIRED_RESOURCES for node 4 instance (HLPR4) is missing. For the other 3 nodes the ASM dependency is defined. It is already verified that ASM is up and running on the 4th node (also there are other RAC databases on the same cluster that do not have any issues).
$ crs_stat -p ora.HLPR.HLPR1.inst|grep REQUIRED
REQUIRED_RESOURCES=ora.db01-lisqa.ASM1.asm
$ crs_stat -p ora.HLPR.HLPR2.inst|grep REQUIRED
REQUIRED_RESOURCES=ora.db02-lisqa.ASM2.asm
$ crs_stat -p ora.HLPR.HLPR3.inst|grep REQUIRED
REQUIRED_RESOURCES=ora.db03-lisqa.ASM3.asm
$ crs_stat -p ora.HLPR.HLPR4.inst|grep REQUIRED
REQUIRED_RESOURCES=
$


Fix

Add the REQUIRED_RESOURCE using "srvctl modify". First lets look at the syntax (-h for help) -
$ srvctl modify instance -h
Usage: srvctl modify database -d [-n ] [-m ] [-p ] [-r {PRIMARY | PHYSICAL_STANDBY | LOGICAL_STANDBY}] [-s ] [-y {AUTOMATIC | MANUAL}]
-d Unique name for the database
-n Database name (DB_NAME), if different from the unique name given by the -d option
-o ORACLE_HOME path
-m Domain for cluster database
-p Server parameter file path
-r Role of the database (primary, physical_standby, logical_standby)
-s Startup options for the database
-y Management policy for the database (automatic, manual)
-h Print usage
$
$ srvctl modify instance -d HLPR -i HLPR4 -s +ASM4
$
$ crs_stat -p ora.HLPR.HLPR4.inst|grep REQUIRE
REQUIRED_RESOURCES=ora.db04-lisqa.ASM4.asm
$



2) SPFILE was incorrectly defined in the OCR
The imon log for the database under /log//racg shows this --
$ cd $ORACLE_HOME/log//racg
$ vi imon_HLPR.log
...
...
2009-03-19 20:36:02.729: [ RACG][1] [8400][1][ora.HLPR.HLPR1.inst]: racgimon exiting

2009-03-19 20:43:34.680: [ RACG][1] [12432][1][ora.HLPR.HLPR1.inst]: racgimon started

2009-03-19 20:43:37.137: [ RACG][6] [12432][6][ora.HLPR.HLPR1.inst]:
SQL*Plus: Release 10.2.0.4.0 - Production on Thu Mar 19 20:43:36 2009

Copyright (c) 1982, 2007, Oracle. All Rights Reserved.
Enter user-name: Connected to an idle instance.

SQL> ORA-01565: error in identifying file '+RAC_REDO02/HLPR/spfileHLPR.ora'

2009-03-19 20:43:37.137: [ RACG][6] [12432][6][ora.HLPR.HLPR1.inst]: ORA-17503: ksfdopn:2 Failed to open file
+RAC_REDO02/HLPR/spfileHLPR.ora
ORA-15056: additional error message
ORA-17503: ksfdopn:DGOpenFile05 Failed to open file +RAC_REDO02/hlpr/spfile
ORA-01078: failure in processing system parameters
SQL> Disconnected
This shows that the SPFILE was incorrectly defined in the OCR

Fix
Modified OCR for the correct location of the SPFILE -
$ cat $ORACLE_HOME/dbs/initHLPR1.ora
spfile='+HLPR_DATA/HLPR/PARAMETERFILE/spfile.286.640265465'
$
$ srvctl modify database -d HLPR -p +HLPR_DATA/HLPR/PARAMETERFILE/spfile.286.640265465
$


This fixed the OCR and database could then be started using SRVCTL.

There is also one more issue that I noticed today in the listener configuration w/ CRS. On node4, the listener was up under db_home (typically it is seen under ASM home). Doing a "ps -ef|grep tns" will tell what OH is being used for the listener home. Also, the crs_stat -p output shows that.
[pwqa@db04-lisqa /home/pwqa] $ ps -ef|grep tns
pwqa 1982 1 0 May 13 ? 2:06 /d01/app/pwqa/product/10.2/pwqadb/bin/tnslsnr LISTENER_DB04-LISQA -inherit
pwqa 1965 1 0 May 13 ? 5:49 /d01/app/pwqa/product/10.2/pwqadb/bin/tnslsnr LISTENER_CGQA_DB04-LISQA -inherit
pwqa 2007 1 0 May 13 ? 4:33 /d01/app/pwqa/product/10.2/pwqadb/bin/tnslsnr LISTENER_HLPR_DB04-LISQA -inherit
pwqa 2041 1 0 May 13 ? 5:43 /d01/app/pwqa/product/10.2/pwqadb/bin/tnslsnr LISTENER_PWQA_DB04-LISQA -inherit
pwqa 2024 1 0 May 13 ? 6:33 /d01/app/pwqa/product/10.2/pwqadb/bin/tnslsnr LISTENER_PLQA_DB04-LISQA -inherit
pwqa 22092 23479 0 11:18:42 pts/1 0:00 grep tns



As we can see from above, listener "
LISTENER_DB04-LISQA" is running under the database home (/d01/app/pwqa/product/10.2/pwqadb). The crs_stat -p output for this resource shows this --
[pwqa@db04-lisqa /home/pwqa] $ crs_stat -p ora.db04-lisqa.LISTENER_DB04-LISQA.lsnr
NAME=ora.db04-lisqa.LISTENER_DB04-LISQA.lsnr
TYPE=application
ACTION_SCRIPT=/d01/app/pwqa/product/10.2/pwqadb/bin/racgwrap
ACTIVE_PLACEMENT=0
AUTO_START=1
CHECK_INTERVAL=600
DESCRIPTION=CRS application for listener on node
FAILOVER_DELAY=0
FAILURE_INTERVAL=0
FAILURE_THRESHOLD=0
HOSTING_MEMBERS=db04-lisqa
OPTIONAL_RESOURCES=
PLACEMENT=restricted
REQUIRED_RESOURCES=ora.db04-lisqa.vip
RESTART_ATTEMPTS=5
SCRIPT_TIMEOUT=600
START_TIMEOUT=0
STOP_TIMEOUT=0
UPTIME_THRESHOLD=7d
USR_ORA_ALERT_NAME=
USR_ORA_CHECK_TIMEOUT=0
USR_ORA_CONNECT_STR=/ as sysdba
USR_ORA_DEBUG=0
USR_ORA_DISCONNECT=false
USR_ORA_FLAGS=
USR_ORA_IF=
USR_ORA_INST_NOT_SHUTDOWN=
USR_ORA_LANG=
USR_ORA_NETMASK=
USR_ORA_OPEN_MODE=
USR_ORA_OPI=false
USR_ORA_PFILE=
USR_ORA_PRECONNECT=none
USR_ORA_SRV=
USR_ORA_START_TIMEOUT=0
USR_ORA_STOP_MODE=immediate
USR_ORA_STOP_TIMEOUT=0
USR_ORA_VIP=

[pwqa@db04-lisqa /home/pwqa] $


The action script clearly points to the database home.
To change this, we again use the "srvctl modify" command.

[pwqa@db04-lisqa /home/pwqa] $ srvctl modify listener -h
Usage: srvctl modify listener -n [-l ] -o
-n Node name
-l "" Comma separated listener names
-o ORACLE_HOME path
-h Print usage
[pwqa@db04-lisqa /home/pwqa] $

[pwqa@db04-lisqa /home/pwqa] $ srvctl modify listener -n db04-lisqa -l LISTENER_DB04-LISQA -o /d01/app/pwqa/product/10.2/ASM

[pwqa@db04-lisqa /home/pwqa] $ crs_stat -p ora.db04-lisqa.LISTENER_DB04-LISQA.lsnr
NAME=ora.db04-lisqa.LISTENER_DB04-LISQA.lsnr
TYPE=application
ACTION_SCRIPT=/d01/app/pwqa/product/10.2/ASM/bin/racgwrap
ACTIVE_PLACEMENT=0
AUTO_START=1
CHECK_INTERVAL=600
DESCRIPTION=CRS application for listener on node
FAILOVER_DELAY=0
FAILURE_INTERVAL=0
FAILURE_THRESHOLD=0
HOSTING_MEMBERS=db04-lisqa
OPTIONAL_RESOURCES=
PLACEMENT=restricted
REQUIRED_RESOURCES=ora.db04-lisqa.vip
RESTART_ATTEMPTS=5
SCRIPT_TIMEOUT=600
START_TIMEOUT=0
STOP_TIMEOUT=0
UPTIME_THRESHOLD=7d
USR_ORA_ALERT_NAME=
USR_ORA_CHECK_TIMEOUT=0
USR_ORA_CONNECT_STR=/ as sysdba
USR_ORA_DEBUG=0
USR_ORA_DISCONNECT=false
USR_ORA_FLAGS=
USR_ORA_IF=
USR_ORA_INST_NOT_SHUTDOWN=
USR_ORA_LANG=
USR_ORA_NETMASK=
USR_ORA_OPEN_MODE=
USR_ORA_OPI=false
USR_ORA_PFILE=
USR_ORA_PRECONNECT=none
USR_ORA_SRV=
USR_ORA_START_TIMEOUT=0
USR_ORA_STOP_MODE=immediate
USR_ORA_STOP_TIMEOUT=0
USR_ORA_VIP=

[pwqa@db04-lisqa /home/pwqa] $


The CRS registration has been changed.
Now we only need to stop the listener and restart it under the ASM home.






Sunday, January 4, 2009

11g Replay using DBMS_WORKLOAD_REPLAY

The other day I was playing around with the 11g Database Replay features and came across Arup Nanda's 11g-top-features articles on oracle site. Very very Nice! It goes in great depth detailing every step neatly with nice screen shots using Database Control. I went thru the steps and everything went clean!
Next, I decided to repeat all the steps using the DBMS_WORKLOAD_REPLAY packages and had to follow the documentation and a couple ML notes (mostly Note 445116.1 and Note 560977.1 ).
There were a few places that I stumbled upon. I created a small test scenario with steps, so here it goes.

Preparing user account and database objects
Create a USER account with the right privs that will have the privs to run the CAPTURE and REPLAY procedures.
CREATE USER ORACLE IDENTIFIED BY oracle;
GRANT CREATE SESSION, RESOURCE to ORACLE;
GRANT EXECUTE ON DBMS_WORKLOAD_CAPTURE TO oracle;
GRANT EXECUTE ON DBMS_WORKLOAD_REPLAY TO oracle;

GRANT CREATE SESSION TO oracle;
GRANT CREATE ANY DIRECTORY TO oracle;
GRANT SELECT_CATALOG_ROLE TO oracle;
GRANT BECOME USER TO oracle;
GRANT UNLIMITED TABLESPACE TO oracle;



Create a test table. To verify if DMLs generated due to trigger actions are also part of captured replay, I've created a child table and a trigger also.
create table ORACLE.test (
column_1 number,
column_2 varchar2(600)
);
create table ORACLE.test_child (
column_1 number,
column_2 varchar2(600)
);


create trigger ORACLE.tr_test after insert on ORACLE.test
for each row
begin
insert into ORACLE.test_child values (:new.column_1, :new.column_2);
end;
/



Creating a Restore Point

Since this is a test database, it will be flashed back to run the REPLAY. Create a restore point now.
drop restore point BEFORE_CAPTURE;
create restore point BEFORE_CAPTURE;


Adding a FILTER
Add a filter so that only transactions executed under ORACLE id are captured
exec DBMS_WORKLOAD_CAPTURE.DELETE_FILTER('Test_Replay');
begin
DBMS_WORKLOAD_CAPTURE.ADD_FILTER (fname => 'Test_Replay',
fattribute =>'USER',
fvalue =>'ORACLE');
end;
/


Starting the CAPTURE process
The capture process could be started using DBMS_WORKLOAD_CAPTURE.START_CAPTURE. The capture statistics could be obtained by querying DBA_WORKLOAD_CAPTURES view. A value of "EXCLUDE" in the default_action indicates that only things that "pass" the filter should be captured (otherwords, only actions performed by the ORACLE id should be captured while it is running).

begin
DBMS_WORKLOAD_CAPTURE.START_CAPTURE (name => 'Test_Capture',
dir => 'DBCAPTURE',
default_action => 'EXCLUDE');
end;
/

SELECT id,name,start_time,end_time,start_scn, end_scn, duration_secs, filters_used, capture_size FROM dba_workload_captures order by start_time desc;


You might hit the following errors while starting the CAPTURE process -

RGTEST > begin
2 DBMS_WORKLOAD_CAPTURE.START_CAPTURE (name => 'Test_Capture',
3 dir => 'DBCAPTURE',
4 default_action => 'EXCLUDE');
5 end;
6 /
begin
*
ERROR at line 1:
ORA-15505: cannot start workload capture because instance 1 encountered errors while accessing directory
"/home/oracle/dbcapture"
ORA-06512: at "SYS.DBMS_WORKLOAD_CAPTURE", line 799
ORA-06512: at line 2

=> Make sure that oracle owns (or has the write access) to that os directory. Also the directory should be fully empty.

RGTEST > SELECT id,name, start_time,end_time,start_scn, end_scn, duration_secs, filters_used, capture_size FROM dba_workload_captures order by start_time desc;

ID NAME START_TIME END_TIME START_SCN END_SCN DURATION_SECS FILTERS_USED
---------- ------------------------- --------------- --------------- ---------- ---------- ------------- ------------
CAPTURE_SIZE
------------
51 Test_Capture 03Jan2009 10:26 24971410 24971673 182 1
313174


Execute the WORKLOAD
To execute the workload, I generated a bunch of INSERT statements.
set termout off
spool ./trans.sql
set head off feed off pagesize 0 linesize 550
select 'insert into oracle.test values ('||rownum||','||''''||rpad('AZ',500,'ZA')||''');' from dba_objects where rownum < style="color: rgb(102, 102, 204);" face="georgia">These were executed as user "ORACLE".


RGTEST > select count(*) from oracle.test;

COUNT(*)
----------
1000

1 row selected.

RGTEST > select count(*) from oracle.test_child;

COUNT(*)
----------
1000

1 row selected.



Stopping the CAPTURE process
The capture process could also be stopped using DBMS_WORKLOAD_CAPTURE.START_CAPTURE
RGTEST > exec DBMS_WORKLOAD_CAPTURE.FINISH_CAPTURE();

PL/SQL procedure successfully completed.



Export the AWR for capture statistics
Export the AWR if you need to compare any stats etc after the replay has been done. You'd need the capture_id (queried earlier) for this.
RGTEST > begin
2 DBMS_WORKLOAD_CAPTURE.EXPORT_AWR(51);
3 end;
4 /

PL/SQL procedure successfully completed.


Following files should be created in the directory referenced by the DBCAPTURE directory object -
-rw-r----- 1 oracle dba 14413824 Jan 3 10:35 wcr_ca.dmp
-rw-r--r-- 1 oracle dba 28458
Jan 3 10:35 wcr_ca.log
-rw-r--r-- 1 oracle dba 17246
Jan 3 10:30 wcr_cr.text
-rw-r--r-- 1 oracle dba 35502
Jan 3 10:30 wcr_cr.html
-rw-r--r-- 1 oracle dba 208
Jan 3 10:29 wcr_fcapture.wmd
-rw-r--r-- 1 oracle dba 2975
Jan 3 10:29 wcr_4m42t64002s47.rec
-rw-r--r-- 1 oracle dba 69
Jan 3 10:28 wcr_4m42tsw002sa4.rec
-rw-r--r-- 1 oracle dba 303553
Jan 3 10:28 wcr_4m42tnh002s92.rec
..
..

Looking at export log (wcr_ca.log) indicates that all the WRH tables get exported
. . exported "SYS"."WRH$_SYSMETRIC_HISTORY" 13.46 KB 100 rows
. . exported "SYS"."WRH$_SQL_PLAN" 6.355 MB 26787 rows
. . exported "SYS"."WRH$_SYSMETRIC_SUMMARY" 32.19 KB 296 rows
. . exported "SYS"."WRH$_ENQUEUE_STAT" 18.02 KB 174 rows
. . exported "SYS"."WRH$_WAITCLASSMETRIC_HISTORY" 9.429 KB 0 rows


We may also run a report on capture using DBMS_WORKLOAD_CAPTURE.REPORT.
A snip of it is shown below --
RGTEST > select dbms_workload_capture.report(51,'TEXT') from dual;

Database Capture Report For RGTEST

DB Name DB Id Release RAC Capture Name Status
------------ ----------- ----------- --- -------------------------- ----------
RGTEST 2815669595 11.1.0.6.0 NO Test_Capture COMPLETED


Start time: 03-Jan-09 10:26:25 (SCN = 24971410)
End time: 03-Jan-09 10:29:27 (SCN = 24971673)
Duration: 3 minutes 2 seconds
Capture size: 305.83 KB
Directory object: DBCAPTURE
Directory path: /home/oracle/dbcapture
Directory shared in RAC: TRUE
Filters used: 1 INCLUSION filter

Captured Workload Statistics DB: RGTEST Snaps: 4390-4391
-> 'Value' represents the corresponding statistic aggregated
across the entire captured database workload.
-> '% Total' is the percentage of 'Value' over the corresponding
system-wide aggregated total.

Statistic Name Value % Total
---------------------------------------- ------------- ---------
DB time (secs) 2.07 64.69
Average Active Sessions 0.01
User calls captured 2054 72.71
User calls captured with Errors 0
Session logins 5 35.71
Transactions 3 4.35
-------------------------------------------------------------

Top Events Captured DB: RGTEST Snaps: 4390-4391

Avg Active
Event Event Class % Event Sessions
----------------------------------- --------------- ---------- ----------
CPU + Wait for CPU CPU 50.00 0.01
-------------------------------------------------------------



Pre-processing the WORKLOAD
Before REPLAY can be done, the workload information needs to be "pre-processed". Essentially, this creates a .pp file (wcr_login.pp) in the capture directory.

RGTEST > BEGIN
2 DBMS_WORKLOAD_REPLAY.PROCESS_CAPTURE
3 (capture_dir => 'DBCAPTURE');
4 END;
5 /

PL/SQL procedure successfully completed.



Preparing for a REPLAY
In a practical scenario, the CAPTURE will run in production and the REPLAY will be tested in a QA/TST type environment (to check potential impact of any database/application changes) . However in our test scenario, we'll simply flashback the TEST database to the restore point that was created earlier.
RGTEST > connect / as sysdba
Connected.

RGTEST > shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.

RGTEST > startup mount
ORACLE instance started.

Total System Global Area 954155008 bytes
Fixed Size 1303916 bytes
Variable Size 692062868 bytes
Database Buffers 255852544 bytes
Redo Buffers 4935680 bytes
Database mounted.
RGTEST > flashback database to restore point BEFORE_CAPTURE;

Flashback complete.

RGTEST > alter database open resetlogs;

Database altered.

RGTEST > select count(*) from oracle.test;

COUNT(*)
----------
0

1 row selected.

RGTEST > select count(*) from oracle.test_child;

COUNT(*)
----------
0

1 row selected.


In my test, I also drop the TRIGGER to see if the DML changes generated from the trigger action were captured and will be replayed.
RGTEST > drop trigger ORACLE.TR_TEST;

Trigger dropped.

Executing a REPLAY of the workload captured earlier
The REPLAY needs to be "initialized" and "prepared" before replay clients can be started.
RGTEST > begin
2 DBMS_WORKLOAD_REPLAY.INITIALIZE_REPLAY(
3 replay_name=> 'TEST_REPLAY',
4 replay_dir=>'DBCAPTURE');
5 end;
6 /


PL/SQL procedure successfully completed.

RGTEST > exec DBMS_WORKLOAD_REPLAY.PREPARE_REPLAY();

PL/SQL procedure successfully completed.


From another windows session, we can start the replay clients (wrc).
[oracle@dbasandbox /home/oracle/ravi/DB_REPLAY] $ wrc system/ mode=replay replaydir=/home/oracle/dbcapture
Workload Replay Client: Release 11.1.0.6.0 - Production on Sat Jan 3 15:59:57 2009
Copyright (c) 1982, 2007, Oracle. All rights reserved.

Wait for the replay to start (15:59:57)

Start the REPLAY using START_REPLAY procedure of DBMS_WORKLOAD_REPLAY package.
RGTEST > BEGIN
2 DBMS_WORKLOAD_REPLAY.START_REPLAY ();
3 end;
4 /

PL/SQL procedure successfully completed.

The window where you started the replay client now shows that replay has been started.
[oracle@dbasandbox /home/oracle/ravi/DB_REPLAY] $ wrc system/ mode=replay replaydir=/home/oracle/dbcapture
Workload Replay Client: Release 11.1.0.6.0 - Production on Sat Jan 3 15:59:57 2009
Copyright (c) 1982, 2007, Oracle. All rights reserved.

Wait for the replay to start (15:59:57)

Replay started (16:01:11)
Replay finished (16:06:26)



Database view DBA_WORKLOAD_REPLAY may be queried for getting the replay_id (for exporting AWR stats etc) and other stats.
RGTEST > select id,name, start_time,end_time,duration_secs FROM DBA_WORKLOAD_REPLAYS;

ID NAME START_TIM END_TIME DURATION_SECS
---------- ------------------------- --------- --------- -------------
13 TEST_REPLAY 03-JAN-09 03-JAN-09 193

3 rows selected.


A running REPLAY may be stopped (if its running for quite a long time and you need to cancel it) by executing the CANCEL_REPLAY procedure. In our case, it returned an error since it already completed earlier.
RGTEST > exec DBMS_WORKLOAD_REPLAY.CANCEL_REPLAY ();
BEGIN DBMS_WORKLOAD_REPLAY.CANCEL_REPLAY (); END;

*
ERROR at line 1:
ORA-20223: No workload replay is in progress!
ORA-06512: at "SYS.DBMS_WORKLOAD_REPLAY", line 1578
ORA-06512: at "SYS.DBMS_WORKLOAD_REPLAY", line 1633
ORA-06512: at line 1


Querying the 2 tables indicate only the primary table gets populated with data. The child table that was earlier populated by the triggering action does not get populated in our replay (note: we dropped the trigger earlier to check if the DMLs generated by the triggering action get captured).
RGTEST > select count(*) from oracle.test;
COUNT(*)
----------
100

1 row selected.

RGTEST > select count(*) from oracle.test_child;
COUNT(*)
----------
0

1 row selected.



Export the AWR for replay statistics
Like earlier, AWR stats can be exported for replay statistics also. We need a replay_id to do this however (queried earlier).
RGTEST > exec DBMS_WORKLOAD_REPLAY.EXPORT_AWR (replay_id => 13);
PL/SQL procedure successfully completed.


This completes the REPLAY exercise. I need to play a bit more with the excellent 11g feature and will post anything new I learn here.