Monday, November 12, 2007

Resync RAC standby database using ASM

If a RAC standby database using ASM (and also standby logfiles in our case) stays down for a little while, it will likely go behind the primary and under ASM of the standby cluster you'll see these "gaps" in sequence numbers --
ASMCMD [+DATA/PROD/ARCHIVELOG/2007_11_12] > ls
thread_1_seq_652.618.638445679
thread_1_seq_654.619.638466159
thread_1_seq_656.620.638466307
thread_1_seq_658.623.638467267
thread_1_seq_659.624.638467413
thread_1_seq_660.625.638467431
thread_2_seq_262.616.638416837
thread_2_seq_264.617.638445669
thread_2_seq_265.621.638467091
thread_2_seq_266.622.638467229
ASMCMD [+DATA/PROD/ARCHIVELOG/2007_11_12]


Notice that thread 1 (node 1) is missing sequences 653, 655 and 657. Similarly thread 2 is missing sequence#263.
Question is how to bring these missing archivelogs from ASM@primary cluster (they are still sitting on disk @primary)?

Solution: I use a 3-step process there
step 1: Use RMAN "copy" command to copy these to a cooked filesystem on the primary cluster
step 2: Scp the restored (copied) archivelogs to the standby cluster
step 3: Recover (automatic) standby database and supply the scp'ed archivelog under cooked FS.

First, query the ASM@primary to get the exact name/location of these archives --
ASMCMD [+FRA/PROD/ARCHIVELOG/2007_11_12] > ls
thread_1_seq_650.768.638416833
thread_1_seq_651.771.638416841
thread_1_seq_652.781.638445663
thread_1_seq_653.796.638465819
thread_1_seq_654.797.638466159
thread_1_seq_655.798.638466189
thread_1_seq_656.800.638466307
thread_1_seq_657.801.638466455
thread_2_seq_262.769.638416835
thread_2_seq_263.770.638416841
thread_2_seq_264.782.638445665
thread_2_seq_265.799.638466191


step 1: Execute the RMAN copy command (primary cluster). Set the env to the database first.
RMAN> copy archivelog '+FRA/prod/archivelog/2007_11_12/thread_2_seq_263.770.638416841' to '/u02/backup/PROD/thread_2_seq_263.770.638416841';

Starting backup at 12-NOV-07
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=1055 instance=PROD1 devtype=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: sid=1057 instance=PROD1 devtype=DISK
channel ORA_DISK_1: starting archive copy
input archive log thread=2 sequence=263 recid=1412 stamp=638416840
output filename=/u02/backup/PROD/thread_2_seq_263.770.638416841 recid=1426 stamp=638465672
channel ORA_DISK_1: archivelog copy complete, elapsed time: 00:00:02
Finished backup at 12-NOV-07

Starting Control File and SPFILE Autobackup at 12-NOV-07
piece handle=+FRA/prod/autobackup/2007_11_12/s_638465673.795.638465673 comment=NONE
Finished Control File and SPFILE Autobackup at 12-NOV-07

RMAN> copy archivelog '+FRA/PROD/ARCHIVELOG/2007_11_12/thread_1_seq_653.796.638465819' to '/u02/backup/PROD/thread_1_seq_653.796.638465819';

Starting backup at 12-NOV-07
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=1055 instance=PROD1 devtype=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: sid=1081 instance=PROD1 devtype=DISK
channel ORA_DISK_1: starting archive copy
input archive log thread=1 sequence=653 recid=1427 stamp=638465819
output filename=/u02/backup/PROD/thread_1_seq_653.796.638465819 recid=1435 stamp=638466707
channel ORA_DISK_1: archivelog copy complete, elapsed time: 00:00:02
Finished backup at 12-NOV-07

Starting Control File and SPFILE Autobackup at 12-NOV-07
piece handle=+FRA/prod/autobackup/2007_11_12/s_638466708.802.638466709 comment=NONE
Finished Control File and SPFILE Autobackup at 12-NOV-07

RMAN> copy archivelog '+FRA/PROD/ARCHIVELOG/2007_11_12/thread_1_seq_655.798.638466189' to '/u02/backup/PROD/thread_1_seq_655.798.638466189';

Starting backup at 12-NOV-07
using channel ORA_DISK_1
using channel ORA_DISK_2
channel ORA_DISK_1: starting archive copy
input archive log thread=1 sequence=655 recid=1430 stamp=638466189
output filename=/u02/backup/PROD/thread_1_seq_655.798.638466189 recid=1436 stamp=638466745
channel ORA_DISK_1: archivelog copy complete, elapsed time: 00:00:02
Finished backup at 12-NOV-07

Starting Control File and SPFILE Autobackup at 12-NOV-07
piece handle=+FRA/prod/autobackup/2007_11_12/s_638466746.803.638466747 comment=NONE
Finished Control File and SPFILE Autobackup at 12-NOV-07

RMAN> copy archivelog '+FRA/PROD/ARCHIVELOG/2007_11_12/thread_1_seq_657.801.638466455' to '/u02/backup/PROD/thread_1_seq_657.801.638466455';

Starting backup at 12-NOV-07
using channel ORA_DISK_1
using channel ORA_DISK_2
channel ORA_DISK_1: starting archive copy
input archive log thread=1 sequence=657 recid=1434 stamp=638466454
output filename=/u02/backup/PROD/thread_1_seq_657.801.638466455 recid=1437 stamp=638466777
channel ORA_DISK_1: archivelog copy complete, elapsed time: 00:00:02
Finished backup at 12-NOV-07

Starting Control File and SPFILE Autobackup at 12-NOV-07
piece handle=+FRA/prod/autobackup/2007_11_12/s_638466778.804.638466779 comment=NONE
Finished Control File and SPFILE Autobackup at 12-NOV-07

RMAN> quit


Recovery Manager complete.

/u02/backup/PROD: ls -lt|more
total 47463696
-rw-r----- 1 oracle oinstall 32768 Nov 12 15:52 thread_1_seq_657.801.638466455
-rw-r----- 1 oracle oinstall 17920 Nov 12 15:52 thread_1_seq_655.798.638466189
-rw-r----- 1 oracle oinstall 15824384 Nov 12 15:51 thread_1_seq_653.796.638465819
-rw-r----- 1 oracle oinstall 1024 Nov 12 15:50 thread_2_seq_263.770.638416841
...
...


step2: Scp the restored archivelogs to the standby server (bcp1 is node 1 of our standby cluster)
/u02/backup/PROD: scp thread_2_seq_263.770.638416841 bcp1:/u02/backup/PROD/.
thread_2_seq_263.770.638416841
/u02/backup/PROD: scp thread_1_seq_653.796.638465819 bcp1:/u02/backup/PROD/.
thread_1_seq_653.796.638465819 100% 15MB 15.1MB/s 00:01
/u02/backup/PROD: scp thread_1_seq_655.798.638466189 bcp1:/u02/backup/PROD/.
thread_1_seq_655.798.638466189 100% 18KB 17.5KB/s 00:00
/u02/backup/PROD: scp thread_1_seq_657.801.638466455 bcp1:/u02/backup/PROD/.
thread_1_seq_657.801.638466455

step 3: Recover the standby database using the scp'ed archive logs.
PROD1_SQL > recover automatic standby database;
ORA-00279: change 98980021976 generated at 11/12/2007 02:00:34 needed for thread 2
ORA-00289: suggestion : +DATA/prod/2_263_631792716.dbf
ORA-00280: change 98980021976 for thread 2 is in sequence #263
ORA-00278: log file '+DATA/prod/2_262_631792716.dbf' no longer needed for this recovery

Specify log: {=suggested | filename | AUTO | CANCEL}
/u02/backup/PROD/thread_2_seq_263.770.638416841

RA-00279: change 98980621191 generated at 11/12/2007 10:01:02 needed for thread 1
ORA-00289: suggestion : +DATA/prod/1_653_631792716.dbf
ORA-00280: change 98980621191 for thread 1 is in sequence #653
ORA-00278: log file '+DATA/prod/1_653_631792716.dbf' no longer needed for this recovery
ORA-00308: cannot open archived log '+DATA/prod/1_653_631792716.dbf'
ORA-17503: ksfdopn:2 Failed to open file +DATA/prod/1_653_631792716.dbf
ORA-15173: entry '1_653_631792716.dbf' does not exist in directory 'prod'


Specify log: {=suggested | filename | AUTO | CANCEL}
/u02/backup/PROD/thread_1_seq_653.796.638465819
ORA-00279: change 98980621191 generated at 11/12/2007 02:00:40 needed for thread 2
ORA-00289: suggestion : +DATA/prod/2_264_631792716.dbf
ORA-00280: change 98980621191 for thread 2 is in sequence #264


Specify log: {=suggested | filename | AUTO | CANCEL}

ORA-00279: change 98980621195 generated at 11/12/2007 10:01:04 needed for thread 2
ORA-00289: suggestion : +DATA/prod/2_265_631792716.dbf
ORA-00280: change 98980621195 for thread 2 is in sequence #265
ORA-00278: log file '+DATA/prod/2_264_631792716.dbf' no longer needed for this recovery

Specify log: {=suggested | filename | AUTO | CANCEL}
/u02/backup/PROD/thread_2_seq_265.799.638466191
ORA-00279: change 98981521197 generated at 11/12/2007 15:36:59 needed for thread 1
ORA-00289: suggestion : +DATA/prod/1_654_631792716.dbf
ORA-00280: change 98981521197 for thread 1 is in sequence #654
ORA-00278: log file '/u02/backup/PROD/thread_1_seq_653.796.638465819' no longer needed for this recovery

Specify log: {=suggested | filename | AUTO | CANCEL}

ORA-00279: change 98981522036 generated at 11/12/2007 15:42:38 needed for thread 1
ORA-00289: suggestion : +DATA/prod/1_655_631792716.dbf
ORA-00280: change 98981522036 for thread 1 is in sequence #655
ORA-00278: log file '+DATA/prod/1_654_631792716.dbf' no longer needed for this recovery

Specify log: {=suggested | filename | AUTO | CANCEL}
/u02/backup/PROD/thread_1_seq_655.798.638466189
ORA-00279: change 98981522129 generated at 11/12/2007 15:43:09 needed for thread 1
ORA-00289: suggestion : +DATA/prod/1_656_631792716.dbf
ORA-00280: change 98981522129 for thread 1 is in sequence #656
ORA-00278: log file '/u02/backup/PROD/thread_1_seq_655.798.638466189' no longer needed for this recovery

Specify log: {=suggested | filename | AUTO | CANCEL}

ORA-00279: change 98981522138 generated at 11/12/2007 15:43:11 needed for thread 2
ORA-00289: suggestion : +DATA/prod/2_266_631792716.dbf
ORA-00280: change 98981522138 for thread 2 is in sequence #266
ORA-00278: log file '/u02/backup/PROD/thread_2_seq_265.799.638466191' no longer needed for this recovery


Specify log: {=suggested | filename | AUTO | CANCEL}
/u02/backup/PROD/thread_2_seq_266.806.638467227
ORA-00279: change 98981577931 generated at 11/12/2007 15:45:07 needed for thread 1
ORA-00289: suggestion : +DATA/prod/1_657_631792716.dbf
ORA-00280: change 98981577931 for thread 1 is in sequence #657
ORA-00278: log file '+DATA/prod/1_656_631792716.dbf' no longer needed for this recovery

Specify log: {=suggested | filename | AUTO | CANCEL}
/u02/backup/PROD/thread_1_seq_657.801.638466455
ORA-00279: change 98981578237 generated at 11/12/2007 15:47:34 needed for thread 1
ORA-00289: suggestion : +DATA/prod/1_658_631792716.dbf
ORA-00280: change 98981578237 for thread 1 is in sequence #658
ORA-00278: log file '/u02/backup/PROD/thread_1_seq_657.801.638466455' no longer needed for this recovery

Specify log: {=suggested | filename | AUTO | CANCEL}
cancel
Media recovery cancelled.
PROD1_SQL > recover managed standby database using current logfile disconnect from session;
Media recovery complete.


This completes the recovery process for the standby database and fills in the gaps.

Thursday, November 1, 2007

Switchover from Primary to Standby in RAC

Recently we were asked to perform a switchover in a RAC environment and although it seemed difficult initially, it turned out to be rather quick and simple (provided, of course you did things in the right order!).

Following the documentation at http://download.oracle.com/docs/cd/B19306_01/server.102/b14239/role_management.htm#i1030646
in preparation for a switchover, I shut instance 2 of standby down (the one that was not involved in the media recovery).

Per the documentation:
Verify that all but one RAC instance on the standby databases in a Real Application Clusters configuration are shut down.
For a Real Application Clusters database, only one RAC instance on the standby database can be online during the role transition. Shut down all other instances before starting the role transition. Then, after the role transition completes, bring these instances back online.
I ensured that the primary and standby databases were in sync and then went ahead with the ALTER DATABASE COMMIT TO SWITCHOVER.. command on one of the two primary instances --

GCPROD1_SQL > ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY with session shutdown;
ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY with session shutdown
*
ERROR at line 1:
ORA-01105: mount is incompatible with mounts by other instances


After doing some research, I realized that you need to shutdown ALL but one instance on the primary side also (other words, before running the switchover command in a RAC env, keep only one instance running at primary site and one instance at the DR site).

On primary node 1
After shutting down instance 2 at the primary site, I was able to run the SWITCHOVER successfully on instance 1 of the primary.
[Note: After the switchover, the instance is brought up in NOMOUNT stage. So I did a shutdown and mounted it again]

GCPROD1_SQL > ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY WITH SESSION SHUTDOWN;


Database altered.

GCPROD1_SQL > shutdown immediate
ORA-01507: database not mounted

ORACLE instance shut down.

GCPROD1_SQL > startup NOMOUNT
ORACLE instance started.

Total System Global Area 1543503872 bytes
Fixed Size 2072960 bytes
Variable Size 369102464 bytes
Database Buffers 1157627904 bytes
Redo Buffers 14700544 bytes
Database mounted.

GCPROD1_SQL > SELECT SWITCHOVER_STATUS FROM V$DATABASE;


SWITCHOVER_STATUS
--------------------
SESSIONS ACTIVE

Notice that I used the "..WITH SESSION SHUTDOWN" option as I was unable to bring a few active internal processes down.

Now that our old primary has become the new standby, we need to fix a few INIT parameters related to log shipping.

a) LOG_ARCHIVE_DEST_n setting

Here's our old settings --
GCPROD1_SQL > show parameter log_archive_dest

NAME TYPE VALUE
------------------------------------ -------- ------------------------------
log_archive_dest_1 string location="+FLASHDG", valid_for
=(ONLINE_LOGFILES, PRIMARY_ROLE)
log_archive_dest_10 string
log_archive_dest_2 string service="GCPROD_STBY", LGWR AS
YNC NOAFFIRM delay=0 OPTIONAL
max_failure=10 max_connection
s=4 reopen=15 db_unique_name="
GCPROD" register net_timeout
=180 valid_for=(online_logfile
,primary_role)
...
...
[removed for brevity]

We can remove log_archive_dest_2 as its not needed on a standby database. Also log_archive_dest_1 needs to be changed.

GCPROD1_SQL > ALTER SYSTEM SET log_archive_dest_1='LOCATION=+FLASHDG' SCOPE=BOTH;
GCPROD1_SQL > ALTER SYSTEM SET log_archive_dest_2='';

[Note: For, log_archive_dest_1 we may also set as LOCATION=+FLASHDG valid_for=(STANDBY_LOGFILES, STANDBY_ROLE)'. That should also work ok]


b) FAL settings
FAL_SERVER should point to the primary database (tns string) and FAL_CLIENT should point to the Standby database (self)

GCPROD1_SQL > ALTER SYSTEM SET fal_client='GCPROD_STBY' SCOPE=BOTH;
GCPROD1_SQL > ALTER SYSTEM SET fal_server='GCPROD_PRIM' SCOPE=BOTH;


Make sure we update the TNSNAMES.ora file for GCPROD_STBY and GCPROD_PRIM. GCPROD_STBY should resolve to the standby database and GCPROD_PRIM should resolve to the primary (ie. the erstwhile standby; that we haven't switched to the primary role yet -- we'll do that shortly)

After making these change, we put the new standby database in recovery mode --

GCPROD1_SQL > alter database recover managed standby database disconnect from session;

Database altered.




On standby node 1
Next, we perform a switchover of the old standby database to make it the new PRIMARY.
[Note: I had to perform media recovery as shown below as I had accidentally canceled the recovery earlier -- this might not be needed if the (old) standby was already in recovery mode]

GCPROD1_SQL > ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;
ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY
*
ERROR at line 1:
ORA-16139: media recovery required

GCPROD1_SQL > alter database recover managed standby database disconnect from session;

Database altered.

GCPROD1_SQL > ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;
Database altered.

GCPROD1_SQL > alter database open;
Database altered.


At this point our new primary database is open and available

Now we've to setup the log shipping related INIT parameters on our new primary.

a) LOG_ARCHIVE_DEST_n setting

Here's our old settings --
GCPROD1_SQL > show parameter log_archive_dest

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
log_archive_dest string
log_archive_dest_1 string LOCATION=+FLASHDG
log_archive_dest_10 string
log_archive_dest_2 string
log_archive_dest_3 string

...
...
[removed for brevity]

GCPROD1_SQL > ALTER SYSTEM SET log_archive_dest_1='location="+FLASHDG", valid_for=(ALL_LOGFILES, ALL_ROLES)' SCOPE=BOTH;

GCPROD1_SQL > ALTER SYSTEM SET log_archive_dest_2='service="GCPROD_STBY", LGWR ASYNC NOAFFIRM delay=0 OPTIONAL max_failure=10 max_connections=4 reopen=15 db_unique_name="GCPROD" register net_timeout=180 valid_for=(online_logfile,pr
imary_role)' SCOPE=BOTH;

Also don't forget to enable the destination (else the logs don't ship automatically).

GCPROD1_SQL > ALTER SYSTEM SET log_archive_dest_state_2='ENABLE' SCOPE=BOTH;


b) FAL settings
Primary database doesn't need any FAL settings, so we may remove those.

GCPROD1_SQL > ALTER SYSTEM SET fal_client='' SCOPE=BOTH;
GCPROD1_SQL > ALTER SYSTEM SET fal_server='' SCOPE=BOTH;

This completes the SWITCHOVER activity pretty much.
If the DB_UNIQUE names are different for some reason on the primary and DR sides, we also need to update LOG_ARCHIVE_CONFIG parameter for dg_config setting otherwise we might see the following errors in the alert log of the primary database --

Errors in file /u01/app/oracle/admin/GCPROD/bdump/gcprod1_arc1_1316.trc:
ORA-16047: DGID mismatch between destination setting and standby
...
PING[ARC1]: Heartbeat failed to connect to standby 'GCPROD_STBY'. Error is 16047.


The solution is to set the LOG_ARCHIVE_CONFIG parameter to show the db_unique_name (if that is different than db_name, for any reason)

GCPROD1_SQL > ALTER SYSTEM SET log_archive_config='dg_config=(GCPROD, GCPRODDR)' SCOPE=BOTH;

[where GCPROD is db_unique_name on primary side and GCPRODDR is the db_name on the DR side]

All is set now. We can verify it by performing a few log switches on the primary side
The only remaining item would be to make sure the TNSNAMES file is updated on both sides.