Oracle数据库数据恢复、性能优化

找回密码
注册
搜索
热搜: 活动 交友 discuz
发新帖

121

积分

0

好友

2

主题
1#
发表于 2011-10-30 10:26:24 | 查看: 14127| 回复: 15
上次逻辑备库REDO GAP的问题,经过大家的指点,改了FAL参数之后,通过这2天的观察,仍然有问题。考虑到逻辑备库的情况比物理备库复杂,我另开一贴,关于物理备库GAP的,解决了物理备库GAP,再回头来解决逻辑备库GAP的问题。

环境: Windows 2003 + 10.2.0.4
现象: 经常性出现REDO GAP,而且无法自动重传REDO,只能手工修复。为简化问题,我们从昨天开始的两个REDO GAP来看。
SQL> select * from v$archive_gap;

   THREAD# LOW_SEQUENCE# HIGH_SEQUENCE#
---------- ------------- --------------
         1          6623           6623

SQL> select * from
  2  (
  3  SELECT SEQUENCE#, to_char(FIRST_TIME,'yyyymmdd hh24:mi:ss'),
  4  to_char(NEXT_TIME,'yyyymmdd hh24:mi:ss'),applied
  5  FROM V$ARCHIVED_LOG
  6  ORDER BY SEQUENCE# desc
  7  ) where rownum < 10;

SEQUENCE# TO_CHAR(FIRST_TIME,'YYYYMMDDHH TO_CHAR(NEXT_TIME,'YYYYMMDDHH2 APPLIED
---------- ------------------------------ ------------------------------ -------
      6628 20111030 08:03:55              20111030 09:55:54              NO
      6627 20111030 06:11:24              20111030 08:03:55              NO
      6625 20111029 22:01:49              20111029 23:20:08              NO
      6624 20111029 20:55:17              20111029 22:01:49              NO
      6622 20111029 18:52:00              20111029 19:51:23              YES
      6621 20111029 17:40:29              20111029 18:52:00              YES
      6620 20111029 15:53:47              20111029 17:40:29              YES
      6619 20111029 13:41:27              20111029 15:53:47              YES
      6618 20111029 12:16:40              20111029 13:41:27              YES

从10/29 19:51开始出现了两次REDO GAP,从后面的ALERT中可以看出我尝试了重启备库,FAL也没有重传这两个REDO。
2#
发表于 2011-10-30 10:27:57
  1. 主库中相关时间段的ALERT文件:

  2. Sat Oct 29 17:40:32 2011
  3. Thread 1 advanced to log sequence 6621 (LGWR switch)
  4. Current log# 3 seq# 6621 mem# 0: D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\REDO03.LOG
  5. Sat Oct 29 17:40:32 2011
  6. LNS: Standby redo logfile selected for thread 1 sequence 6621 for destination LOG_ARCHIVE_DEST_2
  7. Sat Oct 29 17:40:34 2011
  8. ARC1: Standby redo logfile selected for thread 1 sequence 6620 for destination LOG_ARCHIVE_DEST_2
  9. Sat Oct 29 18:52:00 2011
  10. Thread 1 advanced to log sequence 6622 (LGWR switch)
  11. Current log# 1 seq# 6622 mem# 0: D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\REDO01.LOG
  12. Sat Oct 29 18:52:02 2011
  13. LNS: Standby redo logfile selected for thread 1 sequence 6622 for destination LOG_ARCHIVE_DEST_2
  14. Sat Oct 29 19:51:23 2011
  15. Thread 1 advanced to log sequence 6623 (LGWR switch)
  16. Current log# 2 seq# 6623 mem# 0: D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\REDO02.LOG
  17. Sat Oct 29 19:51:25 2011
  18. LNS: Standby redo logfile selected for thread 1 sequence 6623 for destination LOG_ARCHIVE_DEST_2
  19. Sat Oct 29 20:40:40 2011
  20. ALTER SYSTEM SET log_archive_dest_state_2='DEFER' SCOPE=BOTH;
  21. Sat Oct 29 20:41:37 2011
  22. Errors in file d:\oracle\product\10.2.0\admin\hkdata\bdump\hkdatazq_lns1_5080.trc:
  23. ORA-01089: 正在执行立即关闭 - 不允许进行任何操作

  24. Sat Oct 29 20:41:37 2011
  25. LGWR: I/O error 1089 archiving log 2 to 'hkdata'
  26. Sat Oct 29 20:43:00 2011
  27. Redo Shipping Client Connected as PUBLIC
  28. -- Connected User is Valid
  29. Sat Oct 29 20:43:00 2011
  30. FAL[server, ARC1]: FAL archive failed, see trace file.
  31. Sat Oct 29 20:43:00 2011
  32. FAL[server, ARC3]: FAL archive failed, see trace file.
  33. Sat Oct 29 20:43:00 2011
  34. FAL[server, ARC2]: FAL archive failed, see trace file.
  35. Sat Oct 29 20:43:00 2011
  36. Errors in file d:\oracle\product\10.2.0\admin\hkdata\bdump\hkdatazq_arc1_4144.trc:
  37. ORA-16055: 已拒绝 FAL 请求

  38. Sat Oct 29 20:43:00 2011
  39. ARCH: FAL archive failed. Archiver continuing
  40. Sat Oct 29 20:43:00 2011
  41. Errors in file d:\oracle\product\10.2.0\admin\hkdata\bdump\hkdatazq_arc2_4900.trc:
  42. ORA-16055: 已拒绝 FAL 请求

  43. Sat Oct 29 20:43:00 2011
  44. ARCH: FAL archive failed. Archiver continuing
  45. Sat Oct 29 20:43:01 2011
  46. Errors in file d:\oracle\product\10.2.0\admin\hkdata\bdump\hkdatazq_arc3_2624.trc:
  47. ORA-16055: 已拒绝 FAL 请求

  48. Sat Oct 29 20:43:01 2011
  49. ARCH: FAL archive failed. Archiver continuing
  50. Sat Oct 29 20:43:01 2011
  51. FAL[server, ARC1]: FAL archive failed, see trace file.
  52. Sat Oct 29 20:43:01 2011
  53. Errors in file d:\oracle\product\10.2.0\admin\hkdata\bdump\hkdatazq_arc1_4144.trc:
  54. ORA-16055: 已拒绝 FAL 请求

  55. ARCH: FAL archive failed. Archiver continuing
  56. Sat Oct 29 20:43:31 2011
  57. Redo Shipping Client Connected as PUBLIC
  58. -- Connected User is Valid
  59. Sat Oct 29 20:45:20 2011
  60. ALTER SYSTEM SET log_archive_dest_state_2='ENABLE' SCOPE=BOTH;
  61. Sat Oct 29 20:55:17 2011
  62. Thread 1 advanced to log sequence 6624 (LGWR switch)
  63. Current log# 3 seq# 6624 mem# 0: D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\REDO03.LOG
  64. Sat Oct 29 20:55:17 2011
  65. LNS: Failed to archive log 2 thread 1 sequence 6623 (1089)
  66. ******************************************************************
  67. LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_2
  68. ******************************************************************
  69. LNS: Standby redo logfile selected for thread 1 sequence 6624 for destination LOG_ARCHIVE_DEST_2
  70. Sat Oct 29 22:01:49 2011
  71. Thread 1 advanced to log sequence 6625 (LGWR switch)
  72. Current log# 1 seq# 6625 mem# 0: D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\REDO01.LOG
  73. Sat Oct 29 22:01:51 2011
  74. LNS: Standby redo logfile selected for thread 1 sequence 6625 for destination LOG_ARCHIVE_DEST_2
  75. Sat Oct 29 23:20:08 2011
  76. Thread 1 advanced to log sequence 6626 (LGWR switch)
  77. Current log# 2 seq# 6626 mem# 0: D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\REDO02.LOG
  78. Sat Oct 29 23:20:10 2011
  79. LNS: Standby redo logfile selected for thread 1 sequence 6626 for destination LOG_ARCHIVE_DEST_2
  80. Sat Oct 29 23:53:09 2011
  81. Errors in file d:\oracle\product\10.2.0\admin\hkdata\bdump\hkdatazq_lns1_5080.trc:
  82. ORA-12571: TNS: 包写入程序失败

  83. Sat Oct 29 23:53:09 2011
  84. LGWR: I/O error 12571 archiving log 2 to 'hkdata'
  85. Sun Oct 30 06:11:24 2011
  86. Thread 1 advanced to log sequence 6627 (LGWR switch)
  87. Current log# 3 seq# 6627 mem# 0: D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\REDO03.LOG
  88. Sun Oct 30 06:11:24 2011
  89. LNS: Failed to archive log 2 thread 1 sequence 6626 (12571)
  90. Sun Oct 30 06:11:24 2011
  91. Errors in file d:\oracle\product\10.2.0\admin\hkdata\bdump\hkdatazq_lns1_5080.trc:
  92. ORA-01041: 内部错误, hostdef 扩展名不存在

  93. LGWR: Error 1041 closing archivelog file 'hkdata'
  94. LNS: Standby redo logfile selected for thread 1 sequence 6627 for destination LOG_ARCHIVE_DEST_2
  95. Sun Oct 30 08:03:55 2011
  96. Thread 1 advanced to log sequence 6628 (LGWR switch)
  97. Current log# 1 seq# 6628 mem# 0: D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\REDO01.LOG
  98. Sun Oct 30 08:03:56 2011
  99. LNS: Standby redo logfile selected for thread 1 sequence 6628 for destination LOG_ARCHIVE_DEST_2
复制代码

回复 只看该作者 道具 举报

3#
发表于 2011-10-30 10:34:47
  1. 备库中相关时间段的ALERT文件:

  2. Sat Oct 29 13:48:35 2011
  3. Media Recovery Waiting for thread 1 sequence 6614 (in transit)
  4. Sat Oct 29 15:48:45 2011
  5. RFS[1]: Possible network disconnect with primary database
  6. Sat Oct 29 15:48:53 2011
  7. RFS[4]: Possible network disconnect with primary database
  8. Aborting archivelog file creation: D:\ORACLE\PRODUCT\10.2.0\ORADATA\ARCHIVE\STDARC\1_6614_713168348.ARC
  9. If this a network disconnect, then this archivelog will be fetched again
  10. by GAP resolution mechanism.
  11. Sat Oct 29 15:48:53 2011
  12. RFS[2]: Possible network disconnect with primary database
  13. Aborting archivelog file creation: D:\ORACLE\PRODUCT\10.2.0\ORADATA\ARCHIVE\STDARC\1_6616_713168348.ARC
  14. If this a network disconnect, then this archivelog will be fetched again
  15. by GAP resolution mechanism.
  16. Sat Oct 29 15:48:54 2011
  17. RFS[3]: Possible network disconnect with primary database
  18. Aborting archivelog file creation: D:\ORACLE\PRODUCT\10.2.0\ORADATA\ARCHIVE\STDARC\1_6615_713168348.ARC
  19. If this a network disconnect, then this archivelog will be fetched again
  20. by GAP resolution mechanism.
  21. Sat Oct 29 15:48:55 2011
  22. Fetching gap sequence in thread 1, gap sequence 6614-6617
  23. Sat Oct 29 15:49:26 2011
  24. FAL[client]: Failed to request gap sequence
  25. GAP - thread 1 sequence 6614-6617
  26. DBID 4131117977 branch 713168348
  27. FAL[client]: All defined FAL servers have been attempted.
  28. -------------------------------------------------------------
  29. Check that the CONTROL_FILE_RECORD_KEEP_TIME initialization
  30. parameter is defined to a value that is sufficiently large
  31. enough to maintain adequate log switch information to resolve
  32. archivelog gaps.
  33. -------------------------------------------------------------
  34. Sat Oct 29 17:40:33 2011
  35. Redo Shipping Client Connected as PUBLIC
  36. -- Connected User is Valid
  37. RFS[5]: Assigned to RFS process 2408
  38. RFS[5]: Identified database type as 'physical standby'
  39. Primary database is in MAXIMUM PERFORMANCE mode
  40. Primary database is in MAXIMUM PERFORMANCE mode
  41. RFS[5]: Successfully opened standby log 5: 'D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\STANDBY_REDO05.LOG'
  42. Sat Oct 29 17:40:34 2011
  43. Redo Shipping Client Connected as PUBLIC
  44. -- Connected User is Valid
  45. RFS[6]: Assigned to RFS process 2420
  46. RFS[6]: Identified database type as 'physical standby'
  47. RFS[6]: Successfully opened standby log 6: 'D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\STANDBY_REDO06.LOG'
  48. Sat Oct 29 18:52:02 2011
  49. Primary database is in MAXIMUM PERFORMANCE mode
  50. RFS[5]: Successfully opened standby log 6: 'D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\STANDBY_REDO06.LOG'
  51. Sat Oct 29 19:51:25 2011
  52. Primary database is in MAXIMUM PERFORMANCE mode
  53. RFS[5]: Successfully opened standby log 5: 'D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\STANDBY_REDO05.LOG'
  54. Sat Oct 29 20:41:05 2011
  55. Shutting down instance: further logons disabled
  56. Sat Oct 29 20:41:05 2011

  57. (由于超长,此处省略掉了重启部分的日志)

  58. Sat Oct 29 20:42:54 2011
  59. ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT FROM SESSION
  60. MRP0 started with pid=21, OS id=252
  61. Managed Standby Recovery starting Real Time Apply
  62. parallel recovery started with 7 processes
  63. Sat Oct 29 20:43:00 2011
  64. Waiting for all non-current ORLs to be archived...
  65. Media Recovery Waiting for thread 1 sequence 6614
  66. Fetching gap sequence in thread 1, gap sequence 6614-6617
  67. Sat Oct 29 20:43:00 2011
  68. Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT FROM SESSION
  69. Sat Oct 29 20:44:02 2011
  70. FAL[client]: Failed to request gap sequence
  71. GAP - thread 1 sequence 6614-6617
  72. DBID 4131117977 branch 713168348
  73. FAL[client]: All defined FAL servers have been attempted.
  74. -------------------------------------------------------------
  75. Check that the CONTROL_FILE_RECORD_KEEP_TIME initialization
  76. parameter is defined to a value that is sufficiently large
  77. enough to maintain adequate log switch information to resolve
  78. archivelog gaps.
  79. -------------------------------------------------------------
  80. Sat Oct 29 20:55:18 2011
  81. Redo Shipping Client Connected as PUBLIC
  82. -- Connected User is Valid
  83. RFS[1]: Assigned to RFS process 2120
  84. RFS[1]: Identified database type as 'physical standby'
  85. Primary database is in MAXIMUM PERFORMANCE mode
  86. Sat Oct 29 20:55:18 2011
  87. RFS LogMiner: Client disabled from further notification
  88. Primary database is in MAXIMUM PERFORMANCE mode
  89. RFS[1]: Successfully opened standby log 6: 'D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\STANDBY_REDO06.LOG'
  90. Sat Oct 29 20:57:32 2011
  91. db_recovery_file_dest_size of 10240 MB is 6.38% used. This is a
  92. user-specified limit on the amount of space that will be used by this
  93. database for recovery-related files, and does not reflect the amount of
  94. space available in the underlying filesystem or ASM diskgroup.

  95. Sat Oct 29 21:16:28 2011
  96. alter database register logfile 'D:\oracle\product\10.2.0\oradata\archive\std\1_6614_713168348.ARC'

  97. Sat Oct 29 21:16:28 2011
  98. There are 1 logfiles specified.
  99. ALTER DATABASE REGISTER [PHYSICAL] LOGFILE
  100. Completed: alter database register logfile 'D:\oracle\product\10.2.0\oradata\archive\std\1_6614_713168348.ARC'

  101. Sat Oct 29 21:16:35 2011
  102. alter database register logfile 'D:\oracle\product\10.2.0\oradata\archive\std\1_6615_713168348.ARC'

  103. There are 1 logfiles specified.
  104. ALTER DATABASE REGISTER [PHYSICAL] LOGFILE
  105. Completed: alter database register logfile 'D:\oracle\product\10.2.0\oradata\archive\std\1_6615_713168348.ARC'

  106. Sat Oct 29 21:16:40 2011
  107. Media Recovery Log D:\ORACLE\PRODUCT\10.2.0\ORADATA\ARCHIVE\STD\1_6614_713168348.ARC
  108. Sat Oct 29 21:17:03 2011
  109. Media Recovery Log D:\ORACLE\PRODUCT\10.2.0\ORADATA\ARCHIVE\STD\1_6615_713168348.ARC
  110. Sat Oct 29 21:17:07 2011
  111. alter database register logfile 'D:\oracle\product\10.2.0\oradata\archive\std\1_6616_713168348.ARC'
  112. Sat Oct 29 21:17:07 2011
  113. There are 1 logfiles specified.
  114. ALTER DATABASE REGISTER [PHYSICAL] LOGFILE
  115. Sat Oct 29 21:17:19 2011
  116. alter database register logfile 'D:\oracle\product\10.2.0\oradata\archive\std\1_6617_713168348.ARC'

  117. Sat Oct 29 21:17:19 2011
  118. There are 1 logfiles specified.
  119. Sat Oct 29 21:17:19 2011
  120. Completed: alter database register logfile 'D:\oracle\product\10.2.0\oradata\archive\std\1_6616_713168348.ARC'
  121. Sat Oct 29 21:17:19 2011
  122. ALTER DATABASE REGISTER [PHYSICAL] LOGFILE
  123. Completed: alter database register logfile 'D:\oracle\product\10.2.0\oradata\archive\std\1_6617_713168348.ARC'
  124. Sat Oct 29 21:17:20 2011
  125. Media Recovery Log D:\ORACLE\PRODUCT\10.2.0\ORADATA\ARCHIVE\STD\1_6616_713168348.ARC
  126. Sat Oct 29 21:17:38 2011
  127. Media Recovery Log D:\ORACLE\PRODUCT\10.2.0\ORADATA\ARCHIVE\STD\1_6617_713168348.ARC
  128. Sat Oct 29 21:17:57 2011
  129. Media Recovery Log D:\ORACLE\PRODUCT\10.2.0\ORADATA\ARCHIVE\STD\1_6618_713168348.ARC
  130. Sat Oct 29 21:18:14 2011
  131. Recovery of Online Redo Log: Thread 1 Group 4 Seq 6619 Reading mem 0
  132. Mem# 0: D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\STANDBY_REDO04.LOG
  133. Sat Oct 29 21:41:06 2011
  134. alter database register logfile 'D:\oracle\product\10.2.0\oradata\archive\std\1_6619_713168348.ARC'

  135. Sat Oct 29 21:41:06 2011
  136. There are 1 logfiles specified.
  137. ALTER DATABASE REGISTER [PHYSICAL] LOGFILE
  138. Completed: alter database register logfile 'D:\oracle\product\10.2.0\oradata\archive\std\1_6619_713168348.ARC'
  139. Sat Oct 29 21:41:32 2011
  140. Media Recovery Log D:\ORACLE\PRODUCT\10.2.0\ORADATA\ARCHIVE\STD\1_6619_713168348.ARC
  141. Sat Oct 29 21:41:51 2011
  142. Media Recovery Log D:\ORACLE\PRODUCT\10.2.0\ORADATA\ARCHIVE\STD\1_6620_713168348.ARC
  143. Sat Oct 29 21:42:09 2011
  144. Media Recovery Log D:\ORACLE\PRODUCT\10.2.0\ORADATA\ARCHIVE\STD\1_6621_713168348.ARC
  145. Sat Oct 29 21:42:25 2011
  146. Media Recovery Log D:\ORACLE\PRODUCT\10.2.0\ORADATA\ARCHIVE\STD\1_6622_713168348.ARC
  147. Sat Oct 29 21:42:41 2011
  148. Recovery of Online Redo Log: Thread 1 Group 5 Seq 6623 Reading mem 0
  149. Mem# 0: D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\STANDBY_REDO05.LOG
  150. Sat Oct 29 22:01:50 2011
  151. Primary database is in MAXIMUM PERFORMANCE mode
  152. RFS[1]: Successfully opened standby log 4: 'D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\STANDBY_REDO04.LOG'
  153. Sat Oct 29 23:20:09 2011
  154. Primary database is in MAXIMUM PERFORMANCE mode
  155. RFS[1]: Successfully opened standby log 6: 'D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\STANDBY_REDO06.LOG'
  156. Sun Oct 30 01:52:33 2011
  157. RFS[1]: Possible network disconnect with primary database
  158. Sun Oct 30 06:11:23 2011
  159. Redo Shipping Client Connected as PUBLIC
  160. -- Connected User is Valid
  161. RFS[2]: Assigned to RFS process 2364
  162. RFS[2]: Identified database type as 'physical standby'
  163. Primary database is in MAXIMUM PERFORMANCE mode
  164. Primary database is in MAXIMUM PERFORMANCE mode
  165. RFS[2]: Successfully opened standby log 4: 'D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\STANDBY_REDO04.LOG'
  166. Sun Oct 30 08:03:54 2011
  167. Primary database is in MAXIMUM PERFORMANCE mode
  168. RFS[2]: Successfully opened standby log 7: 'D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\STANDBY_REDO07.LOG'
复制代码

回复 只看该作者 道具 举报

4#
发表于 2011-10-30 10:46:10
附件中是主库和物理备库诊断脚本的执行结果:

备注:我没有设置REOPEN和MAX_FAILURE,理论上来说FAL会一直尝试重传才对。

DataGuard诊断脚本.rar

6.16 KB, 下载次数: 1060

回复 只看该作者 道具 举报

5#
发表于 2011-11-1 15:47:12
把 Primary 的整个Alert.log 打包上传

回复 只看该作者 道具 举报

6#
发表于 2011-11-1 15:51:48
把 Sat Oct 29 20:55:17 2011
LNS: Failed to archive log 2 thread 1 sequence 6623 (1089)
这个时间点的LNS trace上传
d:\oracle\product\10.2.0\admin\hkdata\bdump\hkdatazq_lns1_5080.trc

回复 只看该作者 道具 举报

7#
发表于 2011-11-2 10:28:46
参考下Automatic Gap Resolution和FAL Gap Resolution部分的描述,与楼主的现象不符。谁有环境可以试下在备库制造一些不连续的redo gap,观察gap的解决顺序。

Data Guard Gap Detection and Resolution [ID 232649.1]
          修改时间 20-OCT-2010     类型 REFERENCE     状态 PUBLISHED          


Overview:
===============

An archive gap is a range of missing archived redo logs created whenever the
standby system is unable to receive the next archived redo log generated by
the primary database.  

For example, an archive gap occurs when the network becomes unavailable and
automatic archiving from the primary database to the standby database stops.
When the network is available again, automatic transmission of the redo data
from the primary database to the failed standby database resumes.


Methods of Gap Resolution:
===========================

Data Guard provides two methods for gap resolution, automatic and FAL (Fetch
Archive Log).  The automatic method requires no configuration while FAL requires
configuration via init.ora parameters.  Both methods are discussed below.


Automatic Gap Resolution:
~~~~~~~~~~~~~~~~~~~~~~~~~~~

In both 9.0.1 and 9.2.0 Automatic Gap Resolution is implemented during log
transport processing.  As the LGWR or ARCH process begins to send redo over to
the standby, the sequence number of the log being archived is compared to the
last sequence received by the RFS process on the standby.  If the RFS process
detects that the archive log being received is greater than the last sequence
received plus one, then the RFS will piggyback a request to the primary to send
the missing archive logs.  Since the standby destination requesting the gap
resolution is already defined by the LOG_ARCHIVE_DEST_n parameter on the
primary, the ARCH process on the primary sends the logs to the standby and
notifies the LGWR that the gaps have been resolved.

Starting in 9.2.0, automatic gap resolution has been enhanced.  In addition
to the above, the ARCH process on the primary database polls all standby
databases every minute to see if there is a gap in the sequence of archived
redo logs.   If a gap is detected then the ARCH process sends the missing
archived redo log files to the standby databases that reported the gap.  Once
the gap is resolved, the LGWR process is notified that the site is up to date.


FAL Gap Resolution:
~~~~~~~~~~~~~~~~~~~~~~~~

As the RFS process on the standby receives an archived log, it updates the
standby controlfile with the name and location of the file.  Once the MRP
(Media Recovery Process) sees the update to the controlfile, it attempts to
recover that file.  If the MRP process finds that the archived log is missing
or is corrupt, FAL is called to resolve the gap or obtain a new copy.  Since
MRP has no direct communications link with the primary, it must use the
FAL_SERVER and FAL_CLIENT initialization parameters to resolve the gap.  
Both of these parameters must be set in the standby init.ora.  The two
parameters are defined as:

FAL_SERVER:  An OracleNet service name that exist in the standby tnsnames.ora
             file that points to the primary database listener.  The FAL_SERVER
             parameter can contain a comma delimited list of locations that
             should be attempted during gap resolution.

FAL_CLIENT:  An OracleNet service name that exist in the primary tnsnames.ora
             file that points to the standby database listener.  The value of
             FAL_CLIENT should also be listed as the service in a remote
             archive destination pointing to the standby.

Once MRP needs to resolve a gap it uses the value from FAL_SERVER to call the
primary database.  Once communication with the primary has been established,
MRP passes the FAL_CLIENT value to the primary ARCH process.  The primary ARCH
process locates the remote archive destination with the corresponding service
name and ships the missing archived redo logs.  If the first destination listed
in FAL_SERVER is unable to resolve the gap then the next destination is
attempted until either the gap is resolved or all FAL_SERVER destination have
been tried.

As of 9.2.0 FAL Gap Resolution only works with Physical Standby databases as
the process is tied to MRP.  Gap recovery on a logical standby database is
handled through the heartbeat mechanism.


Simulating Gap Recovery
==========================

The follow steps can be used to illustrate and verify both automatic and FAL
gap recovery.  As the steps involve shutting down the standby database, which
can impact disaster recovery, it is recommended to perform these procedures
in a test environment.


Automatic Gap Resolution:

1.  Shutdown the physical standby database.

2.  Determine the current sequence on the primary database.

3.  Perform at least three log switches on the primary database.

4.  Verify that the logs did not get transferred to the standby archive dest.

5.  Start the standby database.

6.  Perform a log switch on the primary and verify that the gap gets resolved
    on the standby.


FAL Gap Resolution:

1.  In the standby init.ora define the fal_server and fal_client parameters.

2.  Bounce the standby database so that the parameters are put into effect.

3.  Perform three log switches on the primary database.

4.  In the standby_archive_dest directory delete the middle archive log on
    the standby.

5.  Start managed recovery and verify that the gap is resolved by FAL_SERVER
    and FAL_CLIENT.


Manually Resolving a Gap:
=============================

In some rare cases it might be necessary to manually resolve gaps.  The following
section describes how to query the appropriate views to determine if a gap
exists.

On your physical standby database:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Query the V$ARCHIVE_GAP view:

SQL> SELECT * FROM V$ARCHIVE_GAP;

    THREAD#  LOW_SEQUENCE#  HIGH_SEQUENCE#
-----------  -------------  --------------
          1            443             446

The query results show that your physical standby database is currently missing
logs from sequence 443 to sequence 446 for thread 1.  After you identify the
gap, issue the following SQL statement on the primary database to locate the
archived redo logs on your primary database:

SQL> SELECT NAME FROM V$ARCHIVED_LOG WHERE THREAD#=1 AND DEST_ID=1 AND
  2> SEQUENCE# BETWEEN 443 AND 446;

NAME
--------------------------------------------------------------------------------
/u01/oradata/arch/arch_1_443.arc
/u01/oradata/arch/arch_1_444.arc
/u01/oradata/arch/arch_1_445.arc

Copy the logs returned by the query to your physical standby database and
register using the ALTER DATABASE REGISTER LOGFILE command.

SQL> ALTER DATABASE REGISTER LOGFILE
'/u01/oradata/stby/arch/arch_1_443.arc';
SQL> ALTER DATABASE REGISTER LOGFILE
'/u01/oradata/stby/arch/arch_1_444.arc';
SQL> ALTER DATABASE REGISTER LOGFILE
'/u01/oradata/stby/arch/arch_1_445.arc';

Once the log files have been registered in the standby controlfile, you can
restart the MRP process.


On a logical standby database:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Query the DBA_LOGSTDBY_LOG view.

SQL> SELECT THREAD#, SEQUENCE#, FILE_NAME FROM DBA_LOGSTDBY_LOG L
  2> WHERE NEXT_CHANGE# NOT IN
  3> (SELECT FIRST_CHANGE# FROM DBA_LOGSTDBY_LOG WHERE L.THREAD# = THREAD#)
  4> ORDER BY THREAD#,SEQUENCE#;

   THREAD#  SEQUENCE# FILE_NAME
---------- ---------- -----------------------------------------------
         1        451 /u01/oradata/logical_stby/arch/arch_1_451.arc
         1        453 /u01/oradata/logical_stby/arch/arch_1_453.arc


Copy the missing logs to the logical standby system and register them using the
ALTER DATABASE REGISTER LOGICAL LOGFILE statement on your logical standby
database.

SQL> ALTER DATABASE REGISTER LOGICAL LOGFILE /u01/oradata/logical_stby/arch/arch_1_452.arc;


After you register these logs on the logical standby database, you can restart
log apply services.

回复 只看该作者 道具 举报

8#
发表于 2011-11-2 10:30:09
如果有RFS和MRP的trace文件,最好也一并上传看看。

回复 只看该作者 道具 举报

9#
发表于 2011-11-2 10:57:56
仅供参考
Data Guard and Network Disconnects [ID 255959.1]
          修改时间 20-OCT-2010     类型 REFERENCE     状态 PUBLISHED          


PURPOSE
=========

Below is a discussion of how Data Guard reacts and recovers when a network
connection between the primary and the standby is disconnected or broken.  This
discussion applies to both physical and logical standby databases.


Data Guard and Network Disconnects:
=========================================

In a Data Guard configuration the primary establishes a network connection to the
standby by either the LGWR or ARCH processes.  When the standby recieves the network
connection request from the primary, the listener on the standby site spawns the RFS
process.  This RFS process on the standby reads from the networks and sends
acknowledgement messages back to the primary when done processing any request.

During normal Data Guard operations, when redo data is transmitted from the primary
to the standby, network messages are initiated from the primary database (the network
`client'), and always acknowledged by the standby database (the network `server').
In this case, the LGWR and ARCH processes are the network clients, and the RFS
process is the network server.

Consider the simple scenario where the network between the primary and standby
systems is disconnected. This results in what is known as a dead connection. A dead
connection indicates that there is no physical connection, but the connection appears
to still be there to the processes on each system.

When the LGWR process attempts to send a new message to the RFS process over the
dead connection it will notice that the connection appears to be broken.  At this
point LGWR will wait on the TCP layer to timeout on the network session between the
primary and standby before establishing that network connectivity has indeed been lost.
The TCP timeout, as defined by TCP kernel parameter settings, is key to how long either
LGWR or ARCH will remain in a wait state before abandoning the network connection.
On some platforms the default for TCP timeout can be as high as 2 hours.

As of 9.2 new functionality was added to Data Guard so as to better control LGWR
timeouts for broken network connections. The [NO]MAX_FAILURE, [NO]REOPEN and
[NO]NET_TIMEOUT attributes to the LOG_ARCHIVE_DEST_n parameter were added to give
LGWR the desired flexibility to control timeout intervals and the number of retries
when a network connection is not responding.

On the standby side, the RFS process is always synchronously waiting for new information
to arrive from the LGWR or ARCH process on the primary.  The RFS process that is
doing the network read operation is blocked until more data arrives or until the
operating system's network software determines that the connection is dead.

Once the RFS process receives notification of the dead network connection, it will
terminate itself. However, until such time as the RFS process terminates itself, it
will retain lock information on the archivelog on the standby site, or the standby
redo log, whose redo information was being received from the primary database.

Any attempt to perform a failover using the RECOVER MANAGED STANDBY DATABASE FINSIH
command will fail while the RFS process maintains a lock on the standby redo
log.  The recover command will fail with the following errors:

    ORA-00283: recovery session canceled due to errors
    ORA-00261: log 4 of thread 1 is being archived or modified
    ORA-00312: online log 4 thread 1: '/database/mtsmith/920DG/srl1.dbf'

One must wait for either the operating system network software to clean up the dead
connection or kill the RFS process before the failover attempt will succeed.  One
method to decrease the time for the operating system network software to clean up the
dead connection is the use of Oracle's Dead Connection Detection feature.

With Oracle's Dead Connection Detection feature Oracle Net periodically sends a
network probe to verify that a client/server connection is still active. This ensures
that connections are not left open indefinitely due to an abnormal client termination.
If the probe finds a dead connection or a connection that is no longer in use, it
returns an error that causes the RFS process to exit. However, as we are still
dependant on the operation system network software for timeouts and retries it can
take Dead Connection Detection up to 9 minutes to terminate the RFS network connection.
The time can be expressed as:

     time to kill dead connection=
       SQLNET.EXPIRE_TIME (Dead Connection Detection)
     + tcp_ip_abort_interval (default=480000, 8 mins. on Solaris)
     + tcp_rexmit_interval_max (def.=1 min)

Changing the tcp_keep_alive setting to a lower number will improve this timing
but is usually not feasible to apply to the whole system.

Once the network problem is resolved, and the primary database processes are again
able to establish network connections to the standby database, a new RFS process
will automatically be spawned on the standby database for each new network connection.
These new RFS processes will resume the reception of redo data from the primary
database.

回复 只看该作者 道具 举报

10#
发表于 2011-11-2 21:34:27
原帖由 maclean 于 2011-11-1 15:51 发表
把 Sat Oct 29 20:55:17 2011
LNS: Failed to archive log 2 thread 1 sequence 6623 (1089)
这个时间点的LNS trace上传
d:\oracle\product\10.2.0\admin\hkdata\bdump\hkdatazq_lns1_5080.trc


备库在Sat Oct 29 20:41:05 2011有重启过。
hkdatazq_lns1_5080.trc参见附件。

hkdatazq_lns1_5080.rar

1.53 KB, 下载次数: 963

回复 只看该作者 道具 举报

11#
发表于 2011-11-2 21:49:04
原帖由 maclean 于 2011-11-1 15:47 发表
把 Primary 的整个Alert.log 打包上传


附件中为主库的ALERT。

alert_hkdatazq.zip

58.76 KB, 下载次数: 914

回复 只看该作者 道具 举报

12#
发表于 2011-11-2 22:06:14
Action Plan
1. 手动在 STANDBY 上重设置参数 "fal_server"


alter system set fal_server='hkdatazq'     ;

    手动 重设 Primary 上的 fal_server 参数

alter system set fal_server='hkdata';

观察alert.log 和LNS 跟踪日志

2.  重建 STANDBY上的 控制文件 后 再次测试 步骤1

[ 本帖最后由 maclean 于 2011-11-2 22:10 编辑 ]

回复 只看该作者 道具 举报

13#
发表于 2011-11-3 22:15:16
有结果了么?:) :)

回复 只看该作者 道具 举报

14#
发表于 2011-11-5 18:10:53
重设之后GAP没有修复,看起来FAL并没有启动。

主库的ALERT如下:

Sat Nov 05 06:51:39 2011
ALTER SYSTEM SET fal_server='hkdata' SCOPE=BOTH;
Sat Nov 05 07:33:45 2011
Thread 1 advanced to log sequence 6707 (LGWR switch)
  Current log# 2 seq# 6707 mem# 0: D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\REDO02.LOG
Sat Nov 05 07:33:46 2011
LNS: Standby redo logfile selected for thread 1 sequence 6707 for destination LOG_ARCHIVE_DEST_2
Sat Nov 05 09:30:35 2011
WARNING: inbound connection timed out (ORA-3136)
Sat Nov 05 09:55:34 2011
Thread 1 advanced to log sequence 6708 (LGWR switch)
  Current log# 3 seq# 6708 mem# 0: D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\REDO03.LOG
Sat Nov 05 09:55:36 2011
LNS: Standby redo logfile selected for thread 1 sequence 6708 for destination LOG_ARCHIVE_DEST_2
Sat Nov 05 11:46:54 2011
Thread 1 advanced to log sequence 6709 (LGWR switch)
  Current log# 1 seq# 6709 mem# 0: D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\REDO01.LOG
Sat Nov 05 11:46:56 2011
LNS: Standby redo logfile selected for thread 1 sequence 6709 for destination LOG_ARCHIVE_DEST_2
Sat Nov 05 13:14:26 2011
Thread 1 advanced to log sequence 6710 (LGWR switch)
  Current log# 2 seq# 6710 mem# 0: D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\REDO02.LOG
Sat Nov 05 13:14:28 2011
LNS: Standby redo logfile selected for thread 1 sequence 6710 for destination LOG_ARCHIVE_DEST_2
Sat Nov 05 13:19:45 2011
WARNING: inbound connection timed out (ORA-3136)
Sat Nov 05 13:21:06 2011
WARNING: inbound connection timed out (ORA-3136)
Sat Nov 05 13:21:35 2011
WARNING: inbound connection timed out (ORA-3136)
Sat Nov 05 13:22:17 2011
WARNING: inbound connection timed out (ORA-3136)
Sat Nov 05 13:23:37 2011
WARNING: inbound connection timed out (ORA-3136)
Sat Nov 05 13:56:56 2011
WARNING: inbound connection timed out (ORA-3136)
Sat Nov 05 15:25:54 2011
Thread 1 advanced to log sequence 6711 (LGWR switch)
  Current log# 3 seq# 6711 mem# 0: D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\REDO03.LOG
Sat Nov 05 15:25:55 2011
LNS: Standby redo logfile selected for thread 1 sequence 6711 for destination LOG_ARCHIVE_DEST_2
Sat Nov 05 17:28:37 2011
Thread 1 advanced to log sequence 6712 (LGWR switch)
  Current log# 1 seq# 6712 mem# 0: D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\REDO01.LOG
Sat Nov 05 17:28:38 2011
LNS: Standby redo logfile selected for thread 1 sequence 6712 for destination LOG_ARCHIVE_DEST_2

备库的ALERT如下:

Sat Nov 05 06:50:38 2011
ALTER SYSTEM SET fal_server='hkdatazq' SCOPE=BOTH;
Sat Nov 05 07:33:48 2011
Primary database is in MAXIMUM PERFORMANCE mode
RFS[12]: Successfully opened standby log 4: 'D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\STANDBY_REDO04.LOG'
Sat Nov 05 09:55:37 2011
Primary database is in MAXIMUM PERFORMANCE mode
RFS[12]: Successfully opened standby log 8: 'D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\STANDBY_REDO08.LOG'
Sat Nov 05 11:46:57 2011
Primary database is in MAXIMUM PERFORMANCE mode
RFS[12]: Successfully opened standby log 4: 'D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\STANDBY_REDO04.LOG'
Sat Nov 05 13:14:29 2011
Primary database is in MAXIMUM PERFORMANCE mode
RFS[12]: Successfully opened standby log 8: 'D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\STANDBY_REDO08.LOG'
Sat Nov 05 15:25:56 2011
Primary database is in MAXIMUM PERFORMANCE mode
RFS[12]: Successfully opened standby log 4: 'D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\STANDBY_REDO04.LOG'
Sat Nov 05 17:28:39 2011
Primary database is in MAXIMUM PERFORMANCE mode
RFS[12]: Successfully opened standby log 8: 'D:\ORACLE\PRODUCT\10.2.0\ORADATA\HKDATA\STANDBY_REDO08.LOG'

回复 只看该作者 道具 举报

15#
发表于 2011-11-5 22:45:32
阶段性小结一下:
按METALINK文档看gap的解决有两种方式:

1.Automatic Gap Resolution

当lgwr或arch发送redo日志到备库时,这个日志的sequence号(暂称为Sending_seq)会与备库上已被接收的最后一个sequence号(暂称为Last_dgseq)进行比较。
如果RFS进程发现Sending_seq>Last_dgseq+1,那么RFS进程会向主库请求发送相应的missing日志,由主库的ARCH进程来发送这个日志,并告诉LGWR进程gap被解决了。
从9.2.0开始,ARCH进程会每分钟对备库进行调查,查看是否有gap。如果备库返回有gap信息,则发送相关日志。gap解决后,lgwr进程会被告知备库已与主库同步。


2.FAL Gap Resolution:
当RFS接到到一个归档日志,它会根据这个归档日志的名字与位置,更新备库控制文件。一旦MRP进程看到这个更新,它会尝试去使用这个归档日志进行恢复。
如果MRP发现这个归档日志丢失或损坏了,FAL会被调用去解决这个GAP。
如果主库中设定的第一个destination不能解决这个gap,那么会尝试后续的destination,直到gap被解决或所有destination被尝试完毕。

在9.2.0版本中 FAL Gap Resolution only works with Physical Standby databases as the process is tied to MRP.  Gap recovery on a logical standby database is
handled through the heartbeat mechanism(这个心跳机制不了解).
----------------------
以上是根据我的理解,进行的翻译,可以对照看下理解是否正确。
那么为什么在这个案例中,以前已有的redo gap不能被解决,而只有新出的gap会进行尝试解决呢。

在Automatic Gap Resolution方式中,只有当Sending_seq>Last_dgseq+1时,才会尝试进行解决gap。按照这种逻辑,当序号为N的归档日志发现是gap,但gap解决失败,而如果这时第N+1号归档日志接收成功,当进行N+2号归档日志接收时,RFS就不会向主库请求发送第N号归档日志了,因为N+2≯(N+1)+1,暂且将第N号的redo gap称为“以前的gap”。至于9.2.0开始的ARCH进程会每分钟对备库进行gap调查,没有起作用,可能是“备库返回的gap信息”有问题,问题包括网络问题(lag,中断及长时间的tcp timeout等)和备库发现gap的机制问题。(关于Automatic Gap Resolution的实际机制,有兴趣的同学可以进行研究确定)

在FAL Gap Resolution方式中,也是最近一个归档日志如果有问题才进行gap解决,并不涉及到以前的gap。而且如果最近的这个gap因为网络问题而完全失败,导致其也变成了“以前的gap”,那么它再也不会被解决。

综上所述,我的结论是“以前的”redo gap或archive gap不被自动解决这一事实与“文档推理的结果”一定程度上相符。

楼主如果对我的说法认同或觉得有参考价值。可以根据信息进行分析与验证,毕竟这个环境是个相当好的研究场景。
信息应当包括,但不限于:
主备库的alert日志
主库的lgwr trace,arch trace,lns trace
物理备库的rfs trace,mrp trace
群主提供的dg诊断脚本
深入分析的话,可以对相关进程进行10046 trace等等(我只知道10046)
参数log_archive_trace可提供的信息

--------------------------------
关于log_archive_trace的能力(详见10gr2 database reference,我没用过,不知道用哪个合适):
LOG_ARCHIVE_TRACE controls output generated by the archivelog process.
This process can be initiated by any of the following:
■ An ARCn background process (designated as ARCn in the output logs)
■ An explicit session-invoked foreground process (designated as ARCH in the
output logs)
The valid values have the following meanings:
■ 0: Disable archivelog tracing (this is the default)
■ 1: Track archival of redo log file
■ 2: Track archival status of each archivelog destination
■ 4: Track archival operational phase
■ 8: Track archivelog destination activity
■ 16: Track detailed archivelog destination activity
■ 32: Track archivelog destination parameter modifications
■ 64: Track ARCn process state activity
■ 128: Track FAL (fetch archived log) server related activities
■ 256: Track RFS Logical Client
■ 512: Track LGWR redo shipping network activity
■ 1024: Track RFS Physical Client
■ 2048: Track RFS/ARCn Ping Heartbeat
■ 4096: Track Real Time Apply
■ 8192: Track Redo Apply (Media Recovery or Physical Standby)

回复 只看该作者 道具 举报

16#
发表于 2011-11-5 22:53:29

回复 15# 的帖子

SQL> show parameter log_archive_trace


NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
log_archive_trace                    integer     252


SQL> alter system switch logfile;
System altered.







设置log_archive_trace      =252 并观察 ARCH进程的 trace信息

[ 本帖最后由 maclean 于 2011-11-5 23:00 编辑 ]

回复 只看该作者 道具 举报

您需要登录后才可以回帖 登录 | 注册

QQ|手机版|Archiver|Oracle数据库数据恢复、性能优化

GMT+8, 2024-11-15 01:45 , Processed in 0.069959 second(s), 24 queries .

Powered by Discuz! X2.5

© 2001-2012 Comsenz Inc.

回顶部
TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569