Oracle数据库数据恢复、性能优化

找回密码
注册
搜索
热搜: 活动 交友 discuz
发新帖

157

积分

0

好友

14

主题
1#
发表于 2012-6-4 14:14:17 | 查看: 11125| 回复: 16
数据库版本 11.2.0.2.0 linux x86-64
集群件版本 11.2.0.2.0 linux x86-64
这是一个2节点RAC数据库
上周五下午17:06对实例一执行shutdown immediate操作,但直到17:22分仍然不能正常关闭。
当时的DBA在17:16对执行shutdown命令的进程做了processstate(详见附件)
17:22分DBA对数据库执行shutdown abort并重新启动,一切恢复正常(很快)。
烦请刘大帮忙诊断一下,由于当时DBA只做了processstate,不知借此是否可分析出问题,再此先谢过了。
附上hang住时的alert日志
  1. Fri Jun 01 17:06:37 2012
  2. Shutting down instance (immediate)
  3. Stopping background process SMCO
  4. Shutting down instance: further logons disabled
  5. Fri Jun 01 17:06:38 2012
  6. Stopping background process CJQ0
  7. Stopping background process QMNC
  8. Stopping background process MMNL
  9. Stopping background process MMON
  10. License high water mark = 220
  11. All dispatchers and shared servers shutdown
  12. ALTER DATABASE CLOSE NORMAL
  13. Fri Jun 01 17:06:44 2012
  14. SMON: disabling tx recovery
  15. Stopping background process RCBG
  16. SMON: disabling cache recovery
  17. Fri Jun 01 17:06:49 2012
  18. NOTE: Deferred communication with ASM instance
  19. Fri Jun 01 17:06:49 2012
  20. NOTE: Deferred communication with ASM instance
  21. NOTE: deferred map free for map id 25041
  22. NOTE: deferred map free for map id 25043
  23. Fri Jun 01 17:06:49 2012
  24. NOTE: Deferred communication with ASM instance
  25. Fri Jun 01 17:06:50 2012
  26. Shutting down archive processes
  27. Archiving is disabled
  28. Fri Jun 01 17:06:50 2012
  29. ARCH shutting down
  30. Fri Jun 01 17:06:50 2012
  31. ARCH shutting down
  32. Fri Jun 01 17:06:50 2012
  33. ARCH shutting down
  34. Fri Jun 01 17:06:50 2012
  35. ARCH shutting down
  36. ARC3: Archival stopped
  37. ARC1: Archival stopped
  38. ARC0: Archival stopped
  39. ARC2: Archival stopped
  40. Thread 1 closed at log sequence 6680
  41. Successful close of redo thread 1
  42. Fri Jun 01 17:06:51 2012
  43. NOTE: Deferred communication with ASM instance
  44. NOTE: deferred map free for map id 4
  45. Fri Jun 01 17:22:08 2012
  46. License high water mark = 220
  47. USER (ospid: 17823): terminating the instance
  48. Instance terminated by USER, pid = 17823
复制代码


precessstate.rar (374.88 KB, 下载次数: 959)

[ 本帖最后由 clevernby 于 2012-6-4 14:19 编辑 ]
2#
发表于 2012-6-4 22:35:46
ODM DATA:
Stopping background process SMCO

*** 2012-06-01 17:06:38.202
ksimdel: READY status 5

*** 2012-06-01 17:06:38.298
Stopping background process QMNC

*** 2012-06-01 17:06:39.317
Stopping background process MMNL

*** 2012-06-01 17:06:40.329
Stopping background process MMON

*** 2012-06-01 17:06:41.437
ksukia: Starting kill, flags = 1
ksukia: Attempt 1 to kill process oracle@tbdb01_zj, OS id=18387

ksukia: killed 143 out of 143 processes.

*** 2012-06-01 17:06:43.131
ksukia: Starting kill, flags = 1
ksukia: killed 0 out of 0 processes.

*** 2012-06-01 17:06:44.206
* Set mstr_rdy 0, lmon_pnpchk 0

*** 2012-06-01 17:06:44.290
Stopping background process RCBG

在stop RCBG 时hang住  


RCBG 负责管理result cache , 建议上传该段时间rcbg的TRACE

  SO: 0xc987576e8, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0xc987576e8, name=process, file=ksu.h LINE:12451, pg=0
  (process) Oracle pid:94, ser:20, calls cur/top: 0xc0ad8dc80/0xc0ad8dc80
            flags : (0x0) -
            flags2: (0x0),  flags3: (0x0)
            intr error: 0, call error: 0, sess error: 0, txn error 1089
            intr queue: empty
    ksudlp FALSE at location: 0
  (post info) last post received: 0 0 27
              last post received-location: ksa2.h LINE:286 ID:ksasnr
              last process to post me: c8074b178 1 6
              last post sent: 0 0 26
              last post sent-location: ksa2.h LINE:282 ID:ksasnd
              last process posted by me: c8074b178 1 6
    (latch info) wait_event=0 bits=80
      holding    (efd=9) 601097d8 Child shared pool level=7 child#=3
        Location from where latch is held: kgh.h LINE:6407 ID:kghfre: Chunk Header
        Context saved from call: 49454235520
        state=busy [holder orapid=94] wlstate=free [value=0]


pid:94 的进程hold hild shared pool level=7 child#=3   by kghfre: Chunk Header

在等Disk file operations I/O


   SO: 0xc888362c8, type: 4, owner: 0xc987576e8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
     proc=0xc987576e8, name=session, file=ksu.h LINE:12459, pg=0
    (session) sid: 455 ser: 60839 trans: 0xc75af15f0, creator: 0xc987576e8
              flags: (0x100041) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
              flags2: (0x48008) -/DDLT2
              DID: , short-term DID:
              txn branch: (nil)
              oct: 35, prv: 0, sql: 0x9eb32f4c8, psql: 0xcd4417108, user: 0/SYS
    ksuxds FALSE at location: 0
    service name: SYS$USERS
    client details:
      O/S info: user: oracle, term: pts/3, ospid: 7396
      machine: tbdb01_zj program: sqlplus@tbdb01_zj (TNS V1-V3)
      application name: sqlplus@tbdb01_zj (TNS V1-V3), hash value=1997861317
    Current Wait Stack:
      Not in wait; last wait ended 9 min 19 sec ago
    Wait State:
      fixed_waits=0 flags=0x28 boundary=(nil)/-1
    Session Wait History:
        elapsed time of 9 min 19 sec since last wait
     0: waited for 'Disk file operations I/O'
        FileOperation=0x5, fileno=0x1, filetype=0x1
        wait_id=27177 seq_num=27184 snap_id=1
        wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000001 sec of elapsed time
     1: waited for 'Disk file operations I/O'
        FileOperation=0x5, fileno=0x0, filetype=0x1
        wait_id=27176 seq_num=27183 snap_id=1
        wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000010 sec of elapsed time


      SO: 0xa71e5e438, type: 77, owner: 0xc888362c8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
       proc=0xc987576e8, name=LIBRARY OBJECT LOCK, file=kgl.h LINE:8476, pg=0

      LibraryObjectLock:  Address=0xa71e5e438 Handle=0x9eb32f4c8 Mode=N CanBeBrokenCount=1 Incarnation=1 ExecutionCount=1         
        
        User=0xc888362c8 Session=0xc888362c8 ReferenceCount=1 Flags=CNB/[0001] SavepointNum=4fc88623
      LibraryHandle:  Address=0x9eb32f4c8 Hash=65cb1a12 LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
        ObjectName:  Name=ALTER DATABASE CLOSE NORMAL


执行的语句是  ALTER DATABASE CLOSE NORMAL


没有更多其它的可用信息

回复 只看该作者 道具 举报

3#
发表于 2012-6-5 10:13:03
附上该时段的RCBG进程trc,该文件的最后修改时间为17:06:46,从内容来看RCBG正常退出无挂起现象。
tbdb1_rcbg_13681.zip (695 Bytes, 下载次数: 885)

尚存的一些疑惑
(1) Disk file operations I/O等待

  1. SO: 0xc888362c8, type: 4, owner: 0xc987576e8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
  2.      proc=0xc987576e8, name=session, file=ksu.h LINE:12459, pg=0
  3.     (session) sid: 455 ser: 60839 trans: 0xc75af15f0, creator: 0xc987576e8
  4.               flags: (0x100041) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
  5.               flags2: (0x48008) -/DDLT2
  6.               DID: , short-term DID:
  7.               txn branch: (nil)
  8.               oct: 35, prv: 0, sql: 0x9eb32f4c8, psql: 0xcd4417108, user: 0/SYS
  9.     ksuxds FALSE at location: 0
  10.     service name: SYS$USERS
  11.     client details:
  12.       O/S info: user: oracle, term: pts/3, ospid: 7396
  13.       machine: tbdb01_zj program: sqlplus@tbdb01_zj (TNS V1-V3)
  14.       application name: sqlplus@tbdb01_zj (TNS V1-V3), hash value=1997861317
  15.     Current Wait Stack:
  16.       Not in wait; last wait ended 9 min 19 sec ago
  17.     Wait State:
  18.       fixed_waits=0 flags=0x28 boundary=(nil)/-1
  19.     Session Wait History:
  20.         elapsed time of 9 min 19 sec since last wait
  21.      0: waited for 'Disk file operations I/O'
  22.         FileOperation=0x5, fileno=0x1, filetype=0x1
  23.         wait_id=27177 seq_num=27184 snap_id=1
  24.         wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec
  25.         wait times: max=infinite
  26.         wait counts: calls=0 os=0
  27.         occurred after 0.000001 sec of elapsed time
复制代码

其中

  1.     Current Wait Stack:
  2.       Not in wait; last wait ended 9 min 19 sec ago
复制代码

最后一次等待发生在9分19秒之前,时间大致应该是17:07:07,自此之后并无等待发生。我觉得它导致hang的可能性不高。

(2) ASM map operations

  1. Fri Jun 01 17:06:51 2012
  2. NOTE: Deferred communication with ASM instance
  3. NOTE: deferred map free for map id 4
复制代码

这是alert日志中最后输出的内容
然后是ASM map operation的SO

  1. SO: 0xbf8c8e070, type: 123, owner: 0xc987576e8, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
  2.      proc=0xc987576e8, name=ASM map operations, file=kffm2.h LINE:380, pg=0
  3.     transistion:0x(nil)
  4.     busylist:[bf8c8e0d8,bf8c8e0d8]
  5.     freelist:[cbfaa0f00,cbfaa0f78]
  6.       KFFMOP: hash link:[cbfaa0ef0,cbfaa0ef0] sobj link:[cbfaa0e88,bf8c8e0c8]
  7.         map kggrp:[0x0xcbff3bad8, 0, valid]  map id:2
  8.         group:[2,-2016913365] file:[265,753976451] extent:0
  9.         flags:0x0000 disk:3 au:390 lock:0 proc:0x0xc987576e8
  10.       KFFMOP: hash link:[cbfaa0e78,cbfaa0e78] sobj link:[cbfaa0e10,cbfaa0f00]
  11.         map kggrp:[0x0xcbff3bad8, 0, valid]  map id:2
  12.         group:[2,-2016913365] file:[265,753976451] extent:7
  13.         flags:0x0000 disk:2 au:393 lock:0 proc:0x0xc987576e8
  14.       KFFMOP: hash link:[cbfaa0e00,cbfaa0e00] sobj link:[cbfaa0c30,cbfaa0e88]
  15.         map kggrp:[0x0xcbff3bad8, 0, valid]  map id:2
  16.         group:[2,-2016913365] file:[265,753976451] extent:6
  17.         flags:0x0000 disk:0 au:427 lock:0 proc:0x0xc987576e8
  18.       KFFMOP: hash link:[cbfaa0c20,cbfaa0c20] sobj link:[cbfaa0bb8,cbfaa0e10]
  19.         map kggrp:[0x0xcbff3bad8, 0, valid]  map id:2
  20.         group:[2,-2016913365] file:[265,753976451] extent:5
  21.         flags:0x0000 disk:1 au:392 lock:0 proc:0x0xc987576e8
  22.       ......略
  23.       KFFMOP: hash link:[cbfaa1238,cbfaa1238] sobj link:[cbfaa10e0,cbfaa12c0]
  24.         map kggrp:[0x0xcbff3bad8, 0, valid]  map id:2
  25.         group:[2,-2016913365] file:[265,753976451] extent:0
  26.         flags:0x0000 disk:3 au:390 lock:0 proc:0x0xc987576e8
  27.       KFFMOP: hash link:[cbfaa10d0,cbfaa10d0] sobj link:[cbfaa1068,cbfaa1248]
  28.         map kggrp:[0x0xcbff3bad8, 0, valid]  map id:2
  29.         group:[2,-2016913365] file:[265,753976451] extent:4
  30.         flags:0x0000 disk:3 au:391 lock:0 proc:0x0xc987576e8
  31.       KFFMOP: hash link:[cbfaa1058,cbfaa1058] sobj link:[cbfaa0f78,cbfaa10e0]
  32.         map kggrp:[0x0xcbff3bad8, 0, valid]  map id:2
  33.         group:[2,-2016913365] file:[265,753976451] extent:5
  34.         flags:0x0000 disk:1 au:392 lock:0 proc:0x0xc987576e8
  35.       KFFMOP: hash link:[cbfaa0f68,cbfaa0f68] sobj link:[bf8c8e0c8,cbfaa1068]
  36.         map kggrp:[0x0xcbff3bad8, 0, valid]  map id:2
  37.         group:[2,-2016913365] file:[265,753976451] extent:13
  38.         flags:0x0000 disk:1 au:394 lock:0 proc:0x0xc987576e8
复制代码

其中
file 265是控制文件Current.265.753976451
file 267是SYSAUX.267.753976401
但是对ASM map operations不是很了解无法进一步分析,也不清楚是不是它导致的hang

(3) call SO 0xc0ad8dc80

  1. SO: 0xc0ad8dc80, type: 3, owner: 0xc987576e8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
  2.      proc=0xc987576e8, name=call, file=ksu.h LINE:12455, pg=0
  3.     (call) sess: cur c888362c8, rec c7085cfa0, usr c888362c8; flg:30 fl2:1; depth:0
  4.     svpt(xcb:(nil) sptn:0x76 uba: 0x00000000.0000.00)
  5.     ksudlc FALSE at location: 0
  6.       ----------------------------------------
  7.       SO: 0xc7092a440, type: 8, owner: 0xc0ad8dc80, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
  8.        proc=0xc987576e8, name=enqueue, file=ksq1.h LINE:365, pg=0
  9.       (enqueue) IS-00000000-00000000 DID: 0001-005E-00084048
  10.       lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x2
  11.       mode: X, lock_flag: 0x10, lock: 0xc7092a498, res: 0xcd8992e60
  12.       own: 0xc888362c8, sess: 0xc888362c8, proc: 0xc987576e8, prv: 0xcd8992e70
  13.       ----------------------------------------
  14.       SO: 0xc7085cfa0, type: 4, owner: 0xc0ad8dc80, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
  15.        proc=0xc987576e8, name=session, file=ksu.h LINE:12459, pg=0
  16.       (session) sid: 459 ser: 16449 trans: 0xc7cad5b40, creator: (nil)
  17.                 flags: (0x2) -/REC flags_idl: (0x0) -/-/-/-/-/-
  18.                 flags2: (0x0) -/-
  19.                 DID: , short-term DID:
  20.                 txn branch: (nil)
  21.                 oct: 0, prv: 0, sql: (nil), psql: 0xcbfa7baf8, user: 0/SYS
  22.       ksuxds FALSE at location: 0
  23.       temporary object counter: 0
  24.         KTU Session Commit Cache Dump for IDLs:
  25.         KTU Session Commit Cache Dump for Non-IDLs:
复制代码

递归会话0xc7085cfa0发起事务0xc7cad5b40,起始时间17:06:48,但该会话目前没有SQL在执行。这是不是也是条线索?


看来仅凭一个processstate很难完整分析出当时系统情况,不过还是谢谢刘大

回复 只看该作者 道具 举报

4#
发表于 2012-6-8 16:54:28
又检查了一下lmhb进程的trace文件,请刘大过目,其中多次出现
  1. LCK0 (ospid: 13607) has not moved for 20 sec (xxxxxxx.xxxxxxx)
  2. kjfmGCR_HBCheckAll: LCK0 (ospid: 13607) has status 2
  3.   : waiting for event 'rdbms ipc message' for 0 secs with wait_id xxxxxxx.
  4.   ===[ Wait Chain ]===
  5.   Wait chain is empty.
复制代码


tbdb1_lmhb_13465.rar (2.7 KB, 下载次数: 890)

回复 只看该作者 道具 举报

5#
发表于 2012-6-8 21:54:50
*** 2012-06-01 17:20:36.601
==============================
LCK0 (ospid: 13607) has not moved for 20 sec (1338542436.1338542416)
kjfmGCR_HBCheckAll: LCK0 (ospid: 13607) has status 2
  : waiting for event 'rdbms ipc message' for 0 secs with wait_id 560181102.
  ===[ Wait Chain ]===
  Wait chain is empty.

*** 2012-06-01 17:20:56.607
==============================
LCK0 (ospid: 13607) has not moved for 20 sec (1338542456.1338542436)
kjfmGCR_HBCheckAll: LCK0 (ospid: 13607) has status 2
  : waiting for event 'rdbms ipc message' for 0 secs with wait_id 560181301.
  ===[ Wait Chain ]===
  Wait chain is empty.


出现的LCK0进程长期空闲等待中,一致也没有被KILL掉,只能说是另一种该问题的表现,而非原因

回复 只看该作者 道具 举报

6#
发表于 2012-6-12 15:36:14
开了SR,如果有新进展我会回复该帖。

回复 只看该作者 道具 举报

7#
发表于 2012-6-12 16:55:01

回复 6# 的帖子

Good reply !

回复 只看该作者 道具 举报

8#
发表于 2012-6-14 16:15:03
Oracle Support在查看了提供的lmhb trace文件后做了如下研究

  1. Note: This is INTERNAL ONLY research.  No action should be taken by the customer on this information. This is research only, and may NOT be applicable to your specific situation.

  2. KM SEARCH
  3. -------------------

  4. SR #: 3-5600337171
  5. Summary: shutdown of RAC database hangs
  6. Description: When we tried to perform a switchover from a primary RAC database to a physical standby the process stopped because shutdown of the primary RAC database hung.

  7. The following messages was written in the lmhb trace file:
  8. LMS0 (ospid: 17482) has not moved for 34 sec (1334571575.1334571541)
  9. kjfmGCR_HBCheckAll: LMS0 (ospid: 17482) has status 2
  10. : waiting for event 'gcs remote message' for 0 secs with wait_id 166465.
  11. ===[ Wait Chain ]===
  12. Wait chain is empty.

  13. After searching the support web I found document ID 1440112.1.
  14. We implemented the workaround by setting "_gc_defer_time=3", and that fixed the problem.

  15. I am not 100% sure if you have hit this bug.
  16. When looking at the messages in the trace file, then yes, it surely has all the symptoms
  17. But I see no recording, that may be lack of info, that this problem hangs the instance during a shutdown.

  18. The parameter "_gc_defer_time=3" configures the time (in ms) on a cache block that a LMS process gives to a user process
  19. to complete his transaction on a local instance, before changing the lock and
  20. sending the block to a requestor on another instance.
  21. It "may" improve performance on local instance transactions.

  22. That setting the hidden parameter solved the hang is interesting and does
  23. add to the suspicion that this bug was involved

  24. ===> 11.2.0.3

  25. Bug 9886569: [11202-LNX-100628]LMHB DON'T KILL LCK0 EVEN THOUGH IT WAS HUNG FOR A LONG TIME
复制代码


我会继续跟进...

回复 只看该作者 道具 举报

9#
发表于 2012-6-14 20:33:19
backport BUG 9886569 to 11.2.0.2.0
  1. KM Search
  2. ------------------
  3. keyword = shutdown hang kjfmGCR_HBCheckAll: LCK0 "Wait chain is empty"

  4. Bug 9886569 [http://bug.oraclecorp.com/pls/bug/webbug_edit.edit_info_top?rptno=9886569] : [11202-LNX-100628]LMHB DON'T KILL LCK0 EVEN THOUGH IT WAS HUNG FOR A LONG TIME

  5. RELEASE NOTES:
  6. ]]LMHB does not terminate the instance even though it detcts that a fatal backg
  7. ]]round has been stuck for a long time.
  8. @
  9. @INTERNAL PROBLEM DESCRIPTION:
  10. @A flag is not reset in the LMHB codepath which prevents LMHB from taking any f
  11. @atal action such as instance termination when required.
  12. @
  13. @INTERNAL FIX DESCRIPTION:
  14. @The required flag is now reset at beginning of the action taking loop.
  15. @
  16. @BACKPORT FEASIBLE:
  17. @Yes
  18. @
  19. @FORWARD MERGE REQUIRED:
  20. @No (merged to main branch)
  21. @
  22. REDISCOVERY INFORMATION:
  23. If LMHB tdetects that a fatal background is hung but it does not termiante the
  24. instance then most probably this bug has been hit.


  25. Bug 9894036 [http://bug.oraclecorp.com/pls/bug/webbug_edit.edit_info_top?rptno=9894036] : RFI BACKPORT OF BUG 9886569 [http://bug.oraclecorp.com/pls/bug/webbug_edit.edit_info_top?rptno=9886569] FOR INCLUSION IN 11.2.0.2.0 (RFI #350307)
复制代码

回复 只看该作者 道具 举报

10#
发表于 2012-6-14 22:42:22

回复 9# 的帖子

就SR的回复看GCS 从内部 bug database中找到了一个潜在可能的bug ,即lck hang ,但是 LMHB进程没有合理终止实例

回复 只看该作者 道具 举报

11#
发表于 2012-6-15 08:30:54

  1. Dear customer,

  2. From the trace file provided, we confirmed your issue hit Bug 9886569 : [11202-LNX-100628]LMHB DON'T KILL LCK0 EVEN THOUGH IT WAS HUNG FOR A LONG TIME

  3. The bug will be fixed in 11.2.0.3. The backport patch request with
  4. Bug 9894036 : RFI BACKPORT OF BUG 9886569 FOR INCLUSION IN 11.2.0.2.0 (RFI #350307)
  5. is still under review.

  6. The cause is the LMHB does not terminate the instance even though it detects that a fatal background has been stuck for a long time.

  7. Please let me know whether you would like to have a backport patch on 11.2.0.2 which takes some time or any plan to upgrade to 11.2.0.3 recently?

  8. Best Regards,
  9. Oracle Global Software Support
复制代码

回复 只看该作者 道具 举报

12#
发表于 2012-6-15 10:20:33

回复 11# 的帖子

很好 , 这是一个完整的case了。

回复 只看该作者 道具 举报

13#
发表于 2012-6-21 09:36:36
backport被拒绝,RnD team认为当时lck0进程并没有出现hang的迹象,理由是心跳正常,因此bug 9886569并不能解决我的问题,需要raise new bug.

  1. Dear customer,

  2. Thanks for your waiting.

  3. I got a rejection from the back port patch request.

  4. Please refer to the following from our RnD team

  5. ------
  6. I was looking at the SR trace files and I saw that LCK0 never stopped issuing HB more than 70 secs. Then LMHB should not attempt to kill the process.

  7. Actually most of the times we had:

  8. LCK0 (ospid: 13607) has not moved for 20 sec

  9. which means that the process HB was not produced for 20 secs.

  10. Then the patch is not useful here, you need to raise a new bug if this happens again and collect a couple of systemstate dumps.
  11. ------

  12. In the suspected bug, it has the symptom as

  13. @ ORA-29770: global enqueue process LCK0 (OSID 13469) is hung for more than 150
  14. @ seconds

  15. So, we have to raise another new bug once issue reoccurred.

  16. Before raising the bug, need to have the following log / information:

  17. ACTION PLAN
  18. -----------------------
  19. 1. Please confirm whether we can reproduce the case?

  20. 2. Once issue reoccurred, please

  21. See Note 121779.1 Taking Systemstate Dumps when You cannot Connect to Oracle.

  22. Please run the following on one instance as sysdba:
  23. SQL>oradebug setmypid
  24. SQL>oradebug unlimit
  25. SQL>oradebug -g all hanganalyze 3
  26. Wait for 30 seconds
  27. SQL>oradebug -g all hanganalyze 3
  28. SQL>exit

  29. Open another session as sysdba:
  30. SQL>oradebug setmypid
  31. SQL>oradebug unlimit
  32. SQL>oradebug -g all dump systemstate 10
  33. Wait for 30 seconds
  34. SQL>oradebug -g all dump systemstate 10
  35. SQL>oradebug tracefile_name
  36. SQL>exit

  37. The generated trace files will be in the diag trace file in BDUMP for each instance, please upload these files.

  38. 3. Provide alert log and related trace file during issue occurred period.

  39. Best Regards,
  40. Oracle Global Software Support

复制代码


接下来要漫长等待下次shutdown了,可能今年会有,也可能明年,甚至后年,还得再次触发,呵呵,似乎不了了之了...

重要:出现异常现象时,systemdump对于support团队诊断、制做patch至关重要,一定要记得做。

回复 只看该作者 道具 举报

14#
发表于 2012-6-21 11:00:04
刚才实验了下,lmhb心跳检查的阈值是70s,达到110s时会进行process dump并尝试唤醒对应进程

  1. *** 2012-06-21 10:49:07.009
  2. ==============================
  3. LCK0 (ospid: 13261) has not moved for 110 sec (1340246946.1340246836)
  4. kjfmGCR_HBCheckAll: LCK0 (ospid: 13261) has status 6
  5. ==================================================
  6. === LCK0 (ospid: 13261) Heartbeat Report
  7. ==================================================
  8. LCK0 (ospid: 13261) has no heartbeats for 110 sec. (threshold 70 sec)
  9.   : waiting for event 'rdbms ipc message' for 100 secs with wait_id 37443.
  10. ===[ Wait Chain ]===
  11. Wait chain is empty.
  12. ==============================
  13. Dumping PROCESS LCK0 (ospid: 13261) States
  14. ==============================
  15. ...
复制代码

回复 只看该作者 道具 举报

15#
发表于 2012-6-21 11:38:30
呵呵

GCS 最喜欢让人对hang的场景做 hanganalyze 和 systemstate了。

回复 只看该作者 道具 举报

16#
发表于 2012-6-21 15:24:33
这些内部note有什么办法看?

回复 只看该作者 道具 举报

17#
发表于 2012-6-21 16:26:45
好东西。

回复 只看该作者 道具 举报

您需要登录后才可以回帖 登录 | 注册

QQ|手机版|Archiver|Oracle数据库数据恢复、性能优化

GMT+8, 2024-12-26 00:47 , Processed in 0.059723 second(s), 24 queries .

Powered by Discuz! X2.5

© 2001-2012 Comsenz Inc.

回顶部
TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569