Oracle数据库数据恢复、性能优化

找回密码
注册
搜索
热搜: 活动 交友 discuz
发新帖

94

积分

0

好友

1

主题
1#
发表于 2012-4-22 10:52:49 | 查看: 21556| 回复: 18
客户这边是HP-UX 11i v3 的11.2.0.2 RAC环境

已经出现两次同样的情况在晚上00:39的时候节点1 ora.LISTENER.lsnr 莫名其妙abort掉了,同时该节点主机HANG死,无法登陆。

grid的alert日志里面记录如下:
[/grid/product/11.2/bin/oraagent.bin(3084)]CRS-5818:Aborted command 'check for resource: ora.LISTENER.lsnr jzdb1 1' for resource 'ora.LISTENER.lsnr'. Details at (:CRSA
GF00113:) {1:49618:2} in /grid/product/11.2/log/jzdb1/agent/crsd/oraagent_grid/oraagent_grid.log.
2012-04-22 00:39:43.525
[/grid/product/11.2/bin/oraagent.bin(3084)]CRS-5014:Agent "/grid/product/11.2/bin/oraagent.bin" timed out starting process "/grid/product/11.2/bin/lsnrctl" for action
"check": details at "(:CLSN00009:)" in "/grid/product/11.2/log/jzdb1/agent/crsd/oraagent_grid/oraagent_grid.log"
2012-04-22 00:40:46.457
[/grid/product/11.2/bin/oraagent.bin(3084)]CRS-5832:Agent '/grid/product/11.2/bin/oraagent_grid' was unable to process commands. Details at (:CRSAGF00128:) {1:49618:2}
in /grid/product/11.2/log/jzdb1/agent/crsd/oraagent_grid/oraagent_grid.log.
2#
发表于 2012-4-22 10:55:26

回复 1# 的帖子

oraagent_grid.log日志:

2012-04-22 00:39:08.189: [ora.LISTENER.lsnr][13] {1:49618:2} [check] Adding Environment Variables ORACLE_HOME=/grid/product/11.2
2012-04-22 00:39:08.189: [ora.LISTENER.lsnr][13] {1:49618:2} [check] Adding Environment variable from USR_ORA_ENV ORACLE_BASE=/grid/app
2012-04-22 00:39:08.189: [ora.LISTENER.lsnr][13] {1:49618:2} [check] Utils:execCmd action = 3 flags = 6 ohome = (null) cmdname = lsnrctl.
2012-04-22 00:39:08.189: [ora.LISTENER.lsnr][13] {1:49618:2} [check] getOracleHomeAttrib: oracle_home = /grid/product/11.2
2012-04-22 00:39:16.426: [    AGFW][10] {1:49618:2} Agent received the message: AGENT_HB[Engine] ID 12293:150938
2012-04-22 00:39:17.249: [ora.FRADG.dg][12] {1:49618:2} [check] ConnectionPool::getConnection 260 pConnxn 00c31690
2012-04-22 00:39:17.250: [ora.FRADG.dg][12] {1:49618:2} [check] DgpAgent::getConnxn connected
2012-04-22 00:39:17.300: [ora.FRADG.dg][12] {1:49618:2} [check] InstConnection:connectInt:is up
2012-04-22 00:39:17.302: [ora.FRADG.dg][12] {1:49618:2} [check] ConnectionPool::releaseConnection InstConnection 00c31690
2012-04-22 00:39:17.302: [ora.FRADG.dg][12] {1:49618:2} [check] DgpAgent::runCheck:clsagfw_res_status 0
2012-04-22 00:39:28.794: [ora.DG01.dg][18] {1:49618:2} [check] ConnectionPool::getConnection 260 pConnxn 00caac90
2012-04-22 00:39:28.794: [ora.DG01.dg][18] {1:49618:2} [check] DgpAgent::getConnxn connected
2012-04-22 00:39:28.795: [ora.CRSDG.dg][9] {1:49618:2} [check] ConnectionPool::getConnection 260 pConnxn 00c86cc0
2012-04-22 00:39:28.795: [ora.CRSDG.dg][9] {1:49618:2} [check] DgpAgent::getConnxn connected
2012-04-22 00:39:29.001: [ora.CRSDG.dg][9] {1:49618:2} [check] InstConnection:connectInt:is up
2012-04-22 00:39:29.003: [ora.CRSDG.dg][9] {1:49618:2} [check] ConnectionPool::releaseConnection InstConnection 00c86cc0
2012-04-22 00:39:29.003: [ora.CRSDG.dg][9] {1:49618:2} [check] DgpAgent::runCheck:clsagfw_res_status 0
2012-04-22 00:39:29.007: [ora.DG01.dg][18] {1:49618:2} [check] InstConnection:connectInt:is up
2012-04-22 00:39:29.009: [ora.DG01.dg][18] {1:49618:2} [check] ConnectionPool::releaseConnection InstConnection 00caac90
2012-04-22 00:39:29.009: [ora.DG01.dg][18] {1:49618:2} [check] DgpAgent::runCheck:clsagfw_res_status 0
2012-04-22 00:39:38.278: [    AGFW][18] {1:49618:2} Created alert : (:CRSAGF00113:) :  Aborting the command: check for resource: ora.LISTENER.lsnr jzdb1 1
2012-04-22 00:39:38.278: [ora.LISTENER.lsnr][18] {1:49618:2} [check] clsn_agent::abort {
2012-04-22 00:39:38.278: [ora.LISTENER.lsnr][18] {1:49618:2} [check] abort {
2012-04-22 00:39:38.278: [ora.LISTENER.lsnr][18] {1:49618:2} [check] abort command: check
2012-04-22 00:39:38.278: [ora.LISTENER.lsnr][18] {1:49618:2} [check] tryActionLock {
2012-04-22 00:39:38.352: [ora.LISTENER.lsnr][13] {1:49618:2} [check] (:CLSN00009:)Utils:execCmd abort request: killing /grid/product/11.2/bin/lsnrctl
2012-04-22 00:39:39.557: [ora.asm][9] {1:49618:2} [check] clsnUtils::setResAttrib attribute GEN_USR_ORA_INST_NAME value +ASM1
2012-04-22 00:39:39.970: [ora.asm][9] {1:49618:2} [check] setResAttrib: attr GEN_USR_ORA_INST_NAME@SERVERNAME(jzdb1) clsagfw attribVal +ASM1 clscrs tmpAttrValue +ASM1
2012-04-22 00:39:40.018: [ora.asm][9] {1:49618:2} [check] AsmProxyAgent::check clsagfw_res_status 0
2012-04-22 00:39:43.037: [   AGENT][13] {1:49618:2} UserErrorException: Locale is
2012-04-22 00:39:43.525: [ora.LISTENER.lsnr][13] {1:49618:2} [check] clsnUtils::error Exception type=2 string=
CRS-5014: Agent "/grid/product/11.2/bin/oraagent.bin" timed out starting process "/grid/product/11.2/bin/lsnrctl" for action "check": details at "(:CLSN00009:)" in "/g
rid/product/11.2/log/jzdb1/agent/crsd/oraagent_grid/oraagent_grid.log"

回复 只看该作者 道具 举报

3#
发表于 2012-4-22 11:00:22

回复 2# 的帖子

同时oracle的TRACE文件中出现如下报错信息:
        Free Mem  = 468.46Mb
skgpgcmdout: read() for cmd /bin/ps -elf | /bin/egrep 'PID | 25146' | /bin/grep -v grep timed out after 14.920 seconds
Skipping stack dump because max dump time exceeded.

-------------------------------------------------------------------------------
Process diagnostic dump actual duration=2 min 41 sec
  (max dump time=30.000000 sec)

*** 2012-04-22 00:44:14.129
Waited for process M000 to initialize for 248 seconds

*** 2012-04-22 00:44:14.129
Process diagnostic dump for M000, OS id=25146
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)
  0.000000 secs at [ 00:42:04 ]
    NOTE: scheduling delay has not been sampled for 129.394396 secs  0.000000 secs from [ 00:42:04 - 00:44:14 ], 5 sec avg
  0.000000 secs from [ 00:42:04 - 00:44:14 ], 1 min avg
  0.000000 secs from [ 00:39:14 - 00:44:14 ], 5 min avg

*** 2012-04-22 00:44:40.946
loadavg : 0.10 0.11 0.11

*** 2012-04-22 00:44:47.314
Swapinfo :
        Avail = 94938.05Mb Used = 47474.27Mb
        Swap free = 47463.78Mb Kernel rsvd = 5111.89Mb
        Free Mem  = 482.04Mb
skgpgcmdout: read() for cmd /bin/ps -elf | /bin/egrep 'PID | 25146' | /bin/grep -v grep timed out after 15.000 seconds
Skipping stack dump because max dump time exceeded.

-------------------------------------------------------------------------------
Process diagnostic dump actual duration=33.251000 sec
  (max dump time=30.000000 sec)

*** 2012-04-22 00:44:47.380
Killing process (ospid 25146):  (reason=KSOREQ_WAIT_CANCELLED error=0)
... and the process is still alive after kill!

*** 2012-04-22 00:44:57.788
Incident 241756 created, dump file: /oracle/diag/rdbms/jzsidb/jzsidb1/incident/incdir_241756/jzsidb1_mmon_3239_i241756.trc

*** 2012-04-22 00:45:02.025
ORA-00445: background process "m000" did not start after 120 seconds

回复 只看该作者 道具 举报

4#
发表于 2012-4-22 11:23:14

回复 3# 的帖子

系统syslog中大量如下报错:
Apr 22 00:21:34 jzdb1 vmunix: device 0x40020001, 1K block #218099648, 65536 bytes of 65536 bytes not paged, page addr =dead31.e008000c69f70000,  bflag =45008e, bp = e0
000004a2840c80
Apr 22 00:21:34 jzdb1 vmunix: device 0x40020001, 1K block #218099200, 65536 bytes of 65536 bytes not paged, page addr =dead31.e008000c3dec0000,  bflag =45008e, bp = e0
000004a2841100
Apr 22 00:21:34 jzdb1 vmunix: device 0x40020001, 1K block #218099264, 65536 bytes of 65536 bytes not paged, page addr =dead31.e008000c3ded0000,  bflag =45008e, bp = e0
000004a2841400
Apr 22 00:21:34 jzdb1 vmunix: device 0x40020001, 1K block #218099328, 65536 bytes of 65536 bytes not paged, page addr =dead31.e008000c3dee0000,  bflag =45008e, bp = e0
000004a2841700

[ 本帖最后由 lixz_2008 于 2012-4-22 12:00 编辑 ]

log.rar

229.38 KB, 下载次数: 2603

回复 只看该作者 道具 举报

5#
发表于 2012-4-22 11:24:24
上传  alert.log  listener.log  $GI_HOME/log/$HOSTNAME/racg 目录下的日志

问题发生时段  diag 进程、pmon、mmon 进程 、dbrm 进程的 TRACE

请压缩打包后上传

回复 只看该作者 道具 举报

6#
发表于 2012-4-22 12:01:02

回复 5# 的帖子

刘大,日志已经上传。请帮忙检查.谢谢! 漏了mmon的,重新上传一份

[ 本帖最后由 lixz_2008 于 2012-4-22 12:07 编辑 ]

log.rar

260.65 KB, 下载次数: 2685

回复 只看该作者 道具 举报

7#
发表于 2012-4-22 12:13:40

回复 6# 的帖子

另外我在2号节点上面发现如下错误:
012-04-16 23:04:49.578
[client(4639)]CRS-10051:CVU found following errors with Clusterware setup : 无法从节点 "jzdb2" 检索 exectask 的版本
PRVF-7543 : 操作系统内核参数 "tcp_smallest_anon_port" 在节点 "jzdb1" 上没有适当的值 [应为 = "9000"; 找到 = "49152"]。
PRVF-7543 : 操作系统内核参数 "tcp_largest_anon_port" 在节点 "jzdb1" 上没有适当的值 [应为 = "65500"; 找到 = "65535"]。
无法从节点 "jzdb2" 检索 exectask 的版本
无法从节点 "jzdb2" 检索 exectask 的版本
无法从节点 "jzdb2" 检索 exectask 的版本
节点 "jzdb2" 不可访问
无法从节点 "jzdb2" 检索 exectask 的版本
PRVF-4555 : 节点 "jzdb1" 上不存在节点应用程序 "ora.jzdb1.vip"
PRVF-4555 : 节点 "jzdb1" 上不存在节点应用程序 "ora.net1.network"

2012-04-17 05:04:46.876
[client(29104)]CRS-10051:CVU found following errors with Clusterware setup : 无法从节点 "jzdb2" 检索 exectask 的版本
PRVF-7543 : 操作系统内核参数 "tcp_smallest_anon_port" 在节点 "jzdb1" 上没有适当的值 [应为 = "9000"; 找到 = "49152"]。
PRVF-7543 : 操作系统内核参数 "tcp_largest_anon_port" 在节点 "jzdb1" 上没有适当的值 [应为 = "65500"; 找到 = "65535"]。
无法从节点 "jzdb2" 检索 exectask 的版本
无法从节点 "jzdb2" 检索 exectask 的版本
无法从节点 "jzdb2" 检索 exectask 的版本
无法从节点 "jzdb2" 检索 exectask 的版本
无法从节点 "jzdb2" 检索 exectask 的版本
PRVF-4555 : 节点 "jzdb1" 上不存在节点应用程序 "ora.jzdb1.vip"
PRVF-4555 : 节点 "jzdb1" 上不存在节点应用程序 "ora.net1.network"

这四个网络参数,系统 /etc/rc.config.d/nddconf,已经设置不知道为何重启后不生效。

回复 只看该作者 道具 举报

8#
发表于 2012-4-22 13:24:29
*** 2012-04-22 00:41:06.035
Waited for process M000 to initialize for 60 seconds

*** 2012-04-22 00:41:21.067
Process diagnostic dump for M000, OS id=25146

*** 2012-04-22 00:41:31.708
-------------------------------------------------------------------------------

*** 2012-04-22 00:41:32.845
os thread scheduling delay history: (sampling every 1.000000 secs)

*** 2012-04-22 00:41:33.004
  0.000000 secs at [ 00:41:32 ]
    NOTE: scheduling delay has not been sampled for 0.829234 secs  0.000000 secs from [ 00:41:29 - 00:41:33 ], 5 sec avg
  0.000000 secs from [ 00:40:33 - 00:41:33 ], 1 min avg
  0.000000 secs from [ 00:36:33 - 00:41:33 ], 5 min avg

*** 2012-04-22 00:42:18.595
loadavg : 0.13 0.12 0.11

*** 2012-04-22 00:42:20.642
Swapinfo :

*** 2012-04-22 00:42:23.745
        Avail = 94938.05Mb Used = 47458.56Mb

*** 2012-04-22 00:44:11.993
        Swap free = 47479.49Mb Kernel rsvd = 5114.76Mb

*** 2012-04-22 00:44:14.027
        Free Mem  = 468.46Mb
skgpgcmdout: read() for cmd /bin/ps -elf | /bin/egrep 'PID | 25146' | /bin/grep -v grep timed out after 14.920 seconds
Skipping stack dump because max dump time exceeded.

-------------------------------------------------------------------------------
Process diagnostic dump actual duration=2 min 41 sec
  (max dump time=30.000000 sec)





我们假设 问题是从 2012-04-22 00:41:06.035开始的 , 其最近的 负载并不高

loadavg : 0.13 0.12 0.11
且有大量空闲内存
        Free Mem  = 468.46Mb


症状是 M000  进程无法启动 , 可能发生了 实例hang 住导致该问题

因为M000 缺席,这可能导致该段时间 AWR报告不可用 , 但是 ASH信息可能可用

*** 2012-04-22 00:45:02.025
ORA-00445: background process "m000" did not start after 120 seconds




alert .log 中的信息 可以看到从LMON也出现了告警



Sun Apr 22 00:44:02 2012
LMON (ospid: 3209) has not called a wait for 13 secs.
Sun Apr 22 00:44:47 2012
Errors in file /oracle/diag/rdbms/jzsidb/jzsidb1/trace/jzsidb1_mmon_3239.trc  (incident=241756):
ORA-00445: background process "m000" did not start after 120 seconds
Sun Apr 22 00:45:30 2012
Incident details in: /oracle/diag/rdbms/jzsidb/jzsidb1/incident/incdir_241756/jzsidb1_mmon_3239_i241756.trc
Sun Apr 22 00:52:12 2012
ARC2: Detected ARCH process failure
Sun Apr 22 00:56:23 2012
Errors in file /oracle/diag/rdbms/jzsidb/jzsidb1/trace/jzsidb1_cjq0_3314.trc  (incident=241964):
ORA-00445: background process "J000" did not start after 120 seconds

回复 只看该作者 道具 举报

9#
发表于 2012-4-22 13:28:11
[crsd(3033)]CRS-5828:Could not start agent '/grid/product/11.2/bin/oraagent_grid'. Details at (:CRSAGF00130:) {1:49618:10247} in /grid/product/11.2/log/jzdb1/crsd/crsd.log.
2012-04-22 00:49:07.902
[ctssd(2926)]CRS-2408:The clock on host jzdb1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.
2012-04-22 00:51:21.764
[crsd(3033)]CRS-5828:Could not start agent '/grid/product/11.2/bin/oraagent_grid'. Details at (:CRSAGF00130:) {1:49618:10247} in /grid/product/11.2/log/jzdb1/crsd/crsd.log.

也请上传压缩后的  /grid/product/11.2/log/jzdb1/crsd/crsd.log  和  LMON 、LMHB在问题时段的TRACE

回复 只看该作者 道具 举报

10#
发表于 2012-4-22 14:02:25

回复 9# 的帖子

crsd日志上传

crsd.rar

11.58 KB, 下载次数: 2739

回复 只看该作者 道具 举报

11#
发表于 2012-4-22 14:24:57
ODM DATA:

crsd.log
2012-04-22 00:40:49.338: [ CRSCOMM][38][FFAIL] Ipc: Couldnt clscreceive message, no message: 11
2012-04-22 00:40:49.339: [ CRSCOMM][38] Ipc: Client disconnected.
2012-04-22 00:40:49.339: [ CRSCOMM][38][FFAIL] IpcL: Listener got clsc error 11 for memNum. 1
2012-04-22 00:40:49.339: [ CRSCOMM][38] IpcL: connection to member 1 has been removed
2012-04-22 00:40:49.339: [CLSFRAME][38] Removing IPC Member:{Relative|Node:0|Process:1|Type:3}
2012-04-22 00:40:49.339: [CLSFRAME][38] Disconnected from AGENT process: {Relative|Node:0|Process:1|Type:3}
2012-04-22 00:40:49.340: [    AGFW][41] {1:49618:2} Agent /grid/product/11.2/bin/oraagent_grid[3084] killed!
2012-04-22 00:40:49.340: [ CRSCOMM][41] {1:49618:2} IpcL: removeConnection: Member 1 does not exist.
2012-04-22 00:40:49.346: [    AGFW][41] {1:49618:10247} Agfw Proxy Server received process disconnected notification, count=1
2012-04-22 00:40:49.346: [    AGFW][41] {1:49618:10247} /grid/product/11.2/bin/oraagent_grid disconnected.
2012-04-22 00:40:49.346: [    AGFW][41] {1:49618:10247} Agent /grid/product/11.2/bin/oraagent_grid[3084] stopped!
2012-04-22 00:40:49.346: [ CRSCOMM][41] {1:49618:10247} IpcL: removeConnection: Member 1 does not exist.
2012-04-22 00:40:49.346: [    AGFW][41] {1:49618:10247} Restarting the agent /grid/product/11.2/bin/oraagent_grid
2012-04-22 00:40:49.346: [    AGFW][41] {1:49618:10247} Starting the agent: /grid/product/11.2/bin/oraagent with user id: grid and incarnation:2
2012-04-22 00:40:49.880: [    AGFW][41] {1:49618:10247} Starting the HB [Interval =  30000, misscount = 6kill allowed=1] for agent: /grid/product/11.2/bin/oraagent_grid
2012-04-22 00:44:20.383: [    AGFW][41] {1:49618:10247} Created alert : (:CRSAGF00130:) :  Failed to start the agent /grid/product/11.2/bin/oraagent_grid
2012-04-22 00:44:20.383: [    AGFW][41] {1:49618:10247} Can not stop the agent: /grid/product/11.2/bin/oraagent_grid because pid is not initialized
2012-04-22 00:44:20.383: [    AGFW][41] {1:49618:10247} Restarting the agent /grid/product/11.2/bin/oraagent_grid
2012-04-22 00:44:20.383: [    AGFW][41] {1:49618:10247} Starting the agent: /grid/product/11.2/bin/oraagent with user id: grid and incarnation:4
2012-04-22 00:44:20.423: [    AGFW][41] {1:49618:10247} Starting the HB [Interval =  30000, misscount = 6kill allowed=1] for agent: /grid/product/11.2/bin/oraagent_grid
2012-04-22 00:47:51.223: [    AGFW][41] {1:49618:10247} Created alert : (:CRSAGF00130:) :  Failed to start the agent /grid/product/11.2/bin/oraagent_grid
2012-04-22 00:47:51.223: [    AGFW][41] {1:49618:10247} Can not stop the agent: /grid/product/11.2/bin/oraagent_grid because pid is not initialized
2012-04-22 00:47:51.223: [    AGFW][41] {1:49618:10247} Restarting the agent /grid/product/11.2/bin/oraagent_grid
2012-04-22 00:47:51.223: [    AGFW][41] {1:49618:10247} Starting the agent: /grid/product/11.2/bin/oraagent with user id: grid and incarnation:6
2012-04-22 00:47:51.685: [    AGFW][41] {1:49618:10247} Starting the HB [Interval =  30000, misscount = 6kill allowed=1] for agent: /grid/product/11.2/bin/oraagent_grid
2012-04-22 10:19:10.706: [ default][1] First attempt: init CSS context succeeded.
[  clsdmt][3]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=jzdb1DBG_CRSD))
2012-04-22 10:19:10.713: [  clsdmt][3]PID for the Process [3126], connkey 1



lmon:

*** 2012-04-22 00:41:38.849
==============================

*** 2012-04-22 00:41:44.140
LMS1 (ospid: 3215) has not moved for 21 sec (1335026478.1335026457)

*** 2012-04-22 00:42:48.440

*** 2012-04-22 00:42:55.533
=====================================================

*** 2012-04-22 00:43:07.645
kjxgmpoll: stalled for 90 seconds (threshold 42 sec)

*** 2012-04-22 00:43:12.493
kjfmHBupdateTime: took 117 seconds for 8 heartbeats

*** 2012-04-22 00:43:14.103
==============================
LMS1 (ospid: 3215) has not moved for 137 sec (1335026594.1335026457)

*** 2012-04-22 00:43:19.604

*** 2012-04-22 00:43:49.410
==============================
LMS0 (ospid: 3213) has not moved for 63 sec (1335026599.1335026536)

*** 2012-04-22 00:43:50.522
kjfmHBupdateTime: took 39 seconds for 8 heartbeats

*** 2012-04-22 00:44:33.031

*** 2012-04-22 00:44:59.805
==============================
LMD0 (ospid: 3211) has not moved for 25 sec (1335026672.1335026647)

*** 2012-04-22 00:45:00.295
==============================
LMS1 (ospid: 3215) has not moved for 162 sec (1335026699.1335026537)

*** 2012-04-22 00:45:00.296
=====================================================
kjxgmpoll: stalled for 70 seconds (threshold 42 sec)

*** 2012-04-22 00:45:07.892
kjfmHBupdateTime: took 77 seconds for 8 heartbeats

*** 2012-04-22 00:45:15.112

*** 2012-04-22 00:45:23.349
==============================

*** 2012-04-22 00:45:25.666
LMS0 (ospid: 3213) has not moved for 179 sec (1335026715.1335026536)

*** 2012-04-22 00:45:53.614
==============================

*** 2012-04-22 00:45:55.568

*** 2012-04-22 00:46:08.842
Received ORADEBUG command (#1) 'dump KSTDUMPCURPROC 1' from process 'Unix process pid: 0, image: <none>'
-------------------------------------------------------------------------------
Trace Bucket Dump Begin: default bucket for process 11 (osid: 3209, LMON)
TIME(*=approx):SEQ:COMPONENT:FILE@LINE:FUNCTION:SECT/DUMP: [EVENT#:PID:SID] DATA
-------------------------------------------------------------------------------
2012-04-22 00:41:17.267670 :97BB270D:db_trace:ksl2.c@13718:ksliwat(): [10005:11:1409] KSL POST RCVD poster=12 num=8304692 loc='kji.h LINE:3395 ID:kjxhap2ackh: wake up DIA0 waiting in kjiwtp2ack' id1=0 id2=0 name=   type=0 fac#=2 posted=0x1 may_be_posted=1
2012-04-22 00:41:17.267704 :97BB270E:db_trace:ksxp.c@4232:ksxpvsnd_with_bcb(): [10401:11:1409] KSXPVSND: client 5 tid(2,257,0x39dd81a1) buf 0x9ffffffffcdf8bd0 sz 92
2012-04-22 00:41:17.267866 :97BB270F:db_trace:ksxp.c@5699:ksxpwait(): [10401:11:1409] KSXP_SND_CALLBACK:  request 0x9ffffffffd0e66c0, status 30
2012-04-22 00:42:48.440355 :97BB76BE:db_trace:ksl2.c@13718:ksliwat(): [10005:11:1409] KSL POST RCVD poster=12 num=8304692 loc='kji.h LINE:3395 ID:kjxhap2ackh: wake up DIA0 waiting in kjiwtp2ack' id1=0 id2=0 name=   type=0 fac#=2 posted=0x1 may_be_posted=1
2012-04-22 00:43:09.192526 :97BB86FE:db_trace:ksxp.c@4232:ksxpvsnd_with_bcb(): [10401:11:1409] KSXPVSND: client 5 tid(2,257,0x39dd81a1) buf 0x9ffffffffcdfeea0 sz 92
2012-04-22 00:43:09.192602 :97BB86FF:db_trace:ksxp.c@4232:ksxpvsnd_with_bcb(): [10401:11:1409] KSXPVSND: client 5 tid(2,257,0x39dd81a1) buf 0x9ffffffffcdfd950 sz 92
2012-04-22 00:43:09.195347 :97BB8700:db_trace:ksxp.c@4232:ksxpvsnd_with_bcb(): [10401:11:1409] KSXPVSND: client 5 tid(2,257,0x39dd81a1) buf 0x9ffffffffcdf49f0 sz 480
2012-04-22 00:43:09.195384 :97BB8701:db_trace:ksxp.c@4232:ksxpvsnd_with_bcb(): [10401:11:1409] KSXPVSND: client 5 tid(2,257,0x39dd81a1) buf 0x9ffffffffcdf4420 sz 480
2012-04-22 00:43:09.195406 :97BB8702:db_trace:ksxp.c@4232:ksxpvsnd_with_bcb(): [10401:11:1409] KSXPVSND: client 5 tid(2,257,0x39dd81a1) buf 0x9ffffffffcdf4230 sz 480
2012-04-22 00:43:09.195458 :97BB8703:db_trace:ksl2.c@13718:ksliwat(): [10005:11:1409] KSL POST RCVD poster=12 num=8304692 loc='kji.h LINE:3395 ID:kjxhap2ackh: wake up DIA0 waiting in kjiwtp2ack' id1=0 id2=0 name=   type=0 fac#=2 posted=0x1 may_be_posted=1
2012-04-22 00:43:09.195507 :97BB8704:db_trace:ksxp.c@5699:ksxpwait(): [10401:11:1409] KSXP_SND_CALLBACK:  request 0x9ffffffffd2c13c0, status 30
2012-04-22 00:43:09.195511 :97BB8705:db_trace:ksxp.c@5699:ksxpwait(): [10401:11:1409] KSXP_SND_CALLBACK:  request 0x9ffffffffd3f0040, status 30
2012-04-22 00:43:09.197374 :97BB8706:db_trace:ksxp.c@4232:ksxpvsnd_with_bcb(): [10401:11:1409] KSXPVSND: client 5 tid(2,257,0x39dd81a1) buf 0x9ffffffffcdf8410 sz 240
2012-04-22 00:43:09.199024 :97BB8707:db_trace:ksxp.c@5699:ksxpwait(): [10401:11:1409] KSXP_SND_CALLBACK:  request 0x9ffffffffd2d4960, status 30
2012-04-22 00:43:09.199026 :97BB8708:db_trace:ksxp.c@5699:ksxpwait(): [10401:11:1409] KSXP_SND_CALLBACK:  request 0x9ffffffffd074e40, status 30
2012-04-22 00:43:09.199028 :97BB8709:db_trace:ksxp.c@5699:ksxpwait(): [10401:11:1409] KSXP_SND_CALLBACK:  request 0x9ffffffffcd94960, status 30
2012-04-22 00:43:09.199029 :97BB870A:db_trace:ksxp.c@5699:ksxpwait(): [10401:11:1409] KSXP_SND_CALLBACK:  request 0x9ffffffffd199c60, status 30
2012-04-22 00:43:09.199035 :97BB870B:db_trace:ksl2.c@13718:ksliwat(): [10005:11:1409] KSL POST RCVD poster=12 num=8304692 loc='kji.h LINE:3395 ID:kjxhap2ackh: wake up DIA0 waiting in kjiwtp2ack' id1=0 id2=0 name=   type=0 fac#=2 posted=0x1 may_be_posted=1
2012-04-22 00:43:09.199052 :97BB870C:db_trace:ksl2.c@15159:ksl_update_post_stats(): [10005:11:1409] KSL POST SENT postee=122 num=1352 loc='kjxgn1.h LINE:171 ID:kjxgnaef' id1=0 id2=0 name=   type=0
2012-04-22 00:43:09.199099 :97BB870D:db_trace:ksxp.c@4232:ksxpvsnd_with_bcb(): [10401:11:1409] KSXPVSND: client 5 tid(2,257,0x39dd81a1) buf 0x9ffffffffcdf7e40 sz 92
2012-04-22 00:43:09.199230 :97BB870F:db_trace:ksxp.c@5699:ksxpwait(): [10401:11:1409] KSXP_SND_CALLBACK:  request 0x9ffffffffcdeba20, status 30
2012-04-22 00:43:09.302854 :97BB8716:db_trace:ksxp.c@4232:ksxpvsnd_with_bcb(): [10401:11:1409] KSXPVSND: client 5 tid(2,257,0x39dd81a1) buf 0x9ffffffffcdffe20 sz 480
2012-04-22 00:43:09.303043 :97BB8717:db_trace:ksxp.c@5699:ksxpwait(): [10401:11:1409] KSXP_SND_CALLBACK:  request 0x9ffffffffcd34960, status 30
2012-04-22 00:43:09.303084 :97BB8718:db_trace:ksl2.c@13718:ksliwat(): [10005:11:1409] KSL POST RCVD poster=12 num=8304692 loc='kji.h LINE:3395 ID:kjxhap2ackh: wake up DIA0 waiting in kjiwtp2ack' id1=0 id2=0 name=   type=0 fac#=2 posted=0x1 may_be_posted=1
2012-04-22 00:43:09.303092 :97BB8719:db_trace:ksl2.c@15159:ksl_update_post_stats(): [10005:11:1409] KSL POST SENT postee=122 num=1353 loc='kjxgn1.h LINE:171 ID:kjxgnaef' id1=0 id2=0 name=   type=0
2012-04-22 00:43:10.734881 :97BB87CD:db_trace:ksl2.c@13718:ksliwat(): [10005:11:1409] KSL POST RCVD poster=12 num=8304692 loc='kji.h LINE:3395 ID:kjxhap2ackh: wake up DIA0 waiting in kjiwtp2ack' id1=0 id2=0 name=   type=0 fac#=2 posted=0x1 may_be_posted=1
2012-04-22 00:43:10.734901 :97BB87CE:db_trace:ksxp.c@4232:ksxpvsnd_with_bcb(): [10401:11:1409] KSXPVSND: client 5 tid(2,257,0x39dd81a1) buf 0x9ffffffffcdf7680 sz 240
2012-04-22 00:43:10.736404 :97BB87CF:db_trace:ksxp.c@5699:ksxpwait(): [10401:11:1409] KSXP_SND_CALLBACK:  request 0x9ffffffffcd3b9a0, status 30
2012-04-22 00:43:14.214695 :97BB89CC:db_trace:ksxp.c@4232:ksxpvsnd_with_bcb(): [10401:11:1409] KSXPVSND: client 5 tid(2,257,0x39dd81a1) buf 0x9ffffffffcdfe6e0 sz 480
2012-04-22 00:43:14.214765 :97BB89CD:db_trace:ksxp.c@4232:ksxpvsnd_with_bcb(): [10401:11:1409] KSXPVSND: client 5 tid(2,257,0x39dd81a1) buf 0x9ffffffffcdf5b60 sz 480
2012-04-22 00:43:14.215018 :97BB89CE:db_trace:ksxp.c@5699:ksxpwait(): [10401:11:1409] KSXP_SND_CALLBACK:  request 0x9ffffffffd0e66c0, status 30
2012-04-22 00:43:14.215022 :97BB89CF:db_trace:ksxp.c@5699:ksxpwait(): [10401:11:1409] KSXP_SND_CALLBACK:  request 0x9ffffffffd24a140, status 30
2012-04-22 00:43:14.215119 :97BB89D0:db_trace:ksl2.c@13718:ksliwat(): [10005:11:1409] KSL POST RCVD poster=12 num=8304692 loc='kji.h LINE:3395 ID:kjxhap2ackh: wake up DIA0 waiting in kjiwtp2ack' id1=0 id2=0 name=   type=0 fac#=2 posted=0x1 may_be_posted=1
2012-04-22 00:43:14.434379 :97BB89FB:db_trace:ksxp.c@4232:ksxpvsnd_with_bcb(): [10401:11:1409] KSXPVSND: client 5 tid(2,257,0x39dd81a1) buf 0x9ffffffffcdfaad0 sz 480
2012-04-22 00:43:14.434417 :97BB89FC:db_trace:ksxp.c@4232:ksxpvsnd_with_bcb(): [10401:11:1409] KSXPVSND: client 5 tid(2,257,0x39dd81a1) buf 0x9ffffffffcdf8600 sz 480
2012-04-22 00:43:14.434574 :97BB89FD:db_trace:ksxp.c@5699:ksxpwait(): [10401:11:1409] KSXP_SND_CALLBACK:  request 0x9ffffffffd2c13c0, status 30
2012-04-22 00:43:14.434576 :97BB89FE:db_trace:ksxp.c@5699:ksxpwait(): [10401:11:1409] KSXP_SND_CALLBACK:  request 0x9ffffffffd162260, status 30
2012-04-22 00:43:14.434614 :97BB89FF:db_trace:ksl2.c@13718:ksliwat(): [10005:11:1409] KSL POST RCVD poster=12 num=8304692 loc='kji.h LINE:3395 ID:kjxhap2ackh: wake up DIA0 waiting in kjiwtp2ack' id1=0 id2=0 name=   type=0 fac#=2 posted=0x1 may_be_posted=1
2012-04-22 00:43:16.635237 :97BB8B5F:db_trace:ksl2.c@13718:ksliwat(): [10005:11:1409] KSL POST RCVD poster=12 num=8304692 loc='kji.h LINE:3395 ID:kjxhap2ackh: wake up DIA0 waiting in kjiwtp2ack' id1=0 id2=0 name=   type=0 fac#=2 posted=0x1 may_be_posted=0
2012-04-22 00:43:16.742875 :97BB8B6D:db_trace:ksxp.c@4232:ksxpvsnd_with_bcb(): [10401:11:1409] KSXPVSND: client 5 tid(2,257,0x39dd81a1) buf 0x9ffffffffcdf4610 sz 240
2012-04-22 00:43:16.743078 :97BB8B6E:db_trace:ksxp.c@5699:ksxpwait(): [10401:11:1409] KSXP_SND_CALLBACK:  request 0x9ffffffffd2e3fa0, status 30
2012-04-22 00:43:18.565829 :97BB8D1E:db_trace:ksl2.c@13718:ksliwat(): [10005:11:1409] KSL POST RCVD poster=12 num=8304692 loc='kji.h LINE:3395 ID:kjxhap2ackh: wake up DIA0 waiting in kjiwtp2ack' id1=0 id2=0 name=   type=0 fac#=2 posted=0x1 may_be_posted=1
2012-04-22 00:43:18.566373 :97BB8D1F:db_trace:ksxp.c@4232:ksxpvsnd_with_bcb(): [10401:11:1409] KSXPVSND: client 5 tid(2,257,0x39dd81a1) buf 0x9ffffffffcdfe300 sz 92




LMHB:


*** 2012-04-22 00:41:58.970
LMON (ospid: 3209) has not moved for 23 sec (1335026497.1335026474)

*** 2012-04-22 00:42:18.943
kjfmGCR_HBCheckAll: LMON (ospid: 3209) has status 2

*** 2012-04-22 00:42:20.642
  : Not in wait; last wait ended 19 secs ago.

*** 2012-04-22 00:42:25.313
  : last wait_id 12988147 at 'CGS wait for IPC msg'.

*** 2012-04-22 00:42:46.302
  ==============================
  Dumping PROCESS LMON (ospid: 3209) States
  ==============================
  ===[ Callstack ]===

*** 2012-04-22 00:42:46.302
Process diagnostic dump for oracle@jzdb1 (LMON), OS id=3209,
pid: 11, proc_ser: 1, sid: 1409, sess_ser: 1
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)
  0.000000 secs at [ 00:42:04 ]
    NOTE: scheduling delay has not been sampled for 41.577089 secs  0.000000 secs from [ 00:42:04 - 00:42:46 ], 5 sec avg
  0.000000 secs from [ 00:41:46 - 00:42:46 ], 1 min avg
  0.000000 secs from [ 00:37:46 - 00:42:46 ], 5 min avg

*** 2012-04-22 00:42:50.255
loadavg : 0.12 0.12 0.11

*** 2012-04-22 00:42:55.538
Swapinfo :

*** 2012-04-22 00:42:57.019
        Avail = 94938.05Mb Used = 47463.07Mb
        Swap free = 47474.98Mb Kernel rsvd = 5114.71Mb
        Free Mem  = 457.23Mb
skgpgcmdout: read() for cmd /bin/ps -elf | /bin/egrep 'PID | 3209' | /bin/grep -v grep timed out after 2.500 seconds
Skipping stack dump because max dump time exceeded.

-------------------------------------------------------------------------------
Process diagnostic dump actual duration=11.058000 sec
  (max dump time=5.000000 sec)

*** 2012-04-22 00:42:57.360

*** 2012-04-22 00:42:57.360
==============================
LCK0 (ospid: 3261) has not moved for 99 sec (1335026576.1335026477)
kjfmGCR_HBCheckAll: LCK0 (ospid: 3261) has status 2
  : waiting for event 'rdbms ipc message' for 1 secs with wait_id 33556774.
  ===[ Wait Chain ]===
  Wait chain is empty.
kjgcr_Main: KJGCR_ACTION - id 5

*** 2012-04-22 00:43:02.364

*** 2012-04-22 00:43:03.226
==============================

*** 2012-04-22 00:43:05.485
LMON (ospid: 3209) has not moved for 107 sec (1335026581.1335026474)

*** 2012-04-22 00:43:17.023
kjfmGCR_HBCheckAll: LMON (ospid: 3209) has status 6

*** 2012-04-22 00:44:34.927
==================================================

*** 2012-04-22 00:44:44.528
=== LMON (ospid: 3209) Heartbeat Report

*** 2012-04-22 00:44:51.126
==================================================

*** 2012-04-22 00:45:02.011
LMON (ospid: 3209) has no heartbeats for 44 sec. (threshold 70 sec)

*** 2012-04-22 00:45:08.973
  : Not in wait; last wait ended 13 secs ago.

*** 2012-04-22 00:45:10.553
  : last wait_id 12988148 at 'rdbms ipc message'.
==============================
Dumping PROCESS LMON (ospid: 3209) States
==============================
===[ System Load State ]===
  CPU Total 16 Core 16 Socket 4
  Load normal: Cur 22 Highmark 20480 (0.08 80.00)

*** 2012-04-22 00:45:11.461
===[ Latch State ]===

*** 2012-04-22 00:45:16.908
  Not in Latch Get

*** 2012-04-22 00:45:18.279
===[ Session State Object ]===

*** 2012-04-22 00:45:22.217
  ----------------------------------------

*** 2012-04-22 00:45:22.929
  SO: 0xc0000007dcf54ba0, type: 4, owner: 0xc0000007f0a43d98, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0xc0000007f0a43d98, name=session, file=ksu.h LINE:12459 ID:, pg=0
  (session) sid: 1409 ser: 1 trans: 0x0000000000000000, creator: 0xc0000007f0a43d98
            flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
            flags2: (0x408) -/-
            DID: , short-term DID:
            txn branch: 0x0000000000000000
            oct: 0, prv: 0, sql: 0x0000000000000000, psql: 0x0000000000000000, user: 0/SYS
  ksuxds FALSE at location: 0
  service name: SYS$BACKGROUND
  Current Wait Stack:
    Not in wait; last wait ended 7.817365 sec ago
  Wait State:
    fixed_waits=0 flags=0x21 boundary=0x0000000000000000/-1
  Session Wait History:
      elapsed time of 7.817423 sec since last wait
   0: waited for 'CGS wait for IPC msg'
      =0x0, =0x0, =0x0
      wait_id=12988622 seq_num=12994 snap_id=1
      wait times: snap=0.000006 sec, exc=0.000006 sec, total=0.000006 sec
      wait times: max=0.000000 sec
      wait counts: calls=1 os=1
      occurred after 0.000023 sec of elapsed time
  1: waited for 'rdbms ipc message'
      timeout=0xa, =0x0, =0x0
      wait_id=12988621 seq_num=12993 snap_id=1
      wait times: snap=0.105385 sec, exc=0.105385 sec, total=0.105385 sec
      wait times: max=0.100000 sec
      wait counts: calls=1 os=1
      occurred after 0.002480 sec of elapsed time
   2: waited for 'CGS wait for IPC msg'
      =0x0, =0x0, =0x0
      wait_id=12988620 seq_num=12992 snap_id=1
      wait times: snap=0.000020 sec, exc=0.000020 sec, total=0.000020 sec
      wait times: max=0.000000 sec
      wait counts: calls=1 os=1
      occurred after 0.000105 sec of elapsed time
   3: waited for 'control file sequential read'
      file#=0x0, block#=0x23, blocks=0x1
      wait_id=12988619 seq_num=12991 snap_id=1
      wait times: snap=0.001546 sec, exc=0.001546 sec, total=0.001546 sec
      wait times: max=infinite
      wait counts: calls=0 os=0
      occurred after 0.000399 sec of elapsed time


   SO: 0xc0000007c3f84c50, type: 56, owner: 0xc0000007dcf54ba0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
     proc=0xc0000007f0a43d98, name=dummy, file=ktccts.h LINE:412 ID:, pg=0
    (dummy) nxc=0, nlb=0   
===[ Callstack ]===

*** 2012-04-22 00:45:41.793
Process diagnostic dump for oracle@jzdb1 (LMON), OS id=3209,
pid: 11, proc_ser: 1, sid: 1409, sess_ser: 1
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)
  0.000000 secs at [ 00:45:40 ]
    NOTE: scheduling delay has not been sampled for 0.888046 secs  0.001928 secs from [ 00:45:39 - 00:45:41 ], 5 sec avg
  0.000068 secs from [ 00:44:42 - 00:45:41 ], 1 min avg
  0.000013 secs from [ 00:40:42 - 00:45:41 ], 5 min avg

*** 2012-04-22 00:46:04.974
loadavg : 0.08 0.11 0.11

*** 2012-04-22 00:46:05.901
Swapinfo :
        Avail = 94938.05Mb Used = 47448.73Mb
        Swap free = 47489.31Mb Kernel rsvd = 5112.50Mb
        Free Mem  = 468.61Mb
skgpgcmdout: read() for cmd /bin/ps -elf | /bin/egrep 'PID | 3209' | /bin/grep -v grep timed out after 2.500 seconds
Skipping stack dump because max dump time exceeded.

-------------------------------------------------------------------------------
Process diagnostic dump actual duration=24.108000 sec
  (max dump time=5.000000 sec)

回复 只看该作者 道具 举报

12#
发表于 2012-4-22 14:36:26
已上传的diag 的TRACE是否 完整的?

diag:

*** 2012-04-22 00:41:32.956
kjfmHBupdateTime: took 70 seconds for 2 heartbeats

就日志来看 到  00:41:32 结束, 理论上它可能生成systemstate dump 以便诊断


如以上回复,LMHB 对LMON进程做了heartbeat report
*** 2012-04-22 00:43:17.023
kjfmGCR_HBCheckAll: LMON (ospid: 3209) has status 6

*** 2012-04-22 00:44:34.927
==================================================

*** 2012-04-22 00:44:44.528
=== LMON (ospid: 3209) Heartbeat Report

*** 2012-04-22 00:44:51.126
==================================================

*** 2012-04-22 00:45:02.011
LMON (ospid: 3209) has no heartbeats for 44 sec. (threshold 70 sec)


LMON进程 长时间没有完成 heartbeats





ODM FINDING:

MOS上一个类似的案例:

Bug 13588833: INSTANCE CRASHED DUE TO EXHAUSTION OF MEMORY

Hdr: 13588833 11.2.0.2 RDBMS 11.2.0.2 RAC PRODID-5 PORTID-226
Abstract: INSTANCE CRASHED DUE TO EXHAUSTION OF MEMORY

PROBLEM:
--------
Exadata customer running 11.2.0.2 on 4-node RAC
Instance 1 was crashed as shown in alert log:

Fatal NI connect error 12170.

  VERSION INFORMATION:
        TNS for Linux: Version 11.2.0.2.0 - Production
        Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.2.0 -
Production
        TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.2.0 - Production
  Time: 06-JAN-2012 18:36:43
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535
   
TNS-12535: TNS:operation timed out
    ns secondary err code: 12606
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.20.10.190)(PORT=46200))
Fri Jan 06 18:36:45 2012
LMD0 (ospid: 26135) has not called a wait for 101 secs.
Fri Jan 06 18:36:55 2012
LMS0 (ospid: 26137) has not called a wait for 109 secs.
Fri Jan 06 18:37:05 2012
LMS1 (ospid: 26141) has not called a wait for 119 secs.
Fri Jan 06 18:37:21 2012
Errors in file
/u01/app/oracle/admin/diag/rdbms/prd/prd1/trace/prd1_lmhb_26147.trc  
(incident=240467):
ORA-29770: global enqueue process LMON (OSID 26132) is hung for more than 70
seconds
Incident details in:
/u01/app/oracle/admin/diag/rdbms/prd/prd1/incident/incdir_240467/prd1_lmhb_261
47_i240467.trc
Fri Jan 06 18:37:31 2012
             ...
Errors in file
/u01/app/oracle/admin/diag/rdbms/prd/prd1/trace/prd1_lmhb_26147.trc  
(incident=256395):
ORA-29770: global enqueue process LMS1 (OSID 26141) is hung for more than 70
seconds
ERROR: Some process(s) is not making progress.
LMHB (ospid: 26147) is terminating the instance.
Please check LMHB trace file for more details.
Please also check the CPU load, I/O load and other system properties for
anomalous behavior
ERROR: Some process(s) is not making progress.
Fri Jan 06 18:39:16 2012
System state dump requested by (instance=1, osid=26147 (LMHB)),
summary=[abnormal instance termination].
LMHB (ospid: 26147): terminating the instance due to error 29770
System State dumped to trace file
/u01/app/oracle/admin/diag/rdbms/prd/prd1/trace/prd1_diag_26120.trc
Fri Jan 06 18:39:22 2012
Instance terminated by LMHB, pid = 26147
Fri Jan 06 18:39:34 2012
Starting ORACLE instance (normal)
*** Pages Information *****************
Huge Pages memory pool detected (total: 8194 free: 8194)
Huge Pages allocation failed (free: 8194 required: 25601)
Allocation will continue with default/smaller page size
**********************************************************

>> which appears due to memory exhaustion

DIAGNOSTIC ANALYSIS:
--------------------

>> prd1_lmhb_26147.trc showing increasingly higher and higher loadavg until
db crashed.  Note remaining memory of 475 Mb:

*** 18:37:08.193
loadavg : 26.91 13.77 9.47
Memory (Avail / Total) = 475.89M / 96532.20M
Swap (Avail / Total) = 22419.46M /  24575.99M
skgpgcmdout: read() for cmd /bin/ps -elf | /bin/egrep 'PID | 26141' |
/bin/grep -v grep timed out after 2.500 seconds


WORKAROUND:
-----------
N/A

RELATED BUGS:
-------------

REPRODUCIBILITY:
----------------
This is a one-time incident so far.  Hopefully none will recur.

TEST CASE:
----------

STACK TRACE:
------------
*** 18:39:15.101
==============================
LMS1 (ospid: 26141) has not moved for 250 sec (1325875154.1325874904)
DDE: Problem Key 'ORA-29770' was flood controlled (0x6) (incident: 256392)
ORA-29770: global enqueue process LMON (OSID 26132) is hung for more than 70
seconds
DDE: Problem Key 'ORA-29770' was flood controlled (0x6) (incident: 256393)
ORA-29770: global enqueue process LMD0 (OSID 26135) is hung for more than 70
seconds
DDE: Problem Key 'ORA-29770' was flood controlled (0x6) (incident: 256394)
ORA-29770: global enqueue process LMS0 (OSID 26137) is hung for more than 70
seconds
DDE: Problem Key 'ORA-29770' was flood controlled (0x6) (incident: 256395)
ORA-29770: global enqueue process LMS1 (OSID 26141) is hung for more than 70
seconds
kjfmGCR_HBdisambig: action=Inst-kill
kjgcr_Main: KJGCR_ACTION - id 1

*** 18:39:16.434
kjgcr_grouplock: did not get lock yet

*** 18:39:16.434
kjgcr_Main: Waited too long to acquire group lock (lock_retries=1). Taking
action(id=10)

*** 18:39:16.434
==============================
LMON (ospid: 26132) has not moved for 256 sec (1325875155.1325874899)

*** 18:39:16.434
==============================
LMD0 (ospid: 26135) has not moved for 254 sec (1325875155.1325874901)

*** 18:39:16.434
==============================
LMS0 (ospid: 26137) has not moved for 253 sec (1325875155.1325874902)

*** 18:39:16.434
==============================
LMS1 (ospid: 26141) has not moved for 251 sec (1325875155.1325874904)
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+461<-kjzdssdmp()+267<-kjzduptcctx()+232<-kjzdicrshnfy()+53<-ksuitm()
+1325<-kjgcr_KillInstance()+125<-kjgcr_Main()+3535<-kjfmlmhb_Main()+80<-ksbrdp
()+971<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+266<-ssthrdmain
()+252<-main()+201<-__libc_start_main()+244
<-_start()+36
----- End of Abridged Call Stack Trace -----

回复 只看该作者 道具 举报

13#
发表于 2012-4-22 14:38:38
区别在于 MOS上的案例 生成了systemstate dump , 且最终由 LMHB进程要求  terminated 了instance。 而lz的案例则是长时间hang住。

日志显示当时的Load负载并不高,且有较多的空闲内存(400MB ),怀疑是 LMON进程hang住 导致了instance hang。

需要相关的stack call进一步 确认该问题。

回复 只看该作者 道具 举报

14#
发表于 2012-4-22 14:44:12

回复 13# 的帖子

diag是完整的。

回复 只看该作者 道具 举报

15#
发表于 2012-4-22 14:54:02

回复 14# 的帖子

感谢刘大的回复:
由于两次故障的发生的时间点是一样的,都是在00:30左右,而该主机在00:00分左右会用NBU发起一个备份到本地的/ora目录。于是通过NBU,看了下发现了如下报错信息:
Info bptm (pid=3064) waited for full buffer 40 times, delayed 81 times
Critical bptm (pid=3064) sts_close_handle failed: 2060017 system call failed
Error bptm (pid=3064) cannot write image to disk, media close failed with status 2060017
Info bptm (pid=3064) EXITING with status 84 <----------
Info bphdb (pid=29969) done. status: 84: media write error
/ora目录无法写入。于是重新检查了syslog,有如下报错信息:
r 15 00:43:25 jzdb1  above message repeats 2 times
Apr 15 00:43:25 jzdb1 vmunix: vxfs: WARNING: msgcnt 1 mesg 038: V-2-38: vx_dataioerr - /dev/vgora/lvora file system file data write error in dev/block 0/246071296
Apr 15 00:47:13 jzdb1 CLSD: The clock on host jzdb1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.
Apr 15 00:51:08 jzdb1 SQLAnywhere(nb_jzdb1): Starting checkpoint of "NBAZDB" (NBAZDB.db) at Sun Apr 15 2012 00:51
Apr 15 00:51:08 jzdb1 SQLAnywhere(nb_jzdb1): Finished checkpoint of "NBAZDB" (NBAZDB.db) at Sun Apr 15 2012 00:51
Apr 15 00:52:21 jzdb1 vmunix: Synchronous Page I/O error occurred while paging to/from disk
Apr 15 00:52:21 jzdb1 vmunix: device 0x40020001, 1K block #268467968, 65536 bytes of 65536 bytes not paged, page addr =dead31.e008000b32340000,  bflag =45008e, bp = e0
000004cbca5380

Apr 22 00:22:19 jzdb1 vmunix: Synchronous Page I/O ...
Apr 22 00:22:20 jzdb1 vmunix: Page I/O error occurred while paging to/from disk
Apr 22 00:22:19 jzdb1 vmunix: Synchronous Page I/O error occurred while paging to/from disk

这里的device 0x40020001 就是文件系统/ora对应的lv /dev/vgora/lvora

回复 只看该作者 道具 举报

16#
发表于 2012-4-22 15:07:30

回复 15# 的帖子

device 0x40020001, 1K block #268467968, 65536 bytes of 65536 bytes not paged, page addr =dead31.e008000b32340000,  bflag =45008e, bp = e0
这个错误主机上不停的报。
用vgdisplay -v查看vg信息的时候发现如下问题:
vgchange: Warning: Couldn't attach to the volume group physical volume   /dev/disk/disk39

但是ioscan 发现 /dev/disk/disk39是正常的,rmsf后重新扫描也是正常。
将vgora,export后,重新import
jzdb2#[/]vgimport -N -s -m /tmp/hp/vgora.map /dev/vgora
vgimport: One or more physical volume(s) are missing from the configured volume group.
vgimport: Volume group "/dev/vgora" has been successfully created.
Warning: A backup of this volume group may not exist on this machine.
Please remember to take a backup using the vgcfgbackup command after activating the volume group
只导进了2块盘的信息,/dev/disk/disk39还是没有信息进来。
/dev/disk/disk39的vg信息好像全丢了。
把这块盘重建PV后,vgextend,又出现:
Kernel indicates 4 disks for "/dev/vgora"  /etc/lvmtab has 3 disks
vg认为新加的/dev/disk/disk39是另一块盘,估计这可能就是所谓的鬼盘。
重新踢掉/dev/disk/disk39,vgcfgrestore -n vgora  /dev/rdisk/disk39后,
重新vgexport、vgimport后,vg信息正常。

我觉得很可能是因为上面的故障导致主机HANG死,请刘大看看。

回复 只看该作者 道具 举报

17#
发表于 2012-4-22 15:31:42
the application process was killed with SIGKILL. At almost the same time, the following messages were left in the syslog.log file.
vmunix: Page I/O error occurred while paging to/from disk  vmunix: device 0x40000002, 1K block #271564, 0 bytes of vmunix: 4096 bytes not paged, page addr =7838400.bc6000, bflag =100000c, bp = 52b1d2c0 vmunix: Page I/O error occurred while paging to/from disk vmunix: device 0x40000002, 1K block #428904, 0 bytes of vmunix: 4096 bytes not paged, page addr =fd23800.df5000, bflag =100000c, bp = 4f225380 vmunix: Page I/O error occurred while paging to/from disk vmunix: device 0x40000002, 1K block #326132, 0 bytes of vmunix: 4096 bytes not paged, page addr =a32b400.767000, bflag =100000c, bp = 4fe56a80
SOLUTION:
The message logged in the syslog.log file indicate that the kernel detected some disk I/O error while saving/restoring the contents of memory pages to/from disk. The device 0x40000002 is a value of dev_t of the disk device. The upper eight bits make up a major number (0x40 or 64) that corresponds to LVM. The lower bits indicate VG# and LVOL# in the VG. So, this means that the target disk was VG00/LVOL2.
The size of the paging I/O requests were 4kbytes, but zero bytes were transferred.






对HP UX并不太了解, 就日志信息看 这是一个 因为换页I/O请求引起的 错误。

除了 该系统可能用了大量swap 空间之外,很难联想到由 一个I/O error 引发LMON hang住而长期无法 heartbeat

Swapinfo :
        Avail = 94938.05Mb Used = 47448.73Mb
        Swap free = 47489.31Mb Kernel rsvd = 5112.50Mb
        Free Mem  = 468.61Mb

建议你确认 该系统是否 使用了大量的 swap (used = 47448.73Mb) ,这些swap 是否又确实为oracle instance所用,如果是的 建议优化内存参数 将SGA lock在内存中。

回复 只看该作者 道具 举报

18#
发表于 2012-4-22 15:41:49

回复 17# 的帖子

日志中的swap 90G多,这是memory+swap的大小,swap实际是32G。
实际上主机不单单只是LMON hang住,而是整个OS被HANG住,早上我telnet主机无法登陆,但是PING的通,甚至通过主机MP地址登陆console界面都无法打开。
在节点二上面,通过crsctl stat res -t  看到的cluster进程两个节点都是online的。

回复 只看该作者 道具 举报

19#
发表于 2013-7-30 21:48:06
本帖最后由 北柏 于 2013-7-30 22:00 编辑
lixz_2008 发表于 2012-4-22 15:41
日志中的swap 90G多,这是memory+swap的大小,swap实际是32G。
实际上主机不单单只是LMON hang住,而是整个 ...


最后是怎么解决的?
我最经也处理了一个类似的问题,最终定位:aix 6 ;db 10205,os内存耗尽,重启节点后正常

回复 只看该作者 道具 举报

您需要登录后才可以回帖 登录 | 注册

QQ|手机版|Archiver|Oracle数据库数据恢复、性能优化

GMT+8, 2025-1-1 10:54 , Processed in 0.057496 second(s), 24 queries .

Powered by Discuz! X2.5

© 2001-2012 Comsenz Inc.

回顶部
TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569