2135 积分	502 好友	184 主题

发消息

[RAC性能调优] clsc_disc_orphans: protocol exchange timed out"后节点宕机

1^#

发表于 2013-10-12 00:02:36 | 查看: 2673| 回复: 1

ocssd.log日志中出现"clsc_disc_orphans: protocol exchange timed out"信息随之宕机:

[ CSSD]2012-08-16 23:24:15.702 [3086] >TRACE: clsc_disc_orphans: protocol exchange timed out, time limit 120 sec, time since connect 121276 ms
[ CSSD]2012-09-09 23:24:25.620 [3086] >TRACE: clsc_disc_orphans: protocol exchange timed out, time limit 120 sec, time since connect 121399 ms
[ CSSD]2012-10-01 13:56:41.962 [3086] >TRACE: clsc_disc_orphans: protocol exchange timed out, time limit 120 sec, time since connect 121150 ms
[ CSSD]2012-10-11 11:05:40.298 [3086] >TRACE: clsc_disc_orphans: protocol exchange timed out, time limit 120 sec, time since connect 121321 ms

同时伴随有2节点出现较多RACGVIP检测超时:

2012-10-12 10:03:24.950: [  CRSEVT][11040]32CAAMonitorHandler :: 0:Action Script /u01/app/oracle/product/10.2.0/crs/bin/racgwrap(check) timed out for ora.pdcmdb04.vip! (timeout=60)
...............
2012-10-16 09:14:07.334: [  CRSEVT][11066]32CAAMonitorHandler :: 0:Action Script /u01/app/oracle/product/10.2.0/crs/bin/racgwrap(check) timed out for ora.pdcmdb04.vip! (timeout=60)

基于现有的可用日志和TRACE信息分析：当RAC GM client监听线程在处理"clsc_disc_orphans"时，CSSD.LOG中会出现"clsc_disc_orphans"的信息。该函数在处理clsc_disc尝试断开连接时，负责获得和持有线程信息。存在BUG(Bug 9132429: LNX64-10205-CRS:NODE CRASH AFTER 5 MINUTES OF HANG/RESUME OCSSD.BIN。)可能导致多个session形成死锁，最终导致节点HANG住或被驱逐宕机；且该BUG可能附带导致VIP意外OFFLINE。

由于缺少宕机当时的CSSD进程的core dump以及stack call信息，无法确诊该BUG 9132429是引起宕机问题的根本原因；

3. 问题建议

1.  APPLY PATH 9132429

《Patch 9132429: LNX64-10205-CRS:NODE CRASH AFTER 5 MINUTES OF HANG/RESUME OCSSD.BIN》FOR 10.2.0.4 补丁目前等待开发部门BUILD，建议在该补丁可用后实施该补丁。

exchange, protocol

分享0

下载专业ORACLE数据库恢复工具PRM-DUL For Oracle http://www.parnassusdata.com/zh-hans/emergency-services

如果自己搞不定可以找诗檀软件专业ORACLE数据库修复团队成员帮您恢复!

诗檀软件专业数据库修复团队

服务热线： 13764045638 QQ: 47079569

收藏0 回复只看该作者道具举报

Maclean Liu(刘相兵

2^#

发表于 2013-10-12 00:08:13

The problem here is that while the GM client listener thread is processing

clsc_disc_orphans(), the function obtains the ugblm mutex and holds it while

descending into clsc_disc() to attempt to disconnect the connection.

.

clsc_disc() eventually triggers clscidisc() to disconnect to internal

connection. Here clscidisc() will go ahead to grab the ugblm mutex again,

thereby causing the thread to deadlock on itself.

.

Since the ugblm mutex is never released, when the NM sending thread

eventually makes an attempt at the same mutex, it is also blocked waiting for

the mutex - causing the issue that this bug observes.

.

This is a CLSC issue. The fix will involve preventing clsc_disc_orphan() from

deadlocking on itself with the ugblm mutex.

回复只看该作者道具举报

返回列表

		自动登录	找回密码
密码			注册