teapot

42 积分	0 好友	0 主题

发消息

CF锁，导致RAC Node Crash

1^#

发表于 2012-3-16 11:59:46 | 查看: 7002| 回复: 3

Oracle在CF锁持有上是有很多Bug的，但可以从CF锁中看出些什么吗？

*** 2012-03-15 10:26:56.128
DUMP LOCAL BLOCKER/HOLDER: block level 4 res [0x0][0x0],[CF]
----------resource 0x51d9c1890----------------------
resname    : [0x0][0x0],[CF]
Local node : 1
dir_node    : 1
master_node : 1
hv idx       : 95
hv last r.inc : 30
current inc : 32
hv status    : 0
hv master    : 1
open options  : dd
grant_bits : KJUSERNL KJUSERCR KJUSERPR
grant mode : KJUSERNL  KJUSERCR  KJUSERCW  KJUSERPR  KJUSERPW  KJUSEREX
count       : 1       1       0       2       0       0
val_state    : KJUSERVS_VALUE
valblk       : 0x00093f03000000000000000000000000 .?
access_node : 1
vbreq_state : 0
state       : x8
resp       : 51d9c1890
On Scan_q? : N
Total accesses: 591427
Imm.  accesses: 576115
Granted_locks : 3
Cvting_locks  : 1
value_block:  00 09 3f 03 00 00 00 00 00 00 00 00 00 00 00 00
GRANTED_Q :
lp 7413cff60 gl KJUSERPR rp 51d9c1890 [0x0][0x0],[CF]
  master 1 gl owner 741146e18 possible pid 21683 xid 2003-0037-00000018 bast 0 rseq 2 mseq 0 history 0x14951495
  open opt KJUSERDEADLOCK  KJUSERIDLK
lp 74130a848 gl KJUSERCR rp 51d9c1890 [0x0][0x0],[CF]
  master 1 gl owner 74114fee8 possible pid 20992 xid 0002-0023-0000000C bast 0 rseq 2 mseq 0 history 0x9a5
  open opt KJUSERDEADLOCK  KJUSERIDLK
lp 8001fb660 gl KJUSERPR rp 51d9c1890 [0x0][0x0],[CF]
  master 1 owner 0  bast 0 rseq 2 mseq 0x98dd98de history 0x7d7d7d8d
  open opt  KJUSERNO_XID
CONVERT_Q:
lp 74127ef18 gl KJUSERNL rl KJUSERPW rp 51d9c1890 [0x0][0x0],[CF]
  master 1 gl owner 741151398 possible pid 20990 xid 2001-001F-00000006 bast 0 rseq 2 mseq 0 history 0x1495149a
  convert opt KJUSERGETVALUE KJUSERNODEADLOCKWAIT
----------enqueue 0x7413cff60------------------------
lock version    : 793083
Owner node    : 1
grant_level    : KJUSERPR
req_level       : KJUSERPR
bast_level    : KJUSERNL
notify_func    : 0
resp          : 51d9c1890
procp          : 53643e4a0
pid             : 21683
proc version    : 0
oprocp          : 0
opid          : 0
group lock owner : 741146e18
possible pid    : 21683
xid             : 2003-0037-00000018
dd_time       : 0.0 secs
dd_count       : 0
timeout       : 0.0 secs
On_timer_q?    : N
On_dd_q?       : N
lock_state    : GRANTED
Open Options    : KJUSERDEADLOCK  KJUSERIDLK
Convert options  : KJUSERGETVALUE KJUSERNODEADLOCKWAIT
History       : 0x14951495
Msg_Seq       : 0x0
res_seq       : 2
valblk          : 0x00093f03000000000000000000000000 .?
Potential blocker (pid=21683) on resource CF-00000000-00000000
DUMP LOCAL BLOCKER: initiate state dump for TIMEOUT
  possible owner[55.21683]
Submitting asynchronized dump request [28]
----------enqueue 0x8001fb660------------------------
lock version    : 3
Owner node    : 0
grant_level    : KJUSERPR
req_level       : KJUSERPR
bast_level    : KJUSERPW
notify_func    : 0
resp          : 51d9c1890
procp          : 53645ffa0
pid             : 0
proc version    : 0
oprocp          : 0
opid          : 0
group lock owner : 0
xid             : 0000-0000-00000000
dd_time       : 0.0 secs
dd_count       : 0
timeout       : 0.0 secs
On_timer_q?    : N
On_dd_q?       : N
lock_state    : GRANTED
Open Options    :  KJUSERNO_XID
Convert options  : KJUSERGETVALUE
History       : 0x7d7d7d8d
Msg_Seq       : 0x98dd98de
res_seq       : 2
valblk          : 0x00000000000000000000000000000000 .
KCC: controlfile sequence number from SGA =  605955
*** 2012-03-15 10:41:57.248
lmd abort after exception 470

分享0

收藏0 回复只看该作者道具举报

teapot

4^#

发表于 2012-3-20 21:50:37

不需要，主要过程是，归档进程持有CF锁，超时，被杀掉，节点abort。
归档进程最后的等待事件是“KSV master wait”。
ASM没有任何日志，可能是被hang住了。

上面的文档，只是设置不杀持有CF锁的进程，但为什么归档持有CF锁超时，这个原因可能是ASM hang住了，也只能说可能。因为ASM无相关日志

其中diag日志中，归档进程的堆栈日志与
HP Itanium - ORA-240 or process on ASM & Database hang [ID 1105825.1]
的堆栈输出一致。
故障平台是Solaris，只能作为参考吧。

最后的结论是，高压力下，归档不要放到ASM里。这个问题，在10.2.0.5上也没有解决
Bug 12421376: DATABASE HANG THEN CRASH WITH "KSV MASTER WAIT" MESSAGE IN ARCH TRACEFILES

回复只看该作者道具举报

Maclean Liu(刘相兵

3^#

发表于 2012-3-19 20:48:10

请把完整的TRACE文件上传。

ODM FINDING:

CF: Controlfile Transaction Control file schema global enqueue

ORA-00494: enqueue [CF] held for too long (more than 900 seconds)

This error can also be accompanied by ORA-600 [2103] error which is basically the same problem - a process was unable to obtain the CF enqueue within the specified timeout (default 900 seconds) or after 15minutes.

In other words the control file of the instance was not able to be accessed by the Oracle background process within 15 minutes, hence the ORA-00494 error and instance crash.. This is by design..

This behavior can be correlated with server high load and high concurrency on resources, IO waits and contention, which keep the Oracle background processes from receiving the necessary resources.

Cause
=======

The problem has been investigated first in Bug 7692631 - 'DATABASE CRASHES WITH ORA-494 AFTER UPGRADE TO 10.2.0.4' and then in unpublished Bug 7914003 'KILL BLOCKER AFTER ORA-494 LEADS TO FATAL BG PROCESS BEING KILLED'

Solution
========

This kill blocker interface / ora-494 was introduced in 10.2.0.4. This new mechanism will kill *any* kind of blocking process, non-background or background.

* The difference will be that if the enqueue holder is a non-background process, even if it is killed, the instance can function without it.
* In case the holder is a background process, for example the LGWR, the kill of the holder leads to instance crash.

If you want to avoid the kill of the blocker (background or non-background process) you can set _kill_controlfile_enqueue_blocker=false.

This means that no type of blocker will be killed anymore although the resolution to this problem should focus on why the process is holding the enqueue for so long. Also, you may prefer to only avoid killing background processes, since they are vital to the instance, and you may want to allow the killing of non-background blokers.

This has been addressed in a secondary bug - unpublished Bug 7914003 'KILL BLOCKER AFTER ORA-494 LEADS TO FATAL BG PROCESS BEING KILLED' which was closed as Not a bug.
In order to prevent a background blocker from being killed, you can set the following init.ora parameter to 1 (default is 3).

_kill_enqueue_blocker=1

With this parameter, if the enqueue holder is a background process, then it will not be killed, therefore the instance will not crash. If the enqueue holder is not a background process, the new 10.2.0.4 mechanism will still try to kill it.

The reason why the blocker interface with ORA-494 is kept is because, in most cases, customers would prefer crashing the instance than having a cluster-wide hang.

回复只看该作者道具举报

teapot

2^#

发表于 2012-3-16 13:00:40

自已搞了，没问题了
*** 2012-03-15 10:26:56.128
DUMP LOCAL BLOCKER/HOLDER: block level 4 res [0x0][0x0],[CF]
----------resource 0x51d9c1890----------------------
resname    : [0x0][0x0],[CF]
Local node : 1
dir_node    : 1
master_node : 1
hv idx       : 95
hv last r.inc : 30
current inc : 32
hv status    : 0
hv master    : 1
open options  : dd
grant_bits : KJUSERNL KJUSERCR KJUSERPR
grant mode : KJUSERNL  KJUSERCR  KJUSERCW  KJUSERPR  KJUSERPW  KJUSEREX
count       : 1       1       0       2       0       0
val_state    : KJUSERVS_VALUE
valblk       : 0x00093f03000000000000000000000000 .?
access_node : 1
vbreq_state : 0
state       : x8
resp       : 51d9c1890
On Scan_q? : N
Total accesses: 591427
Imm.  accesses: 576115
Granted_locks : 3
Cvting_locks  : 1
value_block:  00 09 3f 03 00 00 00 00 00 00 00 00 00 00 00 00
GRANTED_Q : <-持有cf锁的队列
lp 7413cff60 gl KJUSERPR rp 51d9c1890 [0x0][0x0],[CF]
  master 1 gl owner 741146e18 possible pid 21683 xid 2003-0037-00000018 bast 0 rseq 2 mseq 0 history 0x14951495
  open opt KJUSERDEADLOCK  KJUSERIDLK
lp 74130a848 gl KJUSERCR rp 51d9c1890 [0x0][0x0],[CF]
  master 1 gl owner 74114fee8 possible pid 20992 xid 0002-0023-0000000C bast 0 rseq 2 mseq 0 history 0x9a5 <-CKPT
  open opt KJUSERDEADLOCK  KJUSERIDLK
lp 8001fb660 gl KJUSERPR rp 51d9c1890 [0x0][0x0],[CF]
  master 1 owner 0  bast 0 rseq 2 mseq 0x98dd98de history 0x7d7d7d8d
  open opt  KJUSERNO_XID
CONVERT_Q: <- Convert队列
lp 74127ef18 gl KJUSERNL rl KJUSERPW rp 51d9c1890 [0x0][0x0],[CF]
  master 1 gl owner 741151398 possible pid 20990 xid 2001-001F-00000006 bast 0 rseq 2 mseq 0 history 0x1495149a
  convert opt KJUSERGETVALUE KJUSERNODEADLOCKWAIT    <-是LGWR
----------enqueue 0x7413cff60------------------------
lock version    : 793083
Owner node    : 1
grant_level    : KJUSERPR
req_level       : KJUSERPR
bast_level    : KJUSERNL
notify_func    : 0
resp          : 51d9c1890
procp          : 53643e4a0
pid             : 21683
proc version    : 0
oprocp          : 0
opid          : 0
group lock owner : 741146e18
possible pid    : 21683
xid             : 2003-0037-00000018
dd_time       : 0.0 secs
dd_count       : 0
timeout       : 0.0 secs
On_timer_q?    : N
On_dd_q?       : N
lock_state    : GRANTED
Open Options    : KJUSERDEADLOCK  KJUSERIDLK
Convert options  : KJUSERGETVALUE KJUSERNODEADLOCKWAIT
History       : 0x14951495
Msg_Seq       : 0x0
res_seq       : 2
valblk          : 0x00093f03000000000000000000000000 .?
Potential blocker (pid=21683) on resource CF-00000000-00000000
DUMP LOCAL BLOCKER: initiate state dump for TIMEOUT
  possible owner[55.21683]
Submitting asynchronized dump request [28]
----------enqueue 0x8001fb660------------------------
lock version    : 3
Owner node    : 0
grant_level    : KJUSERPR
req_level       : KJUSERPR
bast_level    : KJUSERPW
notify_func    : 0
resp          : 51d9c1890
procp          : 53645ffa0
pid             : 0
proc version    : 0
oprocp          : 0
opid          : 0
group lock owner : 0
xid             : 0000-0000-00000000
dd_time       : 0.0 secs
dd_count       : 0
timeout       : 0.0 secs
On_timer_q?    : N
On_dd_q?       : N
lock_state    : GRANTED
Open Options    :  KJUSERNO_XID
Convert options  : KJUSERGETVALUE
History       : 0x7d7d7d8d
Msg_Seq       : 0x98dd98de
res_seq       : 2
valblk          : 0x00000000000000000000000000000000 .
KCC: controlfile sequence number from SGA =  605955

回复只看该作者道具举报

返回列表

		自动登录	找回密码
密码			注册