[SQL调优] WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK for can`t generate awr automatic

2^#

发表于 2013-1-11 10:37:51

awk report

Blockers
~~~~~~~~
Above is a list of all the processes. If they are waiting for a resource
then it will be given in square brackets. Below is a summary of the
waited upon resources, together with the holder of that resource.
Notes:
~~~~~
o A process id of '???' implies that the holder was not found in the
systemstate.
Resource Holder State
Latch 7000005ed7579f8 ??? Blocker
Enqueue US-00000017-00000000 ??? Blocker
Rcache object=7000003cd541d68, ??? Blocker
Enqueue TX-003D001E-002166BD 250: 250: is waiting for Latch 7000005ed7579f8
Enqueue TX-003B0015-001504A2 318: last wait for 'gc cr request'
Enqueue TX-0060002C-001BF791 148: 148: is waiting for Latch 7000005ed7579f8
Enqueue TX-003F0024-003879D9 177: 177: is waiting for Latch 7000005ed7579f8
Enqueue TX-005B0003-001F91AA 386: waiting for 'gc current request'
Object Names
~~~~~~~~~~~~
Latch 7000005ed7579f8 Child cache buffers chains
Enqueue US-00000017-00000000
Rcache object=7000003cd541d68,
Enqueue TX-003D001E-002166BD
Enqueue TX-003B0015-001504A2
Enqueue TX-0060002C-001BF791
Enqueue TX-003F0024-003879D9
Enqueue TX-005B0003-001F91AA

复制代码

3^#

发表于 2013-1-11 10:53:43

Blockers
~~~~~~~~

      Above is a list of all the processes. If they are waiting for a resource
      then it will be given in square brackets. Below is a summary of the
      waited upon resources, together with the holder of that resource.
      Notes:
      ~~~~~
      o A process id of '???' implies that the holder was not found in the
         systemstate. (The holder may have released the resource before we
         dumped the state object tree of the blocking process).
      o Lines with 'Enqueue conversion' below can be ignored *unless*
         other sessions are waiting on that resource too. For more, see
         http://dlsunuk11.uk.oracle.com/Public/TOOLS/Ass.html#enqcnv)

                  Resource Holder State
Enq HW-00000053-00800169 18: waiting for 'DFS lock handle'
Rcache object=70000060ee0d950, 149: waiting for 'gc current request'
         Buffer 0x0e03c37a 96: waiting for 'gc current request'
         Buffer 0x2a02f24d ??? Blocker
         Buffer 0x0f800002 18: waiting for 'DFS lock handle'
Enq TX-003B0015-001504A2 ??? Blocker
Enq TX-00370014-0009D402 237: waiting for 'gc current request'
         Buffer 0x13435b67 237: waiting for 'gc current request'

Blockers According to Tracefile Wait Info:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1. This may not work for 64bit platforms. See bug 2902997 for details.
2. If the blocking process is shown as 0 then that session may no longer be
present.
3. If resources are held across code layers then sometimes the tracefile wait
info will not recognise the problem.

No blockers seen.

Object Names
~~~~~~~~~~~~
Enq HW-00000053-00800169
Rcache object=70000060ee0d950,  cid=3(dc_rollback_segments)
Buffer 0x0e03c37a
Buffer 0x2a02f24d
Buffer 0x0f800002
Enq TX-003B0015-001504A2
Enq TX-00370014-0009D402
Buffer 0x13435b67

Summary of Wait Events Seen (count>10)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
14 : 'rdbms ipc message'
121 : 'SQL*Net message from client'
188 : 'row cache lock'
57 : 'gc buffer busy'
11 : 'gc current request'

Rcache object=70000060ee0d950,  cid=3(dc_rollback_segments)

      SO: 7000005a408b7a8, type: 50, owner: 7000005edaf6768, flag: INIT/-/-/0x00
      row cache enqueue: count=1 session=700000609773ed0 object=70000060ee0d950, request=S
      savepoint=0x59
      row cache parent object: address=70000060ee0d950 cid=3(dc_rollback_segments)
      hash=8e1019e8 typ=11 transaction=7000005e819b390 flags=0000012a
      own=70000060ee0da20[7000001160cf4f0,7000001160cf4f0] wat=70000060ee0da30[7000005a478b1a8,7000005a408ca98] mode=X
      status=VALID/UPDATE/-/-/IO/-/-/-/-
      request=N release=TRUE flags=0
      instance lock id=QD f8211c14 4289348c
      data=
      0000005d 00000053 0000003e 00000081 000a5f53 5953534d 55393324 00000000
      00000000 00000000 00000000 00000000 00020000 00000001 55b90000 000db814
      e6b004ae 0008ac50 00000002 00000001

dc_rollback_segments ; US


00000081 000a5f53 5953534d 55393324 00000000=>
_SYSSMU93$

这个row cache resource的owner 70000060ee0da20似乎已经不存在了

跑一下这个脚本：http://www.askmaclean.com/archiv ... nostic-scripts.html

4^#

发表于 2013-1-11 11:03:37

也跑一下这个脚本

REM tablespace report
set linesize 200
select a.tablespace_name,
round(a.bytes_alloc / 1024 / 1024) megs_alloc,
round(nvl(b.bytes_free, 0) / 1024 / 1024) megs_free,
round((a.bytes_alloc - nvl(b.bytes_free, 0)) / 1024 / 1024) megs_used,
round((nvl(b.bytes_free, 0) / a.bytes_alloc) * 100) Pct_Free,
100 - round((nvl(b.bytes_free, 0) / a.bytes_alloc) * 100) Pct_used,
round(maxbytes / 1048576) Max
from (select f.tablespace_name,
sum(f.bytes) bytes_alloc,
sum(decode(f.autoextensible, 'YES', f.maxbytes, 'NO', f.bytes)) maxbytes
from dba_data_files f
group by tablespace_name) a,
(select f.tablespace_name, sum(f.bytes) bytes_free
from dba_free_space f
group by tablespace_name) b
where a.tablespace_name = b.tablespace_name(+)
union all
select h.tablespace_name,
round(sum(h.bytes_free + h.bytes_used) / 1048576) megs_alloc,
round(sum((h.bytes_free + h.bytes_used) - nvl(p.bytes_used, 0)) /
1048576) megs_free,
round(sum(nvl(p.bytes_used, 0)) / 1048576) megs_used,
round((sum((h.bytes_free + h.bytes_used) - nvl(p.bytes_used, 0)) /
sum(h.bytes_used + h.bytes_free)) * 100) Pct_Free,
100 -
round((sum((h.bytes_free + h.bytes_used) - nvl(p.bytes_used, 0)) /
sum(h.bytes_used + h.bytes_free)) * 100) pct_used,
round(sum(f.maxbytes) / 1048576) max
from sys.v_$TEMP_SPACE_HEADER h,
sys.v_$Temp_extent_pool p,
dba_temp_files f
where p.file_id(+) = h.file_id
and p.tablespace_name(+) = h.tablespace_name
and f.file_id = h.file_id
and f.tablespace_name = h.tablespace_name
group by h.tablespace_name
ORDER BY 1
/

复制代码

5^#

发表于 2013-1-11 11:06:25

这个脚本也卡住了

ttitle center "Undo Extents" skip 2
col segment_name format a30 heading "Name"
col "ACT BYTES" format 999,999,999,999 head "Active|Extents"
col "UNEXP BYTES" format 999,999,999,999 head "Unxpired|Extents"
col "EXP BYTES" format 999,999,999,999 head "Expired|Extents"
select segment_name,
nvl(sum(act),0) "ACT BYTES",
nvl(sum(unexp),0) "UNEXP BYTES",
nvl(sum(exp),0) "EXP BYTES"
from (
select segment_name,
nvl(sum(bytes),0) act,00 unexp, 00 exp
from DBA_UNDO_EXTENTS
where status='ACTIVE' group by segment_name
union
select segment_name,
00 act, nvl(sum(bytes),0) unexp, 00 exp
from DBA_UNDO_EXTENTS
where status='UNEXPIRED' group by segment_name
union
select segment_name,
00 act, 00 unexp, nvl(sum(bytes),0) exp

复制代码

6^#

发表于 2013-1-11 12:43:00

Maclean Liu(刘相兵发表于 2013-1-11 11:03
也跑一下这个脚本

查看1节点的redo全部为active状态添加新的logfile仍旧为active，似乎lgwr死掉了

7^#

发表于 2013-1-11 13:53:22

fluttersnow 发表于 2013-1-11 12:43
查看1节点的redo全部为active状态添加新的logfile仍旧为active，似乎lgwr死掉了

alter system checkpoint;
1.
怀疑是某个进程持有dc_rollback_segment =>_SYSSMU93$ 后dead ， row cache resource 未被清理

2. 短期内有大量事务，但是undo 空间不够，引起较多的rollback 争用

3.bug

建议你先KILL server process

ps -ef|grep LOCAL=NO|grep -v grep | awk '{print $2}' | xargs kill -9

最好有AWR 参考，以及上面的脚本结果

8^#

发表于 2013-1-12 21:33:06

本帖最后由 fluttersnow 于 2013-1-12 21:36 编辑

Maclean Liu(刘相兵发表于 2013-1-11 13:53
alter system checkpoint;
1.
怀疑是某个进程持有dc_rollback_segment =>_SYSSMU93$ 后dead ， row cach ...

kill 2个节点所有连接以后，在1节点还是产生 “ROW CACHE ENQUEUE LOCK” 而且1节点用户无法登陆，再次检查其他trc，发现holding似乎是个j001进程，而且几个trc文件都又找到这个holding

PROCESS 318:
----------------------------------------
SO: 700000605389218, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=318, calls cur/top: 7000005dab89db8/70000031505d2c8, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 136
last post received-location: kclrcvt
last process to post me: 700000605373798 1 6
last post sent: 0 0 123
last post sent-location: kcrfw_redo_gen: wake LGWR after redo copy
last process posted by me: 7000006033a1c88 1 6
(latch info) wait_event=0 bits=2
holding (efd=55) 7000005ed7579f8 Child cache buffers chains level=1 child#=105349
Location from where latch is held: kcbgtcr: kslbegin excl:
Context saved from call: 1015236325
state=busy(exclusive) (val=0x200000000000013e) holder orapid = 318
waiters [orapid (seconds since: put on list, posted, alive check)]:
319 (107033, 1357866339, 1)
21 (99663, 1357866339, 1)
177 (87863, 1357866339, 1)
206 (87214, 1357866339, 1)
203 (86635, 1357866339, 1)
250 (85992, 1357866339, 1)
190 (83029, 1357866339, 1)
182 (79437, 1357866339, 1)
166 (75837, 1357866339, 1)
261 (74970, 1357866339, 1)
276 (74949, 1357866339, 1)
257 (74901, 1357866339, 1)
290 (74878, 1357866339, 1)
179 (73446, 1357866339, 1)
194 (72235, 1357866339, 1)
159 (68636, 1357866339, 1)
151 (65694, 1357866339, 1)
229 (65041, 1357866339, 1)
148 (64555, 1357866339, 1)
343 (61440, 1357866339, 1)
277 (58257, 1357866339, 1)
360 (57837, 1357866339, 1)
345 (54223, 1357866339, 1)
362 (50641, 1357866339, 1)
193 (47035, 1357866339, 1)
223 (43444, 1357866339, 1)
363 (40495, 1357866339, 1)
99 (39832, 1357866339, 1)
184 (39789, 1357866339, 1)
446 (36234, 1357866339, 1)
10 (35589, 1357866339, 1)
207 (21845, 1357866339, 1)
233 (18256, 1357866339, 1)
279 (14656, 1357866339, 0)
334 (11044, 1357866339, 1)
195 (7438, 1357866339, 1)
604 (178, 1357866339, 1)
waiter count=37
Process Group: DEFAULT, pseudo proc: 7000006073efb90
O/S info: user: oracle, term: UNKNOWN, ospid: 2130496
OSD pid info: Unix process pid: 2130496, image: oracle@oltp1 (J001)

复制代码

查看发现此进程所运行的sql为大量的update操作
（此处有个疑问:j00x进程是job的进程，而这些操作都是一个triger引发的，为什么也在job进程中运行?)

手动kill 这个j001进程，redo file 状态开始改变为inactive，日志可以正常切换。用户也可以正常连接了，dba_data_files也可以正常查询了，昨天弄了一上午总算是解决了。

这里有个疑问:
对于这种故障，为了快速恢复数据库服务，是否可以直接在trc文件中查找holding关键字找到block的进程？这样是否准确，还是说这次只是碰巧找到？

9^#

发表于 2013-1-12 23:04:17

如果能找到 blocker那么是最好也是最简单的解决方法，甚至于oracle会自动去request KILL BLOCKER，前提是blocker不是后台关键进程。

但是大多数情况是一个resource 确实被hold住了，但是holder在哪里都找不到，或者干脆早被KILL了，这种情况下一般考虑先收集 systemstate dump 后期配合其他数据一起分析原因，对应bug或者其他解决方案。之后KILL server process 、或者rebounce instance 恢复数据库可用

10^#

发表于 2013-1-13 20:42:43

Maclean Liu(刘相兵发表于 2013-1-12 23:04
如果能找到 blocker那么是最好也是最简单的解决方法，甚至于oracle会自动去request KILL BLOCKER，前提是 ...

谢谢回复，ML还有个问题:
找到的这个j001进程,根据spid查到正在运行都是update操作，根据这些语句发现是一个triger引发的，为什么会在job的进程中执行？

11^#

发表于 2013-1-21 15:20:02

fluttersnow 发表于 2013-1-13 20:42
谢谢回复，ML还有个问题:
找到的这个j001进程,根据spid查到正在运行都是update操作，根据这些语句发现是 ...

自己回复下，原来概念搞错了，j00进程是cjq0的slave进程，只要与job有关的sql都会在这个session中记录。
经查询这个job在执行一个pacage中会对表A做更新，然当更新时又会触发这个triger做对B表的update，故一起被记录在session的sql中。