数据库hang latch free 等待事件一例

ALLSTARS_ORACLE 发表于 2014-1-5 21:35:32

数据库hang latch free 等待事件一例

数据库出现了不可用状态，即任何应用均无响应，SQL PLUS 也不能连接到数据库的情况。数据库已经是hang住了。在紧急情况下，用户工程师关闭了监听器。所有的应用连接到另外的节点上。
然后，用户工程师收集了SYSTEM DUMP和hang analyze跟踪文件。

ORACLE工程师来到客户现场后，仔细的分析了数据库的告警日志，监听器日志和后台跟踪文件，均无任何错误信息。紧接着，获取了这段时间的statspack信息：

            Snap Id    Snap Time    Sessions Curs/Sess Comment
         --------- ------------------ -------- --------- -------------------
Begin Snap:    57712 10-Jan-11 16:00:01 6,900    12.3
  End Snap:    57714 10-Jan-11 16:25:46 5,533    11.6
Elapsed:             25.75 (mins)

Top 5 Timed Events
~~~~~~~~~~~~~~~~~~                                                    % Total
Event                                              Waits Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
latch free                                     6,791,132 1,904,330 97.12
db file sequential read                      9,601,945    31,307    1.60
CPU time                                                    10,093    .51
enqueue                                        145,075    6,184    .32
global cache cr request                      6,738,227    2,165    .11


我们可以看到latch free事件达到了97%，而另外的节点，在16点到17点之间的等待事件为：

Top 5 Timed Events
~~~~~~~~~~~~~~~~~~                                                    % Total
Event                                              Waits Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
db file sequential read                      31,130,774    120,170 40.67
enqueue                                        262,096    113,199 38.31
CPU time                                                    37,105 12.56
global cache cr request                      20,352,441    5,839    1.98
db file scattered read                         1,919,755    3,860    1.31

从statspack报告我们基本上可以定位数据库hang住是由于大量的latch等待导致的。
也就是说latch free事件不应该出现在top 5 等待事件中。这个不是数据库故障，而是数据库出现严重的性能问题。
大量的会话中请求latch均不能获取。然而在用户关闭监听器之后，数据库恢复正常，下面为之后的STASPACK报告：

            Snap Id    Snap Time    Sessions Curs/Sess Comment
         --------- ------------------ -------- --------- -------------------
Begin Snap:    57714 10-Jan-11 16:25:46 5,533    11.6
  End Snap:    57725 10-Jan-11 17:00:01 3,046    12.9
Elapsed:             34.25 (mins)

Top 5 Timed Events
~~~~~~~~~~~~~~~~~~                                                    % Total
Event                                              Waits Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
db file sequential read                      9,651,975    34,125 69.32
CPU time                                                    10,545 21.42
global cache cr request                      5,969,261    1,724    3.50
db file scattered read                         379,574       866    1.76
global cache null to x                         116,936       537    1.09

问题分析

出现问题时，用户工程师无法使用SQLPLUS连接到数据库。因此只能等到关闭监听器之后，使用SQLPLUS 收集了SYSTEM DUMP和hang analyze，在用户收集的hang analyze 跟踪文件里面，很遗憾，我们没有发现latch free的会话。这是因为没有及时捕捉到事故发生时的系统状态。目前已经提交了SR，GCS工程师正在对问题进行进一步的处理，分析SYSTEM DUMP 确认bug的可能性(SR 3-2724367311: 9i rac database hang with latch free)

下面我们进一步的分析statspack关于latch的部分。
从statspack中，我们可以看到latch 大部分发生在shared pool的library cache中：

library cache          kqlmbpil: parent                0  5,189,058 2,093
library cache          kglic                         0  1,211,580    70
library cache          kglpndl: child: before pro    0    4,240  106,832
library cache          kglpnc: child                   0    3,005 ########
library cache          kglpin: child: heap proces    0    2,375  783,955
library cache          kgllkdl: child: cleanup       0    2,276 26,380
library cache          kglupc: child                   0    1,604 31,898
library cache          kgldte: child 0                0       881  624,139
library cache          kglhdgn: child:                0       528 ########
library cache          kglhdgc: child:                0       339    719
library cache          kglobpn: child:                0       190 49,889
library cache          kglpndl: child: after proc    0       101    23
library cache          kgldti: 2child                0       53    12
library cache          kglpin                         0       31    13
library cache          kgllkdl: child: multiinsta    0       10       1
library cache          kglhfr: child                   0       5       0
library cache          kglpnp: child                   0       4 7,965
library cache          kglget: child: KGLDSBRD       0       3 10,372
library cache          kglati                         0       1       9
library cache          kglobld                         0       1    43

而正常时间的latch 情况

我们可以看到kqlmbpil: parent 仅仅是33次休眠。故障时刻却达到了500万次(5,189,058)。

library cache          kglpnc: child                   0    10,292 12,282
library cache          kglpin: child: heap proces    0    10,194 6,233
library cache          kglpndl: child: before pro    0    7,341 2,670
library cache          kgllkdl: child: cleanup       0    2,287 3,107
library cache          kglhdgn: child:                0    1,946 3,109
library cache          kglhdgc: child:                0    1,427 3,512
library cache          kglupc: child                   0    1,025 3,027
library cache          kglobpn: child:                0       367    631
library cache          kglpndl: child: after proc    0       98    97
library cache          kgldte: child 0                0       66    563
library cache          kglic                         0       59    30
library cache          kqlmbpil: parent                0       33       0
library cache          kglhfr: child                   0       13       0
library cache          kgldti: 2child                0       4    20
library cache          kglidp: child                   0       3       0
library cache          kglget: child: KGLDSBRD       0       2    22
library cache          kgllkdl: child: multiinsta    0       2       2
library cache          kglhdiv: child                0       2       0
library cache          kglpin                         0       1    17
library cache          kglpnp: child                   0       1    39

Latch 是一种oracle用来保护内存中（SGA）对象的闩锁。它采用一种串行和抢占方式进行获取，详细的描述中文档What are Latches and What Causes Latch Contention (Doc ID 22908.1)。下面详细说明下latch free等待事件导致数据库hang的常见情况：

shared pool SQL语句在执行之前，必须要编译和解析，编译解析之前需要先查看共享池是否存在已经解析后的语句，如果存在就会使用以及编译好的SQL 语句的执行计划去执行这个语句。如果没有，就会重新编译生成执行计划，并且留到共享池当中。编译的时候，oracle会去查询相关的字典信息，例如这个SQL语句引用了哪些对象，某个表的哪些字段等。这些信息都必须在内存中（即share pool）中被保护起来。LATCH即这样的一种保护机制。当大量的SQL语句需要执行时，由于latch的串行机制，没有获取到latch的会话只能在等待或者是空转状态，对于应用来说，数据库出现了严重的性能问题，感觉就像HANG住一样。


另外，我们还可以看到这个节点的会话比较多：
         Snap Id    Snap Time    Sessions Curs/Sess Comment
         --------- ------------------ -------- --------- -------------------
Begin Snap:    57711 10-Jan-11 15:00:01 6,977    12.1
  End Snap:    57712 10-Jan-11 16:00:01 6,900    12.3
Elapsed:             60.00 (mins)

而另外的节点：
         Snap Id    Snap Time    Sessions Curs/Sess Comment
         --------- ------------------ -------- --------- -------------------
Begin Snap:    57722 10-Jan-11 16:00:01 4,959    13.0
  End Snap:    57723 10-Jan-11 17:00:01 5,464    11.9
Elapsed:             60.00 (mins)
问题总结与建议

根据上面的分析，数据库出现HANG的情况是由于kqlmbpil: parent 无法获取latch，从而导致数据库出现大量的latch free等待导致。然而，由于latch是oracle一种低级的保护机制。很难在生产数据库重现，因此给出下面的建议：

下次如果发现同类故障时，使用oradebug收集hanganalyze和system dump信息，供GCS工程师分析使用，确保找到问题的源头。（如果不能使用SQLPLUS,请使用DBX工具附录提供了dbx的使用方法）
1.

Sqlplus ‘/ as sysdba’
oradebug setmypid;
oradebug unlimited;
oradebug dump hanganalyze 3;
!sleep 60
oradebug dump hanganalyze 3;
!sleep 60
oradebug dump hanganalyze 3;
oradebug tracefile_name;

sqlplus ‘/ as sysdba’
oradebug setmypid;
oradebug unlimited;
oradebug dump systemstate 266；
!sleep 60;
oradebug dump systemstate 266;
!sleep 60;
oradebug dump systemstate 266;
oradebug tracefile_name;

2. 使用下面的查询找到具体是哪个latch

SELECT SID, EVENT, P1TEXT, P1RAW, P2TEXT, P2, SECONDS_IN_WAIT FROM V$SESSION_WAIT where event =’latch free’;

SELECT n.name, SUM(w.p3) Sleeps
  FROM V$SESSION_WAIT w, V$LATCHNAME n
WHERE w.event = `latch free'
AND w.p2 = n.latch#
GROUP BY n.name;

Select name, immediate_gets, immediate_misses,
   gets, misses, sleeps, waiters_woken
From v$latch
Where waiters_woken > 0;

页: [1]

Oracle数据库数据恢复、性能优化's Archiver

数据库hang latch free 等待事件一例