121 积分	0 好友	2 主题

发消息

Oracle Hang一例

1^#

发表于 2012-4-1 23:53:14 | 查看: 7658| 回复: 4

Linux + Oracle 10.2.0.4
现象：
4/1下午12点左右开始，Oracle服务无法连接，从服务器本级使用" sqlplus / as sysdba"连接也无任何反应，也没有报错。
此服务器上还有另一个Oracle实例，另一个实例一切正常。

Alert相关内容：
Sun Apr  1 00:00:11 2012
Errors in file /workspace/oracle/admin/ora10/bdump/ora10_j000_9846.trc:
ORA-12012: error on auto execute of job 3
ORA-00942: table or view does not exist
ORA-06512: at line 14
Sun Apr  1 00:32:26 2012
Thread 1 advanced to log sequence 224 (LGWR switch)
  Current log# 2 seq# 224 mem# 0: /workspace/oracle/oradata/ora10/redo02.log
Sun Apr  1 09:30:45 2012
Thread 1 advanced to log sequence 225 (LGWR switch)
  Current log# 3 seq# 225 mem# 0: /workspace/oracle/oradata/ora10/redo03.log
Sun Apr  1 10:10:59 2012
kkjcre1p: unable to spawn jobq slave process
Sun Apr  1 10:11:00 2012
Errors in file /workspace/oracle/admin/ora10/bdump/ora10_cjq0_16701.trc:
Sun Apr  1 12:54:59 2012
WARNING: inbound connection timed out (ORA-3136)

在sqlnet.lorg中出现大量以下提示：
WARNING: Subscription for node down event still pending

查看PMON进程的TRACE，很多：
kgl child 5 level 0 latch clean-up information
采用HANGANALYZE分析，服务器无死锁进程。

尝试解决方法：
在listener.ora中加了：
SUBSCRIBE_FOR_NODE_DOWN_EVENT_<listenername>=OFF
INBOUND_CONNECT_TIMEOUT_<listenername>=0
在sqlnet.ora中加了：
SQLNET.INBOUND_CONNECT_TIMEOUT = 0
重启listener均没有作用。

采用-prelim的方式做了hanganalyze，参见附件。没有找到原因，最后只好杀了服务器的所有进程，然后重启服务器。
服务器重启之后发现故障期间AWR均未产生，看故障发生之前最后的AWR服务器压力很小。
能否帮忙分析一下？先谢了！

ora10_ora_25218.trc.txt

29.05 KB, 下载次数: 805

分享0

收藏0 回复只看该作者道具举报

Maclean Liu(刘相兵

2^#

发表于 2012-4-2 10:49:38

HANGANALYZE TRACE中没有太多可用的信息，

Action Plan:

上传
/workspace/oracle/admin/ora10/bdump/ora10_j000_9846.trc
/workspace/oracle/admin/ora10/bdump/ora10_cjq0_16701.trc

以及 pmon 和 smon在问题触发时刻的TRACE文件

回复只看该作者道具举报

深圳－刀

3^#

发表于 2012-4-3 21:48:18

4个TRACE已经上传

4个Trace.rar

3.78 KB, 下载次数: 1843

回复只看该作者道具举报

Maclean Liu(刘相兵

4^#

发表于 2012-4-4 10:37:50

Sun Apr  1 10:11:00 2012
Errors in file /workspace/oracle/admin/ora10/bdump/ora10_cjq0_16701.trc:
Sun Apr  1 12:54:59 2012
WARNING: inbound connection timed out (ORA-3136)

Oracle Database 10g Enterprise Edition Release 10.2.0.4.0
System name: Linux
Release: 2.6.18-238.el5

10.2.0.4 on Linux

从2012-04-01 10:11:02.102 开始， PMON始终在清理一个GET_EXCL 的 MUTEX

=================================================
KGL Child 5 Level 0 latch clean-up information:
=================================================
Operation = 12
kgllcpt1 = 82655460  kgllcpt2 = 0
kgllcpt3 = 0  kgllcpt4 = 0
kgllcpt5 = 7b213e78  kgllcub4 = 0
*** 2012-04-01 10:14:04.278
KGX cleanup...
KGX Atomic Operation Log 0xa6b619b8
Mutex 0x8d686b30(260, 0) idn 22dfb81b oper GET_EXCL
Cursor Pin uid 280 efd 5 whr 11 slp 16383

*** 2012-04-01 10:17:08.988
KGX cleanup...
KGX Atomic Operation Log 0xa6b619b8
Mutex 0x8d686b30(260, 0) idn 22dfb81b oper GET_EXCL
Cursor Pin uid 280 efd 5 whr 11 slp 16383

KGX cleanup...
KGX Atomic Operation Log 0xa6b619b8
Mutex 0x8d686b30(260, 0) idn 22dfb81b oper GET_EXCL
Cursor Pin uid 280 efd 5 whr 11 slp 16383

但是始终没有清理完成，日志显示直到2012-04-01 16:21:34.331

J000进程的TRACE:

*** 2012-04-01 10:09:39.922
Waited for process J000 to initialize for 60 seconds
*** 2012-04-01 10:09:40.207
Dumping diagnostic information for J000:
OS pid = 10241
loadavg : 2.45 1.42 1.11
memory info: free memory = 0.00M
swap info: free = 0.00M alloc = 0.00M total = 0.00M
F S UID       PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY       TIME CMD
0 D oracle 10241    1  0  78 0 - 332128 sync_p 10:08 ?    00:00:00 ora_j000_ora10
[Thread debugging using libthread_db enabled]
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7fff0d7ae000
0x0000000001ff0237 in kksCursorFreeCallBack ()
#0  0x0000000001ff0237 in kksCursorFreeCallBack ()
#1  0x0000000003e9227f in kglobf0 ()
#2  0x0000000003e981e6 in kglhpd ()
#3  0x0000000003b5a5d2 in kghfrx ()
#4  0x0000000003b698e0 in kghfrunp ()
#5  0x0000000003b6880d in kghfnd ()
#6  0x0000000003b66fcb in kghalo ()
#7  0x0000000000783412 in ksp_param_handle_alloc ()

kksCursorFreeCallBack ()

回复只看该作者道具举报

Maclean Liu(刘相兵

5^#

发表于 2012-4-4 10:51:53

ODM FINDING:

MOS上的类似案例：

Ksvcreate: Process(M001) Creation Failed , Database hang

Applies to:
Oracle Server - Standard Edition - Version: 10.2.0.4 and later [Release: 10.2 and later ]
Information in this document applies to any platform.
Symptoms

Database goes in hang mode. No new connections are able to connect or able to perform any
operations on existing connections. It happens frequently.

The alert.log shows enteries like following:

ksvcreate: Process(m001) creation failed
Thu Oct 7 22:17:37 2010
kkjcre1p: unable to spawn jobq slave process
Thu Oct 7 22:17:37 2010

Cause
The alert.log messages are not the cause , but it results from PMON needing to create new processes. Since PMON itself is hung , no new processes can be started.

This may be  verified from Systemstate dump:
--

OSD pid info: Unix process pid: 1519850, image: oracle@wpcdbqa (PMON)

errorstack
kksLockWait+01fc<-kgxWait+0168<-kgxExclusive+00bc<-kksFreeHeapGetMutex+0228<-kksCursorFreeCallBack+0088<-kgllccl+13cc<-kgllcu+01b4

waiting for 'cursor: pin X' blocking sess=0x0 seq=8166 wait_time=0 seconds since wait started=0
   idn=6e0a6074, value=20600000000, where|sleeps=b003c1e3a
Dumping Session Wait History

The holder of the mutex does not exist ( dead process)

Solution
This is due to a known bug:

Document:8426816.8  PMON may hang cleaning up a dead process (rare)

Details:

This problem is introduced in 10.2.0.4 by the fix for bug 5377099.
Under a very specific scenario involving process death at a precise
point in time, an instance hang may result due to PMON getting
blocked when attempting to clean up a failed process.

PMON gets stuck during cleanup waiting for a mutex in X mode
and that mutex is held in X mode by the dead process being
cleaned up . PMON is blocked under a stack including
kksFreeHeapGetMutex()

Note:
One off patches for this bug also fix issue in bug 5377099 .

Solution:

1)Please apply this patch to fix the issue.

or

2)The patch is also included in 10.2.0.4.2 PSU patch .
( which fixes other known issues of 10.2.0.4.0)

More information of the 10.2.0.4.2 PSU patch is in the following note:

Document:8833280.8Bug 8833280 - 10.2.0.4.2 Patch Set Update (PSU)

PMON可能因为清理 dead process 的 X mode mutex  而导致 instance hang

to be continued .............

回复只看该作者道具举报

返回列表

		自动登录	找回密码
密码			注册