ORA-00445: 后台进程 "J000" 在 120 秒之后仍没有启动
OS: Solaris 10 SPARC 64bitDB: 11.2.0.1 RAC
双节点RAC上一共有5个db,应用报错连接不上其中一个实例,发现节点1的db alert日志报错如下:
Fri Aug 01 11:17:36 2014
Errors in file /u01/app/oracle/diag/rdbms/rep/rep1/trace/rep1_cjq0_2817.trc (incident=264722):
ORA-00445: 后台进程 "J000" 在 120 秒之后仍没有启动
Incident details in: /u01/app/oracle/diag/rdbms/rep/rep1/incident/incdir_264722/rep1_cjq0_2817_i264722.trc
Fri Aug 01 11:17:41 2014
Trace dumping is performing id=
kkjcre1p: unable to spawn jobq slave process
Errors in file /u01/app/oracle/diag/rdbms/rep/rep1/trace/rep1_cjq0_2817.trc:
Fri Aug 01 11:18:33 2014
Sweep : completed
Sweep : completed
Fri Aug 01 11:25:42 2014
Errors in file /u01/app/oracle/diag/rdbms/rep/rep1/trace/rep1_arc2_2549.trc:
ORA-16146: 控制文件入队不可用
Fri Aug 01 11:25:46 2014
Errors in file /u01/app/oracle/diag/rdbms/rep/rep1/trace/rep1_lgwr_2118.trc (incident=264465):
ORA-00240: 控制文件入队被持有的时间超过了 120 秒
Incident details in: /u01/app/oracle/diag/rdbms/rep/rep1/incident/incdir_264465/rep1_lgwr_2118_i264465.trc
Fri Aug 01 11:25:59 2014
Trace dumping is performing id=
Fri Aug 01 11:26:09 2014
Errors in file /u01/app/oracle/diag/rdbms/rep/rep1/trace/rep1_lgwr_2118.trc (incident=264466):
ORA-00445: 后台进程 "O000" 在 120 秒之后仍没有启动
Incident details in: /u01/app/oracle/diag/rdbms/rep/rep1/incident/incdir_264466/rep1_lgwr_2118_i264466.trc
当时OS的CPU, I/O,内存都不高,但是通过mpstat看到其中一个CPU的使用率达到100%,使用者是sys,执行prstat命令没有响应。检查了另外一个节点都是正常的。
大神儿们有空帮忙看看吧!! waited for 'os thread startup', seq_num: 59532
p1: ''=0x0
p2: ''=0x0
p3: ''=0x0
time_waited: >= 0 sec (still in wait)
*** 2014-08-01 11:24:50.720
loadavg : 2.71 3.47 3.52
swap info: free_mem = 284102.17M rsv = 324356.43M
alloc = 199102.66M avail = 130586.28M swap_free = 255840.05M
skgpgcmdout: read() for cmd /bin/ps -elf | /bin/egrep 'PID | 4128' | /bin/grep -v grep timed out after 15.000 seconds
Stack:
pstack: cannot examine 4128: permission denied
给出solaris project 内核相关配置信息 Maclean Liu(刘相兵 发表于 2014-8-4 16:32 static/image/common/back.gif
给出solaris project 内核相关配置信息
/etc/system:
set noexec_user_stack=1
set shmsys:shminfo_shmmax=549755813888
set semsys:seminfo_semmns=1024
set semsys:seminfo_semvmx=32767
set rlim_fd_cur=1024
project :
# projadd group.oinstall
# projmod -sK "project.max-shm-memory=(privileged,512G,deny)" group.oinstall
# projmod -sK "project.max-shm-ids=(privileged,100,deny)" group.oinstall
# projmod -sK "project.max-sem-nsems=(privileged,256,deny)" group.oinstall
# projmod -sK "project.max-sem-ids=(privileged,100,deny)" group.oinstall
谢谢ML、、
Maclean Liu(刘相兵 发表于 2014-8-4 16:32 static/image/common/back.gif
给出solaris project 内核相关配置信息
呼叫ML...有空帮我看一眼,多谢!! 执行ps命令都超时,一般情况下应该很快,是不是可以认为系统当时hang了。
你OS是 SOLARIS。上次遇到在AIX下,主机hang了,也是ps 执行超时。 *** 2014-08-01 11:16:28.255
Warning: log write broadcast wait time 593ms
(set event 10468 level 8 to disable this warning)
Waited for process O000 to initialize for 60 seconds
The process: Onnn (ASM Connection Pool Process) : Maintains a connection to the ASM instance for metadata operations.
O000一直没起来~~ 这么看 就是整个OS都hang了,就TRACE来看内存似乎是够。
由于OS hang了之后,无法启动O的进程是很正常的,应当进一步分析OSW harryzhang 发表于 2014-8-7 11:11 static/image/common/back.gif
*** 2014-08-01 11:16:28.255
Warning: log write broadcast wait time 593ms
(set event 10468 level 8 t ...
恩,我也觉得是OS hang了,08.05又出了一次问题,也是这个节点,db alert先出现ORA-00445,接着crs alert 集群资源检查失败,然后这个节点被驱逐重启了 Maclean Liu(刘相兵 发表于 2014-8-7 16:53 static/image/common/back.gif
这么看 就是整个OS都hang了,就TRACE来看内存似乎是够。
由于OS hang了之后,无法启动O的进程是很正常的, ...
8.1出现问题后才装上OSW,然后2014-08-05 02:36:35又出现了相同问题并导致这个节点被驱逐,/var/adm/message只有panic报错和OS 重启信息,OSW只有mpstat有一些异常,我传上来给你们看一下 Maclean Liu(刘相兵 发表于 2014-8-7 16:53 static/image/common/back.gif
这么看 就是整个OS都hang了,就TRACE来看内存似乎是够。
由于OS hang了之后,无法启动O的进程是很正常的, ...
OSW已经上传,谢谢ML!! OSW的 ps里面 :
zzz ***二 8月 5 02:14:06 CST 2014
USER PID PPID PRI %CPU %MEM VSZ RSS WCHAN S STIME TIME COMMAND
oracle 26138 1 59 0.0 10.0 84094312 52627624 602fdeae5dc S 8月_03 00:05 ora_q003_ards1
oracle 25780 1 59 0.0 10.0 84092584 52628192 602fdeae52c S 16:12:10 00:01 ora_q005_ards1
oracle 17502 1 59 0.0 10.0 84091816 52627888 602fdeae4bc S 02:07:35 00:00 ora_w000_ards1
oracle 3351 1 59 0.0 10.0 84093808 52619072 602fdeae58c S 7月_31 00:23 ora_smco_ards1
zzz ***二 8月 5 02:15:13 CST 2014
USER PID PPID PRI %CPU %MEM VSZ RSS WCHAN S STIME TIME COMMAND
到这里后无数据,但是其他的命令例如 mpstat之类的在35分之后才没数据的
message里面也是在14分时候出现信息:
Aug 5 02:14:47 odsdb-node01 in.mpathd: Cannot meet requested failure detection time of 10000 ms on (inet igb0) new failure detection time for group "public" is 123858 ms
然后10多分钟后,数据库报错进程无法启动,可能实例只是受害者。。
然后系统挂了。
MOS有这样一个,但是因为对SOLARIS不精通,不敢确认,建议找主机人员介入。
Solaris 10 Server or Solaris Cluster Node/System/Server Rebooted Itself With Panic String "forced crash dump initiated at user request" (文档 ID 1388823.1)
The Solaris system crashes with "forced crash dump initiated at user request".
The panic message looks like:
node1 ^Mpanic/thread=2a10007fca0:
node1 unix: forced crash dump initiated at user request
node1 unix:
node1 genunix: 000002a10007f700 genunix:kadmin+544 (b4, 1, 0, 0, 5, 1285800)
node1 genunix: %l0-3: 0000000001815400 00000000011ee684 00000000011ee400 0000000000000004
node1 %l4-7: 0000000000000004 00000000000004e8 0000000000000004 0000000000000004
node1 genunix: 000002a10007f7c0 ntwdt:ntwdt_enforce_timeout+70 (6001153bb30, 60011694370, 7005d800, 18c0400, 7b7d1c00, 0)
node1 genunix: %l0-3: 0000000000000800 00000000018d2c00 0000000000000000 0000000001913c00
node1 %l4-7: 0000060011694370 0000000000000001 000000007b7d1c00 0000000000000001
node1 pcisch: WARNING: pcisch2: ino 0x22 blocked
node1 pcisch: rmc_comm-0#0
Cause
There is a Solaris driver called ntwdt (Netra-based application watchdog timer driver). It is a watchdog mechanism. It detects a system hang, or an application hang or crash. The watchdog is a timer that is continually reset by a user application as long as the operating system and user application are running. When the application stops and the timeout is reached ntwdt driver will generate a coredump and crashes the system. The man pages can be found there :ntwdt(7D)
Solution
If you see ntwdt e.g:
ntwdt:ntwdt_enforce_timeout+0x70(0x600368d67b0)
ntwdt:ntwdt_cyclic_pat+0x90(0x600368d67b0)
in panic stack then these are obvious hardware errors.
Check /var/adm/messages to get a clue what the real problem is.
Also the logs of the service processor of the system can be helpful.
If you unable to determine the cause contact Oracle Support provide crashdump and explorer output of the system for analysis.
查看 odsdb-node01_top_14.08.05.0100.dat
注意
2:11-14三次采样 主机的负载分别是 30 24 14 harryzhang 发表于 2014-8-8 23:36 static/image/common/back.gif
查看 odsdb-node01_top_14.08.05.0100.dat
注意
这个问题已经解决,是由于Solaris 10 DISM 的 unpublish bug造成,禁用DISM后跑了3周没有再出现过问题,谢谢各位大神儿!! which bug,反馈下嘛! harryzhang 发表于 2014-9-2 16:40 static/image/common/back.gif
which bug,反馈下嘛!
BUG 17927249 - System panics due to DISM issues
没有相关补丁,禁用DISM后OK
参照 Doc ID 1606318.1
页:
[1]