ORA-00445: 后台进程 "J000" 在 120 秒之后仍没有启动

iniestandroid 发表于 2014-8-4 12:21:33

OS: Solaris 10 SPARC 64bit
DB: 11.2.0.1 RAC

双节点RAC上一共有5个db，应用报错连接不上其中一个实例，发现节点1的db alert日志报错如下：
Fri Aug 01 11:17:36 2014
Errors in file /u01/app/oracle/diag/rdbms/rep/rep1/trace/rep1_cjq0_2817.trc  (incident=264722):
ORA-00445: 后台进程 "J000" 在 120 秒之后仍没有启动
Incident details in: /u01/app/oracle/diag/rdbms/rep/rep1/incident/incdir_264722/rep1_cjq0_2817_i264722.trc
Fri Aug 01 11:17:41 2014
Trace dumping is performing id=
kkjcre1p: unable to spawn jobq slave process
Errors in file /u01/app/oracle/diag/rdbms/rep/rep1/trace/rep1_cjq0_2817.trc:
Fri Aug 01 11:18:33 2014
Sweep : completed
Sweep : completed
Fri Aug 01 11:25:42 2014
Errors in file /u01/app/oracle/diag/rdbms/rep/rep1/trace/rep1_arc2_2549.trc:
ORA-16146: 控制文件入队不可用
Fri Aug 01 11:25:46 2014
Errors in file /u01/app/oracle/diag/rdbms/rep/rep1/trace/rep1_lgwr_2118.trc  (incident=264465):
ORA-00240: 控制文件入队被持有的时间超过了 120 秒
Incident details in: /u01/app/oracle/diag/rdbms/rep/rep1/incident/incdir_264465/rep1_lgwr_2118_i264465.trc
Fri Aug 01 11:25:59 2014
Trace dumping is performing id=
Fri Aug 01 11:26:09 2014
Errors in file /u01/app/oracle/diag/rdbms/rep/rep1/trace/rep1_lgwr_2118.trc  (incident=264466):
ORA-00445: 后台进程 "O000" 在 120 秒之后仍没有启动
Incident details in: /u01/app/oracle/diag/rdbms/rep/rep1/incident/incdir_264466/rep1_lgwr_2118_i264466.trc

当时OS的CPU， I/O，内存都不高，但是通过mpstat看到其中一个CPU的使用率达到100%，使用者是sys，执行prstat命令没有响应。检查了另外一个节点都是正常的。

大神儿们有空帮忙看看吧！！

Maclean Liu(刘相兵 发表于 2014-8-4 16:24:42

waited for 'os thread startup', seq_num: 59532
   p1: ''=0x0
   p2: ''=0x0
   p3: ''=0x0
   time_waited: >= 0 sec (still in wait)

*** 2014-08-01 11:24:50.720
loadavg : 2.71 3.47 3.52
swap info: free_mem = 284102.17M rsv = 324356.43M
         alloc = 199102.66M avail = 130586.28M swap_free = 255840.05M
skgpgcmdout: read() for cmd /bin/ps -elf | /bin/egrep 'PID | 4128' | /bin/grep -v grep timed out after 15.000 seconds
Stack:
pstack: cannot examine 4128: permission denied

Maclean Liu(刘相兵 发表于 2014-8-4 16:32:05

给出solaris project 内核相关配置信息

iniestandroid 发表于 2014-8-4 16:52:49

Maclean Liu(刘相兵发表于 2014-8-4 16:32 static/image/common/back.gif
给出solaris project 内核相关配置信息

/etc/system:
set noexec_user_stack=1
set shmsys:shminfo_shmmax=549755813888
set semsys:seminfo_semmns=1024
set semsys:seminfo_semvmx=32767
set rlim_fd_cur=1024

project :
# projadd group.oinstall
# projmod -sK "project.max-shm-memory=(privileged,512G,deny)" group.oinstall
# projmod -sK "project.max-shm-ids=(privileged,100,deny)" group.oinstall
# projmod -sK "project.max-sem-nsems=(privileged,256,deny)" group.oinstall
# projmod -sK "project.max-sem-ids=(privileged,100,deny)" group.oinstall

谢谢ML、、

iniestandroid 发表于 2014-8-6 17:12:42

Maclean Liu(刘相兵发表于 2014-8-4 16:32 static/image/common/back.gif
给出solaris project 内核相关配置信息

呼叫ML...有空帮我看一眼，多谢！！

harryzhang 发表于 2014-8-7 10:27:22

执行ps命令都超时，一般情况下应该很快，是不是可以认为系统当时hang了。

你OS是 SOLARIS。上次遇到在AIX下，主机hang了，也是ps 执行超时。

harryzhang 发表于 2014-8-7 11:11:28

*** 2014-08-01 11:16:28.255
Warning: log write broadcast wait time 593ms
(set event 10468 level 8 to disable this warning)
Waited for process O000 to initialize for 60 seconds

The process: Onnn (ASM Connection Pool Process) : Maintains a connection to the ASM instance for metadata operations.

O000一直没起来~~

Maclean Liu(刘相兵 发表于 2014-8-7 16:53:07

这么看就是整个OS都hang了，就TRACE来看内存似乎是够。

由于OS hang了之后，无法启动O的进程是很正常的，应当进一步分析OSW

iniestandroid 发表于 2014-8-8 09:41:02

harryzhang 发表于 2014-8-7 11:11 static/image/common/back.gif
*** 2014-08-01 11:16:28.255
Warning: log write broadcast wait time 593ms
(set event 10468 level 8 t ...

恩，我也觉得是OS hang了，08.05又出了一次问题，也是这个节点，db alert先出现ORA-00445，接着crs alert 集群资源检查失败，然后这个节点被驱逐重启了

iniestandroid 发表于 2014-8-8 09:45:31

Maclean Liu(刘相兵发表于 2014-8-7 16:53 static/image/common/back.gif
这么看就是整个OS都hang了，就TRACE来看内存似乎是够。

由于OS hang了之后，无法启动O的进程是很正常的， ...

8.1出现问题后才装上OSW，然后2014-08-05 02:36:35又出现了相同问题并导致这个节点被驱逐，/var/adm/message只有panic报错和OS 重启信息，OSW只有mpstat有一些异常，我传上来给你们看一下

iniestandroid 发表于 2014-8-8 09:57:43

Maclean Liu(刘相兵发表于 2014-8-7 16:53 static/image/common/back.gif
这么看就是整个OS都hang了，就TRACE来看内存似乎是够。

由于OS hang了之后，无法启动O的进程是很正常的， ...

OSW已经上传，谢谢ML！！

harryzhang 发表于 2014-8-8 23:24:35

OSW的 ps里面：
zzz ***二 8月 5 02:14:06 CST 2014

USER PID  PPID PRI %CPU %MEM  VSZ  RSS          WCHAN S STIME       TIME COMMAND
  oracle 26138    1  59  0.0 10.0 84094312 52627624    602fdeae5dc S 8月_03    00:05 ora_q003_ards1
  oracle 25780    1  59  0.0 10.0 84092584 52628192    602fdeae52c S 16:12:10    00:01 ora_q005_ards1
  oracle 17502    1  59  0.0 10.0 84091816 52627888    602fdeae4bc S 02:07:35    00:00 ora_w000_ards1
  oracle  3351    1  59  0.0 10.0 84093808 52619072    602fdeae58c S 7月_31    00:23 ora_smco_ards1

zzz ***二 8月 5 02:15:13 CST 2014
USER PID  PPID PRI %CPU %MEM  VSZ  RSS          WCHAN S STIME       TIME COMMAND
到这里后无数据，但是其他的命令例如 mpstat之类的在35分之后才没数据的

message里面也是在14分时候出现信息：
Aug  5 02:14:47 odsdb-node01 in.mpathd: Cannot meet requested failure detection time of 10000 ms on (inet igb0) new failure detection time for group "public" is 123858 ms

然后10多分钟后，数据库报错进程无法启动，可能实例只是受害者。。
然后系统挂了。

MOS有这样一个，但是因为对SOLARIS不精通，不敢确认，建议找主机人员介入。

Solaris 10 Server or Solaris Cluster Node/System/Server Rebooted Itself With Panic String "forced crash dump initiated at user request" (文档 ID 1388823.1)

The Solaris system crashes with "forced crash dump initiated at user request".

The panic message looks like:
node1 ^Mpanic/thread=2a10007fca0:
node1 unix: forced crash dump initiated at user request
node1 unix:
node1 genunix: 000002a10007f700 genunix:kadmin+544 (b4, 1, 0, 0, 5, 1285800)
node1 genunix: %l0-3: 0000000001815400 00000000011ee684 00000000011ee400 0000000000000004
node1 %l4-7: 0000000000000004 00000000000004e8 0000000000000004 0000000000000004
node1 genunix: 000002a10007f7c0 ntwdt:ntwdt_enforce_timeout+70 (6001153bb30, 60011694370, 7005d800, 18c0400, 7b7d1c00, 0)
node1 genunix: %l0-3: 0000000000000800 00000000018d2c00 0000000000000000 0000000001913c00
node1 %l4-7: 0000060011694370 0000000000000001 000000007b7d1c00 0000000000000001
node1 pcisch: WARNING: pcisch2: ino 0x22 blocked
node1 pcisch: rmc_comm-0#0

Cause

There is a Solaris driver called ntwdt (Netra-based application watchdog timer driver). It is a watchdog mechanism. It detects a system hang, or an application hang or crash. The watchdog is a timer that is continually reset by a user application as long as the operating system and user application are running. When the application stops and the timeout is reached ntwdt driver will generate a coredump and crashes the system. The man pages can be found there :ntwdt(7D)

Solution

If you see ntwdt e.g:
ntwdt:ntwdt_enforce_timeout+0x70(0x600368d67b0)
ntwdt:ntwdt_cyclic_pat+0x90(0x600368d67b0)
in panic stack then these are obvious hardware errors.

Check /var/adm/messages to get a clue what the real problem is.
Also the logs of the service processor of the system can be helpful.

If you unable to determine the cause contact Oracle Support provide crashdump and explorer output of the system for analysis.

harryzhang 发表于 2014-8-8 23:36:19

查看 odsdb-node01_top_14.08.05.0100.dat

注意
2:11-14三次采样主机的负载分别是 30 24 14

iniestandroid 发表于 2014-9-1 13:23:43

harryzhang 发表于 2014-8-8 23:36 static/image/common/back.gif
查看 odsdb-node01_top_14.08.05.0100.dat

注意

这个问题已经解决，是由于Solaris 10 DISM 的 unpublish bug造成，禁用DISM后跑了3周没有再出现过问题，谢谢各位大神儿！！

harryzhang 发表于 2014-9-2 16:40:50

which bug,反馈下嘛！

iniestandroid 发表于 2014-9-4 14:17:51

harryzhang 发表于 2014-9-2 16:40 static/image/common/back.gif
which bug,反馈下嘛！

BUG 17927249 - System panics due to DISM issues

没有相关补丁，禁用DISM后OK

参照 Doc ID 1606318.1

页: [1]

Oracle数据库数据恢复、性能优化's Archiver

ORA-00445: 后台进程 "J000" 在 120 秒之后仍没有启动