Oracle数据库数据恢复、性能优化»论坛 › Oracle › Oracle数据库管理 › Oracle ORA-00600 [OSDEP_INTERNAL] 数据库宕机！！！ ...

12 积分	0 好友	3 主题

发消息

[备份恢复] Oracle ORA-00600 [OSDEP_INTERNAL] 数据库宕机！！！

1^#

发表于 2011-10-13 13:32:32 | 查看: 12340| 回复: 13

昨晚7点多数据库宕机。同事说是由于内存不足导致数据库宕机，不知道这个结论是否正确，宕机的具体原因是什么？

告警日志文件发现错误信息如下：
Wed Oct 12 19:36:09 2011
Trace dumping is performing id=[11012193433]
Wed Oct 12 19:36:24 2011
Errors in file /oracle/nxcdr/admin/bdump/cdr2_lmon_3387420.trc:
ORA-29740: evicted by member 0, group incarnation 8
Wed Oct 12 19:36:24 2011
LMON: terminating instance due to error 29740
Instance terminated by LMON, pid = 3387420
Wed Oct 12 19:37:49 2011
Errors in file /oracle/nxcdr/admin/udump/cdr2_ora_1597478.trc:
ORA-00600: internal error code, arguments: [OSDEP_INTERNAL], [], [], [], [], [], [], []
ORA-27302: failure occurred at: skgpwinit1
ORA-27303: additional information: invalid shared context
Wed Oct 12 19:37:49 2011
Errors in file /oracle/nxcdr/admin/udump/cdr2_ora_2334720.trc:
ORA-00600: internal error code, arguments: [OSDEP_INTERNAL], [], [], [], [], [], [], []
ORA-27302: failure occurred at: skgpwinit1
ORA-27303: additional information: invalid shared context

数据库版本：
Oracle9i Enterprise Edition Release 9.2.0.6.0 - 64bit Production
With the Partitioning and Real Application Clusters options
JServer Release 9.2.0.6.0 - Production

跟踪日志文件部分信息如下：
/oracle/nxcdr/admin/udump/cdr2_ora_1597478.trc
Oracle9i Enterprise Edition Release 9.2.0.6.0 - 64bit Production
With the Partitioning and Real Application Clusters options
JServer Release 9.2.0.6.0 - Production
ORACLE_HOME = /oracle/app/product/9.2.0
System name: AIX
Node name:    ngrptcdr2
Release:       3
Version:       5
Machine:       00CF7CD34C00
Instance name: cdr2
Redo thread mounted by this instance: 2
Oracle process number: 2
Unix process pid: 1597478, image: [email=oracle@ngrptcdr2]oracle@ngrptcdr2[/email] (TNS V1-V3)
*** 2011-10-12 19:37:49.487
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [OSDEP_INTERNAL], [], [], [], [], [], [], []
ORA-27302: failure occurred at: skgpwinit1
ORA-27303: additional information: invalid shared context
Current SQL information unavailable - no session.
----- Call Stack Trace -----
calling             call    entry             argument values in hex
location          type    point             (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedmp+0148       bl    ksedst             1025AE8FC ?
ksfdmp+0018       bl    01FD8260
kgerinv+00e8       bl    _ptrgl
kgerin+003c       bl    kgerinv             70676574696E666F ?

分享0

收藏0 回复只看该作者道具举报

Honcho

2^#

发表于 2011-10-13 19:05:03

回复 1# 的帖子

Bug 8392904: ORA-29740, INSTANCE TERMINATED BY USER AND ORA 600 [OSDEP_INTERNAL]

回复只看该作者道具举报

Maclean Liu(刘相兵

3^#

发表于 2011-10-13 21:24:24

请检查内存参数是否有效 :

#collect AIX info

ls -al $ORACLE_HOME/bin/oracle >> /tmp/support.txt
oslevel -s
vmo -a >> /tmp/support.txt
whoami >> /tmp/support.txt
ulimit -a >> /tmp/support.txt
svmon -O unit=MB >> /tmp/support.txt
/usr/sbin/lsps -a >> /tmp/support.txt
/usr/sbin/lsattr -HE -l sys0 -a realmem >> /tmp/support.txt
ipcs -m >> /tmp/support.txt

并上传/tmp/support.txt的内容

回复只看该作者道具举报

alonedba

4^#

发表于 2011-10-13 21:33:47

不能VPN，明天收集好AIX系统信息再上传，多谢刘兄！

回复只看该作者道具举报

alonedba

5^#

发表于 2011-10-14 10:25:24

这个是今天收集的aix系统信息。。。

support.rar

1.63 KB, 下载次数: 1338

support.txt文件

回复只看该作者道具举报

Liu Maclean(刘相兵

6^#

发表于 2011-10-14 10:47:10

仔细看一下可以发现，实例crash是因为被member 0 驱逐

Wed Oct 12 19:36:09 2011
Trace dumping is performing id=[11012193433]
Wed Oct 12 19:36:24 2011
Errors in file /oracle/nxcdr/admin/bdump/cdr2_lmon_3387420.trc:
ORA-29740: evicted by member 0, group incarnation 8

建议你分析下HACMP和LMON的日志来分析evicted的原因，这里ORA-600 [OSDEP_INTERNAL]的出现因该也是受害者

回复只看该作者道具举报

alonedba

7^#

发表于 2011-10-14 11:58:46

上传了errpt的结果和LMON的日志文件。

AIX_ERRPT和LMON_log.rar

4.17 KB, 下载次数: 1253

回复只看该作者道具举报

Liu Maclean(刘相兵

8^#

发表于 2011-10-14 12:02:28

把节点 1 和 2的 LMON 和 DIAG进程的相关时段 trace都上传，如果不大的话直接贴内容不要用附件形式

回复只看该作者道具举报

alonedba

9^#

发表于 2011-10-14 12:57:15

节点1没有相关时段的LMON跟踪文件，一文件夹在相关时段创建，有大量的*.trw文件，我弄了两个*lmon.trw和*diag.trw文件。

节点1和2的diag_lmon.rar

166.97 KB, 下载次数: 1216

回复只看该作者道具举报

Maclean Liu(刘相兵

10^#

发表于 2011-10-14 13:01:57

节点1 的 cdr1_lmon_930014.trc 你没上传只传了 cdr1_lmon_930014.trw

回复只看该作者道具举报

alonedba

11^#

发表于 2011-10-14 17:04:16

我把12号之前和之后的内容删了，这样可以不用发附件了。
/oracle/nxcdr/admin/bdump/cdr1_lmon_930014.trc
Oracle9i Enterprise Edition Release 9.2.0.6.0 - 64bit Production
With the Partitioning and Real Application Clusters options
JServer Release 9.2.0.6.0 - Production
ORACLE_HOME = /oracle/app/product/9.2.0
System name: AIX
Node name: ngrptcdr1
Release: 3
Version: 5
Machine: 00CE5FA34C00
Instance name: cdr1
Redo thread mounted by this instance: 0 <none>
Oracle process number: 4
Unix process pid: 930014, image: [email=oracle@ngrptcdr1]oracle@ngrptcdr1[/email] (LMON)
*** SESSION ID:(3.1) 2011-10-07 23:11:45.919
...
*** 2011-10-07 23:12:00.591
kjxgrtmc2: Member 0 thread 1 mounted
*** 2011-10-12 19:34:33.046
kjxgrcomerr: Communications reconfig: instance 1 (7,7)
Submitting asynchronized dump request [2]
kjxgrrcfgchk: Initiating reconfig, reason 3
*** 2011-10-12 19:34:38.155
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 7 0.
*** 2011-10-12 19:34:38.157
   Name Service frozen
kjxgmcs: Setting state to 7 1.
*** 2011-10-12 19:34:38.261
Obtained RR update lock for sequence 7, RR seq 7
*** 2011-10-12 19:37:08.284
Voting results, upd 0, seq 8, bitmap: 0
Evicting mem 1, stat 0x0007 err 0x0002
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 8 2.
   Performed the unique instance identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 8 3.
   Name Service recovery started
   Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 8 4.
   Multicasted all local name entries for publish
   Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 8 5.
   Name Service normal
   Name Service recovery done
*** 2011-10-12 19:37:08.285
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 8 6.
*** 2011-10-12 19:37:19.857
kjxgmcs: Setting state to 0 0.
*** 2011-10-12 19:37:19.859
   Name Service frozen
kjfmact: call ksimdic on instance (1)
*** 2011-10-12 19:37:19.859
*** 2011-10-12 19:37:19.859
Reconfiguration started (old inc 7, new inc 8)
Synchronization timeout interval: 600 sec
List of nodes:
0
*** 2011-10-12 19:37:19.859
kjxgmpoll reconfig bitmap: 0
*** 2011-10-12 19:37:19.859
kjxgmrcfg: Reconfiguration started, reason 1
kjxgmcs: Setting state to 8 0.
*** 2011-10-12 19:37:19.861
   Name Service frozen
kjxgmcs: Setting state to 8 1.
*** 2011-10-12 19:37:19.887
Obtained RR update lock for sequence 8, RR seq 8
*** 2011-10-12 19:37:19.890
Voting results, upd 0, seq 9, bitmap: 0
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 9 2.
   Performed the unique instance identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 9 3.
   Name Service recovery started
   Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 9 4.
   Multicasted all local name entries for publish
   Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 9 5.
   Name Service normal
   Name Service recovery done
*** 2011-10-12 19:37:19.890
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 9 6.
*** 2011-10-12 19:37:19.890
Reconfiguration started (old inc 7, new inc 9)
Synchronization timeout interval: 600 sec
List of nodes:
0
Nested/batched reconfiguration detected.
Global Resource Directory frozen
node 0
* kjshashcfg: I'm the only node in the cluster (node 0)
[WARNING]: send queue inst 1, recvr 1 not empty 19
  sndq msg 19  maxlmt 271975  maxlen 565  wqlen 171563334
  sndq msg 19  start_tm 3707224  end_tm 3748245
  sndq msg : 34:16 38:3
[WARNING]: send queue inst 1, recvr 2 not empty 22
  sndq msg 22  maxlmt 271975  maxlen 575  wqlen 171646144
  sndq msg 22  start_tm 3657455  end_tm 3748245
  sndq msg : 34:18 38:4
[WARNING]: send queue inst 1, recvr 3 not empty 20
  sndq msg 20  maxlmt 271975  maxlen 584  wqlen 171435406
  sndq msg 20  start_tm 3726855  end_tm 3748245
  sndq msg : 34:19 38:1
Active Sendback Threshold = 50 %
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Resources and enqueues cleaned out
Resources remastered 125636
249650 GCS shadows traversed, 28 cancelled, 72127 closed
202060 GCS resources traversed, 1 cancelled
set master node info
Submitted all remote-enqueue requests
Update rdomain variables
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
*** 2011-10-12 19:37:21.920
249650 GCS shadows traversed, 0 replayed, 72153 unopened
Submitted all GCS cache requests
0 write requests issued in 177497 GCS resources
11 PIs marked suspect, 0 flush PI msgs
*** 2011-10-12 19:37:23.559
Reconfiguration complete
Post SMON to start 1st pass IR
*** 2011-10-12 20:09:38.085
kjxgmpoll reconfig bitmap: 0 1
*** 2011-10-12 20:09:38.085
kjxgmrcfg: Reconfiguration started, reason 1
kjxgmcs: Setting state to 9 0.
*** 2011-10-12 20:09:38.087
   Name Service frozen
kjxgmcs: Setting state to 9 1.
*** 2011-10-12 20:09:38.296
Obtained RR update lock for sequence 9, RR seq 9
*** 2011-10-12 20:09:38.311
Voting results, upd 0, seq 10, bitmap: 0 1
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 10 2.
   Performed the unique instance identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 10 3.
   Name Service recovery started
   Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 10 4.
   Multicasted all local name entries for publish
   Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 10 5.
   Name Service normal
   Name Service recovery done
*** 2011-10-12 20:09:38.423
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 10 6.
*** 2011-10-12 20:09:38.523
*** 2011-10-12 20:09:38.523
Reconfiguration started (old inc 9, new inc 10)
Synchronization timeout interval: 600 sec
List of nodes:
0 1
Global Resource Directory frozen
node 0
node 1
release 9 2 0 6
* kjdrqrnums: node 1 resnum could not be queried (ret 7).
res_master_weight for node 0 is 247250
res_master_weight for node 1 is 247250
Total master weight = 494500
Dead  inst
Join  inst 1
Exist inst 0
Active Sendback Threshold = 50 %
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Resources and enqueues cleaned out
Resources remastered 125704
249650 GCS shadows traversed, 0 cancelled, 93365 closed
156285 GCS resources traversed, 0 cancelled
193688 GCS resources on freelist, 271975 on array, 271975 allocated
set master node info
Submitted all remote-enqueue requests
kjfcrfg: Number of mesgs sent to node 1 = 61019
Update rdomain variables
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
*** 2011-10-12 20:09:40.934
249650 GCS shadows traversed, 77998 replayed, 93365 unopened
Submitted all GCS cache requests
0 write requests issued in 78287 GCS resources
0 PIs marked suspect, 0 flush PI msgs
* kjdrqrnums: node 1 resnum could not be queried (ret 7).
*** 2011-10-12 20:09:42.317
Reconfiguration complete
Post SMON to start 1st pass IR
*** 2011-10-13 11:59:29.602

回复只看该作者道具举报

Maclean Liu(刘相兵

12^#

发表于 2011-10-14 19:44:40

在10.12 19:20出现了NIM thread blocked

BFE4C025 1013110011 P H sysplanar0    UNDETERMINED ERROR
3D32B80D 1012192011 P S topsvcs       NIM thread blocked
3C81E43F 1012003011 P U topsvcs       Late in sending heartbeat
3C81E43F 1009091511 P U topsvcs       Late in sending heartbeat
3C81E43F 1008003011 P U topsvcs       Late in sending heartbeat

该NIM thread blocked警告说明网络接口线程阻塞，可能由操作系统资源耗尽引起。

1号节点在19:34开始初始化reconfig

*** 2011-10-12 19:34:33.046
kjxgrcomerr: Communications reconfig: instance 1 (7,7)
Submitting asynchronized dump request [2]
kjxgrrcfgchk: Initiating reconfig, reason 3
*** 2011-10-12 19:34:38.155
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 7 0.
*** 2011-10-12 19:34:38.157
   Name Service frozen
kjxgmcs: Setting state to 7 1.
*** 2011-10-12 19:34:38.261
Obtained RR update lock for sequence 7, RR seq 7
*** 2011-10-12 19:37:08.284
Voting results, upd 0, seq 8, bitmap: 0
Evicting mem 1, stat 0x0007 err 0x0002

建议可能的话追查当日的OS性能日志，如果没有部署监控脚本的话可以安装OSWater。
因为现在缺少系统性能日志，所以不能确定是否由资源耗尽引起，而之前让你收集的内存信息:

#collect AIX info

ls -al $ORACLE_HOME/bin/oracle >> /tmp/support.txt
oslevel -s
vmo -a >> /tmp/support.txt
whoami >> /tmp/support.txt
ulimit -a >> /tmp/support.txt
svmon -O unit=MB >> /tmp/support.txt
/usr/sbin/lsps -a >> /tmp/support.txt
/usr/sbin/lsattr -HE -l sys0 -a realmem >> /tmp/support.txt
ipcs -m >> /tmp/support.txt

因为不是在问题发生时段收集的所以没有意义。

如果确认是由于内存耗尽引起的，那么建议调整AIX Vmo参数，如调大minfree 或调小 maxperm等参数，具体可以参考：
http://www.oracledatabase12g.com ... nce-and-tuning.html

回复只看该作者道具举报

alonedba

13^#

发表于 2011-10-25 12:20:31

TKS Maclean Liu！！！

回复只看该作者道具举报

ALLSTARS_ORACLE

14^#

发表于 2017-4-17 17:02:22

ding ding ding

回复只看该作者道具举报

返回列表

		自动登录	找回密码
密码			注册