Oracle数据库数据恢复、性能优化

找回密码
注册
搜索
热搜: 活动 交友 discuz
发新帖

12

积分

0

好友

3

主题
1#
发表于 2011-10-13 13:32:32 | 查看: 12048| 回复: 13
昨晚7点多数据库宕机。同事说是由于内存不足导致数据库宕机,不知道这个结论是否正确,宕机的具体原因是什么?

告警日志文件发现错误信息如下:
Wed Oct 12 19:36:09 2011
Trace dumping is performing id=[11012193433]
Wed Oct 12 19:36:24 2011
Errors in file /oracle/nxcdr/admin/bdump/cdr2_lmon_3387420.trc:
ORA-29740: evicted by member 0, group incarnation 8
Wed Oct 12 19:36:24 2011
LMON: terminating instance due to error 29740
Instance terminated by LMON, pid = 3387420
Wed Oct 12 19:37:49 2011
Errors in file /oracle/nxcdr/admin/udump/cdr2_ora_1597478.trc:
ORA-00600: internal error code, arguments: [OSDEP_INTERNAL], [], [], [], [], [], [], []
ORA-27302: failure occurred at: skgpwinit1
ORA-27303: additional information: invalid shared context
Wed Oct 12 19:37:49 2011
Errors in file /oracle/nxcdr/admin/udump/cdr2_ora_2334720.trc:
ORA-00600: internal error code, arguments: [OSDEP_INTERNAL], [], [], [], [], [], [], []
ORA-27302: failure occurred at: skgpwinit1
ORA-27303: additional information: invalid shared context

数据库版本:
Oracle9i Enterprise Edition Release 9.2.0.6.0 - 64bit Production
With the Partitioning and Real Application Clusters options
JServer Release 9.2.0.6.0 - Production

跟踪日志文件部分信息如下:
/oracle/nxcdr/admin/udump/cdr2_ora_1597478.trc
Oracle9i Enterprise Edition Release 9.2.0.6.0 - 64bit Production
With the Partitioning and Real Application Clusters options
JServer Release 9.2.0.6.0 - Production
ORACLE_HOME = /oracle/app/product/9.2.0
System name:    AIX
Node name:      ngrptcdr2
Release:        3
Version:        5
Machine:        00CF7CD34C00
Instance name: cdr2
Redo thread mounted by this instance: 2
Oracle process number: 2
Unix process pid: 1597478, image: [email=oracle@ngrptcdr2]oracle@ngrptcdr2[/email] (TNS V1-V3)
*** 2011-10-12 19:37:49.487
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [OSDEP_INTERNAL], [], [], [], [], [], [], []
ORA-27302: failure occurred at: skgpwinit1
ORA-27303: additional information: invalid shared context
Current SQL information unavailable - no session.
----- Call Stack Trace -----
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
ksedmp+0148          bl       ksedst               1025AE8FC ?
ksfdmp+0018          bl       01FD8260            
kgerinv+00e8         bl       _ptrgl               
kgerin+003c          bl       kgerinv              70676574696E666F ?
2#
发表于 2011-10-13 19:05:03

回复 1# 的帖子

Bug 8392904: ORA-29740, INSTANCE TERMINATED BY USER AND ORA 600 [OSDEP_INTERNAL]

回复 只看该作者 道具 举报

3#
发表于 2011-10-13 21:24:24
请检查 内存参数是否有效 :



#collect AIX info

ls -al $ORACLE_HOME/bin/oracle >> /tmp/support.txt
oslevel -s
vmo -a >> /tmp/support.txt
whoami >> /tmp/support.txt
ulimit -a >> /tmp/support.txt
svmon -O unit=MB >> /tmp/support.txt
/usr/sbin/lsps -a >> /tmp/support.txt
/usr/sbin/lsattr -HE -l sys0 -a realmem >> /tmp/support.txt
ipcs -m >> /tmp/support.txt

并上传/tmp/support.txt的内容

回复 只看该作者 道具 举报

4#
发表于 2011-10-13 21:33:47
不能VPN,明天收集好AIX系统信息再上传,多谢刘兄!

回复 只看该作者 道具 举报

5#
发表于 2011-10-14 10:25:24
这个是今天收集的aix系统信息。。。

support.rar

1.63 KB, 下载次数: 1174

support.txt文件

回复 只看该作者 道具 举报

6#
发表于 2011-10-14 10:47:10
仔细看一下可以发现,实例crash是因为被member 0 驱逐

Wed Oct 12 19:36:09 2011
Trace dumping is performing id=[11012193433]
Wed Oct 12 19:36:24 2011
Errors in file /oracle/nxcdr/admin/bdump/cdr2_lmon_3387420.trc:
ORA-29740: evicted by member 0, group incarnation 8

建议你分析下HACMP和LMON的日志来分析evicted的原因,这里ORA-600 [OSDEP_INTERNAL]的出现因该也是受害者

回复 只看该作者 道具 举报

7#
发表于 2011-10-14 11:58:46
上传了errpt的结果和LMON的日志文件。

AIX_ERRPT和LMON_log.rar

4.17 KB, 下载次数: 1090

回复 只看该作者 道具 举报

8#
发表于 2011-10-14 12:02:28
把 节点 1 和 2的  LMON 和 DIAG进程的 相关时段 trace都上传, 如果不大的话 直接贴内容 不要用附件形式

回复 只看该作者 道具 举报

9#
发表于 2011-10-14 12:57:15
节点1没有相关时段的LMON跟踪文件,一文件夹在相关时段创建,有大量的*.trw文件,我弄了两个*lmon.trw和*diag.trw文件。

节点1和2的diag_lmon.rar

166.97 KB, 下载次数: 1054

回复 只看该作者 道具 举报

10#
发表于 2011-10-14 13:01:57
节点1 的 cdr1_lmon_930014.trc 你没上传 只传了 cdr1_lmon_930014.trw

回复 只看该作者 道具 举报

11#
发表于 2011-10-14 17:04:16
我把12号之前和之后的内容删了,这样可以不用发附件了。
/oracle/nxcdr/admin/bdump/cdr1_lmon_930014.trc
Oracle9i Enterprise Edition Release 9.2.0.6.0 - 64bit Production
With the Partitioning and Real Application Clusters options
JServer Release 9.2.0.6.0 - Production
ORACLE_HOME = /oracle/app/product/9.2.0
System name: AIX
Node name: ngrptcdr1
Release: 3
Version: 5
Machine: 00CE5FA34C00
Instance name: cdr1
Redo thread mounted by this instance: 0 <none>
Oracle process number: 4
Unix process pid: 930014, image: [email=oracle@ngrptcdr1]oracle@ngrptcdr1[/email] (LMON)
*** SESSION ID:(3.1) 2011-10-07 23:11:45.919
...
*** 2011-10-07 23:12:00.591
kjxgrtmc2: Member 0 thread 1 mounted
*** 2011-10-12 19:34:33.046
kjxgrcomerr: Communications reconfig: instance 1 (7,7)
Submitting asynchronized dump request [2]
kjxgrrcfgchk: Initiating reconfig, reason 3
*** 2011-10-12 19:34:38.155
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 7 0.
*** 2011-10-12 19:34:38.157
     Name Service frozen
kjxgmcs: Setting state to 7 1.
*** 2011-10-12 19:34:38.261
Obtained RR update lock for sequence 7, RR seq 7
*** 2011-10-12 19:37:08.284
Voting results, upd 0, seq 8, bitmap: 0
Evicting mem 1, stat 0x0007 err 0x0002
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 8 2.
     Performed the unique instance identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 8 3.
     Name Service recovery started
     Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 8 4.
     Multicasted all local name entries for publish
     Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 8 5.
     Name Service normal
     Name Service recovery done
*** 2011-10-12 19:37:08.285
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 8 6.
*** 2011-10-12 19:37:19.857
kjxgmcs: Setting state to 0 0.
*** 2011-10-12 19:37:19.859
     Name Service frozen
kjfmact: call ksimdic on instance (1)
*** 2011-10-12 19:37:19.859
*** 2011-10-12 19:37:19.859
Reconfiguration started (old inc 7, new inc 8)
Synchronization timeout interval: 600 sec
List of nodes:
0
*** 2011-10-12 19:37:19.859
kjxgmpoll reconfig bitmap: 0
*** 2011-10-12 19:37:19.859
kjxgmrcfg: Reconfiguration started, reason 1
kjxgmcs: Setting state to 8 0.
*** 2011-10-12 19:37:19.861
     Name Service frozen
kjxgmcs: Setting state to 8 1.
*** 2011-10-12 19:37:19.887
Obtained RR update lock for sequence 8, RR seq 8
*** 2011-10-12 19:37:19.890
Voting results, upd 0, seq 9, bitmap: 0
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 9 2.
     Performed the unique instance identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 9 3.
     Name Service recovery started
     Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 9 4.
     Multicasted all local name entries for publish
     Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 9 5.
     Name Service normal
     Name Service recovery done
*** 2011-10-12 19:37:19.890
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 9 6.
*** 2011-10-12 19:37:19.890
Reconfiguration started (old inc 7, new inc 9)
Synchronization timeout interval: 600 sec
List of nodes:
0
Nested/batched reconfiguration detected.
Global Resource Directory frozen
node 0
* kjshashcfg: I'm the only node in the cluster (node 0)
[WARNING]: send queue inst 1, recvr 1 not empty 19
  sndq msg 19  maxlmt 271975  maxlen 565  wqlen 171563334
  sndq msg 19  start_tm 3707224  end_tm 3748245
  sndq msg : 34:16 38:3
[WARNING]: send queue inst 1, recvr 2 not empty 22
  sndq msg 22  maxlmt 271975  maxlen 575  wqlen 171646144
  sndq msg 22  start_tm 3657455  end_tm 3748245
  sndq msg : 34:18 38:4
[WARNING]: send queue inst 1, recvr 3 not empty 20
  sndq msg 20  maxlmt 271975  maxlen 584  wqlen 171435406
  sndq msg 20  start_tm 3726855  end_tm 3748245
  sndq msg : 34:19 38:1
Active Sendback Threshold = 50 %
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Resources and enqueues cleaned out
Resources remastered 125636
249650 GCS shadows traversed, 28 cancelled, 72127 closed
202060 GCS resources traversed, 1 cancelled
set master node info
Submitted all remote-enqueue requests
Update rdomain variables
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
*** 2011-10-12 19:37:21.920
249650 GCS shadows traversed, 0 replayed, 72153 unopened
Submitted all GCS cache requests
0 write requests issued in 177497 GCS resources
11 PIs marked suspect, 0 flush PI msgs
*** 2011-10-12 19:37:23.559
Reconfiguration complete
Post SMON to start 1st pass IR
*** 2011-10-12 20:09:38.085
kjxgmpoll reconfig bitmap: 0 1
*** 2011-10-12 20:09:38.085
kjxgmrcfg: Reconfiguration started, reason 1
kjxgmcs: Setting state to 9 0.
*** 2011-10-12 20:09:38.087
     Name Service frozen
kjxgmcs: Setting state to 9 1.
*** 2011-10-12 20:09:38.296
Obtained RR update lock for sequence 9, RR seq 9
*** 2011-10-12 20:09:38.311
Voting results, upd 0, seq 10, bitmap: 0 1
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 10 2.
     Performed the unique instance identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 10 3.
     Name Service recovery started
     Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 10 4.
     Multicasted all local name entries for publish
     Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 10 5.
     Name Service normal
     Name Service recovery done
*** 2011-10-12 20:09:38.423
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 10 6.
*** 2011-10-12 20:09:38.523
*** 2011-10-12 20:09:38.523
Reconfiguration started (old inc 9, new inc 10)
Synchronization timeout interval: 600 sec
List of nodes:
0 1
Global Resource Directory frozen
node 0
node 1
release 9 2 0 6
* kjdrqrnums: node 1 resnum could not be queried (ret 7).
res_master_weight for node 0 is 247250
res_master_weight for node 1 is 247250
Total master weight = 494500
Dead  inst
Join  inst 1
Exist inst 0
Active Sendback Threshold = 50 %
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Resources and enqueues cleaned out
Resources remastered 125704
249650 GCS shadows traversed, 0 cancelled, 93365 closed
156285 GCS resources traversed, 0 cancelled
193688 GCS resources on freelist, 271975 on array, 271975 allocated
set master node info
Submitted all remote-enqueue requests
kjfcrfg: Number of mesgs sent to node 1 = 61019
Update rdomain variables
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
*** 2011-10-12 20:09:40.934
249650 GCS shadows traversed, 77998 replayed, 93365 unopened
Submitted all GCS cache requests
0 write requests issued in 78287 GCS resources
0 PIs marked suspect, 0 flush PI msgs
* kjdrqrnums: node 1 resnum could not be queried (ret 7).
*** 2011-10-12 20:09:42.317
Reconfiguration complete
Post SMON to start 1st pass IR
*** 2011-10-13 11:59:29.602

回复 只看该作者 道具 举报

12#
发表于 2011-10-14 19:44:40
在10.12 19:20出现了NIM thread blocked

BFE4C025   1013110011 P H sysplanar0     UNDETERMINED ERROR
3D32B80D   1012192011 P S topsvcs        NIM thread blocked
3C81E43F   1012003011 P U topsvcs        Late in sending heartbeat
3C81E43F   1009091511 P U topsvcs        Late in sending heartbeat
3C81E43F   1008003011 P U topsvcs        Late in sending heartbeat

该NIM thread blocked警告说明 网络接口线程阻塞,可能由操作系统资源耗尽引起。

1号节点在19:34开始初始化reconfig

*** 2011-10-12 19:34:33.046
kjxgrcomerr: Communications reconfig: instance 1 (7,7)
Submitting asynchronized dump request [2]
kjxgrrcfgchk: Initiating reconfig, reason 3
*** 2011-10-12 19:34:38.155
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 7 0.
*** 2011-10-12 19:34:38.157
     Name Service frozen
kjxgmcs: Setting state to 7 1.
*** 2011-10-12 19:34:38.261
Obtained RR update lock for sequence 7, RR seq 7
*** 2011-10-12 19:37:08.284
Voting results, upd 0, seq 8, bitmap: 0
Evicting mem 1, stat 0x0007 err 0x0002



建议可能的话 追查当日的OS性能日志, 如果没有部署监控脚本的话 可以安装OSWater。
因为现在缺少系统性能日志,所以不能确定是否由资源耗尽引起,而之前让你收集的内存信息:

#collect AIX info

ls -al $ORACLE_HOME/bin/oracle >> /tmp/support.txt
oslevel -s
vmo -a >> /tmp/support.txt
whoami >> /tmp/support.txt
ulimit -a >> /tmp/support.txt
svmon -O unit=MB >> /tmp/support.txt
/usr/sbin/lsps -a >> /tmp/support.txt
/usr/sbin/lsattr -HE -l sys0 -a realmem >> /tmp/support.txt
ipcs -m >> /tmp/support.txt

因为不是在问题发生时段收集的所以 没有意义。


如果确认是由于内存耗尽引起的,那么建议调整AIX Vmo参数, 如调大minfree 或 调小 maxperm等参数,具体可以参考:
http://www.oracledatabase12g.com ... nce-and-tuning.html

回复 只看该作者 道具 举报

13#
发表于 2011-10-25 12:20:31
TKS Maclean Liu!!!

回复 只看该作者 道具 举报

14#
发表于 2017-4-17 17:02:22

ding ding ding

回复 只看该作者 道具 举报

您需要登录后才可以回帖 登录 | 注册

QQ|手机版|Archiver|Oracle数据库数据恢复、性能优化

GMT+8, 2024-9-28 21:16 , Processed in 0.056348 second(s), 23 queries .

Powered by Discuz! X2.5

© 2001-2012 Comsenz Inc.

回顶部
TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569