- 最后登录
- 2016-5-26
- 在线时间
- 35 小时
- 威望
- 8
- 金钱
- 208
- 注册时间
- 2013-1-14
- 阅读权限
- 10
- 帖子
- 60
- 精华
- 0
- 积分
- 8
- UID
- 864
|
1#
发表于 2013-11-22 12:00:43
|
查看: 4432 |
回复: 4
环境描述:AIX 5.3/Oracle 10.2.0.4 64bit
问题描述: 双节点rac中节点1的db 实例在11月20号凌晨2点多abort,直到早上9点上班才通过手工将实例启动,通过后台日志分析发现是由于asm实例无法与db实例通信引发重启导致instance abort,随后crsd进程尝试重启asm实例和db实例,但是只成功启动了asm实例而
无法重启db实例。
诊断日志信息:
1.asm实例的警告日志
Wed Nov 20 02:43:50 2013
Errors in file /home/oracle/admin/+ASM/bdump/+asm1_ckpt_6553660.trc:
ORA-15082: ASM failed to communicate with database instance
CKPT: terminating instance due to error 15082
Wed Nov 20 02:43:50 2013
System state dump is made for local instance
System State dumped to trace file /home/oracle/admin/+ASM/bdump/+asm1_diag_19988986.trc
Wed Nov 20 02:43:55 2013
Instance terminated by CKPT, pid = 6553660
2.db实例的警告日志
Wed Nov 20 02:47:57 2013
Errors in file /home/oracle/admin/zljg/bdump/zljg1_asmb_27983948.trc:
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Wed Nov 20 02:47:57 2013
ASMB: terminating instance due to error 15064
Wed Nov 20 02:47:58 2013
System state dump is made for local instance
System State dumped to trace file /home/oracle/admin/zljg/bdump/zljg1_diag_29425790.trc
Wed Nov 20 02:48:00 2013
Shutting down instance (abort)
Wed Nov 20 02:48:03 2013
Instance terminated by ASMB, pid = 27983948
Wed Nov 20 02:48:05 2013
Instance terminated by USER, pid = 28967452
Wed Nov 20 10:27:16 2013
Starting ORACLE instance (normal)
3.crd进程的日志
2013-11-20 02:08:10.533: [ CRSEVT][14032]32CAAMonitorHandler :: 0:Could not join /home/oracle/product/10.2.0/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child
2013-11-20 02:08:10.534: [ CRSEVT][14032]32CAAMonitorHandler :: 0:Action Script /home/oracle/product/10.2.0/crs/bin/racgwrap(check) timed out for ora.yjbsdb5.vip! (timeout=60)
2013-11-20 02:08:10.534: [ CRSAPP][14032]32CheckResource error for ora.yjbsdb5.vip error code = -2
2013-11-20 02:08:10.533: [ CRSEVT][14032]32CAAMonitorHandler :: 0:Could not join /home/oracle/product/10.2.0/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child
2013-11-20 02:08:10.534: [ CRSEVT][14032]32CAAMonitorHandler :: 0:Action Script /home/oracle/product/10.2.0/crs/bin/racgwrap(check) timed out for ora.yjbsdb5.vip! (timeout=60)
2013-11-20 02:08:10.534: [ CRSAPP][14032]32CheckResource error for ora.yjbsdb5.vip error code = -2
2013-11-20 02:47:59.033: [ CRSRES][15999]32In stateChanged, ora.zljg.zljg1.inst target is ONLINE
2013-11-20 02:47:59.033: [ CRSRES][15999]32ora.zljg.zljg1.inst on yjbsdb5 went OFFLINE unexpectedly
2013-11-20 02:47:59.033: [ CRSRES][15999]32StopResource: setting CLI values
2013-11-20 02:47:59.103: [ CRSRES][15999]32Attempting to stop `ora.zljg.zljg1.inst` on member `yjbsdb5`
2013-11-20 02:48:06.821: [ CRSRES][13936]32Start of `ora.yjbsdb5.ASM1.asm` on member `yjbsdb5` succeeded.
2013-11-20 02:48:06.821: [ CRSRES][13936]32Successfully restarted ora.yjbsdb5.ASM1.asm on yjbsdb5, RESTART_COUNT=1
2013-11-20 02:48:06.905: [ CRSRES][13936]32ora.yjbsdb5.ASM1.asm Updated LAST_RESTART time in ocr
2013-11-20 02:48:20.466: [ OCRSRV][11473]th_select_w_f_r: Error processing request [5]
2013-11-20 02:48:20.538: [ OCRSRV][6693]th_select_w_f_r: Error processing request [5]
2013-11-20 02:48:36.861: [ CRSRES][15999]32Stop of `ora.zljg.zljg1.inst` on member `yjbsdb5` succeeded.
2013-11-20 02:48:36.862: [ CRSRES][15999]32ora.zljg.zljg1.inst RESTART_COUNT=5 RESTART_ATTEMPTS=5
2013-11-20 02:48:36.862: [ CRSRES][15999]32ora.zljg.zljg1.inst Uptime does not exceed uptime_threshold
2013-11-20 02:48:36.862: [ CRSRES][15999]32ora.zljg.zljg1.inst ran out of restarts on yjbsdb5
2013-11-20 02:48:36.884: [ CRSRES][15999]32ora.zljg.zljg1.inst failed on yjbsdb5 relocating.
2013-11-20 02:48:37.399: [ CRSRES][15999]32Cannot relocate ora.zljg.zljg1.instStopping dependents
2013-11-20 02:48:37.424: [ CRSRES][15999]32StopResource: setting CLI values
自己的疑问:
1.是什么导致asm实例重启(IO负载过重 or bug or other)
2.为什么crsd无法启动db实例的resourcedd
补充说明:
ASM警告日志中在重启ASM实例时有时会报ORA-00600: internal error code, arguments: [kfgrpGetByNum02], [2], [318121198], [2], [317804538], [], [], [],
并且节点1上在凌晨2点都会开始做RMAN备份,之前在凌晨2点多db instance会不定期被ORA-29740: evicted by member 1, group incarnation 28,但是事后可以自动重启。这次的问题貌似更严重,db实例已经无法被crsd启动了!故障时间段的相关日志都在附件中! |
|