bj-jn

8 积分	1 好友	20 主题

发消息

[RAC性能调优] 请教rac中其中1个节点的数据库实例crash问题诊断

1^#

发表于 2013-11-22 12:00:43 | 查看: 4432| 回复: 4

环境描述:AIX 5.3/Oracle 10.2.0.4 64bit
问题描述: 双节点rac中节点1的db 实例在11月20号凌晨2点多abort，直到早上9点上班才通过手工将实例启动，通过后台日志分析发现是由于asm实例无法与db实例通信引发重启导致instance abort，随后crsd进程尝试重启asm实例和db实例,但是只成功启动了asm实例而
无法重启db实例。

诊断日志信息：
1.asm实例的警告日志
Wed Nov 20 02:43:50 2013
Errors in file /home/oracle/admin/+ASM/bdump/+asm1_ckpt_6553660.trc:
ORA-15082: ASM failed to communicate with database instance
CKPT: terminating instance due to error 15082
Wed Nov 20 02:43:50 2013
System state dump is made for local instance
System State dumped to trace file /home/oracle/admin/+ASM/bdump/+asm1_diag_19988986.trc
Wed Nov 20 02:43:55 2013
Instance terminated by CKPT, pid = 6553660

2.db实例的警告日志
Wed Nov 20 02:47:57 2013
Errors in file /home/oracle/admin/zljg/bdump/zljg1_asmb_27983948.trc:
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Wed Nov 20 02:47:57 2013
ASMB: terminating instance due to error 15064
Wed Nov 20 02:47:58 2013
System state dump is made for local instance
System State dumped to trace file /home/oracle/admin/zljg/bdump/zljg1_diag_29425790.trc
Wed Nov 20 02:48:00 2013
Shutting down instance (abort)
Wed Nov 20 02:48:03 2013
Instance terminated by ASMB, pid = 27983948
Wed Nov 20 02:48:05 2013
Instance terminated by USER, pid = 28967452
Wed Nov 20 10:27:16 2013
Starting ORACLE instance (normal)

3.crd进程的日志
2013-11-20 02:08:10.533: [  CRSEVT][14032]32CAAMonitorHandler :: 0:Could not join /home/oracle/product/10.2.0/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child

2013-11-20 02:08:10.534: [  CRSEVT][14032]32CAAMonitorHandler :: 0:Action Script /home/oracle/product/10.2.0/crs/bin/racgwrap(check) timed out for ora.yjbsdb5.vip! (timeout=60)
2013-11-20 02:08:10.534: [  CRSAPP][14032]32CheckResource error for ora.yjbsdb5.vip error code = -2

2013-11-20 02:08:10.533: [  CRSEVT][14032]32CAAMonitorHandler :: 0:Could not join /home/oracle/product/10.2.0/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child

2013-11-20 02:08:10.534: [  CRSEVT][14032]32CAAMonitorHandler :: 0:Action Script /home/oracle/product/10.2.0/crs/bin/racgwrap(check) timed out for ora.yjbsdb5.vip! (timeout=60)
2013-11-20 02:08:10.534: [  CRSAPP][14032]32CheckResource error for ora.yjbsdb5.vip error code = -2

2013-11-20 02:47:59.033: [  CRSRES][15999]32In stateChanged, ora.zljg.zljg1.inst target is ONLINE
2013-11-20 02:47:59.033: [  CRSRES][15999]32ora.zljg.zljg1.inst on yjbsdb5 went OFFLINE unexpectedly
2013-11-20 02:47:59.033: [  CRSRES][15999]32StopResource: setting CLI values
2013-11-20 02:47:59.103: [  CRSRES][15999]32Attempting to stop `ora.zljg.zljg1.inst` on member `yjbsdb5`
2013-11-20 02:48:06.821: [  CRSRES][13936]32Start of `ora.yjbsdb5.ASM1.asm` on member `yjbsdb5` succeeded.
2013-11-20 02:48:06.821: [  CRSRES][13936]32Successfully restarted ora.yjbsdb5.ASM1.asm on yjbsdb5, RESTART_COUNT=1
2013-11-20 02:48:06.905: [  CRSRES][13936]32ora.yjbsdb5.ASM1.asm Updated LAST_RESTART time in ocr
2013-11-20 02:48:20.466: [  OCRSRV][11473]th_select_w_f_r: Error processing request [5]
2013-11-20 02:48:20.538: [  OCRSRV][6693]th_select_w_f_r: Error processing request [5]
2013-11-20 02:48:36.861: [  CRSRES][15999]32Stop of `ora.zljg.zljg1.inst` on member `yjbsdb5` succeeded.
2013-11-20 02:48:36.862: [  CRSRES][15999]32ora.zljg.zljg1.inst RESTART_COUNT=5 RESTART_ATTEMPTS=5
2013-11-20 02:48:36.862: [  CRSRES][15999]32ora.zljg.zljg1.inst Uptime does not exceed uptime_threshold
2013-11-20 02:48:36.862: [  CRSRES][15999]32ora.zljg.zljg1.inst ran out of restarts on yjbsdb5
2013-11-20 02:48:36.884: [  CRSRES][15999]32ora.zljg.zljg1.inst failed on yjbsdb5 relocating.
2013-11-20 02:48:37.399: [  CRSRES][15999]32Cannot relocate ora.zljg.zljg1.instStopping dependents
2013-11-20 02:48:37.424: [  CRSRES][15999]32StopResource: setting CLI values

自己的疑问：
1.是什么导致asm实例重启（IO负载过重 or bug or other)
2.为什么crsd无法启动db实例的resourcedd

补充说明：
ASM警告日志中在重启ASM实例时有时会报ORA-00600: internal error code, arguments: [kfgrpGetByNum02], [2], [318121198], [2], [317804538], [], [], [],
并且节点1上在凌晨2点都会开始做RMAN备份，之前在凌晨2点多db instance会不定期被ORA-29740: evicted by member 1, group incarnation 28，但是事后可以自动重启。这次的问题貌似更严重，db实例已经无法被crsd启动了！故障时间段的相关日志都在附件中！