RAC一节点故障
问题节点系统信息:CPU info:
4 Intel(R) Itanium 2 9100 series processors (1.6 GHz, 24 MB)
533 MT/s bus, CPU version A1
8 logical processors (2 per socket)
Memory: 16316 MB (15.93 GB)
OS info:
Nodename: dqpicc07
Release: HP-UX B.11.31
Version: U (unlimited-user license)
数据库版本 Oracle 10.2.0.5
RAC节点数量:4节点
$ crs_stat -t
无回应...............
$ crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
$ ocrcheck
Status of Oracle Cluster Registry is as follows :
Version : 2
Total space (kbytes) : 1048300
Used space (kbytes) : 8340
Available space (kbytes) : 1039960
ID : 941242024
Device/File Name : /dev/rdisk/disk67
Device/File integrity check succeeded
Device/File not configured
Cluster registry integrity check succeeded
$ crsctl query css votedisk
0. 0 /dev/rdisk/disk68
located 1 votedisk(s).
crsd.log中一直在报如下错误:
2013-12-27 14:27:37.665: [ CRSEVT] CAAMonitorHandler :: 0:Action Script /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check) timed out for ora.bxgrid.db! (timeout=600)
2013-12-27 14:27:37.665: [ CRSAPP] CheckResource error for ora.bxgrid.db error code = -2
2013-12-27 14:27:45.395: [ CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/bxgrid/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child
2013-12-27 14:27:45.395: [ CRSEVT] CAAMonitorHandler :: 0:Action Script /u01/oracle/product/10.2.0.5/bxgrid/bin/racgwrap(check) timed out for ora.bxgrid.srv_bxgrid.cs! (timeout=600)
2013-12-27 14:27:45.395: [ CRSAPP] CheckResource error for ora.bxgrid.srv_bxgrid.cs error code = -2
2013-12-27 14:27:53.175: [ CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child
2013-12-27 14:27:53.175: [ CRSEVT] CAAMonitorHandler :: 0:Action Script /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check) timed out for ora.dqpicc07.gsd! (timeout=600)
2013-12-27 14:27:53.175: [ CRSAPP] CheckResource error for ora.dqpicc07.gsd error code = -2
2013-12-27 14:27:59.425: [ CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child
2013-12-27 14:27:59.425: [ CRSEVT] CAAMonitorHandler :: 0:Action Script /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check) timed out for ora.dqpicc07.ons! (timeout=600)
2013-12-27 14:27:59.425: [ CRSAPP] CheckResource error for ora.dqpicc07.ons error code = -2
2013-12-27 14:28:08.575: [ CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/bxgrid/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child
2013-12-27 14:28:08.575: [ CRSEVT] CAAMonitorHandler :: 0:Action Script /u01/oracle/product/10.2.0.5/bxgrid/bin/racgwrap(check) timed out for ora.dqpicc07.ASM3.asm! (timeout=600)
2013-12-27 14:28:08.576: [ CRSAPP] CheckResource error for ora.dqpicc07.ASM3.asm error code = -2
2013-12-27 14:28:55.955: [ CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child
2013-12-27 14:28:55.955: [ CRSEVT] CAAMonitorHandler :: 0:Action Script /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check) timed out for ora.dqpicc07.vip! (timeout=60)
2013-12-27 14:28:55.955: [ CRSAPP] CheckResource error for ora.dqpicc07.vip error code = -2
rac故障相关日志见附件
2013-12-27 14:07:12.823: [ CRSEVT] CAAMonitorHandler :: 0:Action Script /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check) timed out for ora.dqpicc07.vip! (timeout=60)
2013-12-27 14:07:12.824: [ CRSAPP] CheckResource error for ora.dqpicc07.vip error code = -2
2013-12-27 14:07:15.314: [ CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/bxgrid/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child
2013-12-27 14:07:15.314: [ CRSAPP] CheckResource error for ora.bxgrid.srv_bxgrid.cs error code = -1
2013-12-27 14:07:23.080: [ CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child
2013-12-27 14:07:23.081: [ CRSAPP] CheckResource error for ora.dqpicc07.gsd error code = -1
2013-12-27 14:07:29.345: [ CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child
2013-12-27 14:07:29.345: [ CRSAPP] CheckResource error for ora.dqpicc07.ons error code = -1
2013-12-27 14:07:38.486: [ CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/bxgrid/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child
2013-12-27 14:07:38.486: [ CRSAPP] CheckResource error for ora.dqpicc07.ASM3.asm error code = -1
2013-12-27 14:08:45.904: [ CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child
貌似心跳vip有问题。 maybe:hp-ux: Node Crash Due To Large Amount Of Racgimon Threads or CRS_STAT/SRVCTL COMMAND HANG OS bug ( QX:QXCR1000940361 ) (文档 ID 883801.1)
RAC/cluster的问题很复杂,一般浅层次或最基本的分析就需要以下信息:
系统运行情况(CPU、内存、IO、进程、网络资源使用状况)
OS日志
clusterware日志($GRID_HOME/log)
ASM实例日志
数据库日志
listener 日志 dqpicc07[/]#ping dqpicc07-vip
PING dqpicc07-vip: 64 byte packets
64 bytes from 10.65.99.12: icmp_seq=0. time=0. ms
64 bytes from 10.65.99.12: icmp_seq=1. time=0. ms
dqpicc07[/]#ping dqpicc06-vip
PING dqpicc06-vip: 64 byte packets
64 bytes from 10.65.99.11: icmp_seq=0. time=0. ms
64 bytes from 10.65.99.11: icmp_seq=1. time=0. ms
vip没有问题的!
1、
2013-03-17 17:29:08.018: [ CRSEVT] CAAMonitorHandler :: 0:Action Script /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check) timed out for ora.dqpicc07.gsd! (timeout=600)
2013-03-17 17:29:08.019: [ CRSAPP] CheckResource error for ora.dqpicc07.gsd error code = -2
2013-04-02 07:25:12.566: [ CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child
2013-04-02 07:25:12.573: [ CRSEVT] CAAMonitorHandler :: 0:Action Script /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check) timed out for ora.dqpicc07.vip! (timeout=60)
2013-04-02 07:25:12.573: [ CRSAPP] CheckResource error for ora.dqpicc07.vip error code = -2
2013-04-08 13:38:42.215: [ CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child
这个报错从
2013-03-17开始就有了
2、最近一次重启 就日志看是
Oracle Database 10g CRS Release 10.2.0.5.0 Production Copyright 1996, 2004, Oracle. All rights reserved
2013-08-12 19:40:55.391: [ default] CRS Daemon Starting
该问题可能与OS资源或者CRS软件一致性有关系, 建议你先重启下CRS 后再观察 重启后故障解决,谢谢! swgsw 发表于 2014-1-2 11:02 static/image/common/back.gif
重启后故障解决,谢谢!
重启CRS解决的? 是的重启节点故障消失。
页:
[1]