swgsw 发表于 2013-12-27 15:03:14

RAC一节点故障

问题节点系统信息:

CPU info:
  4 Intel(R) Itanium 2 9100 series processors (1.6 GHz, 24 MB)
          533 MT/s bus, CPU version A1
          8 logical processors (2 per socket)

Memory: 16316 MB (15.93 GB)

OS info:
   Nodename:  dqpicc07
   Release:   HP-UX B.11.31
   Version:   U (unlimited-user license)

数据库版本 Oracle 10.2.0.5
RAC节点数量:4节点


$ crs_stat -t
   无回应...............


$ crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy

$ ocrcheck                                                                                                                        
Status of Oracle Cluster Registry is as follows :
         Version                  :          2
         Total space (kbytes)     :    1048300
         Used space (kbytes)      :       8340
         Available space (kbytes) :    1039960
         ID                       :  941242024
         Device/File Name         : /dev/rdisk/disk67
                                    Device/File integrity check succeeded

                                    Device/File not configured

         Cluster registry integrity check succeeded

$ crsctl query css votedisk
0.     0    /dev/rdisk/disk68

located 1 votedisk(s).


crsd.log中一直在报如下错误:

2013-12-27 14:27:37.665: [  CRSEVT] CAAMonitorHandler :: 0:Action Script /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check) timed out for ora.bxgrid.db! (timeout=600)
2013-12-27 14:27:37.665: [  CRSAPP] CheckResource error for ora.bxgrid.db error code = -2
2013-12-27 14:27:45.395: [  CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/bxgrid/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child

2013-12-27 14:27:45.395: [  CRSEVT] CAAMonitorHandler :: 0:Action Script /u01/oracle/product/10.2.0.5/bxgrid/bin/racgwrap(check) timed out for ora.bxgrid.srv_bxgrid.cs! (timeout=600)
2013-12-27 14:27:45.395: [  CRSAPP] CheckResource error for ora.bxgrid.srv_bxgrid.cs error code = -2
2013-12-27 14:27:53.175: [  CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child

2013-12-27 14:27:53.175: [  CRSEVT] CAAMonitorHandler :: 0:Action Script /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check) timed out for ora.dqpicc07.gsd! (timeout=600)
2013-12-27 14:27:53.175: [  CRSAPP] CheckResource error for ora.dqpicc07.gsd error code = -2
2013-12-27 14:27:59.425: [  CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child

2013-12-27 14:27:59.425: [  CRSEVT] CAAMonitorHandler :: 0:Action Script /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check) timed out for ora.dqpicc07.ons! (timeout=600)
2013-12-27 14:27:59.425: [  CRSAPP] CheckResource error for ora.dqpicc07.ons error code = -2
2013-12-27 14:28:08.575: [  CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/bxgrid/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child

2013-12-27 14:28:08.575: [  CRSEVT] CAAMonitorHandler :: 0:Action Script /u01/oracle/product/10.2.0.5/bxgrid/bin/racgwrap(check) timed out for ora.dqpicc07.ASM3.asm! (timeout=600)
2013-12-27 14:28:08.576: [  CRSAPP] CheckResource error for ora.dqpicc07.ASM3.asm error code = -2
2013-12-27 14:28:55.955: [  CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child

2013-12-27 14:28:55.955: [  CRSEVT] CAAMonitorHandler :: 0:Action Script /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check) timed out for ora.dqpicc07.vip! (timeout=60)
2013-12-27 14:28:55.955: [  CRSAPP] CheckResource error for ora.dqpicc07.vip error code = -2

swgsw 发表于 2013-12-27 15:06:36

rac故障相关日志见附件

donyorcl001 发表于 2013-12-27 15:19:58

2013-12-27 14:07:12.823: [  CRSEVT] CAAMonitorHandler :: 0:Action Script /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check) timed out for ora.dqpicc07.vip! (timeout=60)
2013-12-27 14:07:12.824: [  CRSAPP] CheckResource error for ora.dqpicc07.vip error code = -2
2013-12-27 14:07:15.314: [  CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/bxgrid/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child

2013-12-27 14:07:15.314: [  CRSAPP] CheckResource error for ora.bxgrid.srv_bxgrid.cs error code = -1
2013-12-27 14:07:23.080: [  CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child

2013-12-27 14:07:23.081: [  CRSAPP] CheckResource error for ora.dqpicc07.gsd error code = -1
2013-12-27 14:07:29.345: [  CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child

2013-12-27 14:07:29.345: [  CRSAPP] CheckResource error for ora.dqpicc07.ons error code = -1
2013-12-27 14:07:38.486: [  CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/bxgrid/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child

2013-12-27 14:07:38.486: [  CRSAPP] CheckResource error for ora.dqpicc07.ASM3.asm error code = -1
2013-12-27 14:08:45.904: [  CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child


貌似心跳vip有问题。

kevinlin.ora 发表于 2013-12-27 15:28:43

maybe:hp-ux: Node Crash Due To Large Amount Of Racgimon Threads or CRS_STAT/SRVCTL COMMAND HANG OS bug ( QX:QXCR1000940361 ) (文档 ID 883801.1)

RAC/cluster的问题很复杂,一般浅层次或最基本的分析就需要以下信息:
系统运行情况(CPU、内存、IO、进程、网络资源使用状况)
OS日志
clusterware日志($GRID_HOME/log)
ASM实例日志
数据库日志
listener 日志

swgsw 发表于 2013-12-27 15:29:28

dqpicc07[/]#ping dqpicc07-vip
PING dqpicc07-vip: 64 byte packets
64 bytes from 10.65.99.12: icmp_seq=0. time=0. ms
64 bytes from 10.65.99.12: icmp_seq=1. time=0. ms

dqpicc07[/]#ping dqpicc06-vip
PING dqpicc06-vip: 64 byte packets
64 bytes from 10.65.99.11: icmp_seq=0. time=0. ms
64 bytes from 10.65.99.11: icmp_seq=1. time=0. ms

vip没有问题的!

Liu Maclean(刘相兵 发表于 2013-12-27 17:45:50

1、
2013-03-17 17:29:08.018: [  CRSEVT] CAAMonitorHandler :: 0:Action Script /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check) timed out for ora.dqpicc07.gsd! (timeout=600)
2013-03-17 17:29:08.019: [  CRSAPP] CheckResource error for ora.dqpicc07.gsd error code = -2
2013-04-02 07:25:12.566: [  CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child

2013-04-02 07:25:12.573: [  CRSEVT] CAAMonitorHandler :: 0:Action Script /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check) timed out for ora.dqpicc07.vip! (timeout=60)
2013-04-02 07:25:12.573: [  CRSAPP] CheckResource error for ora.dqpicc07.vip error code = -2
2013-04-08 13:38:42.215: [  CRSEVT] CAAMonitorHandler :: 0:Could not join /u01/oracle/product/10.2.0.5/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child

这个报错从
2013-03-17开始就有了


2、最近一次重启 就日志看是

Oracle Database 10g CRS Release 10.2.0.5.0 Production Copyright 1996, 2004, Oracle.  All rights reserved
2013-08-12 19:40:55.391: [ default] CRS Daemon Starting

该问题可能与OS资源或者CRS软件一致性有关系, 建议你先重启下CRS 后再观察

swgsw 发表于 2014-1-2 11:02:47

重启后故障解决,谢谢!

licharles 发表于 2014-1-3 09:21:19

swgsw 发表于 2014-1-2 11:02 static/image/common/back.gif
重启后故障解决,谢谢!

重启CRS解决的?

swgsw 发表于 2014-1-3 10:07:54

是的重启节点故障消失。
页: [1]
查看完整版本: RAC一节点故障