yehc@epsoft.com 发表于 2014-9-30 11:21:13

节点 CRS无法启动

环境描述:AIX 6100-04
节点数:2数据库:Oracle 10g RAC + ASM 10.2.0.4问题描述:

1)RAC 1号节点主机HBA卡出现故障(单卡),导致1号节点CRS服务停止


查看cssd.log:
28日12点48分,1号节点存储掉了。。。……
[    CSSD]2014-09-28 12:48:03.014 >ERROR:   Internal Error Information:
  Category: 1234
  Operation: scls_block_read
  Location: fread_failed
  Other: fread unable to read buffer
  Dep: 19

[    CSSD]2014-09-28 12:48:03.014 >ERROR:   clssnmvReadBlocks: read failed 1 at offset 658 of /dev/rhdisk3
[    CSSD]2014-09-28 12:48:03.014 >TRACE:   clssnmDiskStateChange: state from 4 to 3 disk (0//dev/rhdisk3)
[    CSSD]2014-09-28 12:48:03.014 >TRACE:   clssnmDiskPMT: disk offline (0//dev/rhdisk3)
[    CSSD]2014-09-28 12:48:03.123 >TRACE:   clssgmDeadProc: proc 111f53e70
[    CSSD]2014-09-28 12:48:03.123 >TRACE:   clssgmUnregisterClient(): removing proc 16 client 1, flags 0x08000000
[    CSSD]2014-09-28 12:48:03.123 >TRACE:   clssgmUnregisterClient: member ref count now 9
……2)29号,在替换完HBA卡后,存储重新MAPPING LUN到1号主机,硬件故障解决1号节点HBA卡故障修复后,重新认到存储
在没有修改磁盘权限和组属时,启动crs,显示权限问题:

/tmp/crsctl.xxxx2014-09-29 16:50:21.100: [  OCROSD]utopen:7:failed to open OCR file/disk /dev/rhdisk2 /dev/rhdisk6, errno=13, o
s err string=Permission denied3)同时修改1号节点磁盘的权限、组属关系:jxsmdb1->id oracle
uid=203(oracle) gid=202(oinstall) groups=201(dba)修改权限:#for i in 2 3 4 5 6 7 8 9 10 11 12 13 14 15
>do
>chmod 660 /dev/rhdisk$i
>done
OCR DISK:#chown root:oinstall /dev/rhdisk2 /dev/rhdisk6修改属组:#for i in 3 4 5 7 8 9 10 11 12 13 14 15
>do
>chown oracle:oinstall /dev/rhdisk$i
>done
1号节点尝试启动CRS:jxsmdb1->crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly查看crsd.log……
[ CSSCLNT]clsssInitNative: connect failed, rc 9

[  CRSRTI]32CSS is not ready. Received status 3 from CSS. Waiting for good status ..
……4)检查了相关资料,MOS 726925.1
清空了/var/tmp/.oracle对应的socket文件,重启主机后,问题依旧
备注:该步骤与crs启动失败无必然关系,在死马当活马医了!!!

Maclean Liu(刘相兵 发表于 2014-9-30 13:27:02

[    CSSD]2014-08-27 22:25:41.283 >USER:    Copyright 2014, Oracle version 10.2.0.4.0
[    CSSD]2014-08-27 22:25:41.283 >USER:    CSS daemon log for node jxsmdb1, number 1, in cluster crs


就日志看 ,最近的一次启动CRS是在 2014-8-27 ,没有看到其他重启cSS日志

  Category: 1234
  Operation: scls_block_write
  Location: fwrite_faile
  Other: fwrite unable to write buffer
  Dep: 19

[    CSSD]2014-09-28 12:48:04.244 >ERROR:   clssnmvWriteBlocks: write failed 1 at offset 146 of /dev/rhdisk5
[    CSSD]2014-09-28 12:48:04.244 >TRACE:   clssnmDiskStateChange: state from 4 to 3 disk (2//dev/rhdisk5)
[    CSSD]2014-09-28 12:48:04.244 >TRACE:   clssnmDiskPMT: disk offline (0//dev/rhdisk3)
[    CSSD]2014-09-28 12:48:04.244 >TRACE:   clssnmDiskPMT: disk offline (2//dev/rhdisk5)
[    CSSD]2014-09-28 12:48:04.244 >ERROR:   clssnmDiskPMT: Aborting, 2 of 3 voting disks unavailable
[    CSSD]2014-09-28 12:48:04.244 >ERROR:   ###################################
[    CSSD]2014-09-28 12:48:04.244 >ERROR:   clssscExit: CSSD aborting
[    CSSD]2014-09-28 12:48:04.244 >ERROR:   ###################################
[    CSSD]--- DUMP GROCK STATE DB ---
[    CSSD]----------
[    CSSD]  type 2, Id 4, Name = (crs_version)
[    CSSD]  flags: 0x0
[    CSSD]  grant: count=0, type 0, wait 0



[    CSSD]2014-09-28 12:48:04.244 >ERROR:   clssnmvWriteBlocks: write failed 1 at offset 146 of /dev/rhdisk5


你确定/dev/rhdisk5 是正常的?

yehc@epsoft.com 发表于 2014-9-30 14:15:20

本帖最后由 yehc@epsoft.com 于 2014-9-30 14:20 编辑

Maclean Liu(刘相兵 发表于 2014-9-30 13:27 static/image/common/back.gif
[    CSSD]2014-08-27 22:25:41.283 >USER:    Copyright 2014, Oracle version 10.2.0.4.0
[    CSSD]2014 ...CSSD]2014-09-28 12:48:04.244 >ERROR:   clssnmvWriteBlocks: write failed 1 at offset 146 of /dev/rhdisk52014-09-28 12:48:04.244 这个时间点,的确1号节点认不到存储,HBA故障,但目前1号节点上都能认到存储,而且dd测试所以的磁盘正常。jxsmdb1->id
uid=203(oracle) gid=202(oinstall) groups=201(dba)
jxsmdb1->date
Tue Sep 30 14:09:46 BEIST 2014
jxsmdb1->dd if=/dev/rhdisk5 of=/dev/null bs=1024 count=1000
1000+0 records in.
1000+0 records out.
jxsmdb1->dd if=/dev/rhdisk5 of=/soft/rhdisk5.ts bs=1024 count=1000
1000+0 records in.
1000+0 records out.
jxsmdb1->strings /soft/rhdisk5.ts |more

?z{|}
cLssTock
clSs0pEr
CLSf
Vote
jxsmdb1
孝倦
Sd
Vote
jxsmdb2
T*IN

      CLSfjxsmdb1->crsctl query css votedisk
0.     0    /dev/rhdisk3
1.     0    /dev/rhdisk4
2.     0    /dev/rhdisk5

located 3 votedisk(s).
jxsmdb1->ocrcheck
Status of Oracle Cluster Registry is as follows :
         Version                  :          2
         Total space (kbytes)     :    2096812
         Used space (kbytes)      :       3868
         Available space (kbytes) :    2092944
         ID                       : 1005635852
         Device/File Name         : /dev/rhdisk2
                                    Device/File integrity check succeeded
         Device/File Name         : /dev/rhdisk6
                                    Device/File integrity check succeeded

         Cluster registry integrity check succeeded两节点CRS进程比较:
1号节点(CRS失败)jxsmdb1->ps -ef |grep crs
  oracle 151572 250478   0 14:12:14      -  0:00 /oracle/product/10.2.0/crs/bin/oclsomon.bin
  oracle 184320 180244   0 14:10:57      -  0:00 /oracle/product/10.2.0/crs/bin/evmd.bin
  oracle 123214 127028   0 14:15:16  pts/0  0:00 grep crs
  oracle 250478 115310   0 14:12:14      -  0:00 /bin/sh -c cd /oracle/product/10.2.0/crs/log/jxsmdb1/cssd/oclsomon; ulimit -c unlimited; /oracle/product/10.2.0/crs/bin/oclsomon  || exit $?
    root  62308  86972   0 14:10:34      -  0:00 /oracle/product/10.2.0/crs/bin/crsd.bin restart
    root  86972      1   0 14:10:34      -  0:00 /bin/sh /etc/init.crsd run 2号节点(CRS正常)jxsmdb2->ps -ef |grep crs
  oracle 118884 147588   0   Mar 01      - 518:21 /oracle/product/10.2.0/crs/bin/ocssd.bin
  oracle 147588  33030   0   Mar 01      -  0:00 /bin/sh -c ulimit -c unlimited; cd /oracle/product/10.2.0/crs/log/jxsmdb2/cssd; /oracle/product/10.2.0/crs/bin/ocssd  || exit $?
  oracle 237584      1   0   Mar 02      -  0:00 /oracle/product/10.2.0/crs/opmn/bin/ons -d
    root 340096 242426   2 17:19:49      - 15:53 /oracle/product/10.2.0/crs/bin/crsd.bin restart
  oracle 106862 139442   0   Mar 01      - 36:00 /oracle/product/10.2.0/crs/bin/evmd.bin
  oracle 123380  95080   0   Mar 01      - 43:54 /oracle/product/10.2.0/crs/bin/oclsomon.bin
  oracle 143642 237584   0   Mar 02      -  6:04 /oracle/product/10.2.0/crs/opmn/bin/ons -d
    root 242426      1   0 17:18:47      -  0:00 /bin/sh /etc/init.crsd run
  oracle  95080  98742   0   Mar 01      -  0:00 /bin/sh -c cd /oracle/product/10.2.0/crs/log/jxsmdb2/cssd/oclsomon; ulimit -c unlimited; /oracle/product/10.2.0/crs/bin/oclsomon  || exit $?
  oracle 119712 106862   0   Mar 01      -  2:01 /oracle/product/10.2.0/crs/bin/evmlogger.bin -o /oracle/product/10.2.0/crs/evm/log/evmlogger.info -l /oracle/product/10.2.0/crs/evm/log/evmlogger.log
    root 123668 123578   0   Mar 01      -  9:17 /oracle/product/10.2.0/crs/bin/oprocd.bin run -t 1000 -m 500 -f
  oracle 340922 418496   0 14:16:20  pts/2  0:00 grep crs

Maclean Liu(刘相兵 发表于 2014-10-1 21:25:30

yehc@epsoft.com 发表于 2014-9-30 14:15 static/image/common/back.gif
2014-09-28 12:48:04.244 这个时间点,的确1号节点认不到存储,HBA故障,但目前1号节点上都能认到存储,而 ...

为什么你给出的cssd.log 中没有后续的日志?

2种可能:

1、你取的日志有问题
2、cssd.bin无法正常启动 也甚至无法输出任何日志

yehc@epsoft.com 发表于 2014-10-8 09:07:55

Maclean Liu(刘相兵 发表于 2014-10-1 21:25 static/image/common/back.gif
为什么你给出的cssd.log 中没有后续的日志?

2种可能:


cssd.bin无法正常启动 也甚至无法输出任何日志  cssd无法正常启动是日志没有继续输出的原因。目前问题暂时解决
过程:

1号节点上手动执行 /oracle/product/10.2.0/crs/bin/oprocd.bin run -t 1000 -m 500 -f


页: [1]
查看完整版本: 节点 CRS无法启动