节点 CRS无法启动
环境描述:AIX 6100-04节点数:2数据库:Oracle 10g RAC + ASM 10.2.0.4问题描述:
1)RAC 1号节点主机HBA卡出现故障(单卡),导致1号节点CRS服务停止
查看cssd.log:
28日12点48分,1号节点存储掉了。。。……
[ CSSD]2014-09-28 12:48:03.014 >ERROR: Internal Error Information:
Category: 1234
Operation: scls_block_read
Location: fread_failed
Other: fread unable to read buffer
Dep: 19
[ CSSD]2014-09-28 12:48:03.014 >ERROR: clssnmvReadBlocks: read failed 1 at offset 658 of /dev/rhdisk3
[ CSSD]2014-09-28 12:48:03.014 >TRACE: clssnmDiskStateChange: state from 4 to 3 disk (0//dev/rhdisk3)
[ CSSD]2014-09-28 12:48:03.014 >TRACE: clssnmDiskPMT: disk offline (0//dev/rhdisk3)
[ CSSD]2014-09-28 12:48:03.123 >TRACE: clssgmDeadProc: proc 111f53e70
[ CSSD]2014-09-28 12:48:03.123 >TRACE: clssgmUnregisterClient(): removing proc 16 client 1, flags 0x08000000
[ CSSD]2014-09-28 12:48:03.123 >TRACE: clssgmUnregisterClient: member ref count now 9
……2)29号,在替换完HBA卡后,存储重新MAPPING LUN到1号主机,硬件故障解决1号节点HBA卡故障修复后,重新认到存储
在没有修改磁盘权限和组属时,启动crs,显示权限问题:
/tmp/crsctl.xxxx2014-09-29 16:50:21.100: [ OCROSD]utopen:7:failed to open OCR file/disk /dev/rhdisk2 /dev/rhdisk6, errno=13, o
s err string=Permission denied3)同时修改1号节点磁盘的权限、组属关系:jxsmdb1->id oracle
uid=203(oracle) gid=202(oinstall) groups=201(dba)修改权限:#for i in 2 3 4 5 6 7 8 9 10 11 12 13 14 15
>do
>chmod 660 /dev/rhdisk$i
>done
OCR DISK:#chown root:oinstall /dev/rhdisk2 /dev/rhdisk6修改属组:#for i in 3 4 5 7 8 9 10 11 12 13 14 15
>do
>chown oracle:oinstall /dev/rhdisk$i
>done
1号节点尝试启动CRS:jxsmdb1->crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly查看crsd.log……
[ CSSCLNT]clsssInitNative: connect failed, rc 9
[ CRSRTI]32CSS is not ready. Received status 3 from CSS. Waiting for good status ..
……4)检查了相关资料,MOS 726925.1
清空了/var/tmp/.oracle对应的socket文件,重启主机后,问题依旧
备注:该步骤与crs启动失败无必然关系,在死马当活马医了!!! [ CSSD]2014-08-27 22:25:41.283 >USER: Copyright 2014, Oracle version 10.2.0.4.0
[ CSSD]2014-08-27 22:25:41.283 >USER: CSS daemon log for node jxsmdb1, number 1, in cluster crs
就日志看 ,最近的一次启动CRS是在 2014-8-27 ,没有看到其他重启cSS日志
Category: 1234
Operation: scls_block_write
Location: fwrite_faile
Other: fwrite unable to write buffer
Dep: 19
[ CSSD]2014-09-28 12:48:04.244 >ERROR: clssnmvWriteBlocks: write failed 1 at offset 146 of /dev/rhdisk5
[ CSSD]2014-09-28 12:48:04.244 >TRACE: clssnmDiskStateChange: state from 4 to 3 disk (2//dev/rhdisk5)
[ CSSD]2014-09-28 12:48:04.244 >TRACE: clssnmDiskPMT: disk offline (0//dev/rhdisk3)
[ CSSD]2014-09-28 12:48:04.244 >TRACE: clssnmDiskPMT: disk offline (2//dev/rhdisk5)
[ CSSD]2014-09-28 12:48:04.244 >ERROR: clssnmDiskPMT: Aborting, 2 of 3 voting disks unavailable
[ CSSD]2014-09-28 12:48:04.244 >ERROR: ###################################
[ CSSD]2014-09-28 12:48:04.244 >ERROR: clssscExit: CSSD aborting
[ CSSD]2014-09-28 12:48:04.244 >ERROR: ###################################
[ CSSD]--- DUMP GROCK STATE DB ---
[ CSSD]----------
[ CSSD] type 2, Id 4, Name = (crs_version)
[ CSSD] flags: 0x0
[ CSSD] grant: count=0, type 0, wait 0
[ CSSD]2014-09-28 12:48:04.244 >ERROR: clssnmvWriteBlocks: write failed 1 at offset 146 of /dev/rhdisk5
你确定/dev/rhdisk5 是正常的? 本帖最后由 yehc@epsoft.com 于 2014-9-30 14:20 编辑
Maclean Liu(刘相兵 发表于 2014-9-30 13:27 static/image/common/back.gif
[ CSSD]2014-08-27 22:25:41.283 >USER: Copyright 2014, Oracle version 10.2.0.4.0
[ CSSD]2014 ...CSSD]2014-09-28 12:48:04.244 >ERROR: clssnmvWriteBlocks: write failed 1 at offset 146 of /dev/rhdisk52014-09-28 12:48:04.244 这个时间点,的确1号节点认不到存储,HBA故障,但目前1号节点上都能认到存储,而且dd测试所以的磁盘正常。jxsmdb1->id
uid=203(oracle) gid=202(oinstall) groups=201(dba)
jxsmdb1->date
Tue Sep 30 14:09:46 BEIST 2014
jxsmdb1->dd if=/dev/rhdisk5 of=/dev/null bs=1024 count=1000
1000+0 records in.
1000+0 records out.
jxsmdb1->dd if=/dev/rhdisk5 of=/soft/rhdisk5.ts bs=1024 count=1000
1000+0 records in.
1000+0 records out.
jxsmdb1->strings /soft/rhdisk5.ts |more
?z{|}
cLssTock
clSs0pEr
CLSf
Vote
jxsmdb1
孝倦
Sd
Vote
jxsmdb2
T*IN
CLSfjxsmdb1->crsctl query css votedisk
0. 0 /dev/rhdisk3
1. 0 /dev/rhdisk4
2. 0 /dev/rhdisk5
located 3 votedisk(s).
jxsmdb1->ocrcheck
Status of Oracle Cluster Registry is as follows :
Version : 2
Total space (kbytes) : 2096812
Used space (kbytes) : 3868
Available space (kbytes) : 2092944
ID : 1005635852
Device/File Name : /dev/rhdisk2
Device/File integrity check succeeded
Device/File Name : /dev/rhdisk6
Device/File integrity check succeeded
Cluster registry integrity check succeeded两节点CRS进程比较:
1号节点(CRS失败)jxsmdb1->ps -ef |grep crs
oracle 151572 250478 0 14:12:14 - 0:00 /oracle/product/10.2.0/crs/bin/oclsomon.bin
oracle 184320 180244 0 14:10:57 - 0:00 /oracle/product/10.2.0/crs/bin/evmd.bin
oracle 123214 127028 0 14:15:16 pts/0 0:00 grep crs
oracle 250478 115310 0 14:12:14 - 0:00 /bin/sh -c cd /oracle/product/10.2.0/crs/log/jxsmdb1/cssd/oclsomon; ulimit -c unlimited; /oracle/product/10.2.0/crs/bin/oclsomon || exit $?
root 62308 86972 0 14:10:34 - 0:00 /oracle/product/10.2.0/crs/bin/crsd.bin restart
root 86972 1 0 14:10:34 - 0:00 /bin/sh /etc/init.crsd run 2号节点(CRS正常)jxsmdb2->ps -ef |grep crs
oracle 118884 147588 0 Mar 01 - 518:21 /oracle/product/10.2.0/crs/bin/ocssd.bin
oracle 147588 33030 0 Mar 01 - 0:00 /bin/sh -c ulimit -c unlimited; cd /oracle/product/10.2.0/crs/log/jxsmdb2/cssd; /oracle/product/10.2.0/crs/bin/ocssd || exit $?
oracle 237584 1 0 Mar 02 - 0:00 /oracle/product/10.2.0/crs/opmn/bin/ons -d
root 340096 242426 2 17:19:49 - 15:53 /oracle/product/10.2.0/crs/bin/crsd.bin restart
oracle 106862 139442 0 Mar 01 - 36:00 /oracle/product/10.2.0/crs/bin/evmd.bin
oracle 123380 95080 0 Mar 01 - 43:54 /oracle/product/10.2.0/crs/bin/oclsomon.bin
oracle 143642 237584 0 Mar 02 - 6:04 /oracle/product/10.2.0/crs/opmn/bin/ons -d
root 242426 1 0 17:18:47 - 0:00 /bin/sh /etc/init.crsd run
oracle 95080 98742 0 Mar 01 - 0:00 /bin/sh -c cd /oracle/product/10.2.0/crs/log/jxsmdb2/cssd/oclsomon; ulimit -c unlimited; /oracle/product/10.2.0/crs/bin/oclsomon || exit $?
oracle 119712 106862 0 Mar 01 - 2:01 /oracle/product/10.2.0/crs/bin/evmlogger.bin -o /oracle/product/10.2.0/crs/evm/log/evmlogger.info -l /oracle/product/10.2.0/crs/evm/log/evmlogger.log
root 123668 123578 0 Mar 01 - 9:17 /oracle/product/10.2.0/crs/bin/oprocd.bin run -t 1000 -m 500 -f
oracle 340922 418496 0 14:16:20 pts/2 0:00 grep crs yehc@epsoft.com 发表于 2014-9-30 14:15 static/image/common/back.gif
2014-09-28 12:48:04.244 这个时间点,的确1号节点认不到存储,HBA故障,但目前1号节点上都能认到存储,而 ...
为什么你给出的cssd.log 中没有后续的日志?
2种可能:
1、你取的日志有问题
2、cssd.bin无法正常启动 也甚至无法输出任何日志 Maclean Liu(刘相兵 发表于 2014-10-1 21:25 static/image/common/back.gif
为什么你给出的cssd.log 中没有后续的日志?
2种可能:
cssd.bin无法正常启动 也甚至无法输出任何日志 cssd无法正常启动是日志没有继续输出的原因。目前问题暂时解决
过程:
1号节点上手动执行 /oracle/product/10.2.0/crs/bin/oprocd.bin run -t 1000 -m 500 -f
页:
[1]