- 最后登录
- 2015-3-6
- 在线时间
- 80 小时
- 威望
- 75
- 金钱
- 557
- 注册时间
- 2012-4-11
- 阅读权限
- 50
- 帖子
- 74
- 精华
- 0
- 积分
- 75
- UID
- 338
|
1#
发表于 2014-4-30 00:23:20
|
查看: 6668 |
回复: 7
1、环境介绍
操作系统:AIX 6100-06-01-1043
数据库:10.2.0.5 RAC 2节点 ASM
(用户之前的环境两节点实例装反)- ASM1/instance1:node2
- ASM2/instance2:node1
复制代码 2、问题现象:
28号10:04分出现网络故障,节点2被踢出- [ CSSD]2014-04-28 10:04:24.716 [4371] >WARNING: clssnmPollingThread: node ahxnb02 (1) at 50% heartbeat fatal, eviction in 14.065 seconds seedhbimpd 0
- [ CSSD]2014-04-28 10:04:24.716 [4371] >TRACE: clssnmPollingThread: node ahxnb02 (1) is impending reconfig, flag 1039, misstime 15935
- [ CSSD]2014-04-28 10:04:24.716 [4371] >TRACE: clssnmPollingThread: diskTimeout set to (27000)ms impending reconfig status(1)
- [ CSSD]2014-04-28 10:04:25.217 [4628] >TRACE: clssnmSendingThread: sending status msg to all nodes
- [ CSSD]2014-04-28 10:04:25.217 [4628] >TRACE: clssnmSendingThread: sent 4 status msgs to all nodes
- [ CSSD]2014-04-28 10:04:29.226 [4628] >TRACE: clssnmSendingThread: sending status msg to all nodes
- [ CSSD]2014-04-28 10:04:29.226 [4628] >TRACE: clssnmSendingThread: sent 4 status msgs to all nodes
- [ CSSD]2014-04-28 10:04:31.716 [4371] >WARNING: clssnmPollingThread: node ahxnb02 (1) at 75% heartbeat fatal, eviction in 7.064 seconds seedhbimpd 1
- [ CSSD]2014-04-28 10:04:32.716 [4371] >WARNING: clssnmPollingThread: node ahxnb02 (1) at 75% heartbeat fatal, eviction in 6.065 seconds seedhbimpd 1
- [ CSSD]2014-04-28 10:04:33.227 [4628] >TRACE: clssnmSendingThread: sending status msg to all nodes
- [ CSSD]2014-04-28 10:04:33.227 [4628] >TRACE: clssnmSendingThread: sent 4 status msgs to all nodes
- [ CSSD]2014-04-28 10:04:36.719 [4371] >WARNING: clssnmPollingThread: node ahxnb02 (1) at 90% heartbeat fatal, eviction in 2.062 seconds seedhbimpd 1
- [ CSSD]2014-04-28 10:04:37.230 [4628] >TRACE: clssnmSendingThread: sending status msg to all nodes
- [ CSSD]2014-04-28 10:04:37.230 [4628] >TRACE: clssnmSendingThread: sent 4 status msgs to all nodes
- [ CSSD]2014-04-28 10:04:37.716 [4371] >WARNING: clssnmPollingThread: node ahxnb02 (1) at 90% heartbeat fatal, eviction in 1.065 seconds seedhbimpd 1
- [ CSSD]2014-04-28 10:04:38.716 [4371] >WARNING: clssnmPollingThread: node ahxnb02 (1) at 90% heartbeat fatal, eviction in 0.065 seconds seedhbimpd 1
- [ CSSD]2014-04-28 10:04:38.781 [4371] >TRACE: clssnmPollingThread: Eviction started for node ahxnb02 (1), flags 0x040f, state 3, wt4c 0 seedhbimpd 1
- [ CSSD]2014-04-28 10:04:38.781 [4371] >TRACE: clssnmDiscHelper: ahxnb02, node(1) connection failed, con (1113ab490), probe(0)
复制代码 后根据主机日志,显示出现网络IP地址冲突,该问题已解决。
网络恢复后:
2节点启动crs时,其中1个节点ASM无法启动- ahxnb02$$[/oracle]crs_stat -t
- Name Type Target State Host
- ------------------------------------------------------------
- ora....B1.inst application ONLINE OFFLINE
- ora....B2.inst application ONLINE ONLINE ahxnb01
- ora.AHXNB.db application ONLINE ONLINE ahxnb01
- ora....SM2.asm application ONLINE ONLINE ahxnb01
- ora....01.lsnr application ONLINE ONLINE ahxnb01
- ora....b01.gsd application ONLINE ONLINE ahxnb01
- ora....b01.ons application ONLINE ONLINE ahxnb01
- ora....b01.vip application ONLINE ONLINE ahxnb01
- ora....SM1.asm application ONLINE OFFLINE
- ora....02.lsnr application ONLINE ONLINE ahxnb02
- ora....b02.gsd application ONLINE ONLINE ahxnb02
- ora....b02.ons application ONLINE ONLINE ahxnb02
- ora....b02.vip application ONLINE ONLINE ahxnb02
复制代码 手工尝试在2号节点上启动ASM- ahxnb01$$[/oracle]srvctl start asm -n ahxnb01
复制代码 再次检查:- ahxnb01$$[/oracle]crs_stat -t
- Name Type Target State Host
- ------------------------------------------------------------
- ora....B1.inst application ONLINE ONLINE ahxnb02
- ora....B2.inst application ONLINE OFFLINE
- ora.AHXNB.db application ONLINE ONLINE ahxnb02
- ora....SM2.asm application ONLINE OFFLINE
- ora....01.lsnr application ONLINE ONLINE ahxnb01
- ora....b01.gsd application ONLINE ONLINE ahxnb01
- ora....b01.ons application ONLINE ONLINE ahxnb01
- ora....b01.vip application ONLINE ONLINE ahxnb01
- ora....SM1.asm application ONLINE ONLINE ahxnb02
- ora....02.lsnr application ONLINE ONLINE ahxnb02
- ora....b02.gsd application ONLINE ONLINE ahxnb02
- ora....b02.ons application ONLINE ONLINE ahxnb02
- ora....b02.vip application ONLINE ONLINE ahxnb02
复制代码 ASM alert日志:- Tue Apr 29 21:46:49 CDT 2014
- lmon registered with NM - instance id 1 (internal mem no 0)
- Tue Apr 29 21:51:44 CDT 2014
- Remote instance kill is issued with system inc 0 and reason 0x40000000
- Remote instance kill map (size 1) : 2
- Tue Apr 29 21:51:51 CDT 2014
- Trace dumping is performing id=[cdmp_20140429215152]
- Tue Apr 29 21:51:53 CDT 2014
- Error: KGXGN polling error (15)
- Tue Apr 29 21:51:53 CDT 2014
- Errors in file /oracle/admin/+ASM/bdump/+asm1_lmon_2819952.trc:
- ORA-29702: error occurred in Cluster Group Service operation
- LMON: terminating instance due to error 29702
- Tue Apr 29 21:51:55 CDT 2014
- System state dump is made for local instance
- Tue Apr 29 21:51:55 CDT 2014
- Errors in file /oracle/admin/+ASM/bdump/+asm1_diag_4260796.trc:
- ORA-29702: error occurred in Cluster Group Service operation
- Tue Apr 29 21:51:55 CDT 2014
- Trace dumping is performing id=[cdmp_20140429215153]
- Tue Apr 29 21:51:57 CDT 2014
- Shutting down instance (abort)
- License high water mark = 0
- Tue Apr 29 21:51:57 CDT 2014
- Instance terminated by LMON, pid = 2819952
- Tue Apr 29 21:52:02 CDT 2014
- Instance terminated by USER, pid = 3997966
复制代码 3、查看网络配置:- ahxnb01$$[/oracle]oifcfg getif
- en0 10.88.32.0 global public
- en2 192.168.255.0 global cluster_interconnect
- en4 192.168.254.0 global cluster_interconnect
复制代码 hosts文件:- 127.0.0.1 loopback localhost # loopback (lo0) name/address
- ##############Public Network###########################
- 10.88.32.51 ahxnb01
- 10.88.32.52 ahxnb02
- ############# Virtual IP address#######################
- 10.88.32.53 ahxnb01_vip
- 10.88.32.54 ahxnb02_vip
- ############ Interconnect RAC1#########################
- 192.168.255.51 ahxnb01_priv
- 192.168.255.52 ahxnb02_priv
- ############ Interconnect RAC2#########################
- 192.168.254.51 ahxnb01_priv2
- 192.168.254.52 ahxnb02_priv2
复制代码 两节点网络测试:
public 和cluster_interconnect都正常- ahxnb02$$[/oracle]ping ahxnb01_priv
- PING ahxnb01_priv: (192.168.255.51): 56 data bytes
- 64 bytes from 192.168.255.51: icmp_seq=0 ttl=255 time=0 ms
- 64 bytes from 192.168.255.51: icmp_seq=1 ttl=255 time=0 ms
- 64 bytes from 192.168.255.51: icmp_seq=2 ttl=255 time=0 ms
- 64 bytes from 192.168.255.51: icmp_seq=3 ttl=255 time=0 ms
复制代码- ahxnb01$$[/oracle]ping ahxnb02_priv
- PING ahxnb02_priv: (192.168.255.52): 56 data bytes
- 64 bytes from 192.168.255.52: icmp_seq=0 ttl=255 time=0 ms
- 64 bytes from 192.168.255.52: icmp_seq=1 ttl=255 time=0 ms
- 64 bytes from 192.168.255.52: icmp_seq=2 ttl=255 time=0 ms
- 64 bytes from 192.168.255.52: icmp_seq=3 ttl=255 time=0 ms
复制代码 查看了crsd和ocssd日志,没能判断出问题的原因,上传附件,请各位大神帮分析分析。
|
|