25 积分	0 好友	3 主题

发消息

oracle 11gr2 RAC如果心跳线不同会不会重启

1^#

发表于 2012-4-9 17:00:42 | 查看: 11326| 回复: 6

RT, 节点会不会重启。

附oracle一本书上的原文

Oracle’s cluster management software is referred to as Cluster Ready Services in 10g Release 1 and Clusterware in 10g Release 2. In 11g Release 2, it was finally rebranded as Grid Infrastructure, and it uses the cluster interconnect and a quorum device, called a voting disk, to determine cluster membership. A voting disk is shared by all nodes in the cluster, and its main use comes into play when the interconnect fails. A node is evicted from the cluster if it fails to send heartbeats through the interconnect and the voting disk. A voting disk can also help in cases where a node can’t communicate with the other nodes via the interconnect, but still has access to the voting disk. The subcluster elected to survive this scenario will send a node eviction message to the node. Clusterware performs node eviction using the STONITH algorithm. This is short for shoot the other node in the head—a software request is sent to the node to reboot itself. This can be tricky if the node to be rebooted is hung, and responding to a software reset might not be possible. But luckily, the hardware can assist in these cases, and Grid Infrastructure has support for IPMI (intelligent platform management interface), which makes it possible to issue a node termination signal. In the event of a node failure or eviction, the remaining nodes should be able to carry on processing user requests. Software APIs should make the node failure transparent to the application where possible.

分享0

收藏0 回复只看该作者道具举报

Maclean Liu(刘相兵

7^#

发表于 2012-4-10 22:50:33

ODM TEST 1:

[root@vrh2 ~]# crsctl query crs activeversion
Oracle Clusterware active version on the cluster is [11.2.0.3.0]

[root@vrh1 ~]# crsctl check cluster
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

[root@vrh2 ~]# crsctl check cluster
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

[root@vrh2 ~]# crsctl stat res -t
--------------------------------------------------------------------------------
NAME          TARGET  STATE       SERVER                STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.BACKUPDG.dg
            ONLINE  ONLINE    vrh1
            ONLINE  ONLINE    vrh2
ora.DATA.dg
            ONLINE  ONLINE    vrh1
            ONLINE  ONLINE    vrh2
ora.LISTENER.lsnr
            ONLINE  ONLINE    vrh1
            ONLINE  ONLINE    vrh2
ora.LSN_MACLEAN.lsnr
            ONLINE  ONLINE    vrh1
            ONLINE  ONLINE    vrh2
ora.MACLEAN2_LISTENER.lsnr
            ONLINE  ONLINE    vrh1
            ONLINE  ONLINE    vrh2
ora.MACLEAN3_LISTENER.lsnr
            ONLINE  INTERMEDIATE vrh1                   Not All Endpoints R
                                                         egistered
            ONLINE  ONLINE    vrh2
ora.MACLEAN4_LISTENER.lsnr
            ONLINE  ONLINE    vrh1
            ONLINE  ONLINE    vrh2
ora.MACLEAN_LISTENER.lsnr
            ONLINE  ONLINE    vrh1
            ONLINE  ONLINE    vrh2
ora.NEW_MACLEAN_LISTENER.lsnr
            ONLINE  ONLINE    vrh1
            ONLINE  ONLINE    vrh2
ora.SYSTEMDG.dg
            ONLINE  ONLINE    vrh1
            ONLINE  ONLINE    vrh2
ora.asm
            ONLINE  ONLINE    vrh1                   Started
            ONLINE  ONLINE    vrh2                   Started
ora.gsd
            OFFLINE OFFLINE    vrh1
            OFFLINE OFFLINE    vrh2
ora.net1.network
            ONLINE  ONLINE    vrh1
            ONLINE  ONLINE    vrh2
ora.ons
            ONLINE  OFFLINE    vrh1
            ONLINE  ONLINE    vrh2
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
   1       ONLINE  ONLINE    vrh2
ora.cvu
   1       OFFLINE OFFLINE
ora.oc4j
   1       OFFLINE OFFLINE
ora.scan1.vip
   1       ONLINE  ONLINE    vrh2
ora.vprod.db
   1       ONLINE  OFFLINE                            Instance Shutdown
   2       ONLINE  OFFLINE                            Instance Shutdown,S
                                                         TARTING
ora.vprod.maclean_pre.svc
   1       ONLINE  OFFLINE
ora.vprod.maclean_pres.svc
   1       ONLINE  OFFLINE
ora.vprod.maclean_pres1.svc
   1       ONLINE  OFFLINE
ora.vprod.maclean_pres2.svc
   1       ONLINE  OFFLINE
ora.vrh1.vip
   1       ONLINE  ONLINE    vrh1
ora.vrh2.vip
   1       ONLINE  ONLINE    vrh2


[root@vrh2 ~]# oifcfg getif
eth0  192.168.1.0  global  public
eth1  172.168.1.0  global  cluster_interconnect

[root@vrh2 ~]# ifconfig
eth0    Link encap:Ethernet  HWaddr 08:00:27:8C:B5:1D
      inet addr:192.168.1.163  Bcast:192.168.1.255  Mask:255.255.255.0
      UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
      RX packets:518065 errors:0 dropped:0 overruns:0 frame:0
      TX packets:297965 errors:0 dropped:0 overruns:0 carrier:0
      collisions:0 txqueuelen:1000
      RX bytes:145654338 (138.9 MiB)  TX bytes:33025986 (31.4 MiB)

eth0:1 Link encap:Ethernet  HWaddr 08:00:27:8C:B5:1D
      inet addr:192.168.1.200  Bcast:192.168.1.255  Mask:255.255.255.0
      UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

eth0:2 Link encap:Ethernet  HWaddr 08:00:27:8C:B5:1D
      inet addr:192.168.1.164  Bcast:192.168.1.255  Mask:255.255.255.0
      UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

eth1    Link encap:Ethernet  HWaddr 08:00:27:B2:AA:A5
      inet addr:172.168.1.19  Bcast:172.168.1.255  Mask:255.255.255.0
      UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
      RX packets:4613238 errors:0 dropped:0 overruns:0 frame:0
      TX packets:5551737 errors:0 dropped:0 overruns:0 carrier:0
      collisions:0 txqueuelen:1000
      RX bytes:1852273946 (1.7 GiB)  TX bytes:3457235177 (3.2 GiB)

eth1:1 Link encap:Ethernet  HWaddr 08:00:27:B2:AA:A5
      inet addr:169.254.145.211  Bcast:169.254.255.255  Mask:255.255.0.0
      UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

lo       Link encap:Local Loopback
      inet addr:127.0.0.1  Mask:255.0.0.0
      UP LOOPBACK RUNNING  MTU:16436  Metric:1
      RX packets:981653 errors:0 dropped:0 overruns:0 frame:0
      TX packets:981653 errors:0 dropped:0 overruns:0 carrier:0
      collisions:0 txqueuelen:0
      RX bytes:379674709 (362.0 MiB)  TX bytes:379674709 (362.0 MiB)

[root@vrh2 ~]# ifdown  eth1

[root@vrh2 ~]# date
Tue Apr 10 10:37:15 EDT 2012

[root@vrh2 ~]# date
Tue Apr 10 10:37:52 EDT 2012

[root@vrh2 ~]# uptime
10:38:07 up 9 days, 50 min,  2 users,  load average: 3.70, 1.56, 0.72

kgzf_fini1

手动拔网线或者 down掉 interconnect private network 后节点未重启 reboot

观察ocssd.log

2012-04-10 10:37:07.382: [ GIPCNET][1085045056] gipcmodNetworkProcessSend: [network]  failed send attempt endp 0x260f3f0 [0000000000000365] { gipcEndpoint : localAddr 'udp://172.168.1.19:18466', remoteAddr '', numPend 5, numReady 1, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x2, usrFlags 0x4000 }, req 0x7f198c85ba30 [000000000000ac7a] { gipcSendRequest : addr 'udp://172.168.1.18:61760', data 0x7f198c85c958, len 228, olen 0, parentEndp 0x260f3f0, ret gipcretEndpointNotAvailable (40), objFlags 0x0, reqFlags 0x2 }
2012-04-10 10:37:07.382: [ GIPCNET][1085045056] gipcmodNetworkProcessSend: slos op  :  sgipcnValidateSocket
2012-04-10 10:37:07.382: [ GIPCNET][1085045056] gipcmodNetworkProcessSend: slos dep :  Invalid argument (22)
2012-04-10 10:37:07.382: [ GIPCNET][1085045056] gipcmodNetworkProcessSend: slos loc :  address not
2012-04-10 10:37:07.382: [ GIPCNET][1085045056] gipcmodNetworkProcessSend: slos info:  addr '172.168.1.19:18466', len 228, buf 0x7f198c85c958, cookie 0x7f198c85ba30
2012-04-10 10:37:07.382: [GIPCXCPT][1085045056] gipcInternalSendSync: failed sync request, ret gipcretEndpointNotAvailable (40)
2012-04-10 10:37:07.382: [GIPCXCPT][1085045056] gipcSendSyncF [gipchaLowerInternalSend : gipchaLower.c : 781]: EXCEPTION[ ret gipcretEndpointNotAvailable (40) ]  failed to send on endp 0x260f3f0 [0000000000000365] { gipcEndpoint : localAddr 'udp://172.168.1.19:18466', remoteAddr '', numPend 5, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x2, usrFlags 0x4000 }, addr 0x2605d80 [0000000000000a5b] { gipcAddress : name 'udp://172.168.1.18:61760', objFlags 0x0, addrFlags 0x1 }, buf 0x7f198c85c958, len 228, flags 0x0
2012-04-10 10:37:07.382: [GIPCHGEN][1085045056] gipchaInterfaceFail: marking interface failing 0x260de10 { host 'vrh1', haName 'CSS_vrh-cluster', local 0x7f198c3f1d00, ip '172.168.1.18:61760', subnet '172.168.1.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 2, flags 0x6 }
2012-04-10 10:37:07.383: [GIPCHALO][1085045056] gipchaLowerInternalSend: failed to initiate send on interface 0x260de10 { host 'vrh1', haName 'CSS_vrh-cluster', local 0x7f198c3f1d00, ip '172.168.1.18:61760', subnet '172.168.1.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 2, flags 0x86 }, hctx 0x23f3a30 [0000000000000010] { gipchaContext : host 'vrh2', name 'CSS_vrh-cluster', luid 'cb9990a7-00000000', numNode 1, numInf 1, usrFlags 0x0, flags 0x67 }
2012-04-10 10:37:08.383: [GIPCHGEN][1085045056] gipchaInterfaceDisable: disabling interface 0x7f198c3f1d00 { host '', haName 'CSS_vrh-cluster', local (nil), ip '172.168.1.19', subnet '172.168.1.0', mask '255.255.255.0', mac '08-00-27-b2-aa-a5', ifname 'eth1', numRef 0, numFail 1, idxBoot 0, flags 0x194d }
2012-04-10 10:37:08.383: [GIPCHGEN][1085045056] gipchaInterfaceDisable: disabling interface 0x260de10 { host 'vrh1', haName 'CSS_vrh-cluster', local 0x7f198c3f1d00, ip '172.168.1.18:61760', subnet '172.168.1.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 2, flags 0x86 }
2012-04-10 10:37:08.383: [GIPCHALO][1085045056] gipchaLowerCleanInterfaces: performing cleanup of disabled interface 0x260de10 { host 'vrh1', haName 'CSS_vrh-cluster', local 0x7f198c3f1d00, ip '172.168.1.18:61760', subnet '172.168.1.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 2, flags 0xa6 }
2012-04-10 10:37:08.383: [GIPCHGEN][1085045056] gipchaInterfaceReset: resetting interface 0x260de10 { host 'vrh1', haName 'CSS_vrh-cluster', local 0x7f198c3f1d00, ip '172.168.1.18:61760', subnet '172.168.1.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 2, flags 0xa6 }
2012-04-10 10:37:08.772: [GIPCHDEM][1085045056] gipchaWorkerCleanInterface: performing cleanup of disabled interface 0x7f198c3f1d00 { host '', haName 'CSS_vrh-cluster', local (nil), ip '172.168.1.19', subnet '172.168.1.0', mask '255.255.255.0', mac '08-00-27-b2-aa-a5', ifname 'eth1', numRef 0, numFail 0, idxBoot 0, flags 0x196d }
2012-04-10 10:37:08.772: [GIPCHTHR][1085045056] gipchaWorkerCreateInterface: created remote interface for node 'vrh1', haName 'CSS_vrh-cluster', inf 'udp://172.168.1.18:61760'
2012-04-10 10:37:08.772: [GIPCHALO][1085045056] gipchaLowerCleanInterfaces: forcing interface purge due to loss of all comms node 0x2559bd0 { host 'vrh1', haName 'CSS_vrh-cluster', srcLuid cb9990a7-6c3b4b25, dstLuid f886de70-b59bfb3b numInf 1, contigSeq 956, lastAck 1034, lastValidAck 956, sendSeq [1035 : 1037], createTime 779040464, sentRegister 1, localMonitor 1, flags 0x4808 }
2012-04-10 10:37:08.772: [GIPCHGEN][1085045056] gipchaInterfaceDisable: disabling interface 0x260de10 { host 'vrh1', haName 'CSS_vrh-cluster', local (nil), ip '172.168.1.18:61760', subnet '172.168.1.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 2, flags 0x6 }
2012-04-10 10:37:09.383: [GIPCHALO][1085045056] gipchaLowerCleanInterfaces: performing cleanup of disabled interface 0x260de10 { host 'vrh1', haName 'CSS_vrh-cluster', local (nil), ip '172.168.1.18:61760', subnet '172.168.1.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 2, flags 0x226 }
2012-04-10 10:37:11.384: [ CSSD][1124194624]clssnmSendingThread: sending status msg to all nodes
2012-04-10 10:37:11.384: [ CSSD][1124194624]clssnmSendingThread: sent 5 status msgs to all nodes

2012-04-10 10:37:12.385: [GIPCHALO][1085045056] gipchaLowerProcessNode: no valid interfaces found to node for 5010 ms, node 0x2559bd0 { host 'vrh1', haName 'CSS_vrh-cluster', srcLuid cb9990a7-6c3b4b25, dstLuid f886de70-b59bfb3b numInf 0, contigSeq 956, lastAck 1034, lastValidAck 956, sendSeq [1035 : 1044], createTime 779040464, sentRegister 1, localMonitor 1, flags 0x2408 }
2012-04-10 10:37:16.387: [ CSSD][1124194624]clssnmSendingThread: sending status msg to all nodes
2012-04-10 10:37:16.387: [ CSSD][1124194624]clssnmSendingThread: sent 5 status msgs to all nodes
2012-04-10 10:37:18.388: [GIPCHALO][1085045056] gipchaLowerProcessNode: no valid interfaces found to node for 11010 ms, node 0x2559bd0 { host 'vrh1', haName 'CSS_vrh-cluster', srcLuid cb9990a7-6c3b4b25, dstLuid f886de70-b59bfb3b numInf 0, contigSeq 956, lastAck 1034, lastValidAck 956, sendSeq [1035 : 1056], createTime 779040464, sentRegister 1, localMonitor 1, flags 0x2408 }
2012-04-10 10:37:21.389: [ CSSD][1124194624]clssnmSendingThread: sending status msg to all nodes
2012-04-10 10:37:21.389: [ CSSD][1124194624]clssnmSendingThread: sent 5 status msgs to all nodes
2012-04-10 10:37:21.781: [ CSSD][1122617664]clssnmPollingThread: node vrh1 (1) at 50% heartbeat fatal, removal in 14.920 seconds
2012-04-10 10:37:21.781: [ CSSD][1122617664]clssnmPollingThread: node vrh1 (1) is impending reconfig, flag 2491404, misstime 15080
2012-04-10 10:37:21.781: [ CSSD][1122617664]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2012-04-10 10:37:21.782: [ CSSD][1111578944]clssnmvDHBValidateNCopy: node 1, vrh1, has a disk HB, but no network HB, DHB has rcfg 216157779, wrtcnt, 34742238, LATS 779494584, lastSeqNo 34740153, uniqueness 1334068150, timestamp 1334068641/779496994
2012-04-10 10:37:21.782: [ CSSD][1090509120]clssnmvDHBValidateNCopy: node 1, vrh1, has a disk HB, but no network HB, DHB has rcfg 216157779, wrtcnt, 34742239, LATS 779494584, lastSeqNo 34739995, uniqueness 1334068150, timestamp 1334068641/779497524
2012-04-10 10:37:22.782: [ CSSD][1111578944]clssnmvDHBValidateNCopy: node 1, vrh1, has a disk HB, but no network HB, DHB has rcfg 216157779, wrtcnt, 34742245, LATS 779495584, lastSeqNo 34742238, uniqueness 1334068150, timestamp 1334068642/779498494
2012-04-10 10:37:22.783: [ CSSD][1090509120]clssnmvDHBValidateNCopy: node 1, vrh1, has a disk HB, but no network HB, DHB has rcfg 216157779, wrtcnt, 34742246, LATS 779495584, lastSeqNo 34742239, uniqueness 1334068150, timestamp 1334068642/779498524
2012-04-10 10:37:23.783: [ CSSD][1111578944]clssnmvDHBValidateNCopy: node 1, vrh1, has a disk HB, but no network HB, DHB has rcfg 216157779, wrtcnt, 34742255, LATS 779496584, lastSeqNo 34742245, uniqueness 1334068150, timestamp 1334068643/779499494

................................................................................

79507594, lastSeqNo 34742356, uniqueness 1334068150, timestamp 1334068654/779510544
2012-04-10 10:37:35.399: [ CSSD][1124194624]clssnmSendingThread: sending status msg to all nodes
2012-04-10 10:37:35.399: [GIPCHALO][1085045056] gipchaLowerProcessNode: no valid interfaces found to node for 28020 ms, node 0x2559bd0 { host 'vrh1', haName 'CSS_vrh-cluster', srcLuid cb9990a7-6c3b4b25, dstLuid f886de70-b59bfb3b numInf 0, contigSeq 956, lastAck 1034, lastValidAck 956, sendSeq [1035 : 1089], createTime 779040464, sentRegister 1, localMonitor 1, flags 0x2008 }
2012-04-10 10:37:35.399: [ CSSD][1124194624]clssnmSendingThread: sent 4 status msgs to all nodes
2012-04-10 10:37:35.790: [ CSSD][1111578944]clssnmvDHBValidateNCopy: node 1, vrh1, has a disk HB, but no network HB, DHB has rcfg 216157779, wrtcnt, 34742375, LATS 779508594, lastSeqNo 34742365, uniqueness 1334068150, timestamp 1334068655/779511534
2012-04-10 10:37:35.790: [ CSSD][1090509120]clssnmvDHBValidateNCopy: node 1, vrh1, has a disk HB, but no network HB, DHB has rcfg 216157779, wrtcnt, 34742376, LATS 779508594, lastSeqNo 34742366, uniqueness 1334068150, timestamp 1334068655/779511544
2012-04-10 10:37:36.700: [ CSSD][1122617664]clssnmPollingThread: Removal started for node vrh1 (1), flags 0x26040c, state 3, wt4c 0
2012-04-10 10:37:36.700: [ CSSD][1122617664]clssnmMarkNodeForRemoval: node 1, vrh1 marked for removal
2012-04-10 10:37:36.700: [ CSSD][1122617664]clssnmDiscHelper: vrh1, node(1) connection failed, endp (0xa65), probe((nil)), ninf->endp 0xa65
2012-04-10 10:37:36.700: [ CSSD][1122617664]clssnmDiscHelper: node 1 clean up, endp (0xa65), init state 5, cur state 5
2012-04-10 10:37:36.700: [GIPCXCPT][1122617664] gipcInternalDissociate: obj 0x266c750 [0000000000000a65] { gipcEndpoint : localAddr 'gipcha://vrh2:nm2_vrh-cluster/13fb-32d1-ad66-1a5', remoteAddr 'gipcha://vrh1:d736-9e03-d52b-ed5', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x138606, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)
2012-04-10 10:37:36.700: [GIPCXCPT][1122617664] gipcDissociateF [clssnmDiscHelper : clssnm.c : 3436]: EXCEPTION[ ret gipcretFail (1) ]  failed to dissociate obj 0x266c750 [0000000000000a65] { gipcEndpoint : localAddr 'gipcha://vrh2:nm2_vrh-cluster/13fb-32d1-ad66-1a5', remoteAddr 'gipcha://vrh1:d736-9e03-d52b-ed5', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x138606, usrFlags 0x0 }, flags 0x0
2012-04-10 10:37:36.700: [ CSSD][1111578944]clssnmvDHBValidateNCopy: node 1, vrh1, has a disk HB, but no network HB, DHB has rcfg 216157779, wrtcnt, 34742380, LATS 779509504, lastSeqNo 34742375, uniqueness 1334068150, timestamp 1334068656/779512034
2012-04-10 10:37:36.700: [ CSSD][1088932160](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node vrh1, number 1, sync 216157779, stamp 779512164
2012-04-10 10:37:36.700: [ CSSD][1125771584]clssnmDoSyncUpdate: Initiating sync 216157779
2012-04-10 10:37:36.700: [ CSSD][1088932160]###################################
2012-04-10 10:37:36.700: [ CSSD][1125771584]clssscCompareSwapEventValue: changed NMReconfigInProgress  val 2, from -1, changes 4
2012-04-10 10:37:36.700: [ CSSD][1088932160]clssscExit: CSSD aborting from thread clssnmvKillBlockThread
2012-04-10 10:37:36.700: [ CSSD][1125771584]clssnmDoSyncUpdate: local disk timeout set to 27000 ms, remote disk timeout set to 27000
2012-04-10 10:37:36.700: [ CSSD][1088932160]###################################
2012-04-10 10:37:36.700: [ CSSD][1125771584]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed.
2012-04-10 10:37:36.700: [ CSSD][1088932160](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
2012-04-10 10:37:36.700: [ CSSD][1125771584]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 216157779
2012-04-10 10:37:36.700: [ CSSD][1125771584]clssnmSetupAckWait: Ack message type (11)
2012-04-10 10:37:36.700: [ CSSD][1125771584]clssnmSetupAckWait: node(2) is ALIVE
2012-04-10 10:37:36.700: [ CSSD][1125771584]clssnmSendSync: syncSeqNo(216157779), indicating EXADATA fence initialization complete
2012-04-10 10:37:36.700: [ CSSD][1088932160]clssnmSendMeltdownStatus: node vrh2, number 2, has experienced a failure in thread number 10 and is shutting down
2012-04-10 10:37:36.700: [ CSSD][1125771584]List of nodes that have ACKed my sync: NULL

2012-04-10 10:37:36.700: [ CSSD][1125771584]clssnmSendSync: syncSeqNo(216157779)
2012-04-10 10:37:36.700: [ CSSD][1088932160]clssscExit: Starting CRSD cleanup
2012-04-10 10:37:36.700: [ CSSD][1125771584]clssnmWaitForAcks: Ack message type(11), ackCount(1)
2012-04-10 10:37:36.700: [ CSSD][1127348544]clssnmHandleSync: Node vrh2, number 2, is EXADATA fence capable
2012-04-10 10:37:36.700: [ CSSD][1127348544]clssscUpdateEventValue: NMReconfigInProgress  val 2, changes 5
2012-04-10 10:37:36.700: [ CSSD][1127348544]clssnmHandleSync: local disk timeout set to 27000 ms, remote disk timeout set to 27000
2012-04-10 10:37:36.700: [ CSSD][1127348544]clssnmHandleSync: initleader 2 newleader 2
2012-04-10 10:37:36.700: [ CSSD][1127348544]clssnmQueueClientEvent:  Sending Event(2), type 2, incarn 216157778
2012-04-10 10:37:36.700: [ CSSD][1127348544]clssnmQueueClientEvent: Node[1] state = 5, birth = 216157777, unique = 1334068150
2012-04-10 10:37:36.700: [ CSSD][1127348544]clssnmQueueClientEvent: Node[2] state = 3, birth = 216157778, unique = 1334068156
2012-04-10 10:37:36.700: [ CSSD][1127348544]clssnmHandleSync: Acknowledging sync: src[2] srcName[vrh2] seq[2] sync[216157779]
2012-04-10 10:37:36.701: [ CSSD][1097337152]clssgmProcClientReqs: Checking RPC Q
2012-04-10 10:37:36.701: [ CSSD][1097337152]clssgmProcClientReqs: Checking dead client Q
2012-04-10 10:37:36.701: [ CSSD][1097337152]clssgmProcClientReqs: Checking dead proc Q
2012-04-10 10:37:36.701: [ CSSD][2472650464]clssgmStartNMMon: node 1 active, birth 216157777
2012-04-10 10:37:36.701: [ CSSD][2472650464]clssgmStartNMMon: node 2 active, birth 216157778
2012-04-10 10:37:36.701: [ CSSD][2472650464]NMEVENT_SUSPEND [00][00][00][06]
2012-04-10 10:37:36.701: [ CSSD][2472650464]clssgmUpdateEventValue: CmInfo State  val 5, changes 10
2012-04-10 10:37:36.701: [ CSSD][2472650464]clssgmSuspendAllGrocks: Issue SUSPEND
2012-04-10 10:37:36.701: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(IG+ASMSYS$USERS) count(2) master(1) event(2), incarn 2, mbrc 2, to member 2, events 0x0, state 0x0
2012-04-10 10:37:36.702: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(0) event(2), incarn 3, mbrc 3, to member 1, events 0x0, state 0x0
2012-04-10 10:37:36.702: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(IGVPRODSYS$USERS) count(1) master(2) event(2), incarn 1, mbrc 1, to member 2, events 0x0, state 0x0
2012-04-10 10:37:36.702: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(CRF-) count(3) master(0) event(2), incarn 5, mbrc 3, to member 1, events 0x38, state 0x0
2012-04-10 10:37:36.702: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(CLSN.ONSPROC.MASTER) count(1) master(2) event(2), incarn 1, mbrc 1, to member 2, events 0xa0, state 0x0
2012-04-10 10:37:36.702: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(DB+ASM) count(2) master(1) event(2), incarn 2, mbrc 2, to member 1, events 0x68, state 0x0
2012-04-10 10:37:36.702: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(DG+ASM) count(2) master(0) event(2), incarn 2, mbrc 2, to member 1, events 0x0, state 0x0
2012-04-10 10:37:36.702: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(IG+ASMSYS$BACKGROUND) count(2) master(1) event(2), incarn 2, mbrc 2, to member 2, events 0x0, state 0x0
2012-04-10 10:37:36.702: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(VT+ASM) count(2) master(1) event(2), incarn 6, mbrc 2, to member 2, events 0x60, state 0x0
2012-04-10 10:37:36.702: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(DG+ASM0) count(2) master(0) event(2), incarn 2, mbrc 2, to member 1, events 0x0, state 0x0
2012-04-10 10:37:36.702: [ CSSD][1110001984](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node vrh1, number 1, sync 216157779, stamp 779512164
2012-04-10 10:37:36.702: [ CSSD][1110001984]clssscExit: abort already set 1
2012-04-10 10:37:36.702: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(GR+GCR1) count(3) master(0) event(2), incarn 3, mbrc 3, to member 1, events 0x280, state 0x0
2012-04-10 10:37:36.702: [ CSSD][1127348544]clssnmDiscEndp: gipcDestroy 0xa65
2012-04-10 10:37:36.702: [ CSSD][1106848064](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node vrh1, number 1, sync 216157779, stamp 779512164
2012-04-10 10:37:36.702: [ CSSD][1106848064]clssscExit: abort already set 1
2012-04-10 10:37:36.703: [ CSSD][1097337152]clssgmProcClientReqs: Checking RPC Q
2012-04-10 10:37:36.703: [ CSSD][1097337152]clssgmProcClientReqs: Checking dead client Q
2012-04-10 10:37:36.703: [ CSSD][1097337152]clssgmProcClientReqs: Checking dead proc Q
2012-04-10 10:37:36.702: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(GR+GCR1) count(3) master(0) event(2), incarn 3, mbrc 3, to member 2, events 0x280, state 0x0
2012-04-10 10:37:36.703: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(DG_DATA) count(2) master(0) event(2), incarn 2, mbrc 2, to member 1, events 0x4, state 0x0
2012-04-10 10:37:36.703: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(DBVPROD) count(1) master(1) event(2), incarn 1, mbrc 1, to member 1, events 0x68, state 0x0
2012-04-10 10:37:36.703: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(IGVPRODSYS$BACKGROUND) count(1) master(2) event(2), incarn 1, mbrc 1, to member 2, events 0x0, state 0x0

2012-04-10 10:37:36.705: [ CSSD][2472650464]clssgmStartNMMon:  completed node cleanup
2012-04-10 10:37:36.705: [ CSSD][2472650464]clssgmStartNMMon: node 1 active, birth 216157777
2012-04-10 10:37:36.705: [ CSSD][2472650464]clssgmStartNMMon: node 2 active, birth 216157778
2012-04-10 10:37:36.705: [ CSSD][2472650464]NMEVENT_SUSPEND [00][00][00][06]
2012-04-10 10:37:36.705: [ CSSD][2472650464]clssgmUpdateEventValue: CmInfo State  val 5, changes 12
2012-04-10 10:37:36.705: [ CSSD][2472650464]clssgmSuspendAllGrocks: Issue SUSPEND
2012-04-10 10:37:36.705: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(IG+ASMSYS$USERS) count(2) master(1) event(2), incarn 2, mbrc 2, to member 2, events 0x0, state 0x0
2012-04-10 10:37:36.705: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(0) event(2), incarn 3, mbrc 3, to member 1, events 0x0, state 0x0
2012-04-10 10:37:36.705: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(IGVPRODSYS$USERS) count(1) master(2) event(2), incarn 1, mbrc 1, to member 2, events 0x0, state 0x0
2012-04-10 10:37:36.705: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(CRF-) count(3) master(0) event(2), incarn 5, mbrc 3, to member 1, events 0x38, state 0x0
2012-04-10 10:37:36.705: [ CSSD][1097337152]clssgmProcClientReqs: Checking RPC Q
2012-04-10 10:37:36.705: [ CSSD][1097337152]clssgmProcClientReqs: Checking dead client Q
2012-04-10 10:37:36.705: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(CLSN.ONSPROC.MASTER) count(1) master(2) event(2), incarn 1, mbrc 1, to member 2, events 0xa0, state 0x0
2012-04-10 10:37:36.705: [ CSSD][1097337152]clssgmProcClientReqs: Checking dead proc Q
2012-04-10 10:37:36.705: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(DB+ASM) count(2) master(1) event(2), incarn 2, mbrc 2, to member 1, events 0x68, state 0x0
2012-04-10 10:37:36.705: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(DG+ASM) count(2) master(0) event(2), incarn 2, mbrc 2, to member 1, events 0x0, state 0x0
2012-04-10 10:37:36.705: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(IG+ASMSYS$BACKGROUND) count(2) master(1) event(2), incarn 2, mbrc 2, to member 2, events 0x0, state 0x0
2012-04-10 10:37:36.705: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(VT+ASM) count(2) master(1) event(2), incarn 6, mbrc 2, to member 2, events 0x60, state 0x0
2012-04-10 10:37:36.705: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(DG+ASM0) count(2) master(0) event(2), incarn 2, mbrc 2, to member 1, events 0x0, state 0x0
2012-04-10 10:37:36.705: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(GR+GCR1) count(3) master(0) event(2), incarn 3, mbrc 3, to member 1, events 0x280, state 0x0
2012-04-10 10:37:36.705: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(GR+GCR1) count(3) master(0) event(2), incarn 3, mbrc 3, to member 2, events 0x280, state 0x0
2012-04-10 10:37:36.704: [ CSSD][1127348544]clssnmHandleSync: local disk timeout set to 27000 ms, remote disk timeout set to 27000
2012-04-10 10:37:36.705: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(DG_DATA) count(2) master(0) event(2), incarn 2, mbrc 2, to member 1, events 0x4, state 0x0
2012-04-10 10:37:36.705: [ CSSD][1127348544]clssnmHandleSync: initleader 2 newleader 2
2012-04-10 10:37:36.705: [ CSSD][1127348544]clssnmQueueClientEvent:  Sending Event(2), type 2, incarn 216157778
2012-04-10 10:37:36.705: [ CSSD][1127348544]clssnmQueueClientEvent: Node[1] state = 5, birth = 216157777, unique = 1334068150
2012-04-10 10:37:36.705: [ CSSD][1127348544]clssnmQueueClientEvent: Node[2] state = 3, birth = 216157778, unique = 1334068156
2012-04-10 10:37:36.705: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(DBVPROD) count(1) master(1) event(2), incarn 1, mbrc 1, to member 1, events 0x68, state 0x0
2012-04-10 10:37:36.705: [ CSSD][1127348544]clssnmHandleSync: Acknowledging sync: src[2] srcName[vrh2] seq[5] sync[216157779]
2012-04-10 10:37:36.705: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(IGVPRODSYS$BACKGROUND) count(1) master(2) event(2), incarn 1, mbrc 1, to member 2, events 0x0, state 0x0
2012-04-10 10:37:36.706: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(ocr_vrh-cluster) count(2) master(2) event(2), incarn 2, mbrc 2, to member 2, events 0x78, state 0x0
2012-04-10 10:37:36.706: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(DGVPROD-) count(1) master(1) event(2), incarn 1, mbrc 1, to member 1, events 0x0, state 0x0
2012-04-10 10:37:36.706: [ CSSD][1097337152]clssgmProcClientReqs: Checking RPC Q
2012-04-10 10:37:36.706: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(DGVPROD0) count(1) master(1) event(2), incarn 1, mbrc 1, to member 1, events 0x0, state 0x0
2012-04-10 10:37:36.706: [ CSSD][1097337152]clssgmProcClientReqs: Checking dead client Q
2012-04-10 10:37:36.706: [ CSSD][1097337152]clssgmProcClientReqs: Checking dead proc Q
2012-04-10 10:37:36.706: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(CLSFRAME) count(2) master(2) event(2), incarn 2, mbrc 2, to member 2, events 0x8, state 0x0
2012-04-10 10:37:36.706: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(CRSDMAIN) count(2) master(2) event(2), incarn 2, mbrc 2, to member 2, events 0x8, state 0x0
2012-04-10 10:37:36.706: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(EVMDMAIN) count(2) master(2) event(2), incarn 2, mbrc 2, to member 2, events 0x8, state 0x0
2012-04-10 10:37:36.706: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(EVMDMAIN2) count(2) master(2) event(2), incarn 2, mbrc 2, to member 2, events 0x8, state 0x0
2012-04-10 10:37:36.706: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(CTSSGROUP) count(2) master(1) event(2), incarn 2, mbrc 2, to member 2, events 0x8, state 0x0
2012-04-10 10:37:36.706: [ CSSD][1097337152]clssgmProcClientReqs: Checking RPC Q
2012-04-10 10:37:36.706: [ CSSD][1097337152]clssgmProcClientReqs: Checking dead client Q
2012-04-10 10:37:36.706: [ CSSD][1097337152]clssgmProcClientReqs: Checking dead proc Q
2012-04-10 10:37:36.706: [ CSSD][2472650464]clssgmQueueGrockEvent: groupName(DG_BACKUPDG) count(2) master(0) event(2), incarn 2, mbrc 2, to member 1, events 0x4, state 0x0

以上关闭节点上各类resource

2012-04-10 10:37:38.753: [ CSSD][1088932160]clssnmrFenceSage: Fenced node vrh2, number 2, with EXADATA, handle 0
2012-04-10 10:37:38.753: [ CSSD][1088932160]clssgmUpdateEventValue: CmInfo State  val 0, changes 28
2012-04-10 10:37:38.754: [ CSSD][1097337152]clssgmProcClientReqs: Checking RPC Q
2012-04-10 10:37:38.754: [ CSSD][1097337152]clssgmProcClientReqs: Checking dead client Q
2012-04-10 10:37:38.754: [ CSSD][1097337152]clssgmProcClientReqs: Checking dead proc Q
2012-04-10 10:37:38.754: [ CSSD][1097337152]clssgmSendShutdown: Aborting client (0x7f198c87a9d0) proc (0x7f198c859640), iocapables 1.
2012-04-10 10:37:38.754: [ CSSD][1097337152]clssgmSendShutdown: I/O capable proc (0x7f198c859640), pid (1106), iocapables 1, client (0x7f198c87a9d0)
2012-04-10 10:37:38.754: [ CSSD][1097337152]clssgmSendShutdown: Aborting client (0x7f198c294760) proc (0x7f198c859410), iocapables 2.
2012-04-10 10:37:38.754: [ CSSD][1121040704]clssgmPeerListener: terminating at incarn(216157778)
2012-04-10 10:37:38.754: [ CSSD][1121040704]clssgmPeerDeactivate: node 1 (vrh1), death 0, state 0x80000001 connstate 0xa
2012-04-10 10:37:38.754: [ CSSD][1121040704]clssgmCleanFuture: discarded 0 future msgs for 1
2012-04-10 10:37:38.754: [ CSSD][1097337152]clssgmSendShutdown: I/O capable proc (0x7f198c859410), pid (1114), iocapables 2, client (0x7f198c294760)
2012-04-10 10:37:38.754: [ CSSD][1097337152]clssgmSendShutdown: Aborting client (0x2562620) proc (0x265f090), iocapables 3.
2012-04-10 10:37:38.754: [ CSSD][1121040704]clssgmDiscEndppl: gipcDestroy 0xb1f
2012-04-10 10:37:38.754: [ CSSD][1097337152]clssgmSendShutdown: I/O capable proc (0x265f090), pid (1134), iocapables 3, client (0x2562620)
2012-04-10 10:37:38.754: [ CSSD][1097337152]clssgmSendShutdown: Aborting client (0x2608b70) proc (0x26521a0), iocapables 4.
2012-04-10 10:37:38.754: [ CSSD][1097337152]clssgmSendShutdown: I/O capable proc (0x26521a0), pid (1118), iocapables 4, client (0x2608b70)
2012-04-10 10:37:38.754: [ CSSD][1097337152]clssgmSendShutdown: Aborting client (0x2554d50) proc (0x26521a0), iocapables 5.
2012-04-10 10:37:38.754: [ CSSD][1097337152]clssgmSendShutdown: I/O capable proc (0x26521a0), pid (1118), iocapables 5, client (0x2554d50)
2012-04-10 10:37:38.754: [ CSSD][1097337152]clssgmSendShutdown: Aborting client (0x253c7f0) proc (0x26521a0), iocapables 6.
2012-04-10 10:37:38.754: [ CSSD][1097337152]clssgmSendShutdown: I/O capable proc (0x26521a0), pid (1118), iocapables 6, client (0x253c7f0)
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: Aborting client (0x2664620) proc (0x26521a0), iocapables 7.
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: I/O capable proc (0x26521a0), pid (1118), iocapables 7, client (0x2664620)
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: Aborting client (0x264d5a0) proc (0x2674aa0), iocapables 8.
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: I/O capable proc (0x2674aa0), pid (1185), iocapables 8, client (0x264d5a0)
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: Aborting client (0x26281e0) proc (0x2674aa0), iocapables 9.
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: I/O capable proc (0x2674aa0), pid (1185), iocapables 9, client (0x26281e0)
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: Aborting client (0x2a26480) proc (0x253bfa0), iocapables 10.
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: I/O capable proc (0x253bfa0), pid (1142), iocapables 10, client (0x2a26480)
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: Aborting client (0x2645470) proc (0x253bfa0), iocapables 11.
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: I/O capable proc (0x253bfa0), pid (1142), iocapables 11, client (0x2645470)
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: Aborting client (0x2a480c0) proc (0x253bfa0), iocapables 12.
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: I/O capable proc (0x253bfa0), pid (1142), iocapables 12, client (0x2a480c0)
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: Aborting client (0x2a47cb0) proc (0x2667110), iocapables 13.
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: I/O capable proc (0x2667110), pid (1162), iocapables 13, client (0x2a47cb0)
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: Aborting client (0x2a58290) proc (0x2667110), iocapables 14.
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: I/O capable proc (0x2667110), pid (1162), iocapables 14, client (0x2a58290)
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: Aborting client (0x2a479f0) proc (0x2667110), iocapables 15.
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: I/O capable proc (0x2667110), pid (1162), iocapables 15, client (0x2a479f0)
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: Aborting client (0x2a3c7e0) proc (0x2667110), iocapables 16.
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: I/O capable proc (0x2667110), pid (1162), iocapables 16, client (0x2a3c7e0)
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: Aborting client (0x2ac71b0) proc (0x2a6b5e0), iocapables 17.
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: I/O capable proc (0x2a6b5e0), pid (1205), iocapables 17, client (0x2ac71b0)
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: Aborting client (0x2a78da0) proc (0x2a6b5e0), iocapables 18.
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: I/O capable proc (0x2a6b5e0), pid (1205), iocapables 18, client (0x2a78da0)
2012-04-10 10:37:38.755: [ CSSD][1097337152]clssgmSendShutdown: Aborting client (0x264d410) proc (0x26586a0), iocapables 19.

关闭I/O相关的进程
2012-04-10 10:38:30.404: [ CSSD][1101875520]clssnmvDiskVerify: Successful discovery of 5 disks
2012-04-10 10:38:30.404: [ CSSD][1101875520]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery
2012-04-10 10:38:30.404: [ CSSD][1101875520]clssnmCompleteVFDiscovery: Completing voting file discovery
2012-04-10 10:38:30.404: [ CSSD][1101875520]clssnmvDiskStateChange: state from discovered to pending disk /dev/asm-diskg
2012-04-10 10:38:30.404: [ CSSD][1101875520]clssnmvDiskStateChange: state from discovered to pending disk /dev/asm-diskh
2012-04-10 10:38:30.404: [ CSSD][1101875520]clssnmvDiskStateChange: state from discovered to pending disk /dev/asm-diski
2012-04-10 10:38:30.404: [ CSSD][1101875520]clssnmvDiskStateChange: state from discovered to pending disk /dev/asm-diskj
2012-04-10 10:38:30.404: [ CSSD][1101875520]clssnmvDiskStateChange: state from discovered to pending disk /dev/asm-diskk
2012-04-10 10:38:30.404: [ CSSD][1101875520]clssnmvDiskStateChange: state from pending to configured disk /dev/asm-diskg
2012-04-10 10:38:30.404: [ CSSD][1101875520]clssnmvDiskStateChange: state from pending to configured disk /dev/asm-diskh
2012-04-10 10:38:30.404: [ CSSD][1101875520]clssnmvDiskStateChange: state from pending to configured disk /dev/asm-diski
2012-04-10 10:38:30.404: [ CSSD][1101875520]clssnmvDiskStateChange: state from pending to configured disk /dev/asm-diskj
2012-04-10 10:38:30.404: [ CSSD][1101875520]clssnmvDiskStateChange: state from pending to configured disk /dev/asm-diskk
2012-04-10 10:38:30.404: [ CSSD][1101875520]clssnmCompleteVFDiscovery: Committed configuration for CIN 0:1320713333:0
2012-04-10 10:38:30.404: [ CSSD][1101875520]  misscount       30 reboot latency    3
2012-04-10 10:38:30.404: [ CSSD][1101875520]  long I/O timeout  200 short I/O timeout  27
2012-04-10 10:38:30.404: [ CSSD][1101875520]  diagnostic wait    0  active version 11.2.0.3.0
2012-04-10 10:38:30.404: [ CSSD][1101875520]  Listing unique IDs for 5 voting files:
2012-04-10 10:38:30.404: [ CSSD][1101875520] voting file 1: a853d620-4bbc4fea-bfd8c73d-4c3b3001
2012-04-10 10:38:30.404: [ CSSD][1101875520] voting file 2: a5b37704-c3574f0f-bf21d1d9-f58c4a6b
2012-04-10 10:38:30.404: [ CSSD][1101875520] voting file 3: 36e5c51f-f0294fc3-bf2a0422-66650331
2012-04-10 10:38:30.404: [ CSSD][1101875520] voting file 4: af337d15-12824fe4-bf6ad452-83517aaa
2012-04-10 10:38:30.405: [ CSSD][1101875520] voting file 5: 3c4a349e-2e304ff6-bf64b2b1-c9d9cf5d
2012-04-10 10:38:30.405: [ CSSD][48535264]clssnmOpenGIPCEndp: opening cluster listener on gipc://vrh2:nm_vrh-cluster
2012-04-10 10:38:30.405: [GIPCHGEN][48535264] gipchaInternalRegister: Initializing HA GIPC
2012-04-10 10:38:30.405: [GIPCHGEN][48535264] gipchaNodeCreate: adding new node 0x12b2260 { host '', haName 'CSS_vrh-cluster', srcLuid 77a3cf80-00000000, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [0 : 0], createTime 779563194, sentRegister 0, localMonitor 0, flags 0x1 }
2012-04-10 10:38:30.405: [ GPNP][48535264]clsgpnp_Init: [at clsgpnp0.c:585] '/g01/11.2.0/grid' in effect as GPnP home base.
2012-04-10 10:38:30.405: [ GPNP][48535264]clsgpnp_Init: [at clsgpnp0.c:619] GPnP pid=9756, GPNP comp tracelevel=1, depcomp tracelevel=0, tlsrc:ORA_DAEMON_LOGGING_LEVELS, apitl:0, complog:1, tstenv:0, devenv:0, envopt:0, flags=0
2012-04-10 10:38:30.411: [ GIPC][48535264] gipcCheckInitialization: possible incompatible non-threaded init from [clsgpnp0.c : 769], original from [clsssc.c : 970]
2012-04-10 10:38:30.411: [ GPNP][48535264]clsgpnpkwf_initwfloc: [at clsgpnpkwf.c:399] Using FS Wallet Location : /g01/11.2.0/grid/gpnp/vrh2/wallets/peer/

[ CLWAL][48535264]clsw_Initialize: OLR initlevel [30000]
2012-04-10 10:38:30.424: [ GPNP][48535264]clsgpnp_getCachedProfileEx: [at clsgpnp.c:613] Result: (26) CLSGPNP_NO_PROFILE. Can't get offline GPnP service profile: local gpnpd is up and running. Use getProfile instead.
2012-04-10 10:38:30.424: [ GPNP][48535264]clsgpnp_getCachedProfileEx: [at clsgpnp.c:623] Result: (26) CLSGPNP_NO_PROFILE. Failed to get offline GPnP service profile.

2012-04-06 02:10:53.869: [GIPCHAUP][1100032320] gipchaUpperCallbackDisconnect: completed DISCONNECT ret gipcretSuccess (0), umsg 0x7f014407f2e0 { msg 0x7f01440a72a8, ret gipcretSuccess (0), flags 0x2 }, msg 0x7f01440a72a8 { type gipchaMsgTypeDisconnect (5), srcCid 00000000-00000bd8, dstCid 00000000-000020b8 }, hendp 0x7f01440206e0 [0000000000000bd8] { gipchaEndpoint : port 'gm2_vrh-cluster/5a33-b62f-a517-7187', peer 'vrh1:5a18-6ac0-8f41-c29b', srcCid 00000000-00000bd8,  dstCid 00000000-000020b8, numSend 0, maxSend 100, groupListType 2, hagroup 0x7f0144022c70, usrFlags 0x4000, flags 0x21c }
2012-04-06 02:10:53.869: [ CSSD][1089497408]clssgmSendShutdown: I/O capable proc (0x179e150), pid (5240), iocapables 3, client (0x178cf80)
2012-04-06 02:10:53.869: [ CSSD][1089497408]clssgmSendShutdown: Aborting client (0x17ca3e0) proc (0x178abc0), iocapables 4.
2012-04-06 02:10:53.869: [ CSSD][1089497408]clssgmSendShutdown: I/O capable proc (0x178abc0), pid (5255), iocapables 4, client (0x17ca3e0)
2012-04-06 02:10:53.869: [ CSSD][1089497408]clssgmSendShutdown: Aborting client (0x1788750) proc (0x178abc0), iocapables 5.
2012-04-06 02:10:53.869: [ CSSD][1089497408]clssgmSendShutdown: I/O capable proc (0x178abc0), pid (5255), iocapables 5, client (0x1788750)
2012-04-06 02:10:53.869: [ CSSD][1089497408]clssgmSendShutdown: Aborting client (0x1889e00) proc (0x178abc0), iocapables 6.
2012-04-06 02:10:53.869: [ CSSD][1089497408]clssgmSendShutdown: I/O capable proc (0x178abc0), pid (5255), iocapables 6, client (0x1889e00)
2012-04-06 02:10:53.869: [ CSSD][1089497408]clssgmSendShutdown: Aborting client (0x189e6c0) proc (0x178d020), iocapables 7.
2012-04-06 02:10:53.869: [ CSSD][1089497408]clssgmSendShutdown: I/O capable proc (0x178d020), pid (5187), iocapables 7, client (0x189e6c0)
2012-04-06 02:10:53.869: [ CSSD][1089497408]clssgmClientShutdown: sending shutdown, fence_done 1
2012-04-06 02:10:53.967: [ CSSD][1122732352]clssnmSendingThread: state(3) clusterState(0) exit
2012-04-06 02:10:53.969: [ default][1089497408]kgzf_fini: called

2012-04-06 02:10:53.969: [ default][1089497408]kgzf_fini1: completed. kgzf layer has quit.

清理结束后，会尝试重启cluster

[root@vrh2 ~]# crsctl check has
CRS-4638: Oracle High Availability Services is online
[root@vrh2 ~]# crsctl check cluster
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager

完整的日志内容已上传为附件

ocssd_rebootless_11203.zip (1.18 MB, 下载次数: 952)

回复只看该作者道具举报

Maclean Liu(刘相兵

6^#

发表于 2012-4-10 13:39:19

回复 5# 的帖子

11.2 以前的 NIC Binding

11.2 以后的 HAIP

均支持FAILOVER ，在11.2之前的多个private network (parameter cluster_interconnect)只支持load balance而不支持 failover

在支持FAILOVER的模式下，单个网络心跳失败不会造成 BRAIN SPLIT脑裂算法的开始

回复只看该作者道具举报

orafans

5^#

发表于 2012-4-10 12:31:27

heartbeat网络有多个时，当其中一个down了，就会引起相关的STONITH还是要所有的heartbeat都down了才会引起集群的相关eviction措施？

回复只看该作者道具举报

lypflash

4^#

发表于 2012-4-10 08:50:28

great. very clearly

回复只看该作者道具举报

Maclean Liu(刘相兵

3^#

发表于 2012-4-9 18:47:25

但是从 GI/CRS 版本 11.2.0.2 开始开发部门改良了这种STONITH algorithm算法， Oracle Clusterware在遇到 Brain Split脑裂情景时会尝试避免reboot node 重启节点。新的算法下 oracle clusterware会尝试关闭 eviction node 被驱逐节点上的所有 resource 资源，特别是产生 I/O的进程会被全部KILL掉。

但是如果所有 eviction node上的resource 或者产生I/O的进程并没有被合理停止，那么Oracle Clusterware 而可能执行 reboot重启或者利用 IPMI 来强迫性地将节点驱逐出集群。

enforces the removal of one or more nodes from the cluster (fencing).
Until Oracle Clusterware 11g Release 2, Patch Set One (11.2.0.2) the fencing of a node was
performed by a “fast reboot” of the respective server. A “fast reboot” in this context summarizes
a shutdown and restart procedure that does not wait for any IO to finish or for file systems to
synchronize on shutdown. With Oracle Clusterware 11g Release 2, Patch Set One (11.2.0.2) this
mechanism has been changed in order to prevent such a reboot as much as possible.
Already with Oracle Clusterware 11g Release 2 this algorithm was improved so that failures of
certain, Oracle RAC-required subcomponents in the cluster do not necessarily cause an
immediate fencing (reboot) of a node. Instead, an attempt is made to clean up the failure within
the cluster and to restart the failed subcomponent. Only, if a cleanup of the failed component
appears to be unsuccessful, a node reboot is performed in order to force a cleanup.

With Oracle Clusterware 11g Release 2, Patch Set One (11.2.0.2) further improvements were
made so that Oracle Clusterware will try to prevent a split-brain without rebooting the node. It
thereby implements a standing requirement from those customers, who were requesting to
preserve the node and to prevent a reboot, since the node runs applications not managed by
Oracle Clusterware, which would otherwise be forcibly shut down by the reboot of a node.
With the new algorithm and when a decision is made to evict a node from the cluster, Oracle
Clusterware will first attempt to shutdown all resources on the machine that was chosen to be
the subject of an eviction. Especially IO generating processes are killed and it is ensured that
those processes are completely stopped before continuing. If, for some reason, not all resources
can be stopped or IO generating processes cannot be stopped completely, Oracle Clusterware
will still perform a reboot or use IPMI to forcibly evict the node from the cluster.
If all resources can be stopped and all IO generating processes can be killed, Oracle Clusterware
will shut itself down on the respective node, but will attempt to restart after the stack has been
stopped. The restart is initiated by the Oracle High Availability Services Daemon, which has been
introduced with Oracle Clusterware 11g Release 2.

FROM

An Oracle White Paper
September 2010
Oracle Clusterware 11g Release 2

http://www.oracle.com/technetwor ... l2-owp-1-129843.pdf

回复只看该作者道具举报

Maclean Liu(刘相兵

2^#

发表于 2012-4-9 18:38:33

11.2.0.2 之前 Oracle Clusterware采用传统的STONITH 算法(Shoot The Other Node In The Head)

Traditionally, Oracle Clusterware uses a STONITH (Shoot The Other Node In The Head)
comparable fencing algorithm to ensure data integrity in cases, in which cluster integrity is
endangered and split-brain scenarios need to be prevented. For Oracle Clusterware this means
that a local process enforces the removal of one or more nodes from the cluster (fencing).

在脑裂检查阶段Reconfig Manager会找出那些没有Network Heartbeat而有Disk Heartbeat的节点，并通过Network Heartbeat(如果可能的话)和Disk Heartbeat的信息来计算所有竞争子集群(subcluster)内的节点数目，并依据以下2种因素决定哪个子集群应当存活下去:

拥有最多节点数目的子集群(Sub-cluster with largest number of Nodes)
若子集群内数目相等则为拥有最低节点号的子集群(Sub-cluster with lowest node number)，举例来说在一个2节点的RAC环境中总是1号节点会获胜。
采用Stonith algorithm 的IO fencing(remote power reset)

补充:

STONITH算法是一种常用的I/O Fencing algorithm，是RAC中必要的远程关闭节点的接口。其想法十分简单，当某个节点上运行的软件希望确保本集群内的其他节点不能使用某种资源时，拔出其他节点的插座即可。这是一种简单、可靠且有些残酷的算法。Stonith 的优势是其没有特定的硬件需求，也不会限制集群的扩展性。

回复只看该作者道具举报

返回列表

		自动登录	找回密码
密码			注册