11g Rac 实例重启
hi ML:我们生产系统这个库连续2次发生实例重启:
相应报错日志 :
9:02:31 开始ncxdb11 就没有alert 日志 直到 08分开始数据库启动,之前一直在报no heartbeat have disk hb , oswbb 中心跳存在一定延时 但 之前也是那么多:
ncxdb11 oswprvtnet:
zzz ***Mon Jan 20 09:01:29 GMT+08:00 2014
trying to get source for ncxdb11-pri
source should be 172.32.204.29
traceroute to ncxdb11-pri (172.32.204.29) from 172.32.204.29 (172.32.204.29), 30 hops max
outgoing MTU = 1500
1 ncxdb11-pri (172.32.204.29) 58 ms 0 ms 0 ms
trying to get source for ncxdb12-pri
source should be 172.32.204.29
traceroute to ncxdb12-pri (172.32.204.30) from 172.32.204.29 (172.32.204.29), 30 hops max
outgoing MTU = 1500
1 ncxdb12-pri (172.32.204.30) 58 ms 0 ms 1 ms
zzz ***Mon Jan 20 09:02:01 GMT+08:00 2014
trying to get source for ncxdb11-pri
source should be 172.32.204.29
traceroute to ncxdb11-pri (172.32.204.29) from 172.32.204.29 (172.32.204.29), 30 hops max
outgoing MTU = 1500
1 ncxdb11-pri (172.32.204.29) 45 ms 0 ms 0 ms
trying to get source for ncxdb12-pri
source should be 172.32.204.29
traceroute to ncxdb12-pri (172.32.204.30) from 172.32.204.29 (172.32.204.29), 30 hops max
outgoing MTU = 1500
1 ncxdb12-pri (172.32.204.30) 46 ms 0 ms 0 ms
zzz ***Mon Jan 20 09:10:29 GMT+08:00 2014
trying to get source for ncxdb11-pri
source should be 172.32.204.29
traceroute to ncxdb11-pri (172.32.204.29) from 172.32.204.29 (172.32.204.29), 30 hops max
outgoing MTU = 1500
1 ncxdb11-pri (172.32.204.29) 33 ms 0 ms 0 ms
trying to get source for ncxdb12-pri
source should be 172.32.204.29
traceroute to ncxdb12-pri (172.32.204.30) from 172.32.204.29 (172.32.204.29), 30 hops max
outgoing MTU = 1500
1 ncxdb12-pri (172.32.204.30) 32 ms 0 ms 0 ms
ncxdb11 ocssd.log:
2014-01-20 09:02:20.226: [ CSSD]clssnmPollingThread: node ncxdb12 (2) at 50% heartbeat fatal, removal in 14.903 seconds
2014-01-20 09:02:20.226: [ CSSD]clssnmPollingThread: node ncxdb12 (2) is impending reconfig, flag 2229260, misstime 15097
2014-01-20 09:02:20.227: [ CSSD]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2014-01-20 09:02:20.227: [ CSSD]clssnmvDHBValidateNcopy: node 2, ncxdb12, has a disk HB, but no network HB, DHB has rcfg 254422787, wrtcnt, 93434934, LATS 670
757723, lastSeqNo 93427805, uniqueness 1382516743, timestamp 1390179739/670661093
2014-01-20 09:02:20.227: [ CSSD]clssnmvDHBValidateNcopy: node 2, ncxdb12, has a disk HB, but no network HB, DHB has rcfg 254422787, wrtcnt, 93434936, LATS 670
757723, lastSeqNo 93427803, uniqueness 1382516743, timestamp 1390179740/670661392
2014-01-20 09:02:20.366: [ CSSD]clssnmvDiskPing: Writing with status 0x3, timestamp 670757862/1390179740
2014-01-20 09:02:20.736: [ CSSD]clssnmvDiskPing: Writing with status 0x3, timestamp 670758232/1390179740
2014-01-20 09:02:20.776: [ CSSD]clssnmvDiskPing: Writing with status 0x3, timestamp 670758272/1390179740
2014-01-20 09:02:20.876: [ CSSD]clssnmvDiskPing: Writing with status 0x3, timestamp 670758372/1390179740
2014-01-20 09:02:21.227: [ CSSD]clssnmvDHBValidateNcopy: node 2, ncxdb12, has a disk HB, but no network HB, DHB has rcfg 254422787, wrtcnt, 93434939, LATS 670
758723, lastSeqNo 93434936, uniqueness 1382516743, timestamp 1390179741/670662392
2014-01-20 09:02:21.236: [ CSSD]clssnmvDiskPing: Writing with status 0x3, timestamp 670758732/1390179741
2014-01-20 09:02:21.277: [ CSSD]clssnmvDiskPing: Writing with status 0x3, timestamp 670758772/1390179741
2014-01-20 09:02:21.376: [ CSSD]clssnmvDiskPing: Writing with status 0x3, timestamp 670758872/1390179741
2014-01-20 09:02:21.729: [ CSSD]clssnmvDHBValidateNcopy: node 2, ncxdb12, has a disk HB, but no network HB, DHB has rcfg 254422787, wrtcnt, 93434941, LATS 670
759224, lastSeqNo 93405380, uniqueness 1382516743, timestamp 1390179741/670662648
2014-01-20 09:02:21.737: [ CSSD]clssnmvDHBValidateNcopy: node 2, ncxdb12, has a disk HB, but no network HB, DHB has rcfg 254422787, wrtcnt, 93434942, LATS 670
759233, lastSeqNo 93434939, uniqueness 1382516743, timestamp 1390179741/670662893
2014-01-20 09:02:21.738: [ CSSD]clssnmvDiskPing: Writing with status 0x3, timestamp 670759233/1390179741
2014-01-20 09:02:21.778: [ CSSD]clssnmvDiskPing: Writing with status 0x3, timestamp 670759273/1390179741
2014-01-20 09:02:21.879: [ CSSD]clssnmvDiskPing: Writing with status 0x3, timestamp 670759375/1390179741
2014-01-20 09:02:22.231: [ CSSD]clssnmvDHBValidateNcopy: node 2, ncxdb12, has a disk HB, but no network HB, DHB has rcfg 254422787, wrtcnt, 93434944, LATS 670
759727, lastSeqNo 93434941, uniqueness 1382516743, timestamp 1390179741/670663148
2014-01-20 09:02:22.231: [ CSSD]clssnmvDHBValidateNcopy: node 2, ncxdb12, has a disk HB, but no network HB, DHB has rcfg 254422787, wrtcnt, 93434945, LATS 670
759727, lastSeqNo 93434942, uniqueness 1382516743, timestamp 1390179742/670663399
2014-01-20 09:02:22.240: [ CSSD]clssnmvDiskPing: Writing with status 0x3, timestamp 670759736/1390179742
2014-01-20 09:02:22.280: [ CSSD]clssnmvDiskPing: Writing with status 0x3, timestamp 670759775/1390179742
2014-01-20 09:02:22.380: [ CSSD]clssnmvDiskPing: Writing with status 0x3, timestamp 670759876/1390179742
2014-01-20 09:02:22.480: [ CSSD]clssnmSendingThread: sending status msg to all nodes
2014-01-20 09:02:22.480: [ CSSD]clssnmSendingThread: sent 4 status msgs to all nodes
2014-01-20 09:02:22.734: [ CSSD]clssnmvDHBValidateNcopy: node 2, ncxdb12, has a disk HB, but no network HB, DHB has rcfg 254422787, wrtcnt, 93434947, LATS 670
760230, lastSeqNo 93434944, uniqueness 1382516743, timestamp 1390179742/670663651
2014-01-20 09:02:22.734: [ CSSD]clssnmvDHBValidateNcopy: node 2, ncxdb12, has a disk HB, but no network HB, DHB has rcfg 254422787, wrtcnt, 93434948, LATS 670
760230, lastSeqNo 93434945, uniqueness 1382516743, timestamp 1390179742/670663900
2014-01-20 09:02:22.744: [ CSSD]clssnmvDiskPing: Writing with status 0x3, timestamp 670760240/1390179742
2014-01-20 09:02:22.781: [ CSSD]clssnmvDiskPing: Writing with status 0x3, timestamp 670760277/1390179742
2014-01-20 09:02:22.881: [ CSSD]clssnmvDiskPing: Writing with status 0x3, timestamp 670760377/1390179742
2014-01-20 09:02:23.232: [ CSSD]clssnmvDHBValidateNcopy: node 2, ncxdb12, has a disk HB, but no network HB, DHB has rcfg 254422787, wrtcnt, 93434951, LATS 670
760727, lastSeqNo 93434948, uniqueness 1382516743, timestamp 1390179743/670664402
2014-01-20 09:02:23.245: [ CSSD]clssnmvDiskPing: Writing with status 0x3, timestamp 670760741/1390179743
2014-01-20 09:02:23.282: [ CSSD]clssnmvDiskPing: Writing with status 0x3, timestamp 670760777/1390179743
2014-01-20 09:02:23.386: [ CSSD]clssnmvDiskPing: Writing with status 0x3, timestamp 670760881/1390179743
2014-01-20 09:02:23.733: [ CSSD]clssnmvDHBValidateNcopy: node 2, ncxdb12, has a disk HB, but no network HB, DHB has rcfg 254422787, wrtcnt, 93434953, LATS 670
761229, lastSeqNo 93434947, uniqueness 1382516743, timestamp 1390179743/670664657
2014-01-20 09:02:23.735: [ CSSD]clssnmvDHBValidateNcopy: node 2, ncxdb12, has a disk HB, but no network HB, DHB has rcfg 254422787, wrtcnt, 93434954, LATS 670
761231, lastSeqNo 93434951, uniqueness 1382516743, timestamp 1390179743/670664907
2014-01-20 09:02:23.746: [ CSSD]clssnmvDiskPing: Writing with status 0x3, timestamp 670761242/1390179743
2014-01-20 09:02:23.784: [ CSSD]clssnmvDiskPing: Writing with status 0x3, timestamp 670761279/1390179743
2014-01-20 09:02:23.887: [ CSSD]clssnmvDiskPing: Writing with status 0x3, timestamp 670761382/1390179743
2014-01-20 09:02:24.235: [ CSSD]clssnmvDHBValidateNcopy: node 2, ncxdb12, has a disk HB, but no network HB, DHB has rcfg 254422787, wrtcnt, 93434956, LATS 670
761730, lastSeqNo 93434953, uniqueness 1382516743, timestamp 1390179743/670665157
在ncxdb11的ocssd log中,从9点02分20秒开始,报no network HB的错误。 2014-01-20 09:02:35.463: [ CSSD]clssnmCheckDskInfo: My cohort: 2
2014-01-20 09:02:35.463: [ CSSD]clssnmCheckDskInfo: Surviving cohort: 1
2014-01-20 09:02:35.463: [ CSSD](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, ncxdb12, is smalle
r than cohort of 1 nodes led by node 1, ncxdb11, based on map type 2
2014-01-20 09:02:35.463: [ CSSD]clssgmQueueGrockEvent: groupName(IGCXDB1SYS$USERS) count(2) master(1) event(2), incarn 8, mbrc 2, to member 2, events 0x0, state
0x0
2014-01-20 09:02:35.463: [ CSSD]###################################
2014-01-20 09:02:35.463: [ CSSD]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(1) event(2), incarn 15, mbrc 3, to member 0, events 0x0, state 0x0
2014-01-20 09:02:35.463: [ CSSD]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread ncxdb12 :
2014-01-20 09:02:35.463: [ CSSD]clssnmCheckDskInfo: My cohort: 2
2014-01-20 09:02:35.463: [ CSSD]clssnmCheckDskInfo: Surviving cohort: 1
2014-01-20 09:02:35.463: [ CSSD](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, ncxdb12, is smalle
r than cohort of 1 nodes led by node 1, ncxdb11, based on map type 2
2014-01-20 09:02:35.463: [ CSSD]clssgmQueueGrockEvent: groupName(IGCXDB1SYS$USERS) count(2) master(1) event(2), incarn 8, mbrc 2, to member 2, events 0x0, state
0x0
2014-01-20 09:02:35.463: [ CSSD]###################################
2014-01-20 09:02:35.463: [ CSSD]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(1) event(2), incarn 15, mbrc 3, to member 0, events 0x0, state 0x0
2014-01-20 09:02:35.463: [ CSSD]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread
为何ncxdb12 的node number < ncxdb11 的 node number ? 将ncxdb11 剔除?
另外 ncxdb12 lmon 日志中有kjxggpoll: change db group poll time to 50 ms
这段信息如何解读? os版本: aix 6.1
db: 11.2.0.3 需要ocssd.log ,请打包上传 已上传ossd.log alert ,lmon trace 等。希望从这个案例中找到根本原因,现在IBM原厂也与我们一起分析,之前down过第二个节点 SR 给出是访问存储通道有问题,不过和上一次不是一样的报错。 本帖最后由 baolei 于 2014-1-20 17:10 编辑
这是我们这工程师给点建议,我个人认为有点随意,分析过程就不发了,很长。。结论如下
1 在9点02分20秒出现脑裂,节点1在9点02分35秒被逐出cluster,从而导致节点1的主机crash
2. 在9点02分35秒脑裂之后,由于节点1被逐出,从而节点2在9点02分36秒接管节点1,不过由于asm的lmon进程在接管投票的时间超过了50ms,从而引起asm的pmon进程误以为lmon进程已经僵死,进而导致了pmon进程异常终止,最终导致了节点2的数据库的crash
建议:
本次故障主要是由于心跳网络异常导致的一系列问题,建议,主机检查网络心跳为何远超过正常值。 你给的日志最早记录是在1月14日
timeline:
node 1 2014-01-20 09:02:20.227: clssnmvDHBValidateNcopy: node 2, ncxdb12, has a disk HB, but no network HB,
node 2 2014-01-20 09:02:20.769 clssnmvDHBValidateNcopy: node 1, ncxdb11, has a disk HB, but no network HB
node 1 2014-01-20 09:02:31.423: [ CSSD]clssnmvDiskPing: Writing with status 0x3, timestamp 670768918/1390179751 重启前最后一条日志
node 2 2014-01-20 09:02:46.401: [ CSSD]clssgmClientShutdown: sending shutdown, fence_done 1 IO fench 后CSSD SHUTDOWN
node 2 2014-01-20 09:02:54.091: [ CSSD]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (clustered) mode with uniqueness value 1390179774 CRS shutdown 启动CSS
node 1 2014-01-20 09:08:22.972: [ CSSD]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (clustered) mode with uniqueness value 1390180102 重启后 启动CSS
node 1 没有太多可用信息
node 2 可以看到原来这里是想 Aborting local node to avoid splitbrain, 因为这个sub-cluster的权重小于节点1
2014-01-20 09:02:35.463: [ CSSD]clssnmCheckDskInfo: Checking disk info...
2014-01-20 09:02:35.463: [ CSSD]clssnmCheckSplit: Node 1, ncxdb11, is alive, DHB (1390179755, 670772496) more than disk timeout of 27000 after the last NHB (1390179725, 670742947)
2014-01-20 09:02:35.463: [ CSSD]clssnmCheckDskInfo: My cohort: 2
2014-01-20 09:02:35.463: [ CSSD]clssnmCheckDskInfo: Surviving cohort: 1
2014-01-20 09:02:35.463: [ CSSD](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, ncxdb12, is smaller than cohort of 1 nod
es led by node 1, ncxdb11, based on map type 2
2014-01-20 09:02:35.463: [ CSSD]clssgmQueueGrockEvent: groupName(IGCXDB1SYS$USERS) count(2) master(1) event(2), incarn 8, mbrc 2, to member 2, events 0x0, state 0x0
2014-01-20 09:02:35.463: [ CSSD]###################################
2014-01-20 09:02:35.463: [ CSSD]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(1) event(2), incarn 15, mbrc 3, to member 0, events 0x0, state 0x0
2014-01-20 09:02:35.463: [ CSSD]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread
2014-01-20 09:02:35.464: [ CSSD]###################################
2014-01-20 09:02:35.464: [ CSSD](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
这里的奇怪在于 node 1的ocssd.log 中没有显示有clssnmCheckDskInfo的部分就重启了,照理说2个节点都做clssnmCheckDskInfo的话,1节点奖获胜 并存活。
但1节点 09:02:31左右就直接reboot了,这个时间点其实2个节点还没有通过votedisk商讨谁存活下去。
疑问:这2个节点的 时钟是否一致,
crsctl query votedisk 什么结果?
PS: 就你提供的日志而言 仅仅2014-01-20 有no network HB的现象,没有看到其他时候有这种现象。 Liu Maclean(刘相兵 发表于 2014-1-20 20:30 static/image/common/back.gif
你给的日志最早记录是在1月14日
timeline:
我也是非常奇怪,按道理 都有disk hb 情况下 应该是 cxdb12 down ,怎么是 cxdb11 的node number 比12大,而且 9:02:31 秒node1就没有日志了,不知道这个reboot到底是人为还是系统,当时问了一圈人都没有人会做这个操作。
crsctl query votedisk 稍后晚些输出,好像这个库是我建的。。 Liu Maclean(刘相兵 发表于 2014-1-20 20:40 static/image/common/back.gif
PS: 就你提供的日志而言 仅仅2014-01-20 有no network HB的现象,没有看到其他时候有这种现象。 ...
是啊,日志里面也就这些信息,也没看到其他。 Liu Maclean(刘相兵 发表于 2014-1-20 20:30 static/image/common/back.gif
你给的日志最早记录是在1月14日
timeline:
grid@ncxdb11:/home/grid$ crsctl query css votedisk
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE 5e380d62b6cb4f6ebf13846fa0e0f0c8 (/dev/rhdiskpower6001)
2. ONLINE 4397d72c00174fe7bf13acd712071024 (/dev/rhdiskpower6002)
3. ONLINE 9b1cb71a838c4ffdbfb93565d5c1cb4c (/dev/rhdiskpower6003)
Located 3 voting disk(s).
ncxdb12:[/]#crsctl query css votedisk
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE 5e380d62b6cb4f6ebf13846fa0e0f0c8 (/dev/rhdiskpower6001)
2. ONLINE 4397d72c00174fe7bf13acd712071024 (/dev/rhdiskpower6002)
3. ONLINE 9b1cb71a838c4ffdbfb93565d5c1cb4c (/dev/rhdiskpower6003) 疑问:这2个节点的 时钟是否一致,如9楼 Liu Maclean(刘相兵 发表于 2014-1-21 19:47 static/image/common/back.gif
疑问:这2个节点的 时钟是否一致,如9楼
oracle@ncxdb11:/home/oracle$ ssh ncxdb12 date;date;
Tue Jan 21 20:09:44 GMT+08:00 2014
Tue Jan 21 20:09:44 GMT+08:00 2014
一致的啊 Liu Maclean(刘相兵 发表于 2014-1-21 19:47 static/image/common/back.gif
疑问:这2个节点的 时钟是否一致,如9楼
请问ML还有什么新发现吗?
1、 显然要控制这种网络故障 包括采用bind等技术
2、 设置crsctl set log css CSSD:2 保证下次发生时让node1 重启产生日志
3、 需要2节点 errpt 数据 节点1:
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
ECCE4018 0122181714 T S fcs4 SOFTWARE PROGRAM ERROR
ECCE4018 0122181714 T S fcs6 SOFTWARE PROGRAM ERROR
ECCE4018 0122181414 T S fcs4 SOFTWARE PROGRAM ERROR
ECCE4018 0122181314 T S fcs6 SOFTWARE PROGRAM ERROR
ECCE4018 0122181314 T S fcs6 SOFTWARE PROGRAM ERROR
ECCE4018 0122181314 T S fcs6 SOFTWARE PROGRAM ERROR
ECCE4018 0122181314 T S fcs6 SOFTWARE PROGRAM ERROR
ECCE4018 0122181314 T S fcs8 SOFTWARE PROGRAM ERROR
A6DF45AA 0120090714 I O RMCdaemon The daemon is started.
2BFA76F6 0120090314 T S SYSPROC SYSTEM SHUTDOWN BY USER
9DBCFDEE 0120090614 T O errdemon ERROR LOGGING TURNED ON
E87EF1BE 0119150014 P O dumpcheck The largest dump device is too small.
E87EF1BE 0118150014 P O dumpcheck The largest dump device is too small.
ncxdb11:[/]#lsattr -El ent17
adapter_names ent4 EtherChannel Adapters True
alt_addr 0x000000000000 Alternate EtherChannel Address True
auto_recovery yes Enable automatic recovery after failover True
backup_adapter ent10 Adapter used when whole channel fails True
hash_mode default Determines how outgoing adapter is chosen True
interval long Determines interval value for IEEE 802.3ad mode True
mode standard EtherChannel mode of operation True
netaddr 0 Address to ping True
noloss_failover yes Enable lossless failover after ping failure True
num_retries 3 Times to retry ping before failing True
retry_time 1 Wait time (in seconds) between pings True
use_alt_addr no Enable Alternate EtherChannel Address True
use_jumbo_frame no Enable Gigabit Ethernet Jumbo Frames True
节点2 :
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
E87EF1BE 0120150014 P O dumpcheck The largest dump device is too small.
A924A5FC 0120090214 P S SYSPROC SOFTWARE PROGRAM ABNORMALLY TERMINATED
E87EF1BE 0119150014 P O dumpcheck The largest dump device is too small.
E87EF1BE 0118150014 P O dumpcheck The largest dump device is too small.
ncxdb12:[/]#lsattr -El ent17
adapter_names ent4 EtherChannel Adapters True
alt_addr 0x000000000000 Alternate EtherChannel Address True
auto_recovery yes Enable automatic recovery after failover True
backup_adapter ent10 Adapter used when whole channel fails True
hash_mode default Determines how outgoing adapter is chosen True
interval long Determines interval value for IEEE 802.3ad mode True
mode standard EtherChannel mode of operation True
netaddr 0 Address to ping True
noloss_failover yes Enable lossless failover after ping failure True
num_retries 3 Times to retry ping before failing True
retry_time 1 Wait time (in seconds) between pings True
use_alt_addr no Enable Alternate EtherChannel Address True
use_jumbo_frame no Enable Gigabit Ethernet Jumbo Frames True
没有异常的啊。。 Liu Maclean(刘相兵 发表于 2014-1-23 15:40 static/image/common/back.gif
1、 显然要控制这种网络故障 包括采用bind等技术
2、 设置crsctl set log css CSSD:2 保证下次发生时让 ...
我同事的结论 貌似也不对吧?
页:
[1]