- 最后登录
- 2016-12-22
- 在线时间
- 16 小时
- 威望
- 5
- 金钱
- 73
- 注册时间
- 2013-2-6
- 阅读权限
- 10
- 帖子
- 16
- 精华
- 0
- 积分
- 5
- UID
- 917
|
1#
发表于 2013-6-28 10:21:29
|
查看: 4900 |
回复: 3
环境:rac 11.2.0.1 aix 6.1
情况:某一个节点不定期的重启,不是固定的一个节点,重启周期大概一个月一次
最近一次的日志如下
dbs02
# errpt
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
A924A5FC 0626225013 P S SYSPROC SOFTWARE PROGRAM ABNORMALLY TERMINATED
A924A5FC 0626225013 P S SYSPROC SOFTWARE PROGRAM ABNORMALLY TERMINATED
A6DF45AA 0626004413 I O RMCdaemon The daemon is started.
2BFA76F6 0626004113 T S SYSPROC SYSTEM SHUTDOWN BY USER
9DBCFDEE 0626004313 T O errdemon ERROR LOGGING TURNED ON
alertdbs02.log
2013-06-26 00:44:18.653
[ohasd(1966340)]CRS-2112:The OLR service started on node dbs02.
2013-06-26 00:44:19.158
[ohasd(1966340)]CRS-8017:location: /etc/oracle/lastgasp has 28 reboot advisory log files, 0 were announced and 0 errors occurred
2013-06-26 00:44:41.675
[ohasd(1966340)]CRS-2772:Server 'dbs02' has been assigned to pool 'Free'.
2013-06-26 00:45:16.970
[ctssd(4260068)]CRS-2403:The Cluster Time Synchronization Service on host dbs02 is in observer mode.
2013-06-26 00:45:17.021
[ctssd(4260068)]CRS-2407:The new Cluster Time Synchronization Service reference node is host dbs01.
2013-06-26 00:45:17.724
[ctssd(4260068)]CRS-2401:The Cluster Time Synchronization Service started on host dbs02.
2013-06-26 00:45:49.488
[crsd(3997986)]CRS-1012:The OCR service started on node dbs02.
2013-06-26 00:45:53.067
[crsd(3997986)]CRS-1201:CRSD started on node dbs02.
2013-06-26 00:50:06.051
[ctssd(4260068)]CRS-2409:The clock on host dbs02 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
alert_orcl2.log
Wed Jun 26 00:46:26 2013
Starting ORACLE instance (normal)
dbs01
alertdbs01.log
-------------------------------------
2013-06-26 00:40:42.052
[ctssd(2097634)]CRS-2407:The new Cluster Time Synchronization Service reference node is host dbs01.
2013-06-26 00:41:23.794
[crsd(5046304)]CRS-5504:Node down event reported for node 'dbs02'.
2013-06-26 00:41:53.052
[crsd(5046304)]CRS-2773:Server 'dbs02' has been removed from pool 'Generic'.
2013-06-26 00:41:53.079
[crsd(5046304)]CRS-2773:Server 'dbs02' has been removed from pool 'ora.orcl'.
----------------------------
crsd.log
2013-06-26 00:40:41.222: [ CRSCCL][9773]Reconfig event received by clssgsgrpstat
2013-06-26 00:40:41.308: [ CRSCCL][9773]cclGetMemberData called
2013-06-26 00:40:41.400: [ CRSCCL][9773]Disconnecting connection to member 2 node 192.168.60.222.
2013-06-26 00:40:41.423: [ CRSCCL][9773]clsdisc con = 115939a10.
2013-06-26 00:40:41.628: [CLSFRAME][9773] CCL MEMBER LEFT:2:1:CRSD:dbs02
2013-06-26 00:40:41.664: [ CRSSE][8744] Forwarding Node Leave to PE for: dbs02
2013-06-26 00:40:41.693: [CLSFRAME][9773] Disconnected from CRSD:dbs02 process: {Absolute|Node:2|Process:1727121070|Type:1}
2013-06-26 00:40:41.756: [ CRSCCL][9773]Reconfig handled
2013-06-26 00:40:41.806: [ OCRSRV][4371]th_not_master_change: Invoking master change callback. Master [1] Inc [427]
2013-06-26 00:40:41.949: [ AGFW][11829] Agfw Proxy Server received process disconnected notification, count=1
2013-06-26 00:40:41.995: [ CRSSE][13371] Master Change Event; New Master Node ID:1 This Node's ID:1
2013-06-26 00:40:42.030: [ OCRMAS][4371]th_master:13: I AM THE NEW OCR MASTER at incar 426. Node Number 1
2013-06-26 00:40:42.225: [ CRSPE][13114] PE Role|State Update: old role [SLAVE] new [MASTER]; old state [Running] new [Configuring]
2013-06-26 00:40:42.231: [ CRSPE][13114] PE MASTER NAME: dbs01
2013-06-26 00:40:42.232: [ CRSPE][13114] Starting to read configuration
2013-06-26 00:40:42.587: [ OCRSRV][12600]proas_amiwriter: ctx is MASTER CHANGING/CONNECTING
2013-06-26 00:40:42.591: [ OCRSRV][12857]proas_amiwriter: ctx is MASTER CHANGING/CONNECTING
ocssd.log
2013-06-26 00:39:59.072: [ CSSD][5157]clssnmSendingThread: sending status msg to all nodes
2013-06-26 00:39:59.072: [ CSSD][5157]clssnmSendingThread: sent 4 status msgs to all nodes
2013-06-26 00:40:03.076: [ CSSD][5157]clssnmSendingThread: sending status msg to all nodes
2013-06-26 00:40:03.076: [ CSSD][5157]clssnmSendingThread: sent 4 status msgs to all nodes
2013-06-26 00:40:07.083: [ CSSD][5157]clssnmSendingThread: sending status msg to all nodes
2013-06-26 00:40:07.083: [ CSSD][5157]clssnmSendingThread: sent 4 status msgs to all nodes
2013-06-26 00:40:10.628: [ CSSD][4900]clssnmPollingThread: node dbs02 (2) at 50% heartbeat fatal, removal in 14.310 seconds
2013-06-26 00:40:10.628: [ CSSD][4900]clssnmPollingThread: node dbs02 (2) is impending reconfig, flag 394254, misstime 15690
2013-06-26 00:40:10.628: [ CSSD][4900]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2013-06-26 00:40:11.088: [ CSSD][5157]clssnmSendingThread: sending status msg to all nodes
2013-06-26 00:40:11.088: [ CSSD][5157]clssnmSendingThread: sent 4 status msgs to all nodes
2013-06-26 00:40:15.090: [ CSSD][5157]clssnmSendingThread: sending status msg to all nodes
2013-06-26 00:40:15.090: [ CSSD][5157]clssnmSendingThread: sent 4 status msgs to all nodes
2013-06-26 00:40:16.477: [ CSSD][1286]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 926 > margin 750 cur_ms 4215208529 lastalive 4215207603
2013-06-26 00:40:16.477: [ CSSD][1286]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 926 > margin 750 cur_ms 4215208529 lastalive 4215207603
2013-06-26 00:40:16.477: [ CSSD][1286]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 926 > margin 750 cur_ms 4215208529 lastalive 4215207603
2013-06-26 00:40:17.636: [ CSSD][4900]clssnmPollingThread: node dbs02 (2) at 75% heartbeat fatal, removal in 7.298 seconds
2013-06-26 00:40:19.096: [ CSSD][5157]clssnmSendingThread: sending status msg to all nodes
2013-06-26 00:40:19.096: [ CSSD][5157]clssnmSendingThread: sent 4 status msgs to all nodes
2013-06-26 00:40:22.641: [ CSSD][4900]clssnmPollingThread: node dbs02 (2) at 90% heartbeat fatal, removal in 2.293 seconds, seedhbimpd 1
2013-06-26 00:40:23.100: [ CSSD][5157]clssnmSendingThread: sending status msg to all nodes
2013-06-26 00:40:23.100: [ CSSD][5157]clssnmSendingThread: sent 4 status msgs to all nodes
2013-06-26 00:40:24.934: [ CSSD][4900]clssnmPollingThread: Removal started for node dbs02 (2), flags 0x6040e, state 3, wt4c 0
2013-06-26 00:40:24.934: [ CSSD][4900]clssnmDiscHelper: dbs02, node(2) connection failed, endp (32e), probe(0), ninf->endp 32e
2013-06-26 00:40:24.934: [ CSSD][4900]clssnmDiscHelper: node 2 clean up, endp (32e), init state 5, cur state 5
2013-06-26 00:40:24.934: [GIPCXCPT][4900]gipcInternalDissociate: obj 112dea4b0 [000000000000032e] { gipcEndpoint : localAddr 'gipc://dbs01:3182-54d3-ae85-1f02#192.168.60.221#40199', remoteAddr 'gipc://dbs02:nm_dbs-cluster#192.168.60.222#52530', numPend 5, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x3801fc, pidPeer 0, flags 0x2616, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)
2013-06-26 00:40:24.934: [GIPCXCPT][4900]gipcDissociateF [clssnmDiscHelper : clssnm.c : 3260]: EXCEPTION[ ret gipcretFail (1) ] failed to dissociate obj 112dea4b0 [000000000000032e] { gipcEndpoint : localAddr 'gipc://dbs01:3182-54d3-ae85-1f02#192.168.60.221#40199', remoteAddr 'gipc://dbs02:nm_dbs-cluster#192.168.60.222#52530', numPend 5, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x3801fc, pidPeer 0, flags 0x2616, usrFlags 0x0 }, flags 0x0
2013-06-26 00:40:24.934: [ CSSD][5414]clssnmDoSyncUpdate: Initiating sync 239553874
2013-06-26 00:40:24.934: [ CSSD][5414]clssscUpdateEventValue: NMReconfigInProgress val 1, changes 3
2013-06-26 00:40:24.934: [ CSSD][5414]clssnmDoSyncUpdate: local disk timeout set to 27000 ms, remote disk timeout set to 27000
2013-06-26 00:40:24.934: [ CSSD][5414]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed.
2013-06-26 00:40:24.934: [ CSSD][5414]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 239553874
2013-06-26 00:40:24.934: [ CSSD][5414]clssnmSetupAckWait: Ack message type (11)
2013-06-26 00:40:24.934: [ CSSD][5414]clssnmSetupAckWait: node(1) is ALIVE
从日志看节点二心跳超时导致脑裂
附上nmon信息
请帮忙分析节点重启的原因,看看是不是私有网络的问题 |
|