Oracle数据库数据恢复、性能优化

找回密码
注册
搜索
热搜: 活动 交友 discuz
发新帖

75

积分

1

好友

8

主题
1#
发表于 2014-4-30 00:23:20 | 查看: 6668| 回复: 7
1、环境介绍
操作系统:AIX 6100-06-01-1043
数据库:10.2.0.5 RAC 2节点 ASM

(用户之前的环境两节点实例装反)
  1. ASM1/instance1:node2   
  2. ASM2/instance2:node1
复制代码
2、问题现象:
28号10:04分出现网络故障,节点2被踢出
  1. [    CSSD]2014-04-28 10:04:24.716 [4371] >WARNING: clssnmPollingThread: node ahxnb02 (1) at 50% heartbeat fatal, eviction in 14.065 seconds seedhbimpd 0
  2. [    CSSD]2014-04-28 10:04:24.716 [4371] >TRACE:   clssnmPollingThread: node ahxnb02 (1) is impending reconfig, flag 1039, misstime 15935
  3. [    CSSD]2014-04-28 10:04:24.716 [4371] >TRACE:   clssnmPollingThread: diskTimeout set to (27000)ms impending reconfig status(1)
  4. [    CSSD]2014-04-28 10:04:25.217 [4628] >TRACE:   clssnmSendingThread: sending status msg to all nodes
  5. [    CSSD]2014-04-28 10:04:25.217 [4628] >TRACE:   clssnmSendingThread: sent 4 status msgs to all nodes
  6. [    CSSD]2014-04-28 10:04:29.226 [4628] >TRACE:   clssnmSendingThread: sending status msg to all nodes
  7. [    CSSD]2014-04-28 10:04:29.226 [4628] >TRACE:   clssnmSendingThread: sent 4 status msgs to all nodes
  8. [    CSSD]2014-04-28 10:04:31.716 [4371] >WARNING: clssnmPollingThread: node ahxnb02 (1) at 75% heartbeat fatal, eviction in 7.064 seconds seedhbimpd 1
  9. [    CSSD]2014-04-28 10:04:32.716 [4371] >WARNING: clssnmPollingThread: node ahxnb02 (1) at 75% heartbeat fatal, eviction in 6.065 seconds seedhbimpd 1
  10. [    CSSD]2014-04-28 10:04:33.227 [4628] >TRACE:   clssnmSendingThread: sending status msg to all nodes
  11. [    CSSD]2014-04-28 10:04:33.227 [4628] >TRACE:   clssnmSendingThread: sent 4 status msgs to all nodes
  12. [    CSSD]2014-04-28 10:04:36.719 [4371] >WARNING: clssnmPollingThread: node ahxnb02 (1) at 90% heartbeat fatal, eviction in 2.062 seconds seedhbimpd 1
  13. [    CSSD]2014-04-28 10:04:37.230 [4628] >TRACE:   clssnmSendingThread: sending status msg to all nodes
  14. [    CSSD]2014-04-28 10:04:37.230 [4628] >TRACE:   clssnmSendingThread: sent 4 status msgs to all nodes
  15. [    CSSD]2014-04-28 10:04:37.716 [4371] >WARNING: clssnmPollingThread: node ahxnb02 (1) at 90% heartbeat fatal, eviction in 1.065 seconds seedhbimpd 1
  16. [    CSSD]2014-04-28 10:04:38.716 [4371] >WARNING: clssnmPollingThread: node ahxnb02 (1) at 90% heartbeat fatal, eviction in 0.065 seconds seedhbimpd 1
  17. [    CSSD]2014-04-28 10:04:38.781 [4371] >TRACE:   clssnmPollingThread: Eviction started for node ahxnb02 (1), flags 0x040f, state 3, wt4c 0 seedhbimpd 1
  18. [    CSSD]2014-04-28 10:04:38.781 [4371] >TRACE:   clssnmDiscHelper: ahxnb02, node(1) connection failed, con (1113ab490), probe(0)
复制代码
后根据主机日志,显示出现网络IP地址冲突,该问题已解决。


网络恢复后:
2节点启动crs时,其中1个节点ASM无法启动
  1. ahxnb02$$[/oracle]crs_stat -t
  2. Name           Type           Target    State     Host        
  3. ------------------------------------------------------------
  4. ora....B1.inst application    ONLINE    OFFLINE               
  5. ora....B2.inst application    ONLINE    ONLINE    ahxnb01     
  6. ora.AHXNB.db   application    ONLINE    ONLINE    ahxnb01     
  7. ora....SM2.asm application    ONLINE    ONLINE    ahxnb01     
  8. ora....01.lsnr application    ONLINE    ONLINE    ahxnb01     
  9. ora....b01.gsd application    ONLINE    ONLINE    ahxnb01     
  10. ora....b01.ons application    ONLINE    ONLINE    ahxnb01     
  11. ora....b01.vip application    ONLINE    ONLINE    ahxnb01     
  12. ora....SM1.asm application    ONLINE    OFFLINE               
  13. ora....02.lsnr application    ONLINE    ONLINE    ahxnb02     
  14. ora....b02.gsd application    ONLINE    ONLINE    ahxnb02     
  15. ora....b02.ons application    ONLINE    ONLINE    ahxnb02     
  16. ora....b02.vip application    ONLINE    ONLINE    ahxnb02  
复制代码
手工尝试在2号节点上启动ASM
  1. ahxnb01$$[/oracle]srvctl start asm -n ahxnb01
复制代码
再次检查:
  1. ahxnb01$$[/oracle]crs_stat -t
  2. Name           Type           Target    State     Host        
  3. ------------------------------------------------------------
  4. ora....B1.inst application    ONLINE    ONLINE   ahxnb02            
  5. ora....B2.inst application    ONLINE    OFFLINE         
  6. ora.AHXNB.db   application    ONLINE    ONLINE    ahxnb02     
  7. ora....SM2.asm application    ONLINE    OFFLINE         
  8. ora....01.lsnr application    ONLINE    ONLINE    ahxnb01     
  9. ora....b01.gsd application    ONLINE    ONLINE    ahxnb01     
  10. ora....b01.ons application    ONLINE    ONLINE    ahxnb01     
  11. ora....b01.vip application    ONLINE    ONLINE    ahxnb01     
  12. ora....SM1.asm application    ONLINE    ONLINE  ahxnb02            
  13. ora....02.lsnr application    ONLINE    ONLINE    ahxnb02     
  14. ora....b02.gsd application    ONLINE    ONLINE    ahxnb02     
  15. ora....b02.ons application    ONLINE    ONLINE    ahxnb02     
  16. ora....b02.vip application    ONLINE    ONLINE    ahxnb02
复制代码
ASM alert日志:
  1. Tue Apr 29 21:46:49 CDT 2014
  2. lmon registered with NM - instance id 1 (internal mem no 0)
  3. Tue Apr 29 21:51:44 CDT 2014
  4. Remote instance kill is issued with system inc 0 and reason 0x40000000
  5. Remote instance kill map (size 1) : 2
  6. Tue Apr 29 21:51:51 CDT 2014
  7. Trace dumping is performing id=[cdmp_20140429215152]
  8. Tue Apr 29 21:51:53 CDT 2014
  9. Error: KGXGN polling error (15)
  10. Tue Apr 29 21:51:53 CDT 2014
  11. Errors in file /oracle/admin/+ASM/bdump/+asm1_lmon_2819952.trc:
  12. ORA-29702: error occurred in Cluster Group Service operation
  13. LMON: terminating instance due to error 29702
  14. Tue Apr 29 21:51:55 CDT 2014
  15. System state dump is made for local instance
  16. Tue Apr 29 21:51:55 CDT 2014
  17. Errors in file /oracle/admin/+ASM/bdump/+asm1_diag_4260796.trc:
  18. ORA-29702: error occurred in Cluster Group Service operation
  19. Tue Apr 29 21:51:55 CDT 2014
  20. Trace dumping is performing id=[cdmp_20140429215153]
  21. Tue Apr 29 21:51:57 CDT 2014
  22. Shutting down instance (abort)
  23. License high water mark = 0
  24. Tue Apr 29 21:51:57 CDT 2014
  25. Instance terminated by LMON, pid = 2819952
  26. Tue Apr 29 21:52:02 CDT 2014
  27. Instance terminated by USER, pid = 3997966
复制代码
3、查看网络配置:
  1. ahxnb01$$[/oracle]oifcfg getif
  2. en0  10.88.32.0  global  public
  3. en2  192.168.255.0  global  cluster_interconnect
  4. en4  192.168.254.0  global  cluster_interconnect
复制代码
hosts文件:
  1. 127.0.0.1               loopback localhost      # loopback (lo0) name/address
  2. ##############Public Network###########################
  3. 10.88.32.51     ahxnb01
  4. 10.88.32.52     ahxnb02
  5. ############# Virtual IP address#######################
  6. 10.88.32.53     ahxnb01_vip
  7. 10.88.32.54     ahxnb02_vip
  8. ############ Interconnect RAC1#########################
  9. 192.168.255.51  ahxnb01_priv
  10. 192.168.255.52  ahxnb02_priv
  11. ############ Interconnect RAC2#########################
  12. 192.168.254.51  ahxnb01_priv2
  13. 192.168.254.52  ahxnb02_priv2
复制代码
两节点网络测试:

public 和cluster_interconnect都正常
  1. ahxnb02$$[/oracle]ping ahxnb01_priv
  2. PING ahxnb01_priv: (192.168.255.51): 56 data bytes
  3. 64 bytes from 192.168.255.51: icmp_seq=0 ttl=255 time=0 ms
  4. 64 bytes from 192.168.255.51: icmp_seq=1 ttl=255 time=0 ms
  5. 64 bytes from 192.168.255.51: icmp_seq=2 ttl=255 time=0 ms
  6. 64 bytes from 192.168.255.51: icmp_seq=3 ttl=255 time=0 ms
复制代码
  1. ahxnb01$$[/oracle]ping ahxnb02_priv
  2. PING ahxnb02_priv: (192.168.255.52): 56 data bytes
  3. 64 bytes from 192.168.255.52: icmp_seq=0 ttl=255 time=0 ms
  4. 64 bytes from 192.168.255.52: icmp_seq=1 ttl=255 time=0 ms
  5. 64 bytes from 192.168.255.52: icmp_seq=2 ttl=255 time=0 ms
  6. 64 bytes from 192.168.255.52: icmp_seq=3 ttl=255 time=0 ms
复制代码
查看了crsd和ocssd日志,没能判断出问题的原因,上传附件,请各位大神帮分析分析。

log.zip

2.53 MB, 下载次数: 908

2#
发表于 2014-4-30 08:38:22
The bug fix is present in 10.2.0.5 and 11.1.0.7 patchsets.

If you cannot immediately apply one of these patchsets, please apply Patch 5259835 to include a mechanism to monitor the LMON, that helps to ensure the problematic instance will crash in time and the healthy instances do not wait until reconfiguration timeout.

回复 只看该作者 道具 举报

3#
发表于 2014-4-30 13:57:05
尝试重启2节点了吗?

回复 只看该作者 道具 举报

4#
发表于 2014-4-30 14:22:57
把业务停掉,把两个节点全部重启一下。

回复 只看该作者 道具 举报

5#
发表于 2014-4-30 15:01:58
两节点都重启;
node1/node2分别重启;
主机重启;
上述三种操作都执行过,但是情况一样,只要有一个节点的ASM和instance启动着,另外一个节点一启动ASM,前面正常的节点就会被abort掉。

最新日志:
  1. ed Apr 30 12:26:56 CDT 2014
  2. Errors in file /oracle/admin/AHXNB/bdump/ahxnb1_asmb_1180310.trc:
  3. ORA-15064: communication failure with ASM instance
  4. ORA-03113: end-of-file on communication channel
  5. Wed Apr 30 12:26:56 CDT 2014
  6. ASMB: terminating instance due to error 15064
  7. Wed Apr 30 12:26:56 CDT 2014
  8. System state dump is made for local instance
  9. System State dumped to trace file /oracle/admin/AHXNB/bdump/ahxnb1_diag_5767354.trc
  10. Wed Apr 30 12:26:57 CDT 2014
  11. Shutting down instance (abort)
  12. License high water mark = 70
  13. Wed Apr 30 12:27:01 CDT 2014
  14. Instance terminated by ASMB, pid = 1180310
  15. Wed Apr 30 12:27:02 CDT 2014
  16. Instance terminated by USER, pid = 5244898
  17. Wed Apr 30 13:20:21 CDT 2014
复制代码
附件中为最新的alert日志。

alert.rar

1.1 MB, 下载次数: 1398

回复 只看该作者 道具 举报

6#
发表于 2014-4-30 18:00:49
目前问题已解决,但自己也不明白到底什么原因。。。
操作过程:

回复 只看该作者 道具 举报

7#
发表于 2014-4-30 18:07:30
解决步骤:
删除其中一块心跳网卡(之前为什么这么部署,现在也不清楚,原先的集成商已脱离用户)
  1. ahxnb02$$[/oracle/product/10.2.0/crs/log/ahxnb02/crsd]oifcfg getif
  2. en0  10.88.32.0  global  public
  3. en2  192.168.255.0  global  cluster_interconnect
  4. [color=Red]en4  192.168.254.0  global  cluster_interconnect[/color]
复制代码
从集群注册信息中删除其中一块cluster_interconnect(en4)
  1. ahxnb02$$[/oracle/product/10.2.0/crs/log/ahxnb02/crsd]oifcfg delif -global en4/192.168.254.0
  2. ahxnb02$$[/oracle/product/10.2.0/crs/log/ahxnb02/crsd]oifcfg getif
  3. en0  10.88.32.0  global  public
  4. en2  192.168.255.0  global  cluster_interconnect
复制代码
再次启动问题节点的CRS服务:
  1. ahxnb01$$[/oracle]crsctl start crs
  2. Attempting to start CRS stack
  3. The CRS stack will be started shortly
复制代码
检查进程:
  1. ahxnb01$$[/oracle]crs_stat -t -v
  2. Name           Type           R/RA   F/FT   Target    State     Host        
  3. ----------------------------------------------------------------------
  4. ora....B1.inst application    0/5    0/0    ONLINE    ONLINE    ahxnb02     
  5. ora....B2.inst application    0/5    0/0    ONLINE    ONLINE    ahxnb01     
  6. ora.AHXNB.db   application    0/0    0/1    ONLINE    ONLINE    ahxnb02     
  7. ora....SM2.asm application    0/5    0/0    ONLINE    ONLINE    ahxnb01     
  8. ora....01.lsnr application    0/5    0/0    ONLINE    ONLINE    ahxnb01     
  9. ora....b01.gsd application    0/5    0/0    ONLINE    ONLINE    ahxnb01     
  10. ora....b01.ons application    0/3    0/0    ONLINE    ONLINE    ahxnb01     
  11. ora....b01.vip application    0/0    0/0    ONLINE    ONLINE    ahxnb01     
  12. ora....SM1.asm application    0/5    0/0    ONLINE    ONLINE    ahxnb02     
  13. ora....02.lsnr application    0/5    0/0    ONLINE    ONLINE    ahxnb02     
  14. ora....b02.gsd application    0/5    0/0    ONLINE    ONLINE    ahxnb02     
  15. ora....b02.ons application    0/3    0/0    ONLINE    ONLINE    ahxnb02     
  16. ora....b02.vip application    0/0    0/0    ONLINE    ONLINE    ahxnb02     
复制代码
本节点的ASM和instance online后没有驱逐2号节点的ASM和instance 服务正常。

虽然问题表面是解决了,但不知道是否就是因为多一块心跳网卡导致的,因为28号10:04分之前一直运行正常。

回复 只看该作者 道具 举报

8#
发表于 2014-4-30 23:41:43
你有两个心跳地址吗

回复 只看该作者 道具 举报

您需要登录后才可以回帖 登录 | 注册

QQ|手机版|Archiver|Oracle数据库数据恢复、性能优化

GMT+8, 2024-12-21 03:26 , Processed in 0.050765 second(s), 23 queries .

Powered by Discuz! X2.5

© 2001-2012 Comsenz Inc.

回顶部
TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569