0 积分	1 好友	5 主题

发消息

[Exadata硬件] RAC群集中一个节点偶尔被逐出又恢复

1^#

发表于 2013-6-8 21:08:17 | 查看: 5442| 回复: 4

本帖最后由 godspeed 于 2013-6-9 13:30 编辑

从4月22日到6月4日一共出现过9次的样子。
出现故障时连接在库上的应用很奇怪，能连接上数据库，但操作被阻塞，很长时间貌似都无法工作。
6月4日16:22出现故障，虽然几分钟后，节点好像回到群集中了，但应用直到18:00还是不正常。这次DBA想看看数据库是否能自动恢复，就没有动。第二天早上居然好了。
附件是alert文件和对应的trace文件。还有crs下面的log文件。

alert_orcl2.zip

177.23 KB, 下载次数: 1131

trace.zip

103.59 KB, 下载次数: 1107

crs_log.zip

393.35 KB, 下载次数: 1031

分享0

收藏0 回复只看该作者道具举报

xifenfei

2^#

发表于 2013-6-9 14:09:06

crs日志

2013-06-04 16:23:55.015: [ CRSRES][344528]32In stateChanged, ora.orcl.orcl2.inst target is ONLINE
2013-06-04 16:23:55.015: [ CRSRES][344528]32ora.orcl.orcl2.inst on dbsvr2 went OFFLINE unexpectedly
2013-06-04 16:23:55.015: [ CRSRES][344528]32StopResource: setting CLI values
2013-06-04 16:23:55.031: [ CRSRES][344528]32Attempting to stop `ora.orcl.orcl2.inst` on member `dbsvr2`
2013-06-04 16:26:57.703: [ CRSRES][344528]32Stop of `ora.orcl.orcl2.inst` on member `dbsvr2` succeeded.
2013-06-04 16:26:57.703: [ CRSRES][344528]32ora.orcl.orcl2.inst RESTART_COUNT=2 RESTART_ATTEMPTS=5
2013-06-04 16:26:57.703: [ CRSRES][344528]32ora.orcl.orcl2.inst Uptime does not exceed uptime_threshold
2013-06-04 16:26:57.718: [ CRSRES][344528]32Restarting ora.orcl.orcl2.inst on dbsvr2
2013-06-04 16:26:57.718: [ CRSRES][344528]32startRunnable: setting CLI values
2013-06-04 16:26:57.718: [ CRSRES][344528]32Attempting to start `ora.orcl.orcl2.inst` on member `dbsvr2`
2013-06-04 16:26:57.843: [ OCRUTL][3048]u_freem: mem passed is null
2013-06-04 16:27:37.953: [ CRSRES][344528]32Start of `ora.orcl.orcl2.inst` on member `dbsvr2` succeeded.
2013-06-04 16:27:37.953: [ CRSRES][344528]32Successfully restarted ora.orcl.orcl2.inst on dbsvr2, RESTART_COUNT=3
2013-06-04 16:27:37.968: [ CRSRES][344528]32ora.orcl.orcl2.inst Updated LAST_RESTART time in ocr
2013-06-04 16:27:42.843: [ OCRUTL][3048]u_freem: mem passed is null

复制代码

这里看到集群的db节点2异常，通过alert日志看到

Thu May 23 12:08:58 2013
Errors in file e:\oracle\product\10.2.0\admin\orcl\bdump\orcl2_asmb_341964.trc:
ORA-15064: ? ASM ??????
ORA-01092: ORACLE ???????????
Thu May 23 12:08:58 2013
ASMB: terminating instance due to error 15064
Thu May 23 12:08:58 2013
Errors in file e:\oracle\product\10.2.0\admin\orcl\bdump\orcl2_lmon_343704.trc:
ORA-15064: 与 ASM 实例通信失败

复制代码

因为无对应的asm日志，从错误看，很可能是asm实例异常导致数据库

Tue Jun 04 16:27:33 2013
Incremental checkpoint up to RBA [0x16c2.2.0], current log tail at RBA [0x16c2.36.0]
Tue Jun 04 16:27:33 2013
Completed: ALTER DATABASE OPEN
Tue Jun 04 16:42:34 2013
Incremental checkpoint up to RBA [0x16c2.4187.0], current log tail at RBA [0x16c2.46e6.0]
Tue Jun 04 16:57:34 2013
Incremental checkpoint up to RBA [0x16c2.5134.0], current log tail at RBA [0x16c2.56dd.0]
Tue Jun 04 17:12:34 2013
Incremental checkpoint up to RBA [0x16c2.6a57.0], current log tail at RBA [0x16c2.6fdc.0]
Tue Jun 04 17:27:34 2013
Incremental checkpoint up to RBA [0x16c2.794a.0], current log tail at RBA [0x16c2.7e91.0]
Tue Jun 04 17:42:34 2013
Incremental checkpoint up to RBA [0x16c2.8742.0], current log tail at RBA [0x16c2.8c69.0]
Tue Jun 04 17:55:48 2013
Beginning log switch checkpoint up to RBA [0x16c3.2.10], SCN: 282230756
Tue Jun 04 17:55:48 2013
Thread 2 advanced to log sequence 5827 (LGWR switch)
Current log# 10 seq# 5827 mem# 0: +DATA/orcl/onlinelog/group_10.597.816688219
Current log# 10 seq# 5827 mem# 1: +DATA/orcl/onlinelog/group_10.732.816688229
Tue Jun 04 17:57:34 2013
Incremental checkpoint up to RBA [0x16c2.94b6.0], current log tail at RBA [0x16c3.1c0.0]
Tue Jun 04 18:00:53 2013
Completed checkpoint up to RBA [0x16c3.2.10], SCN: 282230756
Tue Jun 04 18:12:35 2013
Incremental checkpoint up to RBA [0x16c3.c2b.0], current log tail at RBA [0x16c3.cf9.0]
Tue Jun 04 18:27:35 2013
Incremental checkpoint up to RBA [0x16c3.e7b.0], current log tail at RBA [0x16c3.eee.0]
Tue Jun 04 18:42:35 2013
Incremental checkpoint up to RBA [0x16c3.f9d.0], current log tail at RBA [0x16c3.ff1.0]
Tue Jun 04 18:57:35 2013
Incremental checkpoint up to RBA [0x16c3.1109.0], current log tail at RBA [0x16c3.11ba.0]
Tue Jun 04 19:12:35 2013
Incremental checkpoint up to RBA [0x16c3.1977.0], current log tail at RBA [0x16c3.1a5e.0]
Tue Jun 04 19:27:36 2013
Incremental checkpoint up to RBA [0x16c3.1b6a.0], current log tail at RBA [0x16c3.1bea.0]
Tue Jun 04 19:42:36 2013
Incremental checkpoint up to RBA [0x16c3.1cd7.0], current log tail at RBA [0x16c3.1d21.0]

复制代码

这里显示数据库的checkpoint是可以正常操作，证明数据库原则上问题不大

如果需要深入分析，可能还需要系统负载，asm 日志，系统日志等信息

回复只看该作者道具举报

godspeed

3^#

发表于 2013-6-9 22:57:05

感谢。我再去搞ASM日志和系统日志。
另外，我们有个Java应用使用JDBC的Thin驱动连接到主服务的实IP（192.168.0.2），出问题的时候，这服务不能用了，重启的时候似乎能建立连接，启动过程中读了一些数据貌似都还行，但写一些数据的时候貌似卡住了。这个现象到18:00都有。后来现场人员想等等看，就先撤了，第二天早上再连就可以了，据DBA说数据库没有动（重启）。
系统是WindowsServer 2003 x64的，PageFile用了默认的2GB（不知道这个会不会有问题）
目前我在系统上针对性能计数器做了5秒一次的日志记录，处理器、磁盘IO、网络IO都开始记录。还做了一个脚本连续的ping外网和内网。

回复只看该作者道具举报

harryzhang

4^#

发表于 2013-6-12 16:11:53

1）据DBA说数据库没有动（重启） > 实例有无重启，看看alert.log

2）既然是RAC，使用VIP，你提到的主服务的实IP（192.168.0.2）是VIP么？

3）不要完全相信现场人员的描述

回复只看该作者道具举报

553722769

5^#

发表于 2013-6-19 21:38:41

学习了解

回复只看该作者道具举报

返回列表

		自动登录	找回密码
密码			注册