Oracle数据库数据恢复、性能优化

找回密码
注册
搜索
热搜: 活动 交友 discuz
发新帖

0

积分

1

好友

6

主题
1#
发表于 2013-11-19 15:18:18 | 查看: 7404| 回复: 9
本帖最后由 whutabs 于 2013-11-20 13:21 编辑

单位2012年买的ODA一体机,双节点RAC,生产库。

操作系统:Linux 2.6.32-300.11.1.el5uek #1 SMP x86_64 x86_64 x86_64 GNU/Linux

数据库:Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production

昨晚0点左右,实例重启,造成关键业务中断,求刘大帮忙分析,谢谢!

问题1:此ODA是ORACLE厂家安装11.2.0.2,我升级至11.2.0.3,一切正常,未见错误。
但正常运行几个月后,总会莫名其妙出现: CHECK TIMED OUT  ,重启正常后隔几个月又出现。
不光是LISTENER这个资源,有时还有其他资源,貌似业务正常,很奇怪。

例如:crsctl status res -t

ora.LISTENER.lsnr
               ONLINE  INTERMEDIATE oda1                     CHECK TIMED OUT     
               ONLINE  ONLINE           oda2                                         
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE          oda2                                         
ora.LISTENER_SCAN2.lsnr
      1        ONLINE  INTERMEDIATE oda1                     CHECK TIMED OUT      

问题2,查看这次数据库重启日志,发现并没有生成对应的错误trc文件,而该时间点却有其他trc文件,很奇怪

例如:
Tue Nov 19 00:02:17 2013
NOTE: ASMB terminating
Errors in file /u01/app/oracle/diag/rdbms/whut/whut1/trace/whut1_asmb_17970.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel

ls -al whut1_asmb*
-rw-r----- 1 oracle asmadmin 865 Nov 19 00:02 whut1_asmb_6787.trc
-rw-r----- 1 oracle asmadmin  60 Nov 19 00:02 whut1_asmb_6787.trm

问题3,此次重启,alert日志说是asmb终止了实例,发生重启,而asm的日志说,asmb进程僵死,无法通讯。
问题处在哪里呢?

alert_whut1.log:

Tue Nov 19 00:02:17 2013
NOTE: ASMB terminating
Errors in file /u01/app/oracle/diag/rdbms/whut/whut1/trace/whut1_asmb_17970.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 587 Serial number: 7
Errors in file /u01/app/oracle/diag/rdbms/whut/whut1/trace/whut1_asmb_17970.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 587 Serial number: 7
ASMB (ospid: 17970): terminating the instance due to error 15064
Tue Nov 19 00:02:17 2013
System state dump requested by (instance=1, osid=17970 (ASMB)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/whut/whut1/trace/whut1_diag_17926.trc

alert_+ASM1.log:

Mon Nov 18 23:59:29 2013
WARNING: client [whut1:whut] not responsive for 200s; state=0x1. killing pid 17978
Tue Nov 19 00:02:01 2013
WARNING: ASM waited 355 secs for ASMB process in whut1:whut
Tue Nov 19 00:02:50 2013
NOTE: client whut1:whut registered, osid 6791, mbr 0x2
[grid@oda1 trace]$   

请各位大侠指点迷津,谢谢!
我会上传相关日志。
2#
发表于 2013-11-19 15:27:54
Tue Nov 19 00:02:17 2013
NOTE: ASMB terminating
Errors in file /u01/app/oracle/diag/rdbms/whut/whut1/trace/whut1_asmb_17970.trc:

ls -al whut1_asmb*
-rw-r----- 1 oracle asmadmin 865 Nov 19 00:02 whut1_asmb_6787.trc

Tue Nov 19 00:02:17 2013
System state dump requested by (instance=1, osid=17970 (ASMB)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/whut/whut1/trace/whut1_diag_17926.trc

ls -al whut1_diag*
-rw-r----- 1 oracle asmadmin 4011 Nov 19 00:03 whut1_diag_6743.trc

同时间点,却是不同的trc文件,奇怪啊奇怪,全机器扫描都没有这2个文件

oda1.rar

7.71 KB, 下载次数: 1368

回复 只看该作者 道具 举报

3#
发表于 2013-11-19 16:27:52
本帖最后由 whutabs 于 2013-11-20 13:21 编辑

继续补充,节点2日志发现12751错误,发现节点1重启。

Tue Nov 19 00:01:01 2013
minact-scn master exiting with err:12751
Tue Nov 19 00:01:28 2013
IPC Send timeout detected. Sender: ospid 18283 [oracle@oda2 (LMON)]
Receiver: inst 1 binc 429487726 ospid 17936
Communications reconfiguration: instance_number 1
Tue Nov 19 00:02:18 2013
Dumping diagnostic data in directory=[cdmp_20131119000217], requested by (instance=1, osid=17970 (ASMB)), summary=[abnormal instance termination].
Tue Nov 19 00:02:19 2013
Reconfiguration started (old inc 4, new inc 7)

看了OCSSD和CRSD日志,日志好多,头好晕,貌似OCSSD正常,CRSD中显示oraagent_grid重启?

2013-11-19 00:00:58.187: [ CRSCOMM][1162398016][FFAIL] Ipc: Couldnt clscreceive message, no message: 11
2013-11-19 00:00:58.187: [ CRSCOMM][1162398016] Ipc: Client disconnected.
2013-11-19 00:00:58.187: [ CRSCOMM][1162398016][FFAIL] IpcL: Listener got clsc error 11 for memNum. 25
2013-11-19 00:00:58.187: [ CRSCOMM][1162398016] IpcL: connection to member 25 has been removed
2013-11-19 00:00:58.187: [CLSFRAME][1162398016] Removing IPC Member:{Relative|Node:0|Process:25|Type:3}
2013-11-19 00:00:58.187: [CLSFRAME][1162398016] Disconnected from AGENT process: {Relative|Node:0|Process:25|Type:3}
2013-11-19 00:00:58.187: [   CRSPE][1179208000] {1:44287:23000} Disconnected from server:
2013-11-19 00:00:58.350: [    AGFW][1168701760] {1:44287:23001} Agfw Proxy Server received process disconnected notification, count=1
2013-11-19 00:00:58.350: [    AGFW][1168701760] {1:44287:23001} /u01/app/11.2.0.3/grid/bin/oraagent_grid disconnected.
2013-11-19 00:00:58.350: [    AGFW][1168701760] {1:44287:23001} Agent /u01/app/11.2.0.3/grid/bin/oraagent_grid[6165] stopped!
2013-11-19 00:00:58.350: [ CRSCOMM][1168701760] {1:44287:23001} IpcL: removeConnection: Member 25 does not exist.
2013-11-19 00:00:58.366: [    AGFW][1168701760] {1:44287:23001} Restarting the agent /u01/app/11.2.0.3/grid/bin/oraagent_grid
2013-11-19 00:00:58.366: [    AGFW][1168701760] {1:44287:23001} Starting the agent: /u01/app/11.2.0.3/grid/bin/oraagent with user id: grid and incarnation:17

ocss_crsd.rar

26.3 KB, 下载次数: 1419

回复 只看该作者 道具 举报

4#
发表于 2013-11-19 16:45:30
关于节点2上的minact-scn master exiting with err:12751,MOS上查询到如下:

Exadata: Instance eviction with IPC sendtimeout messages. (文档 ID 1404576.1)

操作系统,数据库,错误都相似。

Cause

Bug:13586346 -- INSTANCE TERMINATED WITH IPC TIMEOUT ERRORS


Solution
This bug is fixed in OFA version 1.5.1-4.0.58, and this is included in Exadata 11.2.3.1.0 version.

If this issue occurs in your environment, please go ahead and open an SR with the support to confirm the issue, and get the fix.

As workaround, we can reboot the affected nodes, which might resolve the issue, temporarily.



这又回到我的问题1,我发现重启后一切正常,过几个月又出现个别资源CHEKED TIMEOUT,估计就是IPC TIMEOUT

难倒真的是BUG?貌似这个BUG也没有补丁啊

回复 只看该作者 道具 举报

5#
发表于 2013-11-19 17:08:48
ASMB (ospid: 29849): Terminating The Instance Due To Error 15064 [ID 1487361.1]

BUG# 13914613 : DATABASE CRASHED DUE TO ORA-240 AND ORA-15064

回复 只看该作者 道具 举报

6#
发表于 2013-11-20 16:17:26
今天继续查看各种日志,排查问题。

根据DB的日志错误,疑似如下

ASMB (ospid: 29849): Terminating The Instance Due To Error 15064 [ID 1487361.1]
BUG# 13914613 : DATABASE CRASHED DUE TO ORA-240 AND ORA-15064

根据CRSD的日志错误,疑似如下:

Exadata: Instance eviction with IPC sendtimeout messages. (文档 ID 1404576.1)
Bug:13586346 -- INSTANCE TERMINATED WITH IPC TIMEOUT ERRORS

再查看agent日志和alertnode日志,发现:

2013-11-18 23:58:57.661: [ora.REDO.dg][1368238400] {0:24:7} [check] clsn_agent::abort: Exception LockAcquireTimeoutException
2013-11-18 23:58:57.697: [ora.REDO.dg][1368238400] {0:24:7} [check] clsn_agent::abort, agent exiting }
2013-11-18 23:58:57.699: [    AGFW][1368238400] {0:24:7} Agent is exiting with exit code: -1

2013-11-18 23:56:18.603
[/u01/app/11.2.0.3/grid/bin/oraagent.bin(23803)]CRS-5818:Aborted command 'check' for resource 'ora.asm'. Details at (:CRSAGF00113:)
{0:24:2} in /u01/app/11.2.0.3/grid/log/oda1/agent/crsd/oraagent_grid/oraagent_grid.log.
2013-11-18 23:57:54.872
[/u01/app/11.2.0.3/grid/bin/oraagent.bin(23803)]CRS-5818:Aborted command 'check' for resource 'ora.REDO.dg'. Details at
(:CRSAGF00113:) {0:24:7} in
/u01/app/11.2.0.3/grid/log/oda1/agent/crsd/oraagent_grid/oraagent_grid.log.

和如下文档的情况相似:

oraagent.bin Exits After Check Timed Out (文档 ID 1323679.1)
Cause
This is caused by bug 11807012


由此,我推断,这疑似的三个BUG,后2个是主因,前1个是表象。
请各位大神协助确认,在此感谢。

回复 只看该作者 道具 举报

7#
发表于 2013-11-20 21:00:52
1、日志似乎都是截断的
2、请给完整的日志

回复 只看该作者 道具 举报

8#
发表于 2013-11-21 09:45:37
谢谢刘大关注!因为日志都比较大,怕查看起来麻烦,所以就截取了关键点上下的日志

附件是比较完整的日志。

oda_log.rar

1.73 MB, 下载次数: 1115

回复 只看该作者 道具 举报

9#
发表于 2013-11-25 09:13:03
求大神,求关注,求路过

回复 只看该作者 道具 举报

10#
发表于 2013-12-2 15:26:04
一直木有人帮偶解答、确认,看来只好有机会升级ODA了。。。

回复 只看该作者 道具 举报

您需要登录后才可以回帖 登录 | 注册

QQ|手机版|Archiver|Oracle数据库数据恢复、性能优化

GMT+8, 2024-5-17 16:10 , Processed in 0.051569 second(s), 23 queries .

Powered by Discuz! X2.5

© 2001-2012 Comsenz Inc.

回顶部
TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569