21 积分	0 好友	0 主题

发消息

RAC的第一个节点不停重启的问题

1^#

发表于 2012-3-13 21:16:30 | 查看: 8195| 回复: 7

Redhat 5.4 + ASM + RAW+ Oracle 10g RAC

[root@rac1 ~]# uname -a
Linux rac1 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:54 EDT 2009 i686 i686 i386 GNU/Linux

[root@rac1 bin]# ./crsctl query crs activeversion
CRS active version on the cluster is [10.2.0.1.0]

[root@rac1 bin]# ./crsctl query crs softwareversion
CRS software version on node [rac1] is [10.2.0.1.0]

SQL> select * from v$version where rownum=1;
BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Prod

实例的 alert.log 和 $ORA_CRS_HOME/log/$HOSTNAME/css/ocssd.log 和 crsd.log 已上传，麻烦刘帮我看看，谢谢。

log.rar

118.46 KB, 下载次数: 1645

分享0

收藏0 回复只看该作者道具举报

Maclean Liu(刘相兵

2^#

发表于 2012-3-13 21:22:08

"RAC的第一个节点不停重启的问题"

是 Oracle Instance 实例重启还是节点 Node 重启？

回复只看该作者道具举报

spring_li

3^#

发表于 2012-3-13 21:26:34

是实例重启，ora.rac.rac1.inst启动后，又挂掉，重复几次现在处于down状态。
[root@rac1 bin]# ./crs_stat -t
Name          Type          Target State    Host
------------------------------------------------------------
ora.rac.db    application ONLINE ONLINE rac1
ora....c1.inst application ONLINE OFFLINE
ora....c2.inst application ONLINE ONLINE rac2
ora....SM1.asm application ONLINE ONLINE rac1
ora....C1.lsnr application ONLINE ONLINE rac1
ora.rac1.gsd application ONLINE ONLINE rac1
ora.rac1.ons application ONLINE ONLINE rac1
ora.rac1.vip application ONLINE ONLINE rac1
ora....SM2.asm application ONLINE ONLINE rac2
ora....C2.lsnr application ONLINE ONLINE rac2
ora.rac2.gsd application ONLINE ONLINE rac2
ora.rac2.ons application ONLINE ONLINE rac2
ora.rac2.vip application ONLINE ONLINE rac2

回复只看该作者道具举报

Maclean Liu(刘相兵

4^#

发表于 2012-3-13 21:31:28

把 ora.rac.rac1.inst 这个实例的告警日志 ALERT.LOG 上传，而不是 CRS 的 ALERT.LOG

可以通过 background_core_dump 参数找到alert.log 所在目录

回复只看该作者道具举报

spring_li

5^#

发表于 2012-3-14 16:08:42

SQL> show parameter background
NAME                               TYPE       VALUE
------------------------------------ ----------- ------------------------------
background_core_dump                string    partial
background_dump_dest                string    /home/opt/app/admin/rac/bdump

alert_rac1.log应该是在 /home/opt/app/admin/rac/bdump里吧，如果是的话，附件里已上传。

另一个问题：
首先我的是测试用的rac，workstation里的，今天打开后，发现rac1部分服务没起来，而且rac2系统打开后，十几秒以后就自动重启。
在网上找了原因，然后查看了rac2上 /var/log/messages 里的记录，写着Mar 14 13:52:32 rac2 logger: Oracle CSSD failure.  Rebooting for cluster integrity；
然后又看到 rac2上ocssd.log里记录很多条这样的记录：
[ CSSD]2012-03-13 21:40:25.102 [161037200] >TRACE: Authentication OSD error, op: scls_auth_response_prepare
loc: mkdir info: failed to make dir /home/opt/app/product/10.2.0/crs_1/css/auth/A2172926, No space left on devicedep: 28
，2012-03-13 21:40:25正好是昨天晚上关机时间。
然后在rac2上查看了下磁盘空间，确实在rac2的安装目录下没了空间，删除trc文件后，然后重启了服务，刚才都好了。
现在我隔几分钟看下rac2的空间，发现空间消耗速度很快，基本上几分钟消耗100M的空间，请问这个正常吗？

alert_rac1.rar

53.6 KB, 下载次数: 1578

回复只看该作者道具举报

teapot

6^#

发表于 2012-3-14 16:17:43

Wed Mar 14 15:47:23 2012
Errors in file /home/opt/app/admin/rac/bdump/rac1_j001_24236.trc:
ORA-00600: internal error code, arguments: [4553], [2], [0], [], [], [], [], []
Wed Mar 14 15:47:24 2012
Trace dumping is performing id=[cdmp_20120314154724]
Wed Mar 14 15:47:29 2012
Doing block recovery for file 2 block 18036
Block recovery from logseq 53, block 1313 to scn 1099551

做了很多cdump，Oracle检测到DB有问题，对每个进程进行转储，这会耗费很多空间
RAC经常碰到的问题
看一下trc文件，为什么要做这些数据块的恢复

回复只看该作者道具举报

Maclean Liu(刘相兵

7^#

发表于 2012-3-14 19:34:46

存在大量的 ORA-600 [4553] 错误：

Errors in file /home/opt/app/admin/rac/bdump/rac1_j000_27869.trc:
ORA-00600: internal error code, arguments: [4553], [2], [0], [], [], [], [], []
Wed Mar 14 13:57:32 2012
Trace dumping is performing id=[cdmp_20120314135732]

ORA-600 [4553] Usually this is caused by OS or hardware errors and all previously reported issues to development have been closed as not a bug. e.g.:

[45XX] ktb.c Kernel Transaction Block code.

一般是由于Lost Write引起的 Block Corruption。

该ORA-600 错误会引起大量的core dump写出占用磁盘空间，究其原因可能是Vmware异常重启引起的。

真正引起 instance reboot 实例重启的原因是 LMS Global Cache Service process遇到 ORA-600 [4519]错误：

Errors in file /home/opt/app/admin/rac/bdump/rac1_lms0_22118.trc:
ORA-00600: internal error code, arguments: [4519], [2], [12613912], [0], [], [], [], []
Tue Mar 13 19:51:57 2012
Trace dumping is performing id=[cdmp_20120313195157]
Tue Mar 13 19:52:06 2012
Errors in file /home/opt/app/admin/rac/bdump/rac1_lms0_22118.trc:
ORA-00600: internal error code, arguments: [4519], [2], [12613912], [0], [], [], [], []
Tue Mar 13 19:52:06 2012
LMS0: terminating instance due to error 484
Tue Mar 13 19:52:06 2012
System state dump is made for local instance
System State dumped to trace file /home/opt/app/admin/rac/bdump/rac1_diag_22110.trc

LMS 遇到 ORA-600 [4519] 从而crash instance 可能由于Hdr: 6329991 10.2.0.3 Abstract: LMS REPORTS ORA-600 [4519] THEN INSTANCE TERMINATED 引起。

同时这个 ORA-600 [4519] 会触发LMS进程让diag 写出一个systemstate dump的转储文件，一般来说这个dump文件也很大，一次至少几十兆。

建议你至少升级到 10.2.0.5 ， 10.2.0.1 的RAC 是很不稳定的。

回复只看该作者道具举报

Maclean Liu(刘相兵

8^#

发表于 2012-3-14 19:36:08

ODM finding:

ORA-600 [4519] "Block Corruption Detected - Cache type wrong"
PURPOSE:
This article discusses the internal error "ORA-600 [4519]", what
it means and possible actions. The information here is only applicable
to the versions listed and is provided only for guidance.
ERROR:
ORA-600 [4519] [a] [b] [c] [d] [e]
VERSIONS:
versions 6.0 to 9.2
DESCRIPTION:
We found a corrupted block when trying to read a block using
consistent read. An invalid block type was found.
FUNCTIONALITY:
KERNEL TRANSACTION
IMPACT:
PROCESS FAILURE
Possible Block Corruption in Memory.
SUGGESTIONS:
This may be an in memory corruption.
Try shutting down and starting up the database instance again. Repeat the
operation shown in the ORA-0600[4519] trace file to see if this reproduces.
If the problem persists after restarting the instance, then the block is
corrupt on disk, and the appropriate diagnostics for block corruptions
need to be obtained (block dump, redo log dumps, etc.) Please submit the
trace files and alert.log to Oracle Support Services for further analysis.
Known Issues:
Known Bugs
NB Bug Fixed Description
P* 6047085 Linux x64-64: SGA corruption / crash following any ORA-7445 See NOTE:434889.1
6401576 9.2.0.8.P22 OERI[ktbair1] / ORA-600 [6101] index corruption possible
364152 7.3.3.6, 7.3.4.0 ORA-600 [4519] using FREELIST GROUPS on a CLUSTER INDEX

复制代码

回复只看该作者道具举报

返回列表

		自动登录	找回密码
密码			注册