87 积分	0 好友	5 主题

发消息

[常规总体性能调优] 数据库遇ORA-600[2103],请问原因

1^#

发表于 2014-5-13 12:04:12 | 查看: 4644| 回复: 2

本帖最后由路人呵呵于 2014-5-13 13:53 编辑

一套非rac数据库,数据库版本是 10.2.0.1.0  操作系统是 aix 6.1;  10:37:22遇到ORA-600错误,导致宕机.之后再10:46分把数据库拉起来了.
在做日志分析是发现metlink上有这错误文章.但是引起的原因不太确定,是不是遇到bug.所以询问一下求指导,谢谢

宕机的日志:

Tue May 13 10:37:22 2014
Errors in file /oracle/admin/hjfxdb/bdump/hjfxdb_lgwr_3276926.trc:
ORA-00600: internal error code, arguments: [2103], [0], [0], [1], [900], [], [], []
Tue May 13 10:37:23 2014
Errors in file /oracle/admin/hjfxdb/bdump/hjfxdb_lgwr_3276926.trc:
ORA-00600: internal error code, arguments: [2103], [0], [0], [1], [900], [], [], []
Tue May 13 10:37:23 2014
LGWR: terminating instance due to error 470
Termination issued to instance processes. Waiting for the processes to exit
Tue May 13 10:37:33 2014
Instance termination failed to kill one or more processes
Tue May 13 10:41:34 2014
Instance terminated by LGWR, pid = 3276926
Tue May 13 10:46:16 2014
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_10 parameter default value as USE_DB_RECOVERY_FILE_DEST
Autotune of undo retention is turned on.
IMODE=BR
ILAT =55
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.1.0.
System parameters with non-default values:
  processes             = 300
  sessions                = 500
  __shared_pool_size    = 536870912
  __large_pool_size       = 16777216
  __java_pool_size       = 33554432
  __streams_pool_size    = 16777216
  sga_target             = 14394851328
  control_files          = /oracle/oradata/hjfxdb/control01.ctl, /oracle/oradata/hjfxdb/control02.ctl, /oracle/oradata/hjfxdb/control03.ctl
  db_block_size          = 8192
  __db_cache_size       = 13774094336
  compatible             = 10.2.0.1.0
  log_archive_start       = TRUE
  log_archive_format    = %t_%s_%r.dbf
  db_files                = 1000
  db_file_multiblock_read_count= 16
  db_recovery_file_dest = /oracle/hjfxdata3/hjfxdb/flash_recovery_area
  db_recovery_file_dest_size= 322122547200
  undo_management       = AUTO
  undo_tablespace       = UNDOTBS1
  remote_login_passwordfile= EXCLUSIVE
  db_domain             =
  dispatchers             = (PROTOCOL=TCP) (SERVICE=hjfxdbXDB)
  job_queue_processes    = 10
  background_dump_dest    = /oracle/admin/hjfxdb/bdump
  user_dump_dest          = /oracle/admin/hjfxdb/udump
  core_dump_dest          = /oracle/admin/hjfxdb/cdump
  audit_file_dest       = /oracle/admin/hjfxdb/adump
  db_name                = hjfxdb
  open_cursors          = 1000
  pga_aggregate_target    = 4794089472
Deprecated system parameters with specified values:
  log_archive_start
End of deprecated system parameter listing
PMON started with pid=2, OS id=12255714
PSP0 started with pid=3, OS id=17957626
MMAN started with pid=4, OS id=1180312
DBW0 started with pid=5, OS id=11731452
DBW1 started with pid=6, OS id=11993692
LGWR started with pid=7, OS id=10814142
CKPT started with pid=8, OS id=2425502
SMON started with pid=9, OS id=5439514
RECO started with pid=10, OS id=11927968
CJQ0 started with pid=11, OS id=14680348
MMON started with pid=12, OS id=13172962
Tue May 13 10:46:17 2014
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
MMNL started with pid=13, OS id=2359898
Tue May 13 10:46:18 2014
starting up 1 shared server(s) ...
Tue May 13 10:46:18 2014
ALTER DATABASE MOUNT
Tue May 13 10:46:22 2014
Setting recovery target incarnation to 2
Tue May 13 10:46:22 2014
Successful mount of redo thread 1, with mount id 2082354938
Tue May 13 10:46:22 2014
Database mounted in Exclusive Mode
Completed: ALTER DATABASE MOUNT
Tue May 13 10:46:22 2014
ALTER DATABASE OPEN
Tue May 13 10:46:22 2014
Beginning crash recovery of 1 threads
parallel recovery started with 11 processes
Tue May 13 10:46:23 2014
Started redo scan
Tue May 13 10:46:28 2014
Completed redo scan
2197009 redo blocks read, 124659 data blocks need recovery
Tue May 13 10:47:03 2014
Started redo application at
Thread 1: logseq 681372, block 790554
Tue May 13 10:47:03 2014
Recovery of Online Redo Log: Thread 1 Group 1 Seq 681372 Reading mem 0
  Mem# 0 errs 0: /oracle/oradata/hjfxdb/redo01_01.log
  Mem# 1 errs 0: /oracle/oradata/hjfxdb/redo01_02.log
Tue May 13 10:47:04 2014
Recovery of Online Redo Log: Thread 1 Group 2 Seq 681373 Reading mem 0
  Mem# 0 errs 0: /oracle/oradata/hjfxdb/redo02_01.log
  Mem# 1 errs 0: /oracle/oradata/hjfxdb/redo02_02.log
Tue May 13 10:47:06 2014

alert 日志.rar (998.56 KB, 下载次数: 903)

分享0

收藏0 回复只看该作者道具举报

与晶之恋

2^#

发表于 2014-5-27 20:36:17

本帖最后由与晶之恋于 2014-5-27 20:37 编辑

TIMEOUT ON CONTROL FILE ENQUEUE
数据库在等待 CF enqueue 的锁900秒后就会杀死进程,如果该进程是后台进程就会导致数据库crash.
在crash之前 redo切换的比较快.还出现了一个checkpoint not complete.
至于为啥有CF 的锁得看看当时的实务了.
#######################################################
*** 2014-05-13 10:36:52.588
TIMEOUT ON CONTROL FILE ENQUEUE
mode=X, type=0, wait=1, eqt=900
Dump the blocker's call stack (ospid 2425488)
*** 2014-05-13 10:37:22.614
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [2103], [0], [0], [1], [900], [], [], []
######################################################
Thread 1 advanced to log sequence 681372
  Current log# 1 seq# 681372 mem# 0: /oracle/oradata/hjfxdb/redo01_01.log
  Current log# 1 seq# 681372 mem# 1: /oracle/oradata/hjfxdb/redo01_02.log
Tue May 13 10:14:31 2014
Thread 1 advanced to log sequence 681373
  Current log# 2 seq# 681373 mem# 0: /oracle/oradata/hjfxdb/redo02_01.log
  Current log# 2 seq# 681373 mem# 1: /oracle/oradata/hjfxdb/redo02_02.log
Tue May 13 10:16:56 2014
Thread 1 advanced to log sequence 681374
  Current log# 3 seq# 681374 mem# 0: /oracle/oradata/hjfxdb/redo03_01.log
  Current log# 3 seq# 681374 mem# 1: /oracle/oradata/hjfxdb/redo03_02.log
Tue May 13 10:21:40 2014
Thread 1 cannot allocate new log, sequence 681375
Checkpoint not complete
  Current log# 3 seq# 681374 mem# 0: /oracle/oradata/hjfxdb/redo03_01.log
  Current log# 3 seq# 681374 mem# 1: /oracle/oradata/hjfxdb/redo03_02.log
Tue May 13 10:37:22 2014
Errors in file /oracle/admin/hjfxdb/bdump/hjfxdb_lgwr_3276926.trc:
ORA-00600: internal error code, arguments: [2103], [0], [0], [1], [900], [], [], []
Tue May 13 10:37:23 2014
Errors in file /oracle/admin/hjfxdb/bdump/hjfxdb_lgwr_3276926.trc:
ORA-00600: internal error code, arguments: [2103], [0], [0], [1], [900], [], [], []

#################################################################
同时参考
http://blog.chinaunix.net/uid-22948773-id-3016011.html

Subject:  Diagnosing ORA-600 [2103] issues in RAC environment
  Doc ID:  Note:297845.1 Type:  HOWTO
  Last Revision Date:  23-FEB-2005 Status:  PUBLISHED

The information in this document applies to:
Oracle Server - Enterprise Edition - Version: 10.1 to 10.2
Information in this document applies to any platform.

Errors
ORA-600 "internal error code, arguments: [%s],[%s],[%s], [%s], [%s],
[2103]

Goal
This document tries to explain how to troubleshoot and find out the cause of ORA-600 [2103]. An ORA-600 [2103] is signalled when an Oracle process cannot get an CF enqueue for 900 seconds. In 10g, The process signalling the ORA-600 [2103] dumps enough information for us to understand what caused the ORA-600 [2103].

Before proceeding with the troubleshooting, Please be advised that an ORA-600 [2103] can be also caused by one or a combination of the below reasons

Very slow I/O subsystem where the Controlfiles are stored.
Frequent log switching, For example if you check the alert.log and find out that the instances are switching logs very often then it is possible for an ORA-600 [2103] to occur.
The steps below will help understand why the process requesting the CF enqueue had to wait for 900 seconds before giving up and crashing the instance

Fix
Check the alert.log of the instance which reported the ORA-600 [2103]

Code:
Mon Feb  7 01:45:06 2005
Errors in file /app/oracle/admin/atmvs/bdump/atmvs1_lgwr_30822.trc:
ORA-00600: internal error code, arguments: [2103], [0], [0], [1], [900], [], [], []
Mon Feb  7 01:45:07 2005
Errors in file /app/oracle/admin/atmvs/bdump/atmvs1_lgwr_30822.trc:
ORA-00600: internal error code, arguments: [2103], [0], [0], [1], [900], [], [], []
Mon Feb  7 01:45:07 2005LGWR: terminating instance due to error 470
So we know that logwriter was trying to get the CF enqueue.

Now we should check the lmd trace file from the background dump destination and try to find the same timeline

Code:
*** 2005-02-07 01:45:05.916ENQUEUE DUMP REQUEST: from 1 on [0x0][0x0],[CF] for reason 2 mtype 0DUMP LOCAL BLOCKER/HOLDER: block level 4 res [0x0][0x0],[CF]
This also further below in the same log should show us the holder.

Code:
----------enqueue------------------------
lock version    : 753415
Owner node    : 0
grant_level    : KJUSERPW
req_level       : KJUSERPW
bast_level    : KJUSERNL
notify_func    : (nil)
resp          : 0x9ad06258
procp          : 0x9acc5970
pid             : 6009
proc version    : 650
oprocp          : (nil)
opid          : 0
group lock    : 1
xid             : 0000-0000-00000000
dd_time       : 0.0 secs
dd_count       : 0
timeout       : 0.0 secs
On_timer_q?    : N
On_dd_q?       : N
lock_state    : GRANTED
Open Options    : KJUSERDEADLOCK
Convert options  : KJUSERGETVALUE KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK KJUSERTRCCANCEL
History       : 0x514514a5
Msg_Seq       : 0x0
res_seq       : 0
valblk          : 0x50330100f4a0ffbfc82f3908dca0ffbf P3/9
DUMP LOCAL BLOCKER: initiate state dump for
  possible owner[30.6009]

The interesting piece is the PID of the holder which is 6009.

Now we should find out the matching files in the bdump or udump destination depending on if the process holding the enqueue is a user process or a background process. In the example below, it is a user process and hence we search for this file in the user dump dest.

ls -l <udump_location>/*6009*.trc

In the file atmvs1_ora_6009.trc, We can find kst tracing and a process dump to find out what is it doing that it needs to hold the CF enqueue in exclusive mode for more than 900 seconds.

In the case above, the current SQL itself shows what is process 6009 doing.

Code:
Current SQL statement for this session:begin sys . dbms_backup_restore . createRmanStatusRow ( level
=> :level , parent_id => id , parent_stamp => ts , status => :status , command_id => :command_id , operation => peration , row_id => :row_id , row_stamp => :row_stamp ) ; end ;
In this case, the cultprit was a rman job which was running. This itself was a Customer specific problem. In most cases, further investigation has to be done to find out why the holder is not yeilding the CF enqueue within the 900 seconds.

回复只看该作者道具举报

psufnxk2000

3^#

发表于 2014-6-13 23:21:10

Symptoms :

1. When creating/dropping/resizing a large file,

2. Mounting or opening a database on DB instances or

3. ASM v$_ table queries are blocked for an extended period of time.

4. DB instance crash with [2103] or [2116].

Workaround :

None

Patch Details:

One-off patch available for few platforms on top of 10.1.0.4,10.1.0.5, 10.2.0.1
Check Metalink for Patch 4671216 availability.

回复只看该作者道具举报

返回列表

		自动登录	找回密码
密码			注册