PMON (ospid: 22867): terminating the instance due to error 471
本帖最后由 Tornado 于 2014-7-3 21:40 编辑故障现象:11点42分数据库宕掉,12点07重启后解决。请高手帮忙诊断下是不是Large Pages的问题,还是其它原因导致数据库宕机。
环境:os:Red Hat Enterprise Linux Server release 6.2 oracle:11.2.0.3 单实例+文件系统
alert日志:
Incremental checkpoint up to RBA , current log tail at RBA
Thu Jul 03 09:55:36 2014
Incremental checkpoint up to RBA , current log tail at RBA
Thu Jul 03 10:25:38 2014
Incremental checkpoint up to RBA , current log tail at RBA
Thu Jul 03 10:55:42 2014
Incremental checkpoint up to RBA , current log tail at RBA
Thu Jul 03 11:25:47 2014
Incremental checkpoint up to RBA , current log tail at RBA
Thu Jul 03 11:42:28 2014
PMON (ospid: 22867): terminating the instance due to error 471
Thu Jul 03 11:42:28 2014
System state dump requested by (instance=1, osid=22867 (PMON)), summary=.
System State dumped to trace file /oracle/EPT/saptrace/diag/rdbms/ept/EPT/trace/EPT_diag_22877.trc
Instance terminated by PMON, pid = 22867
Thu Jul 03 12:07:45 2014
Starting ORACLE instance (normal)
****************** Large Pages Information *****************
Total Shared Global Region in Large Pages = 0 KB (0%)
Large Pages used by this instance: 0 (0 KB)
Large Pages unused system wide = 0 (0 KB) (alloc incr 16 MB)
Large Pages configured system wide = 0 (0 KB)
Large Page size = 2048 KB
RECOMMENDATION:
Total Shared Global Region size is 4290 MB. For optimal performance,
prior to the next instance restart increase the number
of unused Large Pages by atleast 2145 2048 KB Large Pages (4290 MB)
system wide to get 100% of the Shared
Global Region allocated with Large pages
***********************************************************
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 3
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options.
ORACLE_HOME = /oracle/EPT/112_64
System name: Linux
Node name: erptest
Release: 2.6.32-220.el6.x86_64
Version: #1 SMP Wed Nov 9 08:03:13 EST 2011
Machine: x86_64
VM name: VMWare Version: 6
Using parameter settings in server-side spfile /oracle/EPT/112_64/dbs/spfileEPT.ora
System parameters with non-default values:
processes = 190
sessions = 384
shared_pool_size = 2112M
filesystemio_options = "setall"
control_files = "/oracle/EP0/control/cntrlEP0_a.dbf"
control_files = "/oracle/EP0/control/cntrlEP0_b.dbf"
control_file_record_keep_time= 30
db_block_size = 8192
db_cache_size = 2112M
compatible = "11.2.0.2.0"
log_archive_dest_1 = "LOCATION=/oracle/EPT/oraarch/EPTarch"
log_archive_format = "%t_%s_%r.dbf"
db_recovery_file_dest = "/oracle/EPT/oraflash"
db_recovery_file_dest_size= 30000M
log_checkpoints_to_alert = TRUE
replication_dependency_tracking= FALSE
undo_tablespace = "PSAPUNDO001"
_in_memory_undo = FALSE
recyclebin = "off"
sec_case_sensitive_logon = FALSE
remote_login_passwordfile= "EXCLUSIVE"
parallel_execution_message_size= 16384
_table_lookup_prefetch_size= 0
audit_file_dest = "/oracle/EPT/saptrace/audit"
db_name = "EPT"
open_cursors = 2000
_sort_elimination_cost_ratio= 10
_b_tree_bitmap_plans = FALSE
star_transformation_enabled= "true"
parallel_threads_per_cpu = 1
query_rewrite_enabled = "false"
_index_join_enabled = FALSE
_optim_peek_user_binds = FALSE
pga_aggregate_target = 2937984122
_optimizer_mjc_enabled = FALSE
diagnostic_dest = "/oracle/EPT/saptrace"
max_dump_file_size = "20000"
Thu Jul 03 12:07:45 2014
PMON started with pid=2, OS id=26678
Thu Jul 03 12:07:45 2014
PSP0 started with pid=3, OS id=26680
Thu Jul 03 12:07:46 2014
VKTM started with pid=4, OS id=26682 at elevated priority
VKTM running at (1)millisec precision with DBRM quantum (100)ms
Thu Jul 03 12:07:46 2014
GEN0 started with pid=5, OS id=26686
Thu Jul 03 12:07:46 2014
DIAG started with pid=6, OS id=26688
Thu Jul 03 12:07:46 2014
DBRM started with pid=7, OS id=26690
Thu Jul 03 12:07:47 2014
DIA0 started with pid=8, OS id=26692
Thu Jul 03 12:07:47 2014
MMAN started with pid=9, OS id=26694
Thu Jul 03 12:07:47 2014
DBW0 started with pid=10, OS id=26696
Thu Jul 03 12:07:47 2014
LGWR started with pid=11, OS id=26698
Thu Jul 03 12:07:47 2014
CKPT started with pid=12, OS id=26700
Thu Jul 03 12:07:47 2014
SMON started with pid=13, OS id=26702
Thu Jul 03 12:07:47 2014
RECO started with pid=14, OS id=26704
Thu Jul 03 12:07:47 2014
MMON started with pid=15, OS id=26706
Thu Jul 03 12:07:47 2014
MMNL started with pid=16, OS id=26708
ORACLE_BASE from environment = /oracle
Thu Jul 03 12:07:47 2014
ALTER DATABASE MOUNT
Successful mount of redo thread 1, with mount id 454533267
Database mounted in Exclusive Mode
Lost write protection disabled
Completed: ALTER DATABASE MOUNT
Thu Jul 03 12:07:51 2014
ALTER DATABASE OPEN
内存相关配置:
SQL> show parameter sga
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
lock_sga boolean FALSE
pre_page_sga boolean FALSE
sga_max_size big integer 4288M
sga_target big integer 0
SQL> show parameter memory
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
_in_memory_undo boolean FALSE
hi_shared_memory_address integer 0
memory_max_target big integer 0
memory_target big integer 0
shared_memory_address integer 0
trc文件见附件。
附件呢? dba_sky 发表于 2014-7-3 16:19 static/image/common/back.gif
附件呢?
@dba_sky,附件已上传 Linux + Vmware + 11.2.0.3.0
*** 2014-07-03 11:42:28.296
REQUEST:system state dump at level 10, requested by (instance=1, osid=22867 (PMON)), summary=.
===================================================
SYSTEM STATE (level=10)
------------
System global information:
PROCESS 2: PMON
----------------------------------------
SO: 0x16b3888f8, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x16b3888f8, name=process, file=ksu.h LINE:12616, pg=0
(process) Oracle pid:2, ser:1, calls cur/top: 0x16b633ca0/0x16b633ca0
flags : (0xe) SYSTEM
flags2: (0x0), flags3: (0x10)
intr error: 0, call error: 0, sess error: 0, txn error 0
intr queue: empty
ksudlp FALSE at location: 0
(post info) last post received: 0 0 16
last post received-location: ksu.h LINE:13945 ID:ksupsc
last process to post me: 16a3eeda8 3 0
last post sent: 0 0 26
last post sent-location: ksa2.h LINE:285 ID:ksasnd
last process posted by me: 16a3edcf8 60 2
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 0x16a449268
O/S info: user: oraept, term: UNKNOWN, ospid: 22867
OSD pid info: Unix process pid: 22867, image: oracle@erptest (PMON)
----------------------------------------
SO: 0x16aafcc88, type: 14, owner: 0x16b3888f8, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
proc=0x16b3888f8, name=channel handle, file=ksr2.h LINE:367, pg=0
(broadcast handle) 0x16aafcc88 flag: (2) ACTIVE SUBSCRIBER,
owner: 0x16b3888f8 - ospid: 22867
event: 1, last message event: 1,
last message waited event: 1,
next message: (nil)(0), messages read: 0
channel: (0x16ab0f8b0) PMON actions channel
scope: 7, event: 1, last mesage event: 0,
publishers/subscribers: 0/1,
messages published: 0
heuristic msg queue length: 0
----------------------------------------
SO: 0x16aafcaf8, type: 14, owner: 0x16b3888f8, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
proc=0x16b3888f8, name=channel handle, file=ksr2.h LINE:367, pg=0
(broadcast handle) 0x16aafcaf8 flag: (2) ACTIVE SUBSCRIBER,
owner: 0x16b3888f8 - ospid: 22867
event: 1, last message event: 14,
last message waited event: 14,
next message: (nil)(0), messages read: 1
channel: (0x16ab1c8f8) scumnt mount lock
scope: 1, event: 48, last mesage event: 14,
publishers/subscribers: 0/13,
messages published: 1
heuristic msg queue length: 0
----------------------------------------
SO: 0x16a52e688, type: 4, owner: 0x16b3888f8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x16b3888f8, name=session, file=ksu.h LINE:12624, pg=0
(session) sid: 97 ser: 1 trans: (nil), creator: 0x16b3888f8
flags: (0x1000051) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
flags2: (0x409) -/-/INC
DID: 0002-, short-term DID:
txn branch: (nil)
oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS
ksuxds FALSE at location: 0
service name: SYS$BACKGROUND
Current Wait Stack:
Not in wait; last wait ended 0.488532 sec ago
Wait State:
fixed_waits=0 flags=0x21 boundary=(nil)/-1
Session Wait History:
elapsed time of 0.507551 sec since last wait
0: waited for 'pmon timer'
duration=0x12c, =0x0, =0x0
PROCESS 19: W001
----------------------------------------
SO: 0x16a3edcf8, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x16a3edcf8, name=process, file=ksu.h LINE:12616, pg=0
(process) Oracle pid:19, ser:60, calls cur/top: 0x164db8d60/0x16b636b00
flags : (0x2) SYSTEM
flags2: (0x30), flags3: (0x10)
intr error: 0, call error: 0, sess error: 0, txn error 0
intr queue: empty
ksudlp FALSE at location: 0
(post info) last post received: 258 0 2
last post received-location: ksl2.h LINE:2362 ID:kslpsr
last process to post me: 16a3f7328 58 2
last post sent: 0 0 233
last post sent-location: ktm.h LINE:614 ID:ktmpsm
last process posted by me: 16a3eaae8 1 22
(latch info) wait_event=258 bits=0
Process Group: DEFAULT, pseudo proc: 0x16a449268
O/S info: user: oraept, term: UNKNOWN, ospid: 12111 (DEAD)
OSD pid info: Unix process pid: 12111, image: oracle@erptest (W001)
pmon 最后post 16a3edcf8 last process posted by me: 16a3edcf8 60 2
16a3edcf8指向 W001 一个DEAD PROCESS
上传一下 /var/log/messages 和 完整的alert.log Maclean Liu(刘相兵 发表于 2014-7-3 20:47 static/image/common/back.gif
上传一下 /var/log/messages 和 完整的alert.log
@maclean大师,已上传,不胜感激 message显示
Jul 3 11:40:09 erptest kernel: Free swap = 0kB
Jul 3 11:40:09 erptest kernel: Total swap = 4128760kB
Jul 3 11:40:09 erptest kernel: 4094960 pages RAM
Jul 3 11:40:09 erptest kernel: 109792 pages reserved
Jul 3 11:40:09 erptest kernel: 9916734 pages shared
Jul 3 11:40:09 erptest kernel: 1841302 pages non-shared
OS内存用光了哦??? swap 也没得空闲的了?
messages里可以看到大量 oom killer的信息, 我相问题比较明确
ul 3 11:40:49 erptest kernel: 503 12455 1157967 119466 3 0 0 oracle
Jul 3 11:40:49 erptest kernel: 502 12456 2554032 112022 4 0 0 EPT_00_DIA_W1
Jul 3 11:40:49 erptest kernel: 503 12462 1157965 47038 6 0 0 oracle
Jul 3 11:40:49 erptest kernel: Out of memory: Kill process 11763 (EPT_00_DIA_W7) score 104 or sacrifice child
Jul 3 11:40:49 erptest kernel: Killed process 11763, UID 502, (EPT_00_DIA_W7) total-vm:10246688kB, anon-rss:113208kB, file-rss:1963024kB
Jul 3 11:41:08 erptest kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
Jul 3 11:41:08 erptest kernel: oracle cpuset=/ mems_allowed=0
Jul 3 11:41:08 erptest kernel: Pid: 12462, comm: oracle Not tainted 2.6.32-220.el6.x86_64 #1
Jul 3 11:41:08 erptest kernel: Call Trace:
Jul 3 11:41:08 erptest kernel: [<ffffffff810c2cb1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
Jul 3 11:41:08 erptest kernel: [<ffffffff81113a30>] ? dump_header+0x90/0x1b0
Jul 3 11:41:08 erptest kernel: [<ffffffff8120d98c>] ? security_real_capable_noaudit+0x3c/0x70
Jul 3 11:41:08 erptest kernel: [<ffffffff81113eba>] ? oom_kill_process+0x8a/0x2c0
Jul 3 11:41:08 erptest kernel: [<ffffffff81113dae>] ? select_bad_process+0x9e/0x120
Jul 3 11:41:08 erptest kernel: [<ffffffff81114310>] ? out_of_memory+0x220/0x3c0
Jul 3 11:41:08 erptest kernel: [<ffffffff8112402e>] ? __alloc_pages_nodemask+0x89e/0x940
Jul 3 11:41:08 erptest kernel: [<ffffffff81158b7a>] ? alloc_pages_current+0xaa/0x110
Jul 3 11:41:08 erptest kernel: [<ffffffff81110e57>] ? __page_cache_alloc+0x87/0x90
Jul 3 11:41:08 erptest kernel: [<ffffffff81126a7b>] ? __do_page_cache_readahead+0xdb/0x210
Jul 3 11:41:08 erptest kernel: [<ffffffff81126bd1>] ? ra_submit+0x21/0x30
Jul 3 11:41:08 erptest kernel: [<ffffffff81112123>] ? filemap_fault+0x4c3/0x500
Jul 3 11:41:08 erptest kernel: [<ffffffff8113b2c4>] ? __do_fault+0x54/0x510
Jul 3 11:41:08 erptest kernel: [<ffffffff8113b877>] ? handle_pte_fault+0xf7/0xb50
页:
[1]