OLTP数据库hang住一例
OLTP数据库hang住一例21:19左右,新建连接无法连接到数据库实例orcl2
数据库后台日志文件alert.log报以下异常:
Mon Sep 10 21:19:26 EAT 2012
Process J001 died, see its trace file
Mon Sep 10 21:19:26 EAT 2012
kkjcre1p: unable to spawn jobq slave process
Mon Sep 10 21:19:26 EAT 2012
Errors in file /app/oracle/admin/orcl/bdump/orcl2_cjq0_10930.trc:
Mon Sep 10 21:22:16 EAT 2012
Process m000 died, see its trace file
Mon Sep 10 21:22:16 EAT 2012
ksvcreate: Process(m000) creation failed
Mon Sep 10 21:24:47 EAT 2012
Process J000 died, see its trace file
Mon Sep 10 21:24:47 EAT 2012
kkjcre1p: unable to spawn jobq slave process
Mon Sep 10 21:24:47 EAT 2012
Errors in file /app/oracle/admin/orcl/bdump/orcl2_cjq0_10930.trc:
到23:05:25时,一号节点CMRP1也出现连接挂起现象:
Thread 1 advanced to log sequence 127848 (LGWR switch)
Current log# 23 seq# 127848 mem# 0: /dev/vg93/rorcl1A_redolog_21a.log
Current log# 23 seq# 127848 mem# 1: /dev/vg94/rorcl1A_redolog_22a.log
Mon Sep 10 23:05:25 EAT 2012
Process m000 died, see its trace file
Mon Sep 10 23:05:25 EAT 2012
ksvcreate: Process(m000) creation failed
Mon Sep 10 23:08:27 EAT 2012
Process J001 died, see its trace file
Mon Sep 10 23:08:27 EAT 2012
kkjcre1p: unable to spawn jobq slave process
Mon Sep 10 23:08:27 EAT 2012
Errors in file /app/oracle/admin/orcl/bdump/orcl1_cjq0_15552.trc:
Mon Sep 10 23:11:33 EAT 2012
Process m000 died, see its trace file
Mon Sep 10 23:11:33 EAT 2012
ksvcreate: Process(m000) creation failed
Mon Sep 10 23:13:27 EAT 2012
Process J001 died, see its trace file
Mon Sep 10 23:13:27 EAT 2012
kkjcre1p: unable to spawn jobq slave process
Mon Sep 10 23:13:27 EAT 2012
Errors in file /app/oracle/admin/orcl/bdump/orcl1_cjq0_15552.trc:
Mon Sep 10 23:16:06 EAT 2012
Process m000 died, see its trace file
Mon Sep 10 23:16:06 EAT 2012
ksvcreate: Process(m000) creation failed
Mon Sep 10 23:20:29 EAT 2012
故障/问题 #1
2012-09-10 21:19左右,新建连接无法连接到数据库实例orcl2
orcl2数据库后台日志文件alert.log报以下异常:
Mon Sep 10 21:19:26 EAT 2012
Process J001 died, see its trace file
Mon Sep 10 21:19:26 EAT 2012
kkjcre1p: unable to spawn jobq slave process
Mon Sep 10 21:19:26 EAT 2012
Errors in file /app/oracle/admin/orcl/bdump/orcl2_cjq0_10930.trc:
Mon Sep 10 21:22:16 EAT 2012
Process m000 died, see its trace file
Mon Sep 10 21:22:16 EAT 2012
ksvcreate: Process(m000) creation failed
Mon Sep 10 21:24:47 EAT 2012
Process J000 died, see its trace file
至23:05:25时,1号节点orcl1也出现类似问题:
Mon Sep 10 23:05:25 EAT 2012
Process m000 died, see its trace file
Mon Sep 10 23:05:25 EAT 2012
ksvcreate: Process(m000) creation failed
Mon Sep 10 23:08:27 EAT 2012
Process J001 died, see its trace file
Mon Sep 10 23:08:27 EAT 2012
kkjcre1p: unable to spawn jobq slave process
Mon Sep 10 23:08:27 EAT 2012
Errors in file /app/oracle/admin/orcl/bdump/orcl1_cjq0_15552.trc:
Mon Sep 10 23:11:33 EAT 2012
Process m000 died, see its trace file
Mon Sep 10 23:11:33 EAT 2012
ksvcreate: Process(m000) creation failed
Mon Sep 10 23:13:27 EAT 2012
Process J001 died, see its trace file
Mon Sep 10 23:13:27 EAT 2012
kkjcre1p: unable to spawn jobq slave process
Mon Sep 10 23:13:27 EAT 2012
Errors in file /app/oracle/admin/orcl/bdump/orcl1_cjq0_15552.trc:
到次日01:04:28以后实例恢复正常
故障处理:
进行了以下处理,事情发生的先后顺序如下:
1 用户在大约20120910 21:20分左右发现无法连接到 orcl2.
2. orcl2有少量应用程序负载均衡到 orcl1上
3. orcl1在22:50左右出现无法连接的情况
4. orcl2在22:05左右orcl2节点恢复正常
4.在00:47时在 orcl1/orcl2上先后执行了suspend和resume命令
5. 大约在01:10左右 orcl1节点恢复正常
故障分析:
2012-09-10 21:20左右,新建连接无法连接到数据库
数据库后台日志文件alert.log记录情况:
Mon Sep 10 21:19:26 EAT 2012
Process J001 died, see its trace file
Mon Sep 10 21:19:26 EAT 2012
kkjcre1p: unable to spawn jobq slave process
Mon Sep 10 21:19:26 EAT 2012
Errors in file /app/oracle/admin/orcl/bdump/orcl2_cjq0_10930.trc:
Mon Sep 10 21:22:16 EAT 2012
Process m000 died, see its trace file
Mon Sep 10 21:22:16 EAT 2012
ksvcreate: Process(m000) creation failed
Mon Sep 10 21:24:47 EAT 2012
Process J000 died, see its trace file
Mon Sep 10 21:24:47 EAT 2012
kkjcre1p: unable to spawn jobq slave process
Mon Sep 10 21:24:47 EAT 2012
Errors in file /app/oracle/admin/orcl/bdump/orcl2_cjq0_10930.trc:
Mon Sep 10 21:26:20 EAT 2012
Process m000 died, see its trace file
Mon Sep 10 21:26:20 EAT 2012
ksvcreate: Process(m000) creation failed
Mon Sep 10 21:27:55 EAT 2012
Process J000 died, see its trace file
Mon Sep 10 21:27:55 EAT 2012
kkjcre1p: unable to spawn jobq slave process
Mon Sep 10 21:27:55 EAT 2012
Errors in file /app/oracle/admin/orcl/bdump/orcl2_cjq0_10930.trc:
CJQ0进程的后台TRACE内容如下:
1401 R oracle 10048 1 0 178 20 e000000e20fbd700 33641 - 21:22:17 ? 0:00 ora_j000_orcl2
*** 2012-09-10 21:24:04.715
Stack:
(gdb) (gdb) #0 0xc000000000437050:0 in shmget+0x30 () from /usr/lib/hpux64/libc.so.1
#1 0x40000000028e36f0:0 in sskgmget () at sskgm.c:1353
#2 0x40000000028e39a0:0 in skgmlocate () at skgm.c:4047
#3 0x4000000002938f80:0 in skgmattach () at skgm.c:1261
#4 0x4000000002939800:0 in ksmlsge () at ksm.c:7634
#5 0x40000000028ead80:0 in ksmlsg () at ksm.c:7602
#6 0x4000000002937ed0:0 in opirip () at opirip.c:214
#7 0x40000000028bcdc0:0 in opidrv () at opidrv.c:787
#8 0x40000000028bc3b0:0 in sou2o () at sou2o.c:110
#9 0x40000000028a5dc0:0 in opimai_real () at opimai.c:293
#10 0x40000000027e1c60:0 in main () at opimai.c:175
*** 2012-09-10 23:01:47.038
Waited for process J001 to initialize for 60 seconds
*** 2012-09-10 23:01:47.039
Process diagnostic dump for J001, OS id=4711
-------------------------------------------------------------------------------
loadavg : 0.11 0.14 0.16
Swapinfo :
Avail = 311188.19Mb Used = 102252.26Mb
Swap free = 208935.92Mb Kernel rsvd = 34222.16Mb
Free Mem = 133461.94Mb
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME COMD
1401 S oracle 4711 1 0 178 20 e0000012d4733a00 53706 - 23:00:47 ? 0:00 ora_j001_orcl1
*** 2012-09-10 23:02:07.645
Stack:
(gdb) (gdb) #0 skgmattach () at skgm.c:1286
#1 0x4000000002939800:0 in ksmlsge () at ksm.c:7634
#2 0x40000000028ead80:0 in ksmlsg () at ksm.c:7602
#3 0x4000000002937ed0:0 in opirip () at opirip.c:214
#4 0x40000000028bcdc0:0 in opidrv () at opidrv.c:787
#5 0x40000000028bc3b0:0 in sou2o () at sou2o.c:110
#6 0x40000000028a5dc0:0 in opimai_real () at opimai.c:293
#7 0x40000000027e1c60:0 in main () at opimai.c:175
-------------------------------------------------------------------------------
Process diagnostic dump actual duration=20.614000 sec
(max dump time=30.000000 sec)
*** 2012-09-10 23:02:07.652
*** 2012-09-10 23:17:06.628
Process diagnostic dump for J001, OS id=7371
-------------------------------------------------------------------------------
loadavg : 0.11 0.10 0.12
Swapinfo :
Avail = 311188.19Mb Used = 102262.52Mb
Swap free = 208925.67Mb Kernel rsvd = 34232.28Mb
Free Mem = 133695.84Mb
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME COMD
1401 S oracle 7371 1 0 178 20 e0000012ea21ea00 53914 - 23:16:07 ? 0:00 ora_j001_orcl1
*** 2012-09-10 23:17:19.973
Stack:
(gdb) (gdb) #0 0xc0000000004352d0:0 in _ioctl_sys+0x30 () from /usr/lib/hpux64/libc.so.1
#1 0xc0000000004485e0:0 in ioctl+0x140 () from /usr/lib/hpux64/libc.so.1
#2 0x4000000009960350:0 in skgfr_aio_configseg () at skgfr.c:8840
#3 0x4000000009958680:0 in skgfr_enable_aio () at skgfr.c:8718
#4 0x4000000003d180c0:0 in skgfrsaiolmt () at skgfr.c:1320
#5 0x40000000041b3e20:0 in ksfdsaiolmt () at ksfd.c:7893
#6 0x40000000029d8520:0 in ksfdnfy () at ksfd.c:2393
#7 0x4000000002ac21e0:0 in kscnfy () at ksc.c:122
#8 0x4000000003b2d360:0 in ksucrp () at ksu.c:2401
#9 0x400000000262b9e0:0 in kkjrdp () at kkj.c:1277
#10 0x40000000029383b0:0 in opirip () at opirip.c:320
#11 0x40000000028bcdc0:0 in opidrv () at opidrv.c:787
#12 0x40000000028bc3b0:0 in sou2o () at sou2o.c:110
#13 0x40000000028a5dc0:0 in opimai_real () at opimai.c:293
#14 0x40000000027e1c60:0 in main () at opimai.c:175
Process diagnostic dump for J001, OS id=23180
-------------------------------------------------------------------------------
loadavg : 0.13 0.14 0.13
Swapinfo :
Avail = 311188.19Mb Used = 102504.09Mb
Swap free = 208684.09Mb Kernel rsvd = 34232.79Mb
Free Mem = 133893.34Mb
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME COMD
1401 R oracle 23180 1 0 178 20 e00000111e554400 53698 - 23:37:20 ? 0:00 ora_j001_orcl1
*** 2012-09-10 23:39:27.991
Stack:
(gdb) (gdb) #0 0xc000000000437050:0 in shmget+0x30 () from /usr/lib/hpux64/libc.so.1
#1 0x40000000028e36f0:0 in sskgmget () at sskgm.c:1353
#2 0x40000000028e39a0:0 in skgmlocate () at skgm.c:4047
#3 0x4000000002938f80:0 in skgmattach () at skgm.c:1261
#4 0x4000000002939800:0 in ksmlsge () at ksm.c:7634
#5 0x40000000028ead80:0 in ksmlsg () at ksm.c:7602
#6 0x4000000002937ed0:0 in opirip () at opirip.c:214
#7 0x40000000028bcdc0:0 in opidrv () at opidrv.c:787
#8 0x40000000028bc3b0:0 in sou2o () at sou2o.c:110
#9 0x40000000028a5dc0:0 in opimai_real () at opimai.c:293
#10 0x40000000027e1c60:0 in main () at opimai.c:175
-------------------------------------------------------------------------------
Process diagnostic dump actual duration=30.353000 sec
(max dump time=30.000000 sec)
以上是CJQ0对新创建Oracle进程的stack dump信息,相关函数表示与系统资源存在关系。
在1节点的AWR快照中出现了大量log file sync等待:
Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class
log file sync 3,054,284 235,901 77 38.0 Commit
gc buffer busy 806,573 173,696 215 28.0 Cluster
db file sequential read 20,311,467 127,413 6 20.6 User I/O
gcs log flush sync 2,653,583 54,253 20 8.8 Other
CPU time 26,985 4.4
log file sync是服务进程的日志同步写等待时间,Oracle服务进程在发出COMMIT命令后,需要等待LGWR后台进程完成redo日志写出工作并返回后才能继续工作。
我们观察LGWR进程在问题时段的TRACE可以发现,存在大量日志写缓慢的告警信息:
1节点orcl1
*** 2012-09-10 22:59:00.667
Warning: log write time 1130ms, size 410KB
*** 2012-09-10 22:59:01.291
Warning: log write time 510ms, size 107KB
*** 2012-09-10 22:59:05.737
Warning: log write time 1750ms, size 145KB
*** 2012-09-10 22:59:09.282
Warning: log write time 860ms, size 83KB
*** 2012-09-10 22:59:11.783
Warning: log write time 890ms, size 141KB
*** 2012-09-10 22:59:12.621
Warning: log write time 580ms, size 131KB
*** 2012-09-10 22:59:14.152.
省略大量日志.....................................
*** 2012-09-11 00:55:51.387
Warning: log write time 1320ms, size 780KB
*** 2012-09-11 00:55:52.024
Warning: log write time 630ms, size 495KB
1节点最后一次出现" Warning: log write time"是在00:55,之后未再发生,这可能与ALTER SYSTEM SUSPEND/RESUME介入有关。
2节点orcl2
*** 2012-09-10 21:16:58.629
Warning: log write time 2450ms, size 3KB
*** 2012-09-10 21:17:00.748
Warning: log write time 1680ms, size 20KB
*** 2012-09-10 21:17:04.009
Warning: log write time 2280ms, size 114KB
*** 2012-09-10 21:17:04.794
Warning: log write time 550ms, size 166KB
*** 2012-09-10 21:17:05.863
Warning: log write time 540ms, size 79KB
*** 2012-09-10 21:17:06.427
Warning: log write time 560ms, size 117KB
*** 2012-09-10 21:17:08.518
Warning: log write time 2080ms, size 4KB
*** 2012-09-10 21:17:08.518
Warning: log write broadcast wait time 2090ms
*** 2012-09-10 21:17:10.633
Warning: log write time 1830ms, size 11KB
*** 2012-09-10 21:17:13.436
Warning: log write time 2660ms, size 6KB
*** 2012-09-10 21:17:16.841
Warning: log write time 2800ms, size 224KB
*** 2012-09-10 21:17:20.467
Warning: log write time 3620ms, size 61KB
*** 2012-09-10 21:17:22.693
Warning: log write time 2220ms, size 319KB
*** 2012-09-10 21:17:23.518
省略大量日志...........................
*** 2012-09-10 22:01:36.615
Warning: log write time 510ms, size 122KB
*** 2012-09-10 22:01:38.936
Warning: log write time 510ms, size 271KB
*** 2012-09-10 22:01:39.913
Warning: log write time 590ms, size 211KB
*** 2012-09-10 22:01:41.696
Warning: log write time 1280ms, size 201KB
*** 2012-09-11 00:40:39.037
Warning: log write time 11470ms, size 17KB
2节点在22:01之后不再频繁出现log write warning,几乎同时2节点上的症状解除,且此时2节点上尚未执行ALTER SYSTEM SUSPEND/RESUME命令。
同时对问题时段的ORACLE ASH信息分析可以发现,其中阻塞其他进程最多的是13147,13148 2个进程,而这两个进程则分别是两个节点的lgwr进程:
COUNT(*) BLOCKING_SESSION
82439 13147
11857 13148
4980 5940
4332 13145
4283 10038
2828 7692
1127 9435
1058 13146
1029 7626
1005 8953
747 8748
657 6146
636 11975
SQL> select count(*),program,event from sys.bak_ash where session_id=13147 group by program,event order by 1 desc;
COUNT(*) PROGRAM EVENT
---------- ------------------------ --------------------------
2892 oracle@crmdb-01 (LGWR)
1837 oracle@crmdb-01 (LGWR) log file parallel write
10 oracle@crmdb-01 (LGWR) LGWR wait for redo copy
1 oracle@crmdb-01 (LGWR) wait for scn ack
SQL> select count(*),program,event from sys.bak_ash where session_id=13148 group by program,event order by 1 desc;
COUNT(*) PROGRAM EVENT
---------- --------------------------------------------------
1509 oracle@crmdb-02 (LGWR) log file parallel write
1369 oracle@crmdb-02 (LGWR)
1 oracle@crmdb-02 (LGWR) LGWR wait for redo copy
1 oracle@crmdb-02 (LGWR) wait for scn ack
另外osw中看到以下几个操作系统守护进程(evacd、kcmond)和vxfs文件系统的守护进程(vxfsd)消耗了大量的CPU:
kcmond是HP-UX上的后台守护进程;:
The kcmond daemon monitors the consumption of kernel resources. It is
an Event Monitoring Services (EMS - see ems(5)) monitor. The data
collected by kcusage can be displayed by kcusage(1M).
kcmond includes a resource dictionary file
(/etc/opt/resmon/dictionary/kcmond_01.dict) which is used by EMS to
identify resources for monitoring.
Users can control the monitoring of each kernel tunable by setting a
percentage value for the consumption of that resource. For example, if
the user sets a value of 75 for nproc, the Kernel Tunable Monitor will
report an event when the number of processes reaches 75% of the nproc
setting.
The kcalarm(1M) command is used to add, modify and delete kernel
tunable alarms. The kcusage(1M) command is used to view data collected
by kcmond. The kcweb(1M) command provides a Web-based, graphical user
interface to control alarms and view usage data, including graphs of
the usage data.
kcmond is a light weight daemon that will exit after being idle for
ten minutes.
evacd
The following operating systems patches are requierd for HP-UX 11i v3 (11.31):
*
PHKL_35900: 11.31 evacd performance, kvaddr leak panic
以上分析得出如下结论:
RAC 2节点的连接挂起问题与LGWR写日志性能差有直接的关系,LGWR进程是ORACLE后台进程,且其一直在等待"LOG FILE PARALLEL WRITE",该进程并不受到前台服务进程挂起的影响,而是LGWR因为物理I/O写出性能差导致前台进程的挂起。
I/O写出性能差可能受到存储阵列、主机操作系统、卷管理软件等因素影响。
后续建议:
建议HP方面检测OS和存储性能,Veritas方面检测卷管理软件是否存在性能问题。
需要操作系统分析内容:
根据OS Watcher的相关信息,我们可以看到orcl1 从2012-09-10 22:59:08开始突然出现大量阻塞队列 ,阻塞队列(Blocked for resources (I/O, paging, etc.), 阻塞队列突然升高说明IO出现了明显的瓶颈,直到:
zzz ***Mon Sep 10 22:58:58 EAT 2012
procs memory page faults cpu
r b w avm free re at pi po fr de sr in sy cs us sy id
9 1 0 5967544 34128481 0 0 0 0 13 0 102 71346 777118 29900 14 5 81
9 1 0 5967544 34128481 0 0 0 0 0 0 0 57926 632772 20912 8 8 85
9 1 0 5967544 34116321 0 0 0 0 0 0 0 56152 613463 20035 6 8 86
zzz ***Mon Sep 10 22:59:03 EAT 2012
procs memory page faults cpu
r b w avm free re at pi po fr de sr in sy cs us sy id
9 1 0 5967544 34073371 0 0 0 0 13 0 102 71346 777115 29900 14 5 81
9 1 0 5967544 34073352 0 0 0 0 0 0 0 48951 520025 16365 5 8 87
9 1 0 5967544 34057576 0 0 0 0 0 0 0 45880 482248 14973 5 7 88
zzz ***Mon Sep 10 22:59:08 EAT 2012
procs memory page faults cpu
r b w avm free re at pi po fr de sr in sy cs us sy id
6 78 0 5559615 34049377 0 0 0 0 13 0 102 71346 777113 29900 14 5 81
6 78 0 5559615 34032365 0 0 0 0 0 0 0 43406 437409 14121 6 9 85
6 78 0 5559615 34028440 0 0 0 0 0 0 0 43029 425227 13924 7 5 88
zzz ***Mon Sep 10 22:59:13 EAT 2012
procs memory page faults cpu
r b w avm free re at pi po fr de sr in sy cs us sy id
6 78 0 5559615 34045183 0 0 0 0 13 0 102 71346 777110 29900 14 5 81
6 78 0 5559615 34044027 0 0 0 0 0 0 0 40697 368498 14308 5 9 86
6 78 0 5559615 34038583 0 0 0 0 0 0 0 41070 374266 15376 6 5 89
zzz ***Mon Sep 10 22:59:18 EAT 2012
procs memory page faults cpu
r b w avm free re at pi po fr de sr in sy cs us sy id
6 116 0 5589927 34028115 0 0 0 0 13 0 102 71345 777106 29900 14 5 81
6 116 0 5589927 34017415 0 0 0 0 0 0 0 39026 311276 12685 3 6 91
6 116 0 5589927 34019185 0 0 0 0 0 0 0 38711 308142 12373 5 4 91
zzz ***Mon Sep 10 22:59:23 EAT 2012
procs memory page faults cpu
r b w avm free re at pi po fr de sr in sy cs us sy id
6 116 0 5589927 34020333 0 0 0 0 13 0 102 71345 777102 29899 14 5 81
6 116 0 5589927 34006949 0 0 0 0 0 0 0 37228 276208 11115 4 7 89
6 116 0 5589927 34006949 0 0 0 0 0 0 0 37986 286684 11507 5 3 92
zzz ***Mon Sep 10 22:59:28 EAT 2012
procs memory page faults cpu
r b w avm free re at pi po fr de sr in sy cs us sy id
6 135 0 5288401 34010163 0 0 0 0 13 0 102 71345 777099 29899 14 5 81
6 135 0 5288401 34012160 0 0 0 0 0 0 0 36599 305661 10707 5 6 89
6 135 0 5288401 34010109 0 0 0 0 0 0 0 36314 294809 10646 2 3 95
zzz ***Mon Sep 10 22:59:33 EAT 2012
procs memory page faults cpu
r b w avm free re at pi po fr de sr in sy cs us sy id
6 135 0 5288401 34019972 0 0 0 0 13 0 102 71345 777095 29899 14 5 81
6 135 0 5288401 34006945 0 0 0 0 0 0 0 37621 290051 11251 6 7 87
6 135 0 5288401 34006657 0 0 0 0 0 0 0 38604 300217 11706 3 6 91
................................
zzz ***Tue Sep 11 00:05:08 EAT 2012
procs memory page faults cpu
r b w avm free re at pi po fr de sr in sy cs us sy id
6 269 0 4369577 34673020 0 0 0 0 13 0 101 71150 774995 29807 14 5 81
6 269 0 4369577 34657758 0 0 0 0 0 0 0 83569 1378484 36428 23 19 58
6 269 0 4369577 34648721 0 0 0 0 0 0 0 88743 1411625 38035 21 13 67
zzz ***Tue Sep 11 00:05:13 EAT 2012
procs memory page faults cpu
r b w avm free re at pi po fr de sr in sy cs us sy id
26 2 0 6734627 34658154 0 0 0 0 13 0 101 71151 775001 29807 14 5 81
26 2 0 6734627 34662974 0 0 0 0 0 0 0 97989 1342867 40352 19 13 69
26 2 0 6734627 34654645 0 0 0 0 0 0 0 97277 1291069 40132 14 7 79
页:
[1]