数据库运行缓慢hang解析一例
数据库运行缓慢hang解析一例数据库运行缓慢,无响应,应用无法运行。
此时在主机层面执行简单的操作系统命令 ps –ef等命令已经无法执行。
(1).通过topas 命令查看操作系统的资源消耗情况.
MEMORY
Real,MB 190464
% Comp 96 计算内存已经使用率达到96%
% Noncomp 2 ->非计算内存4%
% Client 2
PAGING SPACE
Size,MB 32768
% Used 23 当前内存已经使用殆尽,交换分区已经使用了23%.
% Free 77
说明:当单台连接数到5300时, 2台主机的内存耗尽,导致操作系统所有进程(包括)运行得都极为缓慢。
(2).查看数据库的连接数
ps -ef|grep LOCAL=NO|wc –l 长时间没有结果返回
ps –ef 能够在20几秒返回,
其结果中有5291 个“oraclesiebeldb1 (LOCAL=NO)”的进程,
说明:当前实例1的连接数在5290左右。 而实例2上的连接数量也相当。
而在08:00时刻连接数在3995个。连接数增加1300个左右。
(3).查看共享内存的信号量
oracle@crm-db-001:/home/oracle>ipcs -m
IPC status from /dev/mem as of Sat Oct 8 09:30:40 GMT+08:00 2011
T ID KEY MODE OWNER GROUP
Shared Memory:
m 0 0x55005095 --rw-r--r-- root system
m 1 0x0d000d0e --rw-rw---- root system
m 2 0x78000010 --rw-rw-rw- root system
m 3 0x7800003f --rw-rw-rw- root system
m 4 0x01002003 --rw------- pconsole system
m 159383559 0x020169d8 --rw-rw-rw- patrol bmc
m 8 0x5e00819a --rw------- root system
m 234881035 0x010169d7 --rw-rw-rw- patrol bmc
m 35651599 0x410169e2 --rw-rw-rw- patrol bmc
m 22020112 0x410169e3 --rw-rw-rw- patrol bmc
m 39845905 0x410169e4 --rw-rw-rw- patrol bmc
m 56623122 0x410169e5 --rw-rw-rw- patrol bmc
m 39845907 0x410169e6 --rw-rw-rw- patrol bmc
m 298844180 0x410169e7 --rw-rw-rw- patrol bmc
m 20971541 0x410169e8 --rw-rw-rw- patrol bmc
m 68157462 0x410169da --rw-rw-rw- patrol bmc
m 41943063 0x410169e9 --rw-rw-rw- patrol bmc
m 119537688 0x410169d9 --rw-rw-rw- patrol bmc
m 321912857 0x410169df --rw-rw-rw- patrol bmc
m 48234522 0x410169dd --rw-rw-rw- patrol bmc
m 51380251 0x7c16eade --rw-r--r-- grid oinstall
m 744488988 0x61af482c --rw-r--r-- grid oinstall
m 779091999 0xb4b64668 --rw-rw---- oracle oinstall
m 58720827 0x4b29a7e2 --rw-r--r-- grid oinstall
(4). 在停止完数据库(kill pmon进程),已经grid 后,通过topas查看内存使用情况:
MEMORY
Real,MB 190464
% Comp 18 计算内存已经使用率达仍然有18%
% Noncomp 2
% Client 2
此时,共享内存的情况:
oracle@crm-db-001:/home/oracle>ipcs -m
IPC status from /dev/mem as of Sat Oct 8 09:30:40 GMT+08:00 2011
T ID KEY MODE OWNER GROUP
Shared Memory:
m 0 0x55005095 --rw-r--r-- root system
m 1 0x0d000d0e --rw-rw---- root system
m 2 0x78000010 --rw-rw-rw- root system
m 3 0x7800003f --rw-rw-rw- root system
m 4 0x01002003 --rw------- pconsole system
m 159383559 0x020169d8 --rw-rw-rw- patrol bmc
m 8 0x5e00819a --rw------- root system
m 234881035 0x010169d7 --rw-rw-rw- patrol bmc
m 35651599 0x410169e2 --rw-rw-rw- patrol bmc
m 22020112 0x410169e3 --rw-rw-rw- patrol bmc
m 39845905 0x410169e4 --rw-rw-rw- patrol bmc
m 56623122 0x410169e5 --rw-rw-rw- patrol bmc
m 39845907 0x410169e6 --rw-rw-rw- patrol bmc
m 298844180 0x410169e7 --rw-rw-rw- patrol bmc
m 20971541 0x410169e8 --rw-rw-rw- patrol bmc
m 68157462 0x410169da --rw-rw-rw- patrol bmc
m 41943063 0x410169e9 --rw-rw-rw- patrol bmc
m 119537688 0x410169d9 --rw-rw-rw- patrol bmc
m 321912857 0x410169df --rw-rw-rw- patrol bmc
m 48234522 0x410169dd --rw-rw-rw- patrol bmc
说明:当前数据库软件已经完全停止,所以我们当时初步猜测剩下的patrol 程序可能占用了较多的内存,不过在2011-10-09凌晨安装补丁时,我们再将数据库完全停止后,在potrol程序停止的情况下,发现计算内存依然占用19%。
# svmon –G 结果
root@crm-db-001:/tmp/oracle_work#svmon -G
size inuse free pin virtual mmode
memory 48758784 10713741 38045044 8422149 9290125 Ded --> inuse=10713741*4k= 40GB
pg space 8388608 474500
work pers clnt other
pin 6975681 0 36692 1409776
in use 9034009 0 1679732
PageSize PoolSize inuse pgsp pin virtual
s 4 KB - 4338973 267172 2416101 2876013 --> 16GB
m 64 KB - 398423 12958 375378 400882 --> 24GB
# topas 结果
Topas Monitor for host: crm-db-001 EVENTS/QUEUES FILE/TTY
Sun Oct 9 00:42:19 2011 Interval: 2 Cswitch 5646 Readch 327.1K
Syscall 19607 Writech 13994
Network KBPS I-Pack O-Pack KB-In KB-Out Reads 147 Rawin 0
Total 6.8 25.9 24.4 2.1 4.7 Writes 55 Ttyout 2464
Forks 3 Igets 0
Disk Busy% KBPS TPS KB-Read KB-Writ Execs 4 Namei 258
Total 2.0 48.0 7.0 12.0 36.0 Runqueue 0.0 Dirblk 0
Waitqueue 0.0
FileSystem KBPS TPS KB-Read KB-Writ MEMORY
Total 334.5 160.6 326.2 8.2 PAGING Real,MB 190464
Faults 1242 % Comp 19 比例为19%
Name PID CPU% PgSp Owner Steals 0 % Noncomp 2
CFSMount 516350 0.8 5.9 root PgspIn 0 % Client 2
ocssd.bi 7417872 0.2 102.6 grid PgspOut 0
ohasd.bi 14106656 0.1 59.9 root PageIn 2 PAGING SPACE
oraroota 7987338 0.1 33.9 root PageOut 2 Size,MB 32768
gipcd.bi 14102714 0.1 28.3 grid Sios 5 % Used 6
evmd.bin 585756 0.1 46.4 grid % Free 94
crsd.bin 10006594 0.1 71.6 root NFS (calls/sec)
octssd.b 14094528 0.1 20.3 root SerV2 0 WPAR Activ 0
dmpdaemo 118814 0.1 0.4 root CliV2 0 WPAR Total 0
sshd 14135424 0.1 4.1 root SerV3 0 Press: "h"-help
dmpdaemo 155730 0.1 0.4 root CliV3 0 "q"-quit
gil 78172 0.1 0.9 root
oraagent 864434 0.1 75.6 grid
oraroota 12837136 0.0 58.0 root
vx_sched 98814 0.0 156.7 root
cssdmoni 13205918 0.0 82.3 root
topas 12222744 0.0 56.4 root
topas 13590954 0.0 6.5 root
ScriptAg 364978 0.0 5.0 root
Applicat 340386 0.0 5.1 root
通过svmon –P 命令分析节点1上各个进程占用内存的情况:
Inuse 其中4K页面 其中64 K页面
total内存(MB): 18970 1940 17046
oracle所占内存(MB): 4127 1583 2545
所以下阶段还需要主机厂商从操作系统层面分析当前是否还有其它未知的程序或组件占用了多余的20GB内存 (具体列表见后面的excel表格)。
(5). 进一步观察发现,在我们IBM 平台系统中的11G版本数据库存在进程内存消耗过高的问题。
oracle@crm-db-001:/home/oracle>svmon -P |more
-------------------------------------------------------------------------------
Pid Command Inuse Pin Pgsp Virtual 64-bit Mthrd 16MB
13398182 oracle 9496722 28368 8896 9469019 Y N N
PageSize Inuse Pin Pgsp Virtual
s 4 KB 50386 0 7024 18891
m 64 KB 590396 1773 117 590633
Vsid Esid Type Description PSize Inuse Pin Pgsp Virtual
a933ae 7000006a work default shmat/mmap m 4096 0 0 4096
9acb98 7000006d work default shmat/mmap m 4096 0 0 4096
139aa3e 70000054 work default shmat/mmap m 4096 0 0 4096
74fb72 70000013 work default shmat/mmap m 4096 0 0 4096
64100 7000002d work default shmat/mmap m 4096 0 0 4096
9cdf9b 70000033 work default shmat/mmap m 4096 0 0 4096
770d71 7000003d work default shmat/mmap m 4096 0 0 4096
1758473 70000022 work default shmat/mmap m 4096 0 0 4096
……
fdd5fb 70000081 work default shmat/mmap m 2255 0 0 2255
de25d9 70000080 work default shmat/mmap m 2253 0 0 2253
11d4c1b 70000077 work default shmat/mmap m 2046 0 0 2046
191311 7000009e work default shmat/mmap m 1910 0 0 1910
667160 7000008f work default shmat/mmap m 1908 0 0 1908
1910495 7000009c work default shmat/mmap m 1880 0 0 1880
de9fdd 7000007f work default shmat/mmap m 1806 0 0 1806
1213621 7000008e work default shmat/mmap m 1694 0 0 1694
1150015 90000000 work shared library text m 1578 969 55 1755
e305e5 70000089 work default shmat/mmap m 1370 0 0 1370
4fcd49 7000008d work default shmat/mmap m 1010 0 0 1010
20002 0 work kernel segment m 716 678 50 766
1bdaaba 11 work text data BSS heap sm 9420 0 0 9420
1170017 90020014 work shared library s 2115 0 3623 3882
89b38e 80020014 work USLA heap sm 1904 0 0 1904 1904个 4K页面
进程在 “work USLA”堆栈heap 处占用一片大约7M(1904 page (~7MB)的内存。
可以考虑:
应用 patch:10190759(注意阅读readme).可以降低服务进程所占用USLA的大小。
建议 :
1.减少前台客户端的连接数,降低内存的使用率。
目前SIEBEL已初步实施: 10月8号15:00左右将6台服务器(晚上将会调整其它的服务器)启用 每2个页面共用一个连接的方式,降低了数据库的连接数。
10月8号13:00时的2台共10200个(当前服务器端连接负载打开,每个实例的连接大约各5100个,内存占用为96%),在10月8号15:30 时降到8000(每个实例的连接大约各4000个连接,内存占用为85%). 在10月9号9:40时降到3500(每个实例的连接大约各1750个连接,内存占用为54%)
2. 当前进一步调整了数据库的resource manager,使得新出现的消耗较高资源的sql语句能够自动退出运行,减少CPU使用率。
3.首先初步安装11.2.0.2 RDBMS补丁 12419331(解决psu2版本上存在的重要问题)以及one-off补丁10190759(解决进程多占用7M内存的问题) (补丁冲突已经检测通过),patch 9829397(解决可能出现 ‘ASYNC DESCRIPTOR RESIZE’异常等待的BUG)。
时间安排: 2011年10月9日 0:30 – 02:30 。
补丁安装步骤:
1. 停止数据库,nodeapps
ps -ef|grep grid|grep LOCAL=NO|wc -l
su - grid
## 杀前端连接进程
ps -ef|grep grid|grep LOCAL=NO|grep -v grep|awk '{print $2}'|xargs kill -9
## 杀tns进程
ps -ef|grep grid|grep LISTENER|grep -v grep|awk '{print $2}'|xargs kill -9
srvctl stop database -d siebeldb
srvctl stop nodeapps -n crm-db-001
srvctl stop nodeapps -n crm-db-002
slibclean
slibclean
slibclean
slibclean
slibclean
slibclean
slibclean
slibclean
slibclean
ps -ef|grep tns
crs_stat -t
2.安装数据库补丁
cd /tmp/oracle_work/one-off/12419331
opatch apply -local
cd /tmp/oracle_work/one-off/10190759
opatch apply -local
cd /tmp/oracle_work/one-off/9829397
opatch apply -local
##查看补丁列表
opatch lsinventory
Interim patches (4) :
Patch 9829397 : applied on Sun Oct 09 00:36:34 CST 2011
Unique Patch ID: 13975479
Created on 27 Jul 2011, 23:04:08 hrs PST8PDT
Bugs fixed:
9829397
This patch overlays patches:
12419331
This patch needs patches:
12419331
as prerequisites
Patch 10190759 : applied on Sun Oct 09 00:35:11 CST 2011
Unique Patch ID: 14143708
Created on 29 Sep 2011, 20:11:01 hrs UTC
Bugs fixed:
10190759
This patch overlays patches:
12419331
This patch needs patches:
12419331
as prerequisites
Patch 12419331 : applied on Sun Oct 09 00:32:55 CST 2011
Unique Patch ID: 13821492
Created on 13 Jul 2011, 02:03:10 hrs PST8PDT
Bugs fixed:
10151017, 10158965, 11724916, 10190642, 12586486, 12586487, 10129643
12586488, 12586489, 10018789, 9744252, 10248523, 9956713, 10356513
9715581, 9770451, 10378005, 10170431, 10425676, 10222719, 10126094
9591812, 10127360, 10132870, 10094201, 9443361, 10193846, 11664046
11069199, 10324294, 10245086, 12586490, 10205230, 12586491, 10052141
12586492, 12586493, 12586494, 10142788, 11818335, 11830776, 12586495
9905049, 11830777, 12586496, 11830778, 6892311, 10040921, 10077191
10358019, 12431716, 10219576, 10258337, 11707699, 10264680, 10209232
11651810, 10102506, 11067567, 9881076, 10278372, 10040531, 10621169
10155605, 10082277, 10356782, 10218814, 9078442, 9788588, 10157249
9735237, 10317487, 12326246, 11707302, 10310299, 10636231, 10230571
11065646, 12419321, 10368698, 10079168, 10013431, 10228151, 10233732
10324526, 8223165, 10238786, 10217802, 10061015, 9953542, 9572787
10052956, 10080579, 11699057, 12620422, 10332111, 10227288, 10329146
10332589, 10110863, 10073683, 9869401, 10019218, 10229719, 11664719
9539440, 10373381, 9735282, 9748749, 11724984, 10022980, 10411618
11800854, 12419331, 11674485, 10187168, 6523037, 10648873, 9724970
10053725, 10084145, 10367188, 11800170, 11695285, 10157402, 9651350, 10299224
Patch 12311357 : applied on Sat May 07 00:34:11 CST 2011
Unique Patch ID: 13629097
Created on 6 Apr 2011, 22:09:52 hrs PST8PDT
Bugs fixed:
12311357, 10425672, 10157506, 10178670, 9959110, 11655840, 10634513
9891341, 11782423, 11077756, 10376847, 10314123, 10014392, 10157622
10089120, 10057296, 10053985, 9864003, 10044622, 9812970, 10083789
10073372, 9876201, 9963327, 10375649, 9336825, 10062301, 10018215
10105195, 10007185, 10071992, 10038791, 10048487, 9926027, 10260251
10052721, 10028235, 10027079, 10028343, 10231906, 10065216, 10045436
9907089, 10175855, 10284828, 10072474, 10036834, 10028637, 10029900
9974223, 9979706, 10016083, 10015460, 9918485, 9971646, 10040647
9978765, 10069541, 9915329, 10107380, 10110969, 10305361, 10029119
10233159, 10083009, 9812956, 10008467, 10036193, 10048027, 10040109
10015210, 9944978, 10033106, 9978195, 10042143, 10284693, 9679401
10111010, 10075643, 10057680, 10205290, 10124517, 10078086, 9944948
10146768, 10052529, 10011084, 10073075, 10248739, 10236074, 10128191
9975837, 10168006, 9949676
Rac system comprising of multiple nodes
Local node = crm-db-001
Remote node = crm-db-002
--------------------------------------------------------------------------------
3.启动数据库
srvctl start nodeapps -n crm-db-001
srvctl start nodeapps -n crm-db-002
srvctl start database -d siebeldb
4.安装sql语句
cd $ORACLE_HOME/rdbms/admin
sqlplus "/as sysdba"
@catbundle.sql psu apply
@?/rdbms/admin/utlrp.sql
5.升级 RMAN Catalog
catalog10g =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(HOST = 10.7.2.59)(PORT = 1521))
)
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = catalog10g)
)
)
节点2: /usr/openv/rman/
$ rman catalog rmanbak/rmanbak@catalog10g
RMAN>
upgrade catalog;
recovery catalog upgraded to version 11.02.00.02
DBMS_RCVMAN package upgraded to version 11.02.00.02
DBMS_RCVCAT package upgraded to version 11.02.00.02
6. 打完补丁后,单个进程的“work USLA”堆栈占用由7MB下降到152KB,属于正常范围。
grid@crm-db-001:/oracle/app/admin>svmon -P 13689124
-------------------------------------------------------------------------------
Pid Command Inuse Pin Pgsp Virtual 64-bit Mthrd 16MB
13689124 oracle 3713589 28256 8057 3652055 Y N N
PageSize Inuse Pin Pgsp Virtual
s 4 KB 74021 0 6569 9943
m 64 KB 227473 1766 93 227632
Vsid Esid Type Description PSize Inuse Pin Pgsp Virtual
1f670f0 10 clnt text data BSS heap, s 68150 0 - -
/dev/lv_oracle:93573
181e087 7000006f work default shmat/mmap m 4096 0 0 4096
f31df4 70000001 work default shmat/mmap m 4096 0 0 4096
37233f 70000069 work default shmat/mmap m 4096 0 0 4096
42b03 7000005f work default shmat/mmap m 4096 0 0 4096
68e76f 70000060 work default shmat/mmap m 4096 0 0 4096
118f16 7000006d work default shmat/mmap m 4096 0 0 4096
1f9dcfe 70000061 work default shmat/mmap m 4096 0 0 4096
da2bdc 7000005d work default shmat/mmap m 4096 0 0 4096
c117c7 70000002 work default shmat/mmap m 4096 0 0 4096
1fd48fb 7000006b work default shmat/mmap m 4096 0 0 4096
……
50005 9ffffffd work shared library sm 148 0 3397 3410
905997 f00000002 work process private m 5 3 0 5
714579 ffffffff work application stack sm 54 0 0 54
d000d 9ffffffe work shared library sm 51 0 4 52
14db44b 80020014 work USLA heap sm 38 0 0 38 —>152K内存大小
1628c64 7000007d work default shmat/mmap m 2 0 0 2
54f553 70000085 work default shmat/mmap m 2 0 0 2
……
1627c65 fffffff1 work application stack sm 0 0 0 0
1f392f4 fffffff0 work application stack sm 0 0 0 0
5c315a fffffff5 work application stack sm 0 0 0 0
33ad35 fffffff4 work application stack sm 0 0 0 0
13d0035 fffffffb work application stack sm 0 0 0 0
13dfc3a fffffff2 work application stack sm 0 0 0 0
17bae7e fffffffd work application stack sm 0 0 0 0
ec5feb fffffff7 work application stack sm 0 0 0 0
9fed98 fffffff6 work application stack sm 0 0 0 0
165f13 fffffff9 work application stack sm 0 0 0 0
304b38 fffffffc work application stack sm 0 0 0 0
1043202 fffffff8 work application stack sm 0 0 0 0
4. 下阶段还需要主机厂商从操作系统层面查找出是否还有其它未知的程序或组件占用了多余的20GB内存(操作系统+oracle集群软件Grid Infrastructure大约占用 20GB)。
5.在上述工作调整后,加强对主机的CPU,内存等资源进行监控,并适当扩容,尽量将使用率控制在60%以内。
2.2011-10-7日下午13:00 -14:00 点左右CRM数据库出现较高的
“cursor: pin S wait on X”等待事件:
Awr 报告的 top 5 等待事件:
Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Avg
wait % DB
Event Waits Time(s) (ms) time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
cursor: pin S wait on X 5,596 250,178 44707 54.5 Concurrenc
library cache lock 1,942 88,419 45530 19.3 Concurrenc
latch: row cache objects 1,929,683 48,846 25 10.6 Concurrenc
DB CPU 32,205 7.0
db file sequential read 2,339,886 12,267 5 2.7 User I/O
Foreground Wait Events DB/Inst: SIEBELDB/siebeldb1 Snaps: 6437-6438
-> s - second, ms - millisecond - 1000th of a second
-> Only events with Total Wait Time (s) >= .001 are shown
-> ordered by wait time desc, waits desc (idle events last)
-> %Timeouts: value of 0 indicates value was < .5%. Value of null is truly 0
Avg
%Time Total Wait wait Waits % DB
Event Waits -outs Time (s) (ms) /txn time
-------------------------- ------------ ----- ---------- ------- -------- ------
cursor: pin S wait on X 5,596 0 250,178 44707 0.0 54.5
library cache lock 1,942 0 88,419 45530 0.0 19.3
latch: row cache objects 1,929,683 0 48,846 25 1.7 10.6
db file sequential read 2,339,886 0 12,267 5 2.1 2.7
cursor: pin S 241 0 3,183 13209 0.0 .7
resmgr:cpu quantum 203,325 0 2,754 14 0.2 .6
log file sync 1,126,464 0 1,679 1 1.0 .4
gc current block 2-way 1,699,483 0 968 1 1.5 .2
分析:
可能发生library cache lock的情况的原因主要有以下3种:
1、在存储过程或者函数正在运行时被编译。
2、在存储过程或者函数正在运行时被对它们进行授权、或者移除权限等操作。
3、对某个表执行DDL期间,有另外的会话对该表执行DML或者DDL。
据客户dba当时了解到的情况是:当时可能存在grant的ddl语句。
建议:
1.在业务高峰期间,尽量避免对访问频繁的应用表进行DDL语句。
2.如果再次发生该问题,建议根据Note417122.1 来找出holder的会话,并kill掉holding的会话。
2.1 找出 Waiting a和 Holding会话.
select /*+ ordered */ h1.sid holding_session,
w1.sid waiting_session,
w.kgllktype lock_or_pin,
w.kgllkhdl address,
decode(h.kgllkmod, 0, 'None', 1, 'Null', 2, 'Share', 3, 'Exclusive',
'Unknown') mode_held,
decode(w.kgllkreq, 0, 'None', 1, 'Null', 2, 'Share', 3, 'Exclusive',
'Unknown') mode_requested
from dba_kgllock w, dba_kgllock h, v$session w1, v$session h1
where
(((h.kgllkmod != 0) and (h.kgllkmod != 1)
and ((h.kgllkreq = 0) or (h.kgllkreq = 1)))
and
(((w.kgllkmod = 0) or (w.kgllkmod= 1))
and ((w.kgllkreq != 0) and (w.kgllkreq != 1))))
and w.kgllktype = h.kgllktype
and w.kgllkhdl = h.kgllkhdl
and w.kgllkuse = w1.saddr
and h.kgllkuse = h1.saddr
/
HOLDING_SESSION WAITING_SESSION LOCK_OR_PIN ADDRESS MODE_HELD MODE_REQUESTED
9
(holding会话) 11 Pin 67EAFE88 Share Exclusive
11 17 Lock 67EAFE88 Exclusive Exclusive
11 13 Lock 67EAFE88 Exclusive Exclusive
2.2 找出Holding会话的详细信息(主要取出 serial#值).
select ses.sid, ses.serial#,ses.program,sq.sql_text from x$kgllk lck , v$session ses,v$sql sq
where kgllkhdl in
(select kgllkhdl from x$kgllk where kgllkreq >0)
and lck.KGLLKUSE = ses.saddr
and ses.sql_hash_value=sq.hash_value and kgllkpNC<>'00'
Order by lck.KGLNAOBJ
/
SID SERIAL# PROGRAM SQL_TEXT
9 14 sqlplus.exe Begin Dummy; Dbms_lock.sleep(1000); End;
2.3 杀掉该holding会话
sql> alter system kill session '9,14';
页:
[1]