311ybb

0 积分	0 好友	11 主题

发消息

[RAC Clusterware] 用户无法连接，系统HANG

1^#

发表于 2015-5-26 10:31:22 | 查看: 3427| 回复: 1

版本：11.2.0.3.0  单实例
平台： redhat linux 6.5 X86-64
硬件：内存128G \CPU：4*8core  E7- 4820  @ 2.00GHz\存储 EMC vnx
问题描述：5.13日21:11分用户及应用无法连接数据库，21:22分后系统恢复正常。

附件中主要错误日志.txt内容为目前已经发现的主要错误信息：
21:08:02 系统日志出现hald内存分配失败，21:09 oracle进程内存分配失败
21:08:54alert 日志出现Fatal NI connect error 12170.及类似错误，21:11:17开始出现opiodr aborting process unknown ospid (8931) as a result of ORA-609错误， 21:23:01  alert开始正常，期间出现如下错误：
Errors in file /opt/ora11/diag/rdbms/gnntpri/gnnt/trace/gnnt_ora_19004.trc  (incident=382849):
ORA-03137: TTC 协议内部错误: [12333] [19] [3] [15] [] [] [] []

Errors in file /opt/ora11/diag/rdbms/gnntpri/gnnt/trace/gnnt_cjq0_18684.trc  (incident=377041):
ORA-00445: background process "J000" did not start after 120 seconds
kkjcre1p: unable to spawn jobq slave process
Errors in file /opt/ora11/diag/rdbms/gnntpri/gnnt/trace/gnnt_cjq0_18684.trc:
Wed May 13 21:20:35 2015
gnnt_cjq0_18684.trc出现内存不足情况：
*** 2015-05-13 21:12:37.425
loadavg : 328.48 154.27 69.98
Memory (Avail / Total) = 265.17M / 129034.95M
Swap (Avail / Total) = 27197.34M /  32767.99M
skgpgcmdout: read() for cmd /bin/ps -elf | /bin/egrep 'PID | 9104' | /bin/grep -v grep timed out after 14.975 seconds

*** 2015-05-13 21:13:07.535
loadavg : 640.50 255.66 107.24
Memory (Avail / Total) = 267.29M / 129034.95M
Swap (Avail / Total) = 26915.58M /  32767.99M
*** 2015-05-13 21:13:37.173
loadavg : 633.26 291.48 123.93
Memory (Avail / Total) = 271.08M / 129034.95M
Swap (Avail / Total) = 26561.19M /  32767.99M
(问：这里的已用内存是否是真实的使用内存，而不包括文件系统缓冲的，或是文件系统缓冲已经收缩到了最小？)
21:18:02 listener日志出现
TNS-12518: TNS:listener could not hand off client connection
TNS-12540: TNS:internal limit restriction exceeded
21:19:40 listenre重启
13-MAY-2015 21:19:40 * establish * 1159
TNS-01159: Internal connection limit has been reached; listener has shut down
TNS-12540: TNS:internal limit restriction exceeded

awr中该时段主要等待是Library cache:mutex ，主要等待集中在审计上insert aud$语句，数据库开启log on/off审计
对9:00-9:30的ASH数据进行分别采样，发现21:07前系统比较正常，主要等待事件为logfile sync,21:07-21:11系统负载升高，主要等待事件为latch: row cache objects，21:11-21:20:20之间没有ASH数据，21:20-21:27主要等待事件为library cache:Mutex X
AWR中DB time主要在解析上，解析次数、解析调用的SQL、version count sql都比较正常，
ASH中library cache:Mutex X的等待事件的blocker为DIAG

请大神帮忙进行下分析。