ORA-01114 ORA-29701一例
ORA-01114 ORA-29701一例库一个前台服务进程报错 ORA-01114 ORA-29701,错误信息如下:
2013-09-06 09:40:37.257: [ CSSCLNT]clssscConnect: gipcWait failed with 16 (12)
2013-09-06 09:40:37.263: [ CSSCLNT]clsssInitNative: connect to (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_upcmdb03_)) failed, rc 16
kgxgncin: CLSS init failed with status 3
kgxgncin: return status 3 (1311719766 SKGXN not av) from CLSS
kjfmsgr: unable to connect to NM for reg in shared group
*** 2013-09-06 09:40:37.278
dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=0, mask=0x0)
----- Error Stack Dump -----
ORA-01114: 将块写入文件 时出现 IO 错误 (块 # )
ORA-01114: 将块写入文件 162 时出现 IO 错误 (块 # 249666)
ORA-29701: 无法连接到集群同步服务
ORA-29701: 无法连接到集群同步服务
Ocssd.log:
2013-09-06 09:38:19.327: [ CSSD]clssgmClientConnectMsg: msg flags 0x0000
2013-09-06 09:38:19.327: gipcInternalSend: connection not valid for send operation endp 1139caff0 { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_upcmdb03_)(GIPCID=0623a806-0b64b484-1901192))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_upcmdb03_)(GIPCID=0b64b484-0623a806-1508132))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 1508132, flags 0x34271e, usrFlags 0x14000 }, ret gipcretConnectionLost (12)
2013-09-06 09:38:19.327: gipcSendF : EXCEPTION[ ret gipcretConnectionLost (12) ] failed to send on endp 1139caff0 { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_upcmdb03_)(GIPCID=0623a806-0b64b484-1901192))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_upcmdb03_)(GIPCID=0b64b484-0623a806-1508132))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 1508132, flags 0x34271e, usrFlags 0x14000 }, addr 0000000000000000, buf 113745ed8, len 103, cookie 1138ab750, flags 0x4000
2013-09-06 09:38:19.327: gipcSendF : EXCEPTION[ ret gipcretConnectionLost (12) ] failed to send on endp 11388c310 { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_upcmdb03_)(GIPCID=0623a806-0b64b484-1901192))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_upcmdb03_)(GIPCID=0b64b484-0623a806-1508132))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 1508132, flags 0x13061f, usrFlags 0x10010 }, addr 0000000000000000, buf 113745ed8, len 103, cookie 113745f50, flags 0x0
2013-09-06 09:38:19.327: [ CSSD]clssgmclSendGIPC: gipcSend failure - 12
2013-09-06 09:38:19.327: [ CSSD]clssgmSendClient: Send failed rc 12, con (c5e31ee), client (0), proc (113bf6730)
2013-09-06 09:38:19.331: [ CSSD]clssgmDeadProc: proc 113bf6730
2013-09-06 09:40:59.364: gipcmodMuxTransferAccept: internal accept request failed endp 11125bfd0, child 1139ea710, ret gipcretAuthFail (22)
2013-09-06 09:40:59.364: [ GIPCMUX] gipcmodMuxTransferAccept: EXCEPTION[ ret gipcretAuthFail (22) ] error during accept on endp 11125bfd0
2013-09-06 09:40:59.364: gipcmodClscCallback: async request failed req 1135c8150 { gipcSendRequest : addr '', data 111464750, len 48, olen 0, parentEndp 1137538b0, ret gipcretConnectionLost (12), objFlags 0x0, reqFlags 0x224 }, ret gipcretConnectionLost (12)
2013-09-06 09:40:59.365: gipcmodMuxTransferAccept: internal accept request failed endp 11125bfd0, child 1137538b0, ret gipcretConnectionInvalid (13)
2013-09-06 09:40:59.365: [ GIPCMUX] gipcmodMuxTransferAccept: EXCEPTION[ ret gipcretConnectionInvalid (13) ] error during accept on endp 11125bfd0
2013-09-06 09:40:59.381: [ CSSD]clssscSelect: cookie accept request 11098d2c0
2013-09-06 09:40:59.381: [ CSSD]clssgmAllocProc: (1139ea710) allocated
2013-09-06 09:40:59.385: [ CSSD]clssgmClientConnectMsg: properties of cmProc 1139ea710 - 0,1,2,3,4
2013-09-06 09:40:59.385: [ CSSD]clssgmClientConnectMsg: Connect from con(c5e56d8) proc(1139ea710) pid(8454896/8454896) version 11:2:1:4, properties: 0,1,2,3,4
2013-09-06 09:40:59.385: [ CSSD]clssgmClientConnectMsg: msg flags 0x0000
2013-09-06 09:40:59.398: [ CSSD]clssscSelect: cookie accept request 11098d2c0
2013-09-06 09:40:59.398: [ CSSD]clssgmAllocProc: (113a59df0) allocated
2013-09-06 09:40:59.402: [ CSSD]clssgmClientConnectMsg: properties of cmProc 113a59df0 - 0,1,2,3,4
2013-09-06 09:40:59.402: [ CSSD]clssgmClientConnectMsg: Connect from con(c5e572f) proc(113a59df0) pid(7930566/7930566) version 11:2:1:4, properties: 0,1,2,3,4
2013-09-06 09:40:59.402: [ CSSD]clssgmClientConnectMsg: msg flags 0x0000
2013-09-06 09:40:59.417: [ CSSD]clssscSelect: cookie accept request 1139ea710
该ORA-29701: 无法连接到集群同步服务的原因是进程无法成功注册到CSS 集群同步服务中, 对应ocssd.log中存在 2013-09-06 09:38:19.327 gipcInternalSend: connection not valid for send operation endp 1139caff0 { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_upcmdb03_) 的信息,这说明当时cssd.bin进程无法有效处理进程连接行为。
该症状与文档< I/O Errors in Alert log with ORA-29701, with "gipcWait failed with 16" in trace (Doc ID 1496329.1)>描述的较为一致。
当ocssd.bin进程无法有效处理进程连接行为时,可能出现上述ORA-29701错误,该错误导致进程无法成功提交IO。 Ocssd.bin进程无法有效处理该行为的可能包括 :进程HANG、CPU或内存资源不足等。
通过orac性能监控可以看到在问题时段有较高的CPU使用率:
AWR中显示存在少量OS CPU WAIT TIME:
Statistic Value End Value
AVG_BUSY_TIME 59,927
AVG_IDLE_TIME 299,582
AVG_IOWAIT_TIME 20,336
AVG_SYS_TIME 5,995
AVG_USER_TIME 53,764
BUSY_TIME 3,847,221
IDLE_TIME 19,184,896
IOWAIT_TIME 1,312,936
SYS_TIME 395,156
USER_TIME 3,452,065
LOAD 9 10
OS_CPU_WAIT_TIME 3,736,500
RSRC_MGR_CPU_WAIT_TIME 0
VM_IN_BYTES 770,048
VM_OUT_BYTES 339,587,072
PHYSICAL_MEMORY_BYTES 206,158,430,208
NUM_CPUS 64
NUM_CPU_CORES 32
NUM_LCPUS 64
NUM_VCPUS 32
GLOBAL_RECEIVE_SIZE_MAX 41,943,040
GLOBAL_SEND_SIZE_MAX 41,943,040
TCP_RECEIVE_SIZE_DEFAULT 16,384
TCP_RECEIVE_SIZE_MAX 9,223,372,036,854,775,807
TCP_RECEIVE_SIZE_MIN 4,096
TCP_SEND_SIZE_DEFAULT 16,384
TCP_SEND_SIZE_MAX 9,223,372,036,854,775,807
TCP_SEND_SIZE_MIN 4,096
消耗CPU 最高的几个语句如下:
CPU Time (s) Executions CPU per Exec (s) %Total Elapsed Time (s) %CPU %IO SQL Id SQL Module SQL Text
8,624.29 7,673 1.12 26.81 9,227.52 93.46 0.00 bwzctshpb44z9
JDBC Thin Client SELECT A.BUSIID, A.ICONTENTID,...
6,390.37 21,411,575 0.00 19.86 6,828.10 93.59 0.00 5ntdmkpwybhax
JDBC Thin Client SELECT ID FROM CTP_BRANCH WHER...
5,150.51 16,480 0.31 16.01 5,559.22 92.65 0.84 4d060subkztnf
JDBC Thin Client BEGIN p_pcm_wn_united_approve....
3,577.02 1,211 2.95 11.12 5,032.32 71.08 21.05 2yz5fpu6y3gpj
JDBC Thin Client BEGIN p_pcm_ad_handlist_manage...
2,445.13 2,323 1.05 7.60 2,918.86 83.77 10.13 djt2w8tuxkyuz
JDBC Thin Client BEGIN p_pcm_wn_united_approve....
1,189.84 1,153 1.03 3.70 1,316.04 90.41 2.70 3u1m3wfdzfas6
JDBC Thin Client BEGIN p_pcm_bg_bill_apply.r_ap...
1,129.66 288 3.92 3.51 1,638.17 68.96 23.29 2cd6vjbqz0x3k
JDBC Thin Client SELECT COUNT (1) FROM ( SELECT...
1,095.32 288 3.80 3.40 1,180.85 92.76 0.96 39df691a6bb9j
JDBC Thin Client SELECT ta500251001, ta50025100...
1,039.03 274 3.79 3.23 1,525.40 68.12 23.08 4nbf95hhdqqjk
JDBC Thin Client SELECT COUNT (1) FROM ( SELECT...
1,022.15 274 3.73 3.18 1,095.55 93.30 0.71 0hazgftgt2qu4
JDBC Thin Client SELECT ta500251001, ta50025100...
同时我们观察oraagent.bin进程的priority 优先级发现该进程的优先级不是0,而是60:
grid S 1901192 11272321 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid S 1901192 11665611 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid S 1901192 11731153 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid S 1901192 16122139 0 /oragrid/product/11.2.0/bin/ocssd.bin
grid S 1901192 16253213 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid S 1901192 16449833 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid S 1901192 16580917 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid S 1901192 16712107 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid Z 1901192 16777645 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid S 1901192 16908755 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid Z 1901192 17039863 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid Z 1901192 20054343 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid S 1901192 13959817 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid S 1901192 14877215 0 /oragrid/product/11.2.0/bin/ocssd.bin
grid S 1901192 14942755 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid S 1901192 16188071 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid S 1901192 16319141 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid Z 1901192 16384695 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid S 1901192 16515781 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid S 1901192 16581325 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid S 1901192 13108129 61 /oragrid/product/11.2.0/bin/ocssd.bin
grid Z 1901192 13370303 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid S 1901192 13435839 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid S 1901192 13501385 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid S 1901192 13566927 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid Z 1901192 13829109 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid Z 1901192 15729599 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid S 1901192 14025881 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid Z 1901192 14156957 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid Z 1901192 14288041 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid S 1901192 14419137 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid S 1901192 14484675 60 /oragrid/product/11.2.0/bin/ocssd.bin
grid S 1901192 14550215 60 /oragrid/product/11.2.0/bin/ocssd.bin
该现象与BUG:13940331 - VALUE FOR SETTING THREAD SCHEDULING IS INCORRECT IN SLTSTSPAWN中描述的一致:
Bug 13940331 AIX: OCSSD threads are not set to the correct priority
being affected
11.2.0.2
10.2.0.5
10.2.0.4
Platforms affected IBM AIX Power Based Systems (64bit)
Fixed:
This issue is fixed in 12.1.0.1 (Base Release)
Symptoms: Related To:
Hang (Process Hang)
Cluster Ready Services / Parallel Server Management
Description
On AIX has found that the threads created for OCSSD were not inheriting the correct priority
thie may cause evictions when the machine is running under high workload.
The processes show the functions "ntevque->netevpque->poll" in their call stack and may hang.
Without the patch the priority for OCSSD threads is 60, after the patch we can see it's zero as:
$ ps -mp <OCSSD_PID> -o THREAD
USER PID PPID TID S CP PRI SC WCHAN F TT BND COMMAND
oraclets 14090370 11206740 - A 13 0 30 * 10240103 - - /local/oracle/grid/produ
- - - 27459713 Z 0 0 1 - c00001 - - -
- - - 38141977 S 0 0 1 f1000f0a10024640 8410400 - - -
- - - 38600869 S 0 0 1 - 418400 - - -
- - - 46268633 S 0 0 1 f1000f0a1002c240 8410400 - - -
- - - 46465061 S 0 0 1 f1000f0a1002c540 8410400 - - -
- - - 46792861 S 0 0 1 f1000f0a1002ca40 8410400 - - -
- - - 47710399 S 0 0 1 f1000f0a1002d840 8410400 - - -
- - - 48562263 S 0 0 1 f1000f0a1002e540 8410400 - - -
- - - 49545223 Z 0 0 1 - c00001 - - -
...
Workaround
None
该线程priority的问题仅发生在AIX平台上,目前存在11.2.0.3上的补丁 Patch 13940331: VALUE FOR SETTING THREAD SCHEDULING IS INCORRECT IN SLTSTSPAWN。
1.3 建议
1、 考虑优化相关SQL 语句等,减少其对CPU的使用率
2、 考虑应用11.2.0.3上的补丁Patch 13940331: VALUE FOR SETTING THREAD SCHEDULING IS INCORRECT IN SLTSTSPAWN
页:
[1]