Liu Maclean(刘相兵 发表于 2013-12-26 15:08:23

ORA-01114 ORA-29701一例

ORA-01114 ORA-29701一例


库一个前台服务进程报错 ORA-01114 ORA-29701,错误信息如下:


2013-09-06 09:40:37.257: [ CSSCLNT]clssscConnect: gipcWait failed with 16 (12)
2013-09-06 09:40:37.263: [ CSSCLNT]clsssInitNative: connect to (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_upcmdb03_)) failed, rc 16
kgxgncin: CLSS init failed with status 3
kgxgncin: return status 3 (1311719766 SKGXN not av) from CLSS
kjfmsgr: unable to connect to NM for reg in shared group

*** 2013-09-06 09:40:37.278
dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=0, mask=0x0)
----- Error Stack Dump -----
ORA-01114: 将块写入文件  时出现 IO 错误 (块 # )
ORA-01114: 将块写入文件 162 时出现 IO 错误 (块 # 249666)
ORA-29701: 无法连接到集群同步服务
ORA-29701: 无法连接到集群同步服务


Ocssd.log:

2013-09-06 09:38:19.327: [    CSSD]clssgmClientConnectMsg: msg flags 0x0000
2013-09-06 09:38:19.327: gipcInternalSend: connection not valid for send operation endp 1139caff0 { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_upcmdb03_)(GIPCID=0623a806-0b64b484-1901192))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_upcmdb03_)(GIPCID=0b64b484-0623a806-1508132))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 1508132, flags 0x34271e, usrFlags 0x14000 }, ret gipcretConnectionLost (12)
2013-09-06 09:38:19.327: gipcSendF : EXCEPTION[ ret gipcretConnectionLost (12) ]  failed to send on endp 1139caff0 { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_upcmdb03_)(GIPCID=0623a806-0b64b484-1901192))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_upcmdb03_)(GIPCID=0b64b484-0623a806-1508132))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 1508132, flags 0x34271e, usrFlags 0x14000 }, addr 0000000000000000, buf 113745ed8, len 103, cookie 1138ab750, flags 0x4000
2013-09-06 09:38:19.327: gipcSendF : EXCEPTION[ ret gipcretConnectionLost (12) ]  failed to send on endp 11388c310 { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_upcmdb03_)(GIPCID=0623a806-0b64b484-1901192))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_upcmdb03_)(GIPCID=0b64b484-0623a806-1508132))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 1508132, flags 0x13061f, usrFlags 0x10010 }, addr 0000000000000000, buf 113745ed8, len 103, cookie 113745f50, flags 0x0
2013-09-06 09:38:19.327: [    CSSD]clssgmclSendGIPC: gipcSend failure - 12
2013-09-06 09:38:19.327: [    CSSD]clssgmSendClient: Send failed rc 12, con (c5e31ee), client (0), proc (113bf6730)
2013-09-06 09:38:19.331: [    CSSD]clssgmDeadProc: proc 113bf6730


2013-09-06 09:40:59.364: gipcmodMuxTransferAccept: internal accept request failed endp 11125bfd0, child 1139ea710, ret gipcretAuthFail (22)
2013-09-06 09:40:59.364: [ GIPCMUX] gipcmodMuxTransferAccept: EXCEPTION[ ret gipcretAuthFail (22) ]  error during accept on endp 11125bfd0
2013-09-06 09:40:59.364: gipcmodClscCallback: async request failed req 1135c8150 { gipcSendRequest : addr '', data 111464750, len 48, olen 0, parentEndp 1137538b0, ret gipcretConnectionLost (12), objFlags 0x0, reqFlags 0x224 }, ret gipcretConnectionLost (12)
2013-09-06 09:40:59.365: gipcmodMuxTransferAccept: internal accept request failed endp 11125bfd0, child 1137538b0, ret gipcretConnectionInvalid (13)
2013-09-06 09:40:59.365: [ GIPCMUX] gipcmodMuxTransferAccept: EXCEPTION[ ret gipcretConnectionInvalid (13) ]  error during accept on endp 11125bfd0
2013-09-06 09:40:59.381: [    CSSD]clssscSelect: cookie accept request 11098d2c0
2013-09-06 09:40:59.381: [    CSSD]clssgmAllocProc: (1139ea710) allocated
2013-09-06 09:40:59.385: [    CSSD]clssgmClientConnectMsg: properties of cmProc 1139ea710 - 0,1,2,3,4
2013-09-06 09:40:59.385: [    CSSD]clssgmClientConnectMsg: Connect from con(c5e56d8) proc(1139ea710) pid(8454896/8454896) version 11:2:1:4, properties: 0,1,2,3,4
2013-09-06 09:40:59.385: [    CSSD]clssgmClientConnectMsg: msg flags 0x0000
2013-09-06 09:40:59.398: [    CSSD]clssscSelect: cookie accept request 11098d2c0
2013-09-06 09:40:59.398: [    CSSD]clssgmAllocProc: (113a59df0) allocated
2013-09-06 09:40:59.402: [    CSSD]clssgmClientConnectMsg: properties of cmProc 113a59df0 - 0,1,2,3,4
2013-09-06 09:40:59.402: [    CSSD]clssgmClientConnectMsg: Connect from con(c5e572f) proc(113a59df0) pid(7930566/7930566) version 11:2:1:4, properties: 0,1,2,3,4
2013-09-06 09:40:59.402: [    CSSD]clssgmClientConnectMsg: msg flags 0x0000
2013-09-06 09:40:59.417: [    CSSD]clssscSelect: cookie accept request 1139ea710

该ORA-29701: 无法连接到集群同步服务的原因是进程无法成功注册到CSS 集群同步服务中, 对应ocssd.log中存在 2013-09-06 09:38:19.327  gipcInternalSend: connection not valid for send operation endp 1139caff0 { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_upcmdb03_) 的信息,这说明当时cssd.bin进程无法有效处理进程连接行为。

该症状与文档< I/O Errors in Alert log with ORA-29701, with "gipcWait failed with 16" in trace (Doc ID 1496329.1)>描述的较为一致。
当ocssd.bin进程无法有效处理进程连接行为时,可能出现上述ORA-29701错误,该错误导致进程无法成功提交IO。 Ocssd.bin进程无法有效处理该行为的可能包括 :进程HANG、CPU或内存资源不足等。


通过orac性能监控可以看到在问题时段有较高的CPU使用率:



AWR中显示存在少量OS CPU WAIT TIME:



Statistic        Value        End Value
AVG_BUSY_TIME        59,927         
AVG_IDLE_TIME        299,582         
AVG_IOWAIT_TIME        20,336         
AVG_SYS_TIME        5,995         
AVG_USER_TIME        53,764         
BUSY_TIME        3,847,221         
IDLE_TIME        19,184,896         
IOWAIT_TIME        1,312,936         
SYS_TIME        395,156         
USER_TIME        3,452,065         
LOAD        9        10
OS_CPU_WAIT_TIME        3,736,500         
RSRC_MGR_CPU_WAIT_TIME        0         
VM_IN_BYTES        770,048         
VM_OUT_BYTES        339,587,072         
PHYSICAL_MEMORY_BYTES        206,158,430,208         
NUM_CPUS        64         
NUM_CPU_CORES        32         
NUM_LCPUS        64         
NUM_VCPUS        32         
GLOBAL_RECEIVE_SIZE_MAX        41,943,040         
GLOBAL_SEND_SIZE_MAX        41,943,040         
TCP_RECEIVE_SIZE_DEFAULT        16,384         
TCP_RECEIVE_SIZE_MAX        9,223,372,036,854,775,807         
TCP_RECEIVE_SIZE_MIN        4,096         
TCP_SEND_SIZE_DEFAULT        16,384         
TCP_SEND_SIZE_MAX        9,223,372,036,854,775,807         
TCP_SEND_SIZE_MIN        4,096       

消耗CPU 最高的几个语句如下:

CPU Time (s)        Executions        CPU per Exec (s)        %Total        Elapsed Time (s)        %CPU        %IO        SQL Id        SQL Module        SQL Text
8,624.29        7,673        1.12        26.81        9,227.52        93.46        0.00        bwzctshpb44z9
JDBC Thin Client        SELECT A.BUSIID, A.ICONTENTID,...
6,390.37        21,411,575        0.00        19.86        6,828.10        93.59        0.00        5ntdmkpwybhax
JDBC Thin Client        SELECT ID FROM CTP_BRANCH WHER...
5,150.51        16,480        0.31        16.01        5,559.22        92.65        0.84        4d060subkztnf
JDBC Thin Client        BEGIN p_pcm_wn_united_approve....
3,577.02        1,211        2.95        11.12        5,032.32        71.08        21.05        2yz5fpu6y3gpj
JDBC Thin Client        BEGIN p_pcm_ad_handlist_manage...
2,445.13        2,323        1.05        7.60        2,918.86        83.77        10.13        djt2w8tuxkyuz
JDBC Thin Client        BEGIN p_pcm_wn_united_approve....
1,189.84        1,153        1.03        3.70        1,316.04        90.41        2.70        3u1m3wfdzfas6
JDBC Thin Client        BEGIN p_pcm_bg_bill_apply.r_ap...
1,129.66        288        3.92        3.51        1,638.17        68.96        23.29        2cd6vjbqz0x3k
JDBC Thin Client        SELECT COUNT (1) FROM ( SELECT...
1,095.32        288        3.80        3.40        1,180.85        92.76        0.96        39df691a6bb9j
JDBC Thin Client        SELECT ta500251001, ta50025100...
1,039.03        274        3.79        3.23        1,525.40        68.12        23.08        4nbf95hhdqqjk
JDBC Thin Client        SELECT COUNT (1) FROM ( SELECT...
1,022.15        274        3.73        3.18        1,095.55        93.30        0.71        0hazgftgt2qu4
JDBC Thin Client        SELECT ta500251001, ta50025100...


同时我们观察oraagent.bin进程的priority 优先级发现该进程的优先级不是0,而是60:

    grid S   1901192  11272321  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid S   1901192  11665611  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid S   1901192  11731153  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid S   1901192  16122139   0 /oragrid/product/11.2.0/bin/ocssd.bin
    grid S   1901192  16253213  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid S   1901192  16449833  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid S   1901192  16580917  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid S   1901192  16712107  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid Z   1901192  16777645  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid S   1901192  16908755  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid Z   1901192  17039863  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid Z   1901192  20054343  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid S   1901192  13959817  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid S   1901192  14877215   0 /oragrid/product/11.2.0/bin/ocssd.bin
    grid S   1901192  14942755  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid S   1901192  16188071  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid S   1901192  16319141  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid Z   1901192  16384695  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid S   1901192  16515781  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid S   1901192  16581325  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid S   1901192  13108129  61 /oragrid/product/11.2.0/bin/ocssd.bin
    grid Z   1901192  13370303  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid S   1901192  13435839  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid S   1901192  13501385  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid S   1901192  13566927  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid Z   1901192  13829109  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid Z   1901192  15729599  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid S   1901192  14025881  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid Z   1901192  14156957  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid Z   1901192  14288041  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid S   1901192  14419137  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid S   1901192  14484675  60 /oragrid/product/11.2.0/bin/ocssd.bin
    grid S   1901192  14550215  60 /oragrid/product/11.2.0/bin/ocssd.bin

该现象与BUG:13940331 - VALUE FOR SETTING THREAD SCHEDULING IS INCORRECT IN SLTSTSPAWN中描述的一致:

Bug 13940331 AIX: OCSSD threads are not set to the correct priority

being affected
11.2.0.2
10.2.0.5
10.2.0.4
Platforms affected IBM AIX Power Based Systems (64bit)
Fixed:
This issue is fixed in 12.1.0.1 (Base Release)
Symptoms: Related To:
Hang (Process Hang)

Cluster Ready Services / Parallel Server Management
Description
On AIX has found that the threads created for OCSSD were not inheriting the correct priority
thie may cause evictions when the machine is running under high workload.
The processes show the functions "ntevque->netevpque->poll" in their call stack and may hang.
Without the patch the priority for OCSSD threads is 60, after the patch we can see it's zero as:
$ ps -mp <OCSSD_PID> -o THREAD
USER PID PPID TID S CP PRI SC WCHAN F TT BND COMMAND
oraclets 14090370 11206740 - A 13 0 30 * 10240103 - - /local/oracle/grid/produ
- - - 27459713 Z 0 0 1 - c00001 - - -
- - - 38141977 S 0 0 1 f1000f0a10024640 8410400 - - -
- - - 38600869 S 0 0 1 - 418400 - - -
- - - 46268633 S 0 0 1 f1000f0a1002c240 8410400 - - -
- - - 46465061 S 0 0 1 f1000f0a1002c540 8410400 - - -
- - - 46792861 S 0 0 1 f1000f0a1002ca40 8410400 - - -
- - - 47710399 S 0 0 1 f1000f0a1002d840 8410400 - - -
- - - 48562263 S 0 0 1 f1000f0a1002e540 8410400 - - -
- - - 49545223 Z 0 0 1 - c00001 - - -
...
Workaround
None

该线程priority的问题仅发生在AIX平台上,目前存在11.2.0.3上的补丁 Patch 13940331: VALUE FOR SETTING THREAD SCHEDULING IS INCORRECT IN SLTSTSPAWN。

1.3        建议

1、        考虑优化相关SQL 语句等,减少其对CPU的使用率
2、        考虑应用11.2.0.3上的补丁Patch 13940331: VALUE FOR SETTING THREAD SCHEDULING IS INCORRECT IN SLTSTSPAWN
页: [1]
查看完整版本: ORA-01114 ORA-29701一例