Oracle数据库数据恢复、性能优化»论坛 › Oracle › Oracle数据库管理 › alert日志：ERROR: direct connection failure with ASM ...

0 积分	1 好友	5 主题

发消息

[RAC性能调优] alert日志：ERROR: direct connection failure with ASM

1^#

发表于 2013-10-10 11:19:58 | 查看: 15568| 回复: 10

最近几天很不顺啊，8号中午，2个节点都宕机啦。服务器黑屏，没有自动重启。9号凌晨00：37，节点1又宕机啦。10：32左右重启的系统。现在发现alert里面大量的报连接ASM的错误。

请教刘老大，这个异常有没有什么影响？是什么原因引起的？
目前该节点运行正常，但是RMAN备份已经10个小时了一直没完成，平时3-5小时。9号凌晨由于宕机了没备份成功，这两天rebuild过索引，归档日志比平时大很多。

环境：oracle 11.2.0.3 RAC，redhat 5.5 64bit

节点1 alert：节点1从昨天11：30左右重启后，报了很多连接ASM失败的错误

Wed Oct 09 10:37:21 2013
LMS 2: 1 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Wed Oct 09 10:38:12 2013
WARNING: ASM communication error: op 0 state 0x0 (15055)
ERROR: direct connection failure with ASM
Wed Oct 09 10:38:16 2013
WARNING: ASM communication error: op 0 state 0x0 (15055)
ERROR: direct connection failure with ASM
。。。。。
Thu Oct 10 04:43:46 2013
WARNING: ASM communication error: op 0 state 0x0 (15055)
ERROR: direct connection failure with ASM
Thu Oct 10 04:43:46 2013
WARNING: ASM communication error: op 0 state 0x0 (15055)
ERROR: direct connection failure with ASM

复制代码

ASM alert没有相关错误信息.

SQL> select group_number,name,state from v$asm_diskgroup;
GROUP_NUMBER NAME STATE
------------ ------------------------------ -----------
1 CRS MOUNTED
2 DATA CONNECTED
3 FRA CONNECTED

复制代码

分享0

收藏0 回复只看该作者道具举报

huqianhao

2^#

发表于 2013-10-10 11:23:35

看了MOS上的一篇文章：
MOS：
APPLIES TO:
Oracle Database - Enterprise Edition - Version 11.2.0.2 and later
Information in this document applies to any platform.
SYMPTOMS
Reported on RDBMS alert.log:
WARNING: ASM communication error: op 0 state 0x0 (15055)
ERROR: direct connection failure with ASM
ASM side (and RDBMS may too) reports ORA-1031
CAUSE
Different cases for this 15055 issue reported on bug 13495419.
The diagnostic information (Could not find username oracle in group oinstall) indicate that ASM returned ORA-1031 back to RDBMS and eventually the query resulted in ora-1031/ORA-15055.
RDBMS will always make a connection to the ASM instance as the Oracle binary owner (or the effective user id due to setuid bit) of the RDBMS home binary.

In this case, at RDBMS home side, OS user is "patrol" and connected to database as DB user "patrol_dba1". When v$asm_diskgroup is queried, RDBMS would connect as Oracle binary owner (which is "oracle"). Now, at ASM side the authentication process would have the group id of OS user "patrol" (i.e 1284) and user id as "oracle". As the connection is made as SYSDBA, it would use the group name which is assigned for SYSDBA. In this case it is "oinstall". Group id of "onistall" group is 2001. As 2001 != 1284, it checks if "oracle" is also a part of "oinstall" group in "/etc/group" (secondary group). It finds that "oracle" does not belong to "onistall" group, it returns an error.
If oinstall group is not set to patrol user, on database instance, queries to the v$asm* views will return incorrect data (i.e. no rows), when the same query on the ASM instance will return the correct information.

Even in case of a "direct" connection (sqlplus / as sysdba) when OS user is "patrol", you can see that the "real" group id of "patrol" (1284) does not match with the group id of the group "oinstall" (2001) designated for SYSDBA. It will search for presence of "patrol" user in "/etc/group". For "oracle" its not a problem as "real" group id matches the group id of "oinstall" (2001) [primary group of oracle].
SOLUTION
To make this work there are 2 options:
1. make primary group of OS user "patrol" (or the non-install user) as "oinstall"
2. add "oracle" to "oinstall" group in "/etc/group" of ASM home.

This is not a bug but expected behavior.
REFERENCES
BUG:13495419 - WARNING: ASM COMMUNICATION ERROR: (15055) ERROR: DIRECT CONNECTION

感觉跟他说的不一样。我的oracle用户是在oinstall组里

回复只看该作者道具举报

Maclean Liu(刘相兵

3^#

发表于 2013-10-10 11:29:41

ASM 的diag目录打包上传下

回复只看该作者道具举报

huqianhao

4^#

发表于 2013-10-10 11:39:07

Maclean Liu(刘相兵发表于 2013-10-10 11:29
ASM 的diag目录打包上传下

老大，要/u01/app/grid_base/diag/asm/+asm/+ASM1 这个目录下所有的日志吗？有250M哦。压缩后有24M。

回复只看该作者道具举报

huqianhao

5^#

发表于 2013-10-10 11:50:15

第一天宕机的时候产生了一个170M的+ASM1_rbal_4742.trc 文件。太大了，上传不了。我把它排除了。有需要我再给你。

lmsdb1_grid_diag.zip

7.46 MB, 下载次数: 706

回复只看该作者道具举报

Maclean Liu(刘相兵

6^#

发表于 2013-10-10 11:50:18

7z 或者删掉一些非后台进程的trace

回复只看该作者道具举报

huqianhao

7^#

发表于 2013-10-10 11:53:17

huqianhao 发表于 2013-10-10 11:50
第一天宕机的时候产生了一个170M的+ASM1_rbal_4742.trc 文件。太大了，上传不了。我把它排除了。有需要我再 ...

+ASM1_rbal_4742.trc 文件里面多数是下面的这个错误。

*** 2013-10-08 12:07:15.344
2013-10-08 12:07:15.344: [ default]failed to initialize skgp context
2013-10-08 12:07:15.344: [ default]slos op : sslssreghdlr
2013-10-08 12:07:15.344: [ default]slos dep : Error 0 (0)
2013-10-08 12:07:15.344: [ default]slos loc : sskgpinit1
2013-10-08 12:07:15.344: [ default]slos info:
[ CLWAL]clsw_Initialize: OLR initlevel [30000]
2013-10-08 12:07:15.345: [ default]a_init: Unable to get log name. Retval:[-4]
2013-10-08 12:07:15.368: [ GPNP]clsgpnp_dbmsGetItem_profile: [at clsgpnp_dbms.c:313] Result: (0) CLSGPNP_OK. got ASM-Profile.DiscoveryString=''

复制代码

回复只看该作者道具举报

Maclean Liu(刘相兵

8^#

发表于 2013-10-10 13:14:34

10-8:

Tue Oct 08 04:04:46 2013
Time drift detected. Please check VKTM trace file for more details.
Tue Oct 08 13:04:17 2013
* instance_number obtained from CSS = 1, checking for the existence of node 0...
* node 0 does not exist. instance_number = 1
Starting ORACLE instance (normal)

没有明显的shutdown 过程

trace:

2013-10-08 12:07:15.368: [ GPNP]clsgpnp_dbmsGetItem_profile: [at clsgpnp_dbms.c:313] Result: (0) CLSGPNP_OK. got ASM-Profile.DiscoveryString=''

10-9

Tue Oct 08 22:56:00 2013
Errors in file /u01/app/grid_base/diag/asm/+asm/+ASM1/trace/+ASM1_ora_23277.trc  (incident=259474):
ORA-03137: TTC protocol internal error : [12333] [0] [0] [125] [] [] [] []
Incident details in: /u01/app/grid_base/diag/asm/+asm/+ASM1/incident/incdir_259474/+ASM1_ora_23277_i259474.trc
Tue Oct 08 22:56:01 2013
Dumping diagnostic data in directory=[cdmp_20131008225601], requested by (instance=1, osid=23277), summary=[incident=259474].
Tue Oct 08 22:56:03 2013
Sweep [inc][259474]: completed
Sweep [inc2][259474]: completed
Tue Oct 08 23:28:36 2013
Errors in file /u01/app/grid_base/diag/asm/+asm/+ASM1/trace/+ASM1_ora_24779.trc  (incident=259482):
ORA-03137: TTC protocol internal error : [12333] [0] [28] [47] [] [] [] []
Incident details in: /u01/app/grid_base/diag/asm/+asm/+ASM1/incident/incdir_259482/+ASM1_ora_24779_i259482.trc
Tue Oct 08 23:28:37 2013
Dumping diagnostic data in directory=[cdmp_20131008232837], requested by (instance=1, osid=24779), summary=[incident=259482].
Tue Oct 08 23:28:40 2013
Sweep [inc][259482]: completed
Sweep [inc2][259482]: completed
Wed Oct 09 10:31:33 2013
* instance_number obtained from CSS = 1, checking for the existence of node 0...
* node 0 does not exist. instance_number = 1
Starting ORACLE instance (normal)

trace:

*** 2013-10-09 00:08:20.701
2013-10-09 00:08:20.701: [ default]slos op  :  sslssreghdlr
2013-10-09 00:08:20.701: [ default]slos dep :  Error 0 (0)
2013-10-09 00:08:20.701: [ default]slos loc :  sskgpinit1
2013-10-09 00:08:20.701: [ default]slos info:
[ CLWAL]clsw_Initialize: OLR initlevel [30000]
2013-10-09 00:08:20.703: [ default]a_init: Unable to get log name. Retval:[-4]
2013-10-09 00:08:20.756: [ GPNP]clsgpnp_dbmsGetItem_profile: [at clsgpnp_dbms.c:313] Result: (0) CLSGPNP_OK. got ASM-Profile.DiscoveryString=''
2013-10-09 00:23:21.107: [ default]failed to initialize skgp context

*** 2013-10-09 00:23:21.107
2013-10-09 00:23:21.107: [ default]slos op  :  sslssreghdlr
2013-10-09 00:23:21.107: [ default]slos dep :  Error 0 (0)
2013-10-09 00:23:21.107: [ default]slos loc :  sskgpinit1
2013-10-09 00:23:21.107: [ default]slos info:
[ CLWAL]clsw_Initialize: OLR initlevel [30000]
2013-10-09 00:23:21.109: [ default]a_init: Unable to get log name. Retval:[-4]
2013-10-09 00:23:21.160: [ GPNP]clsgpnp_dbmsGetItem_profile: [at clsgpnp_dbms.c:313] Result: (0) CLSGPNP_OK. got ASM-Profile.DiscoveryString=''
2013-10-09 00:38:20.986: [ default]failed to initialize skgp context

*** 2013-10-09 00:38:20.986
2013-10-09 00:38:20.986: [ default]slos op  :  sslssreghdlr
2013-10-09 00:38:20.986: [ default]slos dep :  Error 0 (0)
2013-10-09 00:38:20.986: [ default]slos loc :  sskgpinit1
2013-10-09 00:38:20.986: [ default]slos info:
[ CLWAL]clsw_Initialize: OLR initlevel [30000]
2013-10-09 00:38:20.988: [ default]a_init: Unable to get log name. Retval:[-4]
2013-10-09 00:38:21.039: [ GPNP]clsgpnp_dbmsGetItem_profile: [at clsgpnp_dbms.c:313] Result: (0) CLSGPNP_OK. got ASM-Profile.DiscoveryString=''

回复只看该作者道具举报

Maclean Liu(刘相兵

9^#

发表于 2013-10-10 13:22:20

Trace file /u01/app/grid_base/diag/asm/+asm/+ASM1/trace/+ASM1_ora_24779.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options
ORACLE_HOME = /u01/app/11.2.0/grid
System name: Linux
Node name: lmsdb1
Release: 2.6.18-194.el5
Version: #1 SMP Tue Mar 16 21:52:39 EDT 2010
Machine: x86_64
Instance name: +ASM1
Redo thread mounted by this instance: 0 <none>
Oracle process number: 28
Unix process pid: 24779, image: oracle@lmsdb1

*** 2013-10-08 23:28:36.869
*** SESSION ID:(1737.61) 2013-10-08 23:28:36.869
*** CLIENT ID:() 2013-10-08 23:28:36.869
*** SERVICE NAME:() 2013-10-08 23:28:36.869
*** MODULE NAME:(extract@fsharedb1 (TNS V1-V3)) 2013-10-08 23:28:36.869
*** ACTION NAME:() 2013-10-08 23:28:36.869

--- PROTOCOL VIOLATION DETECTED ---
----- Dump Cursor sql_id=1j1qymw57vjd0 xsc=0x2b720ded89e8 cur=0x2b720dec1b08 -----

LibraryHandle: Address=0x11ea1ef30 Hash=a7dc5a0 LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
ObjectName: Name=BEGIN dbms_diskgroup.read(:handle, :offset, :length, :buffer); END;

extract@fsharedb1 (TNS V1-V3)

这个 dbms_diskgroup.read 是谁在调用？

回复只看该作者道具举报

huqianhao

10^#

发表于 2013-10-10 13:59:57

Maclean Liu(刘相兵发表于 2013-10-10 13:22
Trace file /u01/app/grid_base/diag/asm/+asm/+ASM1/trace/+ASM1_ora_24779.trc
Oracle Database 11g Ente ...

fsharedb1 是另外一个数据库，配了ogg。从这边抽取数据。

回复只看该作者道具举报

huqianhao

11^#

发表于 2013-10-10 14:22:11

知道原因啦。
这个报错的原因是使用了非oracle用户查询v$asm*相关视图，我们使用了zabbix监控数据库，那天服务器重启后zabbix agent是用zabbix用户自动启动的。
用oracle用户启动就没事啦。

感谢刘大！

回复只看该作者道具举报

返回列表

		自动登录	找回密码
密码			注册