LGWR进程的trace里总是报20多秒的警告信息

wushi8 发表于 2013-11-15 14:26:52

-bash-3.2$ cat /etc/release
               Oracle Solaris 10 8/11 s10s_u10wos_17b SPARC
  Copyright (c) 1983, 2011, Oracle and/or its affiliates. All rights reserved.
                        Assembled 23 August 2011

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
PL/SQL Release 11.2.0.2.0 - Production
CORE 11.2.0.2.0    Production
TNS for Solaris: Version 11.2.0.2.0 - Production
NLSRTL Version 11.2.0.2.0 - Production

RAC环境，两台节点，系统已打补丁到11.2.0.2.10
SQL> select comments from dba_registry_history;

COMMENTS
------------------------------
Upgraded from 9.2.0.6.0
view invalidation
Upgraded from 10.2.0.4.0
Patchset 11.2.0.2.0
PSU 11.2.0.2.10

现在的问题是其中一台节点在commit的时候有时特别慢，另一台节点正常。
慢的节点AWR中log file sync事件在top5中，后来发现LGWR进程的trace文件中总是报20多秒的警告信息，跟业务人员一起观察发现，只要commit慢的时候，trace文件里就会出现这么长时间的警告信息。

初步判断为存储的问题，存储型号为sun 7320，存储中未发现异常日志，同时另一台节点是正常的，所以这个问题一直很困惑。
后来把redo都放在另一个新存储的卷里，还是有这个问题，新存储也是在7320这个机头下，只是不同的柜子。
我现在猜测的原因：
1.BUG，查看metalink里还未打的更新的PSU补丁里未提及相关BUG。
2.存储问题，虽然redo换存储了，但是还是在同一个机头下，机头BUG，或者光纤卡BUG。

希望各位高手能帮助解决此问题！还需要什么信息我再发上来。

wushi8 发表于 2013-11-15 14:32:31

本帖最后由 wushi8 于 2013-11-15 14:36 编辑

asm磁盘信息：
SQL> select group_number,os_mb,total_mb,free_mb from v$asm_disk;

GROUP_NUMBER    OS_MB TOTAL_MB FREE_MB
------------ ---------- ---------- ----------
         2 1484467 1484467    667684
         3 1638002 1638002    620978
         5    573052    573052    251989
         6    716417    716417    314081
         1    573052    573052    523928
         4    1855    1855    1546
         4    1855    1855    1546
         4    1855    1855    1547
         7    614004    614004    613403

asm磁盘组信息
SQL> select group_number,name,state,type,total_mb,free_mb,cold_used_mb from v$asm_diskgroup;

GROUP_NUMBER NAME                         STATE    TYPE    TOTAL_MB FREE_MB COLD_USED_MB
------------ ------------------------------ ----------- ------ ---------- ---------- ------------
         1 ARCH                         CONNECTED EXTERN    573052    523492       49560
         2 DATA                         CONNECTED EXTERN 1484467    667684    816783
         3 IDX                         CONNECTED EXTERN 1638002    620978    1017024
         4 OCR                         MOUNTED    NORMAL    5565    4639       926
         5 SYS                         CONNECTED EXTERN    573052    251989    321063
         6 UNDO                         CONNECTED EXTERN    716417    314081    402336
         7 REDO                         CONNECTED EXTERN    614004    613403       601

warmbreeze 发表于 2013-11-15 14:43:45

1.log file parallel write的时间如何
2.lgwr进程的优先级是否正常
3._use_adaptive_log_file_sync是否TRUE(如果true, lgwr的trace里是否有类似“Log file sync switching to polling”
信息)

wushi8 发表于 2013-11-15 15:39:50

补充AWR

Maclean Liu(刘相兵 发表于 2013-11-16 19:29:44

SGA+PGA 用到了物理内存的 80%
Begin End
Host Mem (MB):    262,144.0    262,144.0
SGA use (MB):    204,800.0    204,800.0
PGA use (MB):    3,377.0    5,685.4
% Host Mem used for SGA+PGA:    79.41    80.29

Event Waits %Time -outs Total Wait Time (s) Avg wait (ms) Waits /txn % bg time
db file parallel write 538,429 0 5,612 10 0.25 11.90
gcs log flush sync 583,193 63 5,158 9 0.27 10.94

gcs log flush sync 超时比例达到63%

Global Cache and Enqueue Services - Workload Characteristics

Avg global enqueue get time (ms):    0.0
Avg global cache cr block receive time (ms):    0.8
Avg global cache current block receive time (ms):    0.7
Avg global cache cr block build time (ms):    0.0
Avg global cache cr block send time (ms):    0.0
Global cache log flushes for cr blocks served %:    0.0
Avg global cache cr block flush time (ms):
Avg global cache current block pin time (ms):    545.7
Avg global cache current block send time (ms):    0.0
Global cache log flushes for current blocks served %:    0.3
Avg global cache current block flush time (ms):    4,922.9

Global Cache and Enqueue Services - Messaging Statistics

Avg message sent queue time (ms):    0.1
Avg message sent queue time on ksxp (ms):    3.3
Avg message received queue time (ms):    0.1
Avg GCS message process time (ms):    0.1
Avg GES message process time (ms):    0.0
% of direct sent messages:    55.58
% of indirect sent messages:    43.44
% of flow controlled messages:    0.98

Avg global cache current block flush time (ms):    4,922.9
Avg message sent queue time on ksxp (ms):    3.3

心跳网络的状况不好，

IOStat by Function summary
'Data' columns suffixed with M,G,T,P are in multiples of 1024 other columns suffixed with K,M,G,T,P are in multiples of 1000
ordered by (Data Read + Write) desc
Function Name Reads: Data Reqs per sec Data per sec Writes: Data Reqs per sec Data per sec Waits: Count Avg Tm(ms)
Others 34.7G 30.87 2.46551 30.7G 6.59 2.18212 434.3K 1.79
DBWR 0M 0.00 0M 50.9G 179.60 3.61591 0
Buffer Cache Reads 43.6G 348.35 3.09509 0M 0.00 0M 4998.8K 14.86
LGWR 8M 0.09 .000554 30.4G 76.55 2.15550 947.7K 6.88
Direct Writes 3M 0.00 .000207 1.6G 0.67 .116461 0
Direct Reads 410M 0.71 .028422 109M 0.09 .007556 0
TOTAL: 78.7G 380.02 5.58979 113.8G 263.51 8.07755 6380.8K 12.79

考虑：

1、关闭一个节点并观察现象
2、使用IO calibrate测一下IO

页: [1]

Oracle数据库数据恢复、性能优化's Archiver

LGWR进程的trace里总是报20多秒的警告信息