Oracle数据库数据恢复、性能优化»论坛 › Oracle › Oracle数据库管理 › 奇怪的Log file sync等待事件

ccshu

73 积分	0 好友	0 主题

发消息

奇怪的Log file sync等待事件

1^#

发表于 2012-2-20 13:34:21 | 查看: 18103| 回复: 15

我们近期的一个重要业务数据库经常会发生Log file sync等待事件，每次发生事件大概2-3分钟左右。事件发生时，前端用户会感觉界面卡死，时间过后自动恢复。在系统日志中，发现有lgwr的trace 文件，其中的等待超时事件与log file sync事件时间一致，但原因一直没有找到。系统曾经依照SR工程师的建议，安装了如下3个补丁：11.2.0.2 PSU5, Patch:12378147, Patch: 12412983. 补丁安装后大概系统太平了4天，第五天开始再次出现。大家能不能共同分析一下原因呢？

注：故障是2月9日首次出现，但业务层面确认，当天乃至前一天未有业务变更行为。

环境介绍：

IBM P570
AIX 6.1.0.4
Oracle 11.2.0.2 RAC （问题每次都发生在2号节点）
Veritas CFS

LGWR的报错日志：
*** 2012-02-17 05:24:14.196
*** SESSION ID:(6581.1) 2012-02-17 05:24:14.196
*** CLIENT ID:() 2012-02-17 05:24:14.196
*** SERVICE NAME:(SYS$BACKGROUND) 2012-02-17 05:24:14.196
*** MODULE NAME:() 2012-02-17 05:24:14.196
*** ACTION NAME:() 2012-02-17 05:24:14.196

Warning: log write elapsed time 3183ms, size 0KB

*** 2012-02-17 05:24:17.457
Warning: log write elapsed time 3198ms, size 16KB

*** 2012-02-17 05:24:19.887
Warning: log write elapsed time 2429ms, size 77KB

*** 2012-02-17 20:49:09.302
Warning: log write elapsed time 2838ms, size 42KB

*** 2012-02-18 01:33:12.584
Warning: log write broadcast wait time 116640ms (SCN 0xb86.1c436213)

*** 2012-02-18 23:21:55.892
Warning: log write elapsed time 1787ms, size 2KB

*** 2012-02-19 07:33:39.356
Warning: log write elapsed time 602ms, size 2KB

*** 2012-02-20 12:21:09.456
Warning: log write elapsed time 108690ms, size 8KB

[ 本帖最后由 ccshu 于 2012-2-20 13:40 编辑 ]

分享0

收藏0 回复只看该作者道具举报

Maclean Liu(刘相兵

2^#

发表于 2012-2-20 13:57:57

6.1 上的这个log file sync 似乎是一直有的

AIX: Things To Check When Seeing Long Log File Sync Time in 11.2. [ID 1318709.1]
Applies to:
Oracle Server - Enterprise Edition - Version: 11.2.0.1 to 11.2.0.3 - Release: 11.2 to 11.2
IBM AIX on POWER Systems (64-bit)
Symptoms
Long log file sync times on AIX 6.1 (+ later) and Oracle 11.2
Top 5 Timed Foreground Events
Event Waits Time(s) Avg wait (ms) % DB time Wait Class
db file sequential read 6,456,003 43,555 7 42.00 User I/O
log file sync 5,192 21,585 4157 20.81 Commit
read by other session 1,436,853 13,718 10 13.23 User I/O
log buffer space 4,749 8,790 1851 8.48 Configuration
DB CPU 5,518 5.32
Performing a TRUSS of LGWR e.g. truss -dD -o lgwrtruss.log -p <ospid>
you can see that the time reported next to lseek may be very high, and warning messages are being written to the LGWR trace. This is the elapsed time since the previous call aio_nwait64.
0.0009: aio_nwait64(0x0000000000001000, 0x0000000000000012, 0x0FFFFFFFFFFEC0A0, 0x800000000000D032, 0x3FFC000000000003, 0x000000000000011C, 0x0000000000000000, 0x0000000000000000) = 0x0000000000000012
2.4272: lseek(20, 0, 1) = 1689988 <====== Time between lseek and previous call is high
kwrite(20, "\n * * * 2 0 1 1 - 0 3".., 29) = 29
kwrite(21, " 3 ? + O M 9 ~ r T\n", 10) = 10
0.0002: lseek(20, 0, 1) = 1690017
kwrite(20, " W a r n i n g : l o g".., 52) = 52 <====== Warning message written to LGWR trace
kwrite(21, " 3 ? V S 2 ~ 0 q\n", 9) = 9
kwrite(20, "\n", 1) = 1
0.0002: thread_post(12321045) = 0
The above shows LGWR taking a long time and then writing the below message to the LGWR trace file.
Warning: log write elapsed time XXXXms, size XKB
If we look just at the times since the previous call, by looking at lseek times , they jump when load becomes slightly higher. Write times were good though.
0.0003: lseek(20, 0, 1) = 1686342
0.0002: lseek(20, 0, 1) = 1686564
0.0002: lseek(20, 0, 1) = 1686757
0.0002: lseek(22, 0, 0) = 0
0.0002: lseek(482, 0, 0) = 0
0.0002: lseek(483, 0, 0) = 0
0.0026: lseek(20, 0, 1) = 1687283
2.1588: lseek(20, 0, 1) = 1687285 <== elapsed time since previous call (aio_nwait64) jumps, which we see
2.1983: lseek(20, 0, 1) = 1687367 represented next to the lseek call.
2.1029: lseek(20, 0, 1) = 1687449
2.1846: lseek(20, 0, 1) = 1687531
Changes
Database Upgraded to 11.2 or a AIX version, TL or SP is applied
Cause
The operating system, or file system is not correctly configured.
Solution
It is important to ensure that all AIX system parameters are correctly configured and that the file system block size aligns with the block size of the online redo logs.
There have been multiple occasions where an upgrade of Oracle to 11g, or the application of a later Technology Level (TL) and Service Pack (SP) has seem a sudden increase in log file sync wait times with the symptom being high delta between aio_nwait64 and lseek calls seen during truss and high log file sync times.
Even though the only change may be an Oracle upgrade or a OS patch, multiple issues with the above symptoms have been addressed by either correctly configuring the OS or filesystem containing online redo.
Some previous changes that have resolved long LFS issue in Oracle 11.2 on AIX include :-
- Disabling Disk I/O Pacing for the filesystem with Redo Logs. (Enabled by default in AIX 6.1)
- Correctly aligning file system and redo log block size to avoid demoted I/O's
- Correctly setting vmo and network parameters
Please review the following IBM whitepaper "Oracle Architecture and Tuning on AIX" and ensure that minimum recommended settings are met.
Disabling or delaying the warning message from being written to the LGWR trace file also may help improve LFS times. In 10.2 set event 10468 level 4. In 11.2, you can vary the message warning threshold using
"_long_log_write_warning_threshold" which sets the threshold for long log write warning messages in ms.
If the everything is correctly configured then applying the fix for Bug 12412983 - AIX: "asynch descriptor resize" may also help. This contains both the fix of bug 12412983 and bug 9829397.
There is also a merge patch 12986882: MERGE ON TOP OF 11.2.0.2.0 FOR BUGS 9829397 12412983.
Note:
The following recommendation may help in alleviating long 'log file sync' waits
However we highly advise you to contact IBM to verify any changes to configuration on the OS level
Please also ensure all recommended OS patches have been installed

复制代码

		自动登录	找回密码
密码			注册