Oracle数据库数据恢复、性能优化

找回密码
注册
搜索
热搜: 活动 交友 discuz
发新帖

73

积分

0

好友

0

主题
1#
发表于 2012-2-20 13:34:21 | 查看: 18203| 回复: 15
我们近期的一个重要业务数据库经常会发生Log file sync等待事件,每次发生事件大概2-3分钟左右。事件发生时,前端用户会感觉界面卡死,时间过后自动恢复。在系统日志中,发现有lgwr的trace 文件,其中的等待超时事件与log file sync事件时间一致,但原因一直没有找到。系统曾经依照SR工程师的建议,安装了如下3个补丁:11.2.0.2 PSU5, Patch:12378147, Patch: 12412983.  补丁安装后大概系统太平了4天,第五天开始再次出现。大家能不能共同分析一下原因呢?

注:故障是2月9日首次出现,但业务层面确认,当天乃至前一天未有业务变更行为。

环境介绍:

IBM P570
AIX 6.1.0.4
Oracle 11.2.0.2 RAC (问题每次都发生在2号节点)
Veritas CFS


LGWR的报错日志:
*** 2012-02-17 05:24:14.196
*** SESSION ID:(6581.1) 2012-02-17 05:24:14.196
*** CLIENT ID:() 2012-02-17 05:24:14.196
*** SERVICE NAME:(SYS$BACKGROUND) 2012-02-17 05:24:14.196
*** MODULE NAME:() 2012-02-17 05:24:14.196
*** ACTION NAME:() 2012-02-17 05:24:14.196

Warning: log write elapsed time 3183ms, size 0KB

*** 2012-02-17 05:24:17.457
Warning: log write elapsed time 3198ms, size 16KB

*** 2012-02-17 05:24:19.887
Warning: log write elapsed time 2429ms, size 77KB

*** 2012-02-17 20:49:09.302
Warning: log write elapsed time 2838ms, size 42KB

*** 2012-02-18 01:33:12.584
Warning: log write broadcast wait time 116640ms (SCN 0xb86.1c436213)

*** 2012-02-18 23:21:55.892
Warning: log write elapsed time 1787ms, size 2KB

*** 2012-02-19 07:33:39.356
Warning: log write elapsed time 602ms, size 2KB

*** 2012-02-20 12:21:09.456
Warning: log write elapsed time 108690ms, size 8KB

[ 本帖最后由 ccshu 于 2012-2-20 13:40 编辑 ]
2#
发表于 2012-2-20 13:57:57
6.1 上的这个log file sync 似乎是一直有的
  1. AIX: Things To Check When Seeing Long Log File Sync Time in 11.2. [ID 1318709.1]

  2. Applies to:
  3. Oracle Server - Enterprise Edition - Version: 11.2.0.1 to 11.2.0.3 - Release: 11.2 to 11.2
  4. IBM AIX on POWER Systems (64-bit)
  5. Symptoms
  6.   Long log file sync times on AIX 6.1 (+ later) and Oracle 11.2


  7. Top 5 Timed Foreground Events

  8. Event                                       Waits       Time(s)   Avg wait (ms)   % DB time   Wait Class
  9. db file sequential read      6,456,003        43,555                       7               42.00   User I/O
  10. log file sync                             5,192        21,585                4157                20.81  Commit
  11. read by other session        1,436,853        13,718                    10                13.23  User I/O
  12. log buffer space                       4,749          8,790                1851                  8.48  Configuration
  13. DB CPU                                                     5,518                                          5.32


  14. Performing  a TRUSS of LGWR  e.g. truss -dD -o lgwrtruss.log -p <ospid>
  15. you can see that the time reported next to lseek may be very high, and warning messages are being written to the LGWR trace.  This is the elapsed  time since the previous call aio_nwait64.


  16. 0.0009: aio_nwait64(0x0000000000001000, 0x0000000000000012, 0x0FFFFFFFFFFEC0A0, 0x800000000000D032, 0x3FFC000000000003, 0x000000000000011C, 0x0000000000000000, 0x0000000000000000) = 0x0000000000000012
  17. 2.4272: lseek(20, 0, 1) = 1689988                    <====== Time between lseek and previous call is high
  18. kwrite(20, "\n * * * 2 0 1 1 - 0 3".., 29) = 29
  19. kwrite(21, " 3 ? + O M 9 ~ r T\n", 10) = 10
  20. 0.0002: lseek(20, 0, 1) = 1690017                    
  21. kwrite(20, " W a r n i n g : l o g".., 52) = 52    <====== Warning message written to LGWR trace
  22. kwrite(21, " 3 ? V S 2 ~ 0 q\n", 9) = 9
  23. kwrite(20, "\n", 1) = 1
  24. 0.0002: thread_post(12321045) = 0

  25. The above shows LGWR taking a long time and then writing the below message to the LGWR trace file.
  26. Warning: log write elapsed time XXXXms, size XKB

  27. If we look just at the times since the previous call, by looking at lseek times , they jump when load becomes slightly higher.  Write times were good though.


  28. 0.0003: lseek(20, 0, 1) = 1686342
  29. 0.0002: lseek(20, 0, 1) = 1686564
  30. 0.0002: lseek(20, 0, 1) = 1686757
  31. 0.0002: lseek(22, 0, 0) = 0
  32. 0.0002: lseek(482, 0, 0) = 0
  33. 0.0002: lseek(483, 0, 0) = 0
  34. 0.0026: lseek(20, 0, 1) = 1687283
  35. 2.1588: lseek(20, 0, 1) = 1687285  <== elapsed time since previous call (aio_nwait64) jumps, which we see
  36. 2.1983: lseek(20, 0, 1) = 1687367          represented next to the lseek call.
  37. 2.1029: lseek(20, 0, 1) = 1687449
  38. 2.1846: lseek(20, 0, 1) = 1687531

  39. Changes
  40. Database Upgraded to 11.2 or a AIX version, TL or SP is applied
  41. Cause
  42. The operating system, or file system is not correctly configured.
  43. Solution
  44. It is important to ensure that all AIX system parameters are correctly configured and that the file system block size aligns with the block size of the online redo logs.

  45. There have been multiple occasions where an upgrade of Oracle to 11g, or the application of a later Technology Level (TL) and Service Pack (SP) has seem a sudden increase in log file sync wait times with the symptom being high delta between aio_nwait64 and lseek calls seen during truss and high log file sync times.

  46. Even though the only change may be an Oracle upgrade or a OS patch, multiple issues with the above symptoms have been addressed by either correctly configuring the OS or filesystem containing online redo.

  47. Some previous changes that have resolved long LFS issue in Oracle 11.2  on AIX include :-
  48. - Disabling Disk I/O Pacing for the filesystem with Redo Logs. (Enabled by default in AIX 6.1)
  49. - Correctly aligning file system and redo log block size to avoid demoted I/O's
  50. - Correctly setting vmo and network parameters

  51. Please review the following IBM whitepaper "Oracle Architecture and Tuning on AIX" and ensure that minimum recommended settings are met.

  52. Disabling or delaying the warning message from being written to the LGWR trace file also may help improve LFS times.  In 10.2 set event 10468 level 4.  In 11.2, you can vary the message warning threshold using
  53. "_long_log_write_warning_threshold"  which sets the threshold for long log write warning messages in ms.

  54. If the everything is correctly configured then applying the fix for Bug 12412983 - AIX: "asynch descriptor resize" may also help.  This contains both the fix of bug 12412983 and bug 9829397.

  55. There is also a merge patch 12986882: MERGE  ON TOP OF 11.2.0.2.0 FOR BUGS 9829397 12412983.

  56. Note:
  57. The following recommendation may help in alleviating long 'log file sync' waits
  58. However we highly advise you to contact IBM to verify any changes to configuration on the OS level
  59. Please also ensure all recommended OS patches have been installed
复制代码

回复 只看该作者 道具 举报

3#
发表于 2012-2-20 14:00:16
action plan:

观察log file parallel write 等待事件 , 确认存储性能正常。

必要时 strace lgwr 进程

回复 只看该作者 道具 举报

4#
发表于 2012-2-20 14:01:25
是的,最早SR推荐的也是这篇文档,我们已经依照建议关闭了Disabling Disk I/O Pacing,也打了补丁Patch 12412983,但好像无效。

回复 只看该作者 道具 举报

5#
发表于 2012-2-20 14:27:12
AWR 2012-2-20 12:00 - 13:00

awr_report_9703_9704.html

1.5 MB, 下载次数: 1769

回复 只看该作者 道具 举报

6#
发表于 2012-2-20 16:08:22
Avg global cache cr block flush time (ms): 245 ms  查一下你的interconnect是否有问题

回复 只看该作者 道具 举报

7#
发表于 2012-2-20 16:58:17
Top 5 event
Event        Waits        Time(s)        Avg wait (ms)        % DB time        Wait Class
log file sync        1,683,281        35,350        21        38.07        Commit
db file sequential read        3,560,364        17,886        5        19.26        User I/O
DB CPU                 16,717                 18.00         
gc buffer busy release        34,071        8,883        261        9.57        Cluster
gc buffer busy acquire        56,461        7,458        132        8.03        Cluster

log file sync   avg  21ms   log file parallel write   avg 1ms


user commits        1,681,464        469.25        1.00       每秒 469次 commit 造成了 log file sync


log file sync  会加剧 gc buffer busy release 和 gc buffer busy acquire等待事件 ,  current block 在remote instance 要求 log flush 后才能传送给 要求者。

回复 只看该作者 道具 举报

8#
发表于 2012-2-20 18:10:05
原帖由 maclean 于 2012-2-20 16:58 发表
Top 5 event
Event        Waits        Time(s)        Avg wait (ms)        % DB time        Wait Class
log file sync        1,683,281        35,350        21        38.07        Commit
db file sequential read        3,560,364        17,886        5        19.26        User I/O
DB CPU                 16,717                 18.00         
g ...


谢谢的刘大的分析,但是我所知道。在RAC模式下,所有的commit操作都必须就当前的SCN号码向全部节点发出广播包,等待对方节点确认后才写入磁盘。

那有没有可能反过来影响,由于 gc buffer busy acquire,进而引发log file sync呢?当然,这个只是我个人的一个猜想,请刘大指正。

附件是所有事件期间的ASH和AWR报告。

AWR-ASH.zip

357.37 KB, 下载次数: 1348

回复 只看该作者 道具 举报

9#
发表于 2012-2-20 18:23:37
Oracle的SR工程师真会顺杆爬,我刚刚和他提出这个可能性,马上丢给我一个Bug号码。

please check
Bug 11930616 - sporadic buffer busy waits (Doc ID 11930616.8)

Workaround
_buffer_busy_wait_timeout=2 (= 20 ms)

do you get the "gc buffer busy acquire" event at the same time you have "log write elapsed time" ?

回复 只看该作者 道具 举报

10#
发表于 2012-2-20 21:29:56
每秒 469次的COMMIT是比较大,但是在已经优化的高端存储的情况下,应该还不至于引起大量LOG FILE SYNC事件。我的系统每秒738次commit,没有明显的log file sync。
GC CURRENT|CR BLOCK 2-WAY会加剧LOG FILE SYNC

回复 只看该作者 道具 举报

11#
发表于 2012-2-21 21:22:23
[local]1[/local]
Troubleshooting: 'Log File Sync' Waits [ID 1376916.1]

回复 只看该作者 道具 举报

12#
发表于 2012-2-21 21:23:41
上面的没有附件,不好意思

Troubleshooting- 'Log File Sync' Waits [ID 1376916.1].zip

59.78 KB, 下载次数: 1358

回复 只看该作者 道具 举报

13#
发表于 2012-2-21 21:48:11
这篇文章是最近看的主要文档之一,已经基本上排除了IO问题,问题似乎是发生在commit太多上,但是很奇怪的是,每次log file sync故障发生时,系统都不是特别的忙,所以也讲不通。

回复 只看该作者 道具 举报

14#
发表于 2012-2-22 09:57:21
日志切换比较快,平均3分钟就切换了一次

3.jpg (22.67 KB, 下载次数: 526)

3.jpg

回复 只看该作者 道具 举报

15#
发表于 2012-2-22 12:26:54
原帖由 teapot 于 2012-2-22 09:57 发表
日志切换比较快,平均3分钟就切换了一次


是的,曾经一度怀疑是这个原因,但检查了switch的时间,发现和log file sync发生的时间对不上号。

回复 只看该作者 道具 举报

16#
发表于 2012-2-23 11:24:54
select thread#, to_char(first_time,'yyyymmdd'),count(*) from v$log_history where first_time>sysdate-30 group by thread#, to_char(first_time,'yyyymmdd')
order by thread#;

回复 只看该作者 道具 举报

您需要登录后才可以回帖 登录 | 注册

QQ|手机版|Archiver|Oracle数据库数据恢复、性能优化

GMT+8, 2025-1-23 06:01 , Processed in 0.070000 second(s), 24 queries .

Powered by Discuz! X2.5

© 2001-2012 Comsenz Inc.

回顶部
TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569