主机上的ASM实例在存储扩容增加ASM Disk并重启实例后无法MOUNT原DG_DAT DiskGroup,报错信息为ORA-15032、ORA-15063。
如果自己搞不定可以找诗檀软件专业ORACLE数据库修复团队成员帮您恢复!
诗檀软件专业数据库修复团队
服务热线 : 400-690-3643 备用电话: 13764045638 邮箱:service@parnassusdata.com
1.2 问题分析
以下为根据ASM alert日志、后台trace和主机日志整理的该问题时间线:
Time Node What Happened
2:33:35 ASM1 1节点上加ASM171,此时2节点上未实际mount该disk
2:33:35 ASM2 2节点 验证ASM171 失败
2:33:48 ASM2 2节点 停止所有DG_DATA的IO,并dismount DG_DATA
2:33:41 ASM1 1节点offline ASM 171
2:33:48 ASM1 1节点增加 ASM 172
2:33:48 ASM1 节点1 做 脏detach重置
2:33:49 ASM1 节点1成功将DG_DATA_0163 path:/dev/orcl/ASM172 加入DG_DATA 并开始rebalance
2:34:00 ASM1 节点1上的 ARBO正式开始rebalance,此时 DG中新加入的只有ASM172
2:54:00 ASM1 20分钟后,ARB0遇到File I/O error后 rebalance中断
2:54:00 ASM1 同一时间ASM 172对应的HDISK181报DISK OPERATION ERROR
3:00:55 ASM2 2节点上尝试重新加ASM172磁盘 "Alter diskgroup DG_DATA add disk '/dev/orcl/ASM172'"
3:06:13 ASM2 ASM2实例重启
3:06:16 ASM2 2节点ASM实例重启,发现无法MOUNT DG_DATA
3:21:01 ASM1 ASM1实例重启
3:21:01 ASM1 1节点ASM实例重启,发现无法MOUNT DG_DATA
目前通过kfed内部工具,结合上述日志得到以下信息:
1. 因为rebalance 20分钟后才因为I/O问题中断,所以/dev/orcl/ASM172上是有部分数据的,+asm1_arb0_897522.trc显示文件号为258、271、272等文件被relocate到该DISK中
2. ASM171在2:33:41即被offlie踢出diskgroup,没有rebalance到数据;虽然ASM171被踢出了diskgroup,但是其asm header头部的metadata元信息不会被立即清除掉
3. 造成 diskgroup 无法正常mount的原因是 ASM172-DG_DATA_0163,这个ASM DISK的头部metadata数据不正确
4. 第一次加DISK ASM171时,因为节点2上未挂载对应的LUN DISK,故ASM2实例在检测ASM171时无法通过,导致ASM171被offline踢出DiskGroup。 而第二次加Disk ASM172时以为ASM2实例已经DisMount DG_DAT磁盘组,故没有失败。
使用kfed读取ASM DISK发现:
ASM 171和172的头部都存在讹误现象:
一个正常的ASM Disk Header:
/dev/orcl/ASM17
kfbh.endian: 0 ; 0x000: 0x00
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 1 ; 0x002: KFBTYP_DISKHEAD
kfbh.datfmt: 1 ; 0x003: 0x01
kfbh.block.blk: 0 ; 0x004: T=0 NUMB=0x0
kfbh.block.obj: 2147483709 ; 0x008: TYPE=0x8 NUMB=0x3d
kfbh.check: 1946003667 ; 0x00c: 0x73fda8d3
kfbh.fcn.base: 0 ; 0x010: 0x00000000
kfbh.fcn.wrap: 0 ; 0x014: 0x00000000
kfbh.spare1: 0 ; 0x018: 0x00000000
kfbh.spare2: 0 ; 0x01c: 0x00000000
ASM 171的Disk Header:
/dev/orcl/ASM171
kfbh.endian: 201 ; 0x000: 0xc9
kfbh.hard: 194 ; 0x001: 0xc2
kfbh.type: 212 ; 0x002: *** Unknown Enum *** ==》讹误导致数据结构无法按原有格式读取
kfbh.datfmt: 193 ; 0x003: 0xc1
kfbh.block.blk: 0 ; 0x004: T=0 NUMB=0x0
kfbh.block.obj: 0 ; 0x008: TYPE=0x0 NUMB=0x0
kfbh.check: 0 ; 0x00c: 0x00000000 ==》这个block中后面全部为0
kfbh.fcn.base: 0 ; 0x010: 0x00000000
kfbh.fcn.wrap: 0 ; 0x014: 0x00000000
kfbh.spare1: 0 ; 0x018: 0x00000000
kfbh.spare2: 0 ; 0x01c: 0x00000000
/dev/orcl/ASM172
kfbh.endian: 201 ; 0x000: 0xc9
kfbh.hard: 194 ; 0x001: 0xc2
kfbh.type: 212 ; 0x002: *** Unknown Enum *** ==》讹误导致数据结构无法按原有格式读取
kfbh.datfmt: 193 ; 0x003: 0xc1
kfbh.block.blk: 0 ; 0x004: T=0 NUMB=0x0
kfbh.block.obj: 0 ; 0x008: TYPE=0x0 NUMB=0x0
kfbh.check: 0 ; 0x00c: 0x00000000 ==》 这个block中后面全部为0
kfbh.fcn.base: 0 ; 0x010: 0x00000000
kfbh.fcn.wrap: 0 ; 0x014: 0x00000000
kfbh.spare1: 0 ; 0x018: 0x00000000
kfbh.spare2: 0 ; 0x01c: 0x00000000
1.3 建议
ASM DIsk Header讹误可能由多种原因造成,例如OS utility、PVID占用disk header(本例中使用lspv查看已排除)、IO、存储故障等等。
1. 由于版本10.2.0.4中没有asm header自动备份特性,所以无法使用kfed repair 修改问题disk header。
可以尝试手工构造ASM header的方法,修复ASM172的disk header,以便能够正常mount diskgroup。
2. 建议IBM、HDS协助调查该ASM Disk Header讹误问题 ,并解释出现DISK OPERATION ERROR的原因及其影响。
1.4 日志时间表
Timeline What happened Node 注释
2:33:35 Alter diskgroup DG_DATA add disk '/dev/orcl/ASM171' ASM1 1节点上加ASM171,此时2节点上未实际mount该disk
2:33:35 ERROR: group 1/0x4963bbf4 (DG_DATA): could not validate disk 163 ERROR: ERROR: too many offline disks in PST (grp 1)
Thu Feb 21 02:33:40 2013
SUCCESS: refreshed membership for 1/0x4963bbf4 (DG_DATA)
ERROR: ORA-15040 thrown in RBAL for group number 1
Thu Feb 21 02:33:40 2013
Errors in file /db160/app/oracle/admin/+ASM/bdump/+asm2_rbal_671888.trc:
ORA-15040: diskgroup is incomplete
ORA-15066: offlining disk "" may result in a data loss
ORA-15042: ASM disk "163" is missing
NOTE: cache closing disk 163 of grp 1:
NOTE: membership refresh pending for group 1/0x4963bbf4 (DG_DATA)
NOTE: cache closing disk 163 of grp 1: ASM2 2节点 验证ASM171 失败
2:33:48 ERROR: PST-initiated MANDATORY DISMOUNT of group DG_DATA Dirty Detach Reconfiguration complete
Thu Feb 21 02:33:48 2013
WARNING: dirty detached from domain 1
Thu Feb 21 02:33:48 2013
NOTE: PST enabling heartbeating (grp 1)
Thu Feb 21 02:33:48 2013
SUCCESS: diskgroup DG_DATA was dismounted
NOTE: cache dismounting group 1/0x4963BBF4 (DG_DATA)
Thu Feb 21 02:33:48 2013
NOTE: halting all I/Os to diskgroup DG_DATA ASM2 2节点 停止所有DG_DATA的IO,并dismount DG_DATA
2:33:41 WARNING: offlining disk 163.2088814063 (DG_DATA_0163) with mask 0x3
NOTE: PST update: grp = 1, dsk = 163, mode = 0x6
NOTE: PST update: grp = 1, dsk = 163, mode = 0x4
NOTE: cache closing disk 163 of grp 1: DG_DATA_0163
NOTE: PST update: grp = 1
NOTE: requesting all-instance membership refresh for group=1
Thu Feb 21 02:33:45 2013
NOTE: membership refresh pending for group 1/0x496035b9 (DG_DATA)
NOTE: cache closing disk 163 of grp 1: DG_DATA_0163
SUCCESS: refreshed membership for 1/0x496035b9 (DG_DATA) ASM1 1节点offline ASM 171
2:33:48 Alter diskgroup DG_DATA add disk '/dev/orcl/ASM172' ASM1 1节点增加 ASM 172
2:33:48 Dirty detach reconfiguration started Dirty Detach Reconfiguration complete
Thu Feb 21 02:33:48 2013
NOTE: SMON starting instance recovery for group 1 (mounted)
NOTE: F1X0 found on disk 0 fcn 0.126694451
NOTE: starting recovery of thread=2 ckpt=108.9808 group=1
NOTE: advancing ckpt for thread=2 ckpt=108.9808
NOTE: smon did instance recovery for domain 1
Thu Feb 21 02:33:49 2013
NOTE: reconfiguration of group 1/0x496035b9 (DG_DATA), full=1
Thu Feb 21 02:33:49 2013
WARNING: ignoring disk /dev/orcl/ASM171 in deep discovery ASM1 节点1 做 脏detach重置
2:33:49 NOTE: cache opening disk 163 of grp 1: DG_DATA_0163 path:/dev/orcl/ASM172
NOTE: PST update: grp = 1
NOTE: requesting all-instance disk validation for group=1
Thu Feb 21 02:33:49 2013
NOTE: disk validation pending for group 1/0x496035b9 (DG_DATA)
SUCCESS: validated disks for 1/0x496035b9 (DG_DATA)
Thu Feb 21 02:33:51 2013
NOTE: PST update: grp = 1
NOTE: requesting all-instance membership refresh for group=1
Thu Feb 21 02:33:51 2013
NOTE: membership refresh pending for group 1/0x496035b9 (DG_DATA)
SUCCESS: refreshed membership for 1/0x496035b9 (DG_DATA)
Thu Feb 21 02:33:54 2013
NOTE: requesting all-instance membership refresh for group=1
Thu Feb 21 02:33:54 2013
NOTE: membership refresh pending for group 1/0x496035b9 (DG_DATA)
SUCCESS: refreshed membership for 1/0x496035b9 (DG_DATA)
NOTE: recovering COD for group 1/0x496035b9 (DG_DATA)
SUCCESS: completed COD recovery for group 1/0x496035b9 (DG_DATA)
Thu Feb 21 02:34:00 2013
NOTE: starting rebalance of group 1/0x496035b9 (DG_DATA) at power 1
Starting background process ARB0
ARB0 started with pid=20, OS id=897522
Thu Feb 21 02:34:00 2013
NOTE: assigning ARB0 to group 1/0x496035b9 (DG_DATA)
Thu Feb 21 02:54:29 2013
Errors in file /db160/app/oracle/admin/+ASM/bdump/+asm1_arb0_897522.trc: ASM1 节点1成功将DG_DATA_0163 path:/dev/orcl/ASM172 加入DG_DATA 并开始rebalance
2:34:00 *** SERVICE NAME:() 2013-02-21 02:34:00.555
*** SESSION ID:(31.22657) 2013-02-21 02:34:00.555
ARB0 relocating file +DG_DATA.258.754048969 (1 entries)
ARB0 relocating file +DG_DATA.271.754048983 (120 entries) ASM1 节点1上的 ARBO正式开始rebalance,此时 DG中新加入的只有ASM172
2:54:00 ARB0 relocating file +DG_DATA.406.754065421 (120 entries)
ORA-27091: unable to queue I/O
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 16: Device busy
Additional information: 7
Additional information: 8814592
Additional information: -1
ORA-15080: synchronous I/O operation to a disk failed ASM1 20分钟后,ARB0遇到File I/O error后 rebalance中断
2:54:00 B6267342 0221025413 P H hdisk181 DISK OPERATION ERROR LABEL: SC_DISK_ERR2
IDENTIFIER: B6267342
Date/Time: Thu Feb 21 02:54:29 BEIST 2013
Sequence Number: 1921
Machine Id: 00CE62034C00
Node Id: umasadb1
Class: H
Type: PERM
Resource Name: hdisk181
Resource Class: disk
Resource Type: Hitachi
Location: U5791.001.99B19VN-P2-C08-T1-W50060E8005706236-LB3000000000000
VPD:
Manufacturer................HITACHI
Machine Type and Model......OPEN-V
Part Number.................
ROS Level and ID............36303038
Serial Number...............50 07062
EC Level....................
FRU Number..................
Device Specific.(Z0)........00000332CF000002
Device Specific.(Z1)........004D 1G ....
Device Specific.(Z2).........`
Device Specific.(Z3).........
Device Specific.(Z4).........|^B
Device Specific.(Z5)........
Device Specific.(Z6)........
Description
DISK OPERATION ERROR
Probable Causes
DASD DEVICE
Failure Causes
DISK DRIVE
DISK DRIVE ELECTRONICS
Recommended Actions
PERFORM PROBLEM DETERMINATION PROCEDURES
Detail Data
PATH ID ASM1 同一时间ASM 172对应的HDISK181报DISK OPERATION ERROR
3:00:55 Alter diskgroup DG_DATA add disk '/dev/orcl/ASM172' ASM2 2节点上重新加ASM172磁盘
3:06:13 Starting ORACLE instance (normal) ASM2 ASM2实例重启
3:06:16 NOTE: cache dismounting group 1/0x496001AB (DG_DATA)
NOTE: dbwr not being msg'd to dismount
Thu Feb 21 03:06:51 2013
NOTE: PST enabling heartbeating (grp 1)
Thu Feb 21 03:06:51 2013
ERROR: diskgroup DG_DATA was not mounted ASM2 2节点ASM实例重启,发现无法MOUNT DG_DATA
3:21:01 Starting ORACLE instance (normal) ASM1 ASM1实例重启
3:21:01 Thu Feb 21 03:21:15 2013
NOTE: Hbeat: instance first (grp 1)
Thu Feb 21 03:21:15 2013
NOTE: Hbeat: instance not first (grp 2)
Thu Feb 21 03:21:20 2013
NOTE: start heartbeating (grp 1)
Thu Feb 21 03:21:20 2013
NOTE: cache dismounting group 1/0x46315372 (DG_DATA)
NOTE: dbwr not being msg'd to dismount ASM1 1节点ASM实例重启,发现无法MOUNT DG_DATA
|