0 积分	1 好友	10 主题

发消息

[SQL调优] RAC数据库突然掉电后，遭遇ORA-01157、ORA-01110的一些疑惑

1^#

发表于 2014-4-11 16:36:24 | 查看: 2890| 回复: 1

环境描述：
aix 5.3
oracle 10.2.0.5 双节点RAC

客户反应，在10号早上重启了主机后，库起来了，但是部分业务不能办理，部分业务正常（客户描述）
他们的维护策略是：10号重启1节点主机，20号重启节点主机。

具体表现;在一个select查询时，两个节点均遭遇
SQL> select * from scxnb.ac43 where aaa027='510904';
select * from scxnb.ac43 where aaa027='510904'
*
ERROR at line 1:
ORA-01157: cannot identify/lock data file 102 - see DBWR trace file
ORA-01110: data file 102: '+DG_DATA1/cp3_94'

检查：
起初怀疑数据文件丢失
然而经过检查，数据文件存在。
检查alert日志后，之前客户比较粗暴的关闭主机，对主机进行重启的（他们目的是重启数据库相关服务，不知道为啥直接要重启主机）

相关alert日志如下：
SUCCESS: diskgroup ARCHDG was mounted
Wed Apr 09 23:46:56 BEIST 2014
SUCCESS: diskgroup ARCHDG was dismounted
Thu Apr 10 07:00:05 BEIST 2014
Shutting down instance (abort)
License high water mark = 344
Instance terminated by USER, pid = 733204
Thu Apr 10 07:13:33 BEIST 2014
Starting ORACLE instance (normal)
sskgpgetexecname failed to get name
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 en1 172.16.1.0 configured from OCR for use as a cluster interconnect
Interface type 1 en7 10.160.5.0 configured from OCR for use as  a public interface
Picked latch-free SCN scheme 3
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.5.0.
System parameters with non-default values:
  processes             = 1500
  sessions                = 2000
  resource_limit          = TRUE
  sga_max_size          = 21474836480
  __shared_pool_size    = 1442840576
  __large_pool_size       = 16777216
  __java_pool_size       = 50331648
  __streams_pool_size    = 0
  spfile                = +DATADG/scnbdb/spfilescnbdb.ora
  sga_target             = 12884901888
  control_files          = +DATADG/scnbdb/controlfile/current.260.748375667
  db_block_size          = 8192
  __db_cache_size       = 11358175232
  compatible             = 10.2.0.5.0
  log_archive_dest_1    = location=+ARCHDG
  log_archive_format    = %t_%s_%r.dbf
  db_file_multiblock_read_count= 16
  cluster_database       = TRUE
  cluster_database_instances= 2
  db_create_file_dest    = +DATADG
  thread                = 1
  instance_number       = 1
  undo_management       = AUTO
  undo_tablespace       = UNDOTBS1
  undo_retention          = 86400
  remote_login_passwordfile= EXCLUSIVE
  db_domain             =
  dispatchers             = (PROTOCOL=TCP) (SERVICE=scnbdbXDB)
  local_listener          = (address=(protocol=tcp)(host=10.160.5.1)(port=1521))
  remote_listener       = LISTENERS_SCNBDB
  utl_file_dir          = /u01/app/oracle/logs_utl_file
  job_queue_processes    = 10
  background_dump_dest    = /u01/app/oracle/admin/scnbdb/bdump
  user_dump_dest          = /u01/app/oracle/admin/scnbdb/udump
  core_dump_dest          = /u01/app/oracle/admin/scnbdb/cdump
  audit_file_dest       = /u01/app/oracle/admin/scnbdb/adump
  db_name                = scnbdb
  open_cursors          = 2000
  pga_aggregate_target    = 8589934592
  _awr_flush_threshold_metrics= TRUE
Cluster communication is configured to use the following interface(s) for this instance
  172.16.1.1
Thu Apr 10 07:13:34 BEIST 2014
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
PMON started with pid=2, OS id=229432
DIAG started with pid=3, OS id=233530
PSP0 started with pid=4, OS id=229802
LMON started with pid=5, OS id=237628
LMD0 started with pid=6, OS id=241726
LMS0 started with pid=7, OS id=131618
LMS1 started with pid=8, OS id=168506
LMS2 started with pid=9, OS id=254020
MMAN started with pid=10, OS id=201020
DBW0 started with pid=11, OS id=233808
DBW1 started with pid=12, OS id=221268
LGWR started with pid=13, OS id=262218
CKPT started with pid=14, OS id=209300
SMON started with pid=15, OS id=241996
RECO started with pid=16, OS id=250192
CJQ0 started with pid=17, OS id=266322
MMON started with pid=18, OS id=270420
Thu Apr 10 07:13:34 BEIST 2014
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
MMNL started with pid=19, OS id=274518
Thu Apr 10 07:13:35 BEIST 2014
starting up 1 shared server(s) ...
Thu Apr 10 07:13:38 BEIST 2014
lmon registered with NM - instance id 1 (internal mem no 0)
Thu Apr 10 07:13:38 BEIST 2014
Reconfiguration started (old inc 0, new inc 8)
List of nodes:
0 1
Global Resource Directory frozen
* allocate domain 0, invalid = TRUE
Communication channels reestablished
* domain 0 valid according to instance 1
* domain 0 valid = 1 according to instance 1
Thu Apr 10 07:13:38 BEIST 2014
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Thu Apr 10 07:13:38 BEIST 2014
LMS 0: 0 GCS shadows cancelled, 0 closed
Thu Apr 10 07:13:38 BEIST 2014
LMS 2: 0 GCS shadows cancelled, 0 closed
Thu Apr 10 07:13:38 BEIST 2014
LMS 1: 0 GCS shadows cancelled, 0 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Thu Apr 10 07:13:39 BEIST 2014
LMS 2: 0 GCS shadows traversed, 0 replayed
Thu Apr 10 07:13:39 BEIST 2014
LMS 0: 0 GCS shadows traversed, 0 replayed
Thu Apr 10 07:13:39 BEIST 2014
LMS 1: 0 GCS shadows traversed, 0 replayed
Thu Apr 10 07:13:39 BEIST 2014
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
LCK0 started with pid=22, OS id=258128
Thu Apr 10 07:13:41 BEIST 2014
ALTER DATABASE MOUNT
Thu Apr 10 07:13:42 BEIST 2014
Starting background process ASMB
ASMB started with pid=24, OS id=282746
Starting background process RBAL
RBAL started with pid=25, OS id=245828
Thu Apr 10 07:13:46 BEIST 2014
SUCCESS: diskgroup DATADG was mounted
Thu Apr 10 07:13:51 BEIST 2014
Setting recovery target incarnation to 2
Thu Apr 10 07:13:51 BEIST 2014
Successful mount of redo thread 1, with mount id 3422982371
Thu Apr 10 07:13:51 BEIST 2014
Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE)
Completed: ALTER DATABASE MOUNT
Thu Apr 10 07:13:52 BEIST 2014
ALTER DATABASE OPEN
Picked broadcast on commit scheme to generate SCNs
Thu Apr 10 07:13:58 BEIST 2014
Errors in file /u01/app/oracle/admin/scnbdb/bdump/scnbdb1_dbw0_233808.trc:
ORA-01157: cannot identify/lock data file 99 - see DBWR trace file
ORA-01110: data file 99: '+DG_DATA1/cp3_91'
ORA-17503: ksfdopn:2 Failed to open file +DG_DATA1/cp3_91
ORA-15001: diskgroup "DG_DATA1" does not exist or is not mounted
ORA-15001: diskgroup "DG_DATA1" does not exist or is not mounted
Thu Apr 10 07:13:58 BEIST 2014
Errors in file /u01/app/oracle/admin/scnbdb/bdump/scnbdb1_dbw0_233808.trc:
ORA-01157: cannot identify/lock data file 100 - see DBWR trace file
ORA-01110: data file 100: '+DG_DATA1/cp3_92'
ORA-17503: ksfdopn:2 Failed to open file +DG_DATA1/cp3_92
ORA-15001: diskgroup "DG_DATA1" does not exist or is not mounted
ORA-15001: diskgroup "DG_DATA1" does not exist or is not mounted

在两个节点
NODE1:

export ORACLE_SID=+ASM1

SQL> select name,state from v$asm_diskgroup;

NAME                         STATE
------------------------------ ----------------------
ARCHDG                      MOUNTED
DATADG                      MOUNTED
DG_DATA1                   DISMOUNTED

NODE2
export ORACLE_SID=+ASM2

SQL> select name,state from v$asm_diskgroup;

NAME                         STATE
------------------------------ ----------------------
ARCHDG                      MOUNTED
DATADG                      MOUNTED
DG_DATA1                   DISMOUNTED

export ORACLE_SID=+ASM1

SQL> select name,state from v$asm_diskgroup;

NAME                         STATE
------------------------------ ----------------------
ARCHDG                      MOUNTED
DATADG                      MOUNTED
DG_DATA1                   DISMOUNTED

--asm asm_diskgroups参数指定了 DATADG 、ARCHDG 、DG_DATA1  但是，DATADG 、ARCHDG mount了，
DG_DATA1 DISMOUNTED；

--比较奇怪，如果是因为shutdown 导致的diskgroup dismount，数据库还能open？有点疑惑。。
且此时数据库正常运行，日志切换均正常。
和客户沟通，9号一切业务正常，推断是因为对主机的shutdown导致数据库的此错误。

分别在两个节点mount DG_DATA1磁盘组
alter diskgroup DG_DATA1 mount;
然后在数据库执行
SQL> alter system check datafiles;

System altered.

检查之前报错的数据文件状态
SQL> select file_id,file_name,status from dba_data_files where file_id=102;

FILE_ID FILE_NAME                   STATUS
------- ------------------------------ --------------------
102 +DG_DATA1/cp3_94             AVAILABLE

状态正常，
select * from scxnb.ac43 where aaa027='510904';
--返回结果，不再报错

分享0