Exadata数据库一体机软硬结合了ORACLE公司技术的精华,定期的健康检查也马虎不得。
Exadata的健康检查主要基于Oracle Support标准化工具Exachk,关于Exachk的详细介绍可以参考:
介绍 Exachk 的概览和最佳实践;定期使用 Exachk 收集 Exadata 机器的系统信息, 并结合 Oracle 最佳实践与客户当前的环境配置给出建议值, 可及时发现潜在问题, 消除隐患, 保障 Exadata 系统的稳定运行。
这里我们只给出Exachk的具体使用步骤:
$./exachk
CRS stack is running and CRS_HOME is not set. Do you want to set CRS_HOME to /u01/app/11.2.0.3/grid?[y/n][y]y
Checking ssh user equivalency settings on all nodes in cluster
./exachk: line 8674: [: 5120: unary operator expected
Space available on ware at /tmp is KB and required space is 5120 KB
Please make at least 10MB space available at above location and retry to continue.[y/n][y]?
需要设置RAT_CLUSTERNODES 指定检查的节点名
su - oracle
$export RAT_CLUSTERNODES="dm01db01-priv dm01db02-priv"
export RAT_DBNAMES="orcl,dbm"
$ ./exachk
[oracle@192 tmp]$ ./exachk
CRS stack is running and CRS_HOME is not set. Do you want to set CRS_HOME to /u01/app/11.2.0.3/grid?[y/n][y]y
Checking ssh user equivalency settings on all nodes in cluster
Node dm01db01-priv is configured for ssh user equivalency for oracle user
Node dm01db02-priv is configured for ssh user equivalency for oracle user
Searching out ORACLE_HOME for selected databases.
. . . .
Checking Status of Oracle Software Stack - Clusterware, ASM, RDBMS
. . . . . . . . . . . . . . . . . . . /u01/app/11.2.0.3/grid/bin/cemutlo.bin: Failed to initialize communication with CSS daemon, error code 3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
-------------------------------------------------------------------------------------------------------
Oracle Stack Status
-------------------------------------------------------------------------------------------------------
Host Name CRS Installed ASM HOME RDBMS Installed CRS UP ASM UP RDBMS UP DB Instance Name
-------------------------------------------------------------------------------------------------------
192 Yes Yes Yes Yes No Yes
dm01db01-priv Yes Yes Yes Yes No Yes
dm01db02-priv Yes Yes Yes Yes No Yes
-------------------------------------------------------------------------------------------------------
root user equivalence is not setup between 192 and STORAGE SERVER dm01cel01.
1. Enter 1 if you will enter root password for each STORAGE SERVER when prompted.
2. Enter 2 to exit and configure root user equivalence manually and re-run exachk.
3. Enter 3 to skip checking best practices on STORAGE SERVER.
Please indicate your selection from one of the above options[1-3][1]:- 1
Is root password same on all STORAGE SERVER?[y/n][y]y
Enter root password for STORAGE SERVER :-
97 of the included audit checks require root privileged data collection on DATABASE SERVER. If sudo is not configured or the root password is not available, audit checks which require root privileged data collection can be skipped.
1. Enter 1 if you will enter root password for each on DATABASE SERVER host when prompted
2. Enter 2 if you have sudo configured for oracle user to execute root_exachk.sh script on DATABASE SERVER
3. Enter 3 to skip the root privileged collections on DATABASE SERVER
4. Enter 4 to exit and work with the SA to configure sudo on DATABASE SERVER or to arrange for root access and run the tool later.
Please indicate your selection from one of the above options[1-4][1]:- 1
Is root password same on all compute nodes?[y/n][y]y
Enter root password on DATABASE SERVER:-
9 of the included audit checks require nm2user privileged data collection on INFINIBAND SWITCH .
1. Enter 1 if you will enter nm2user password for each INFINIBAND SWITCH when prompted
2. Enter 2 to exit and to arrange for nm2user access and run the exachk later.
3. Enter 3 to skip checking best practices on INFINIBAND SWITCH
Please indicate your selection from one of the above options[1-3][1]:- 3
*** Checking Best Practice Recommendations (PASS/WARNING/FAIL) ***
Log file for collections and audit checks are at
/tmp/exachk_040613_105703/exachk.log
Starting to run exachk in background on dm01db01-priv
Starting to run exachk in background on dm01db02-priv
=============================================================
Node name - 192
=============================================================
Collecting - Compute node PCI bus slot speed for infiniband HCAs
Collecting - Kernel parameters
Collecting - Maximum number of semaphore sets on system
Collecting - Maximum number of semaphores on system
Collecting - Maximum number of semaphores per semaphore set
Collecting - Patches for Grid Infrastructure
Collecting - Patches for RDBMS Home
Collecting - RDBMS patch inventory
Collecting - number of semaphore operations per semop system call
Preparing to run root privileged commands on DATABASE SERVER 192.
Starting to run root privileged commands in background on STORAGE SERVER dm01cel01
root@192.168.64.131's password:
Starting to run root privileged commands in background on STORAGE SERVER dm01cel02
root@192.168.64.132's password:
Starting to run root privileged commands in background on STORAGE SERVER dm01cel03
root@192.168.64.133's password:
Collecting - Ambient Temperature on storage server
Collecting - Exadata Critical Issue EX10
Collecting - Exadata Critical Issue EX11
Collecting - Exadata software version on storage server
Collecting - Exadata software version on storage servers
Collecting - Exadata storage server system model number
Collecting - RAID controller version on storage servers
Collecting - Verify Disk Cache Policy on storage servers
Collecting - Verify Electronic Storage Module (ESM) Lifetime is within Specification
Collecting - Verify Hardware and Firmware on Database and Storage Servers (CheckHWnFWProfile) [Storage Server]
Collecting - Verify Master (Rack) Serial Number is Set [Storage Server]
Collecting - Verify PCI bridge is configured for generation II on storage servers
Collecting - Verify RAID Controller Battery Condition [Storage Server]
Collecting - Verify RAID Controller Battery Temperature [Storage Server]
Collecting - Verify There Are No Storage Server Memory (ECC) Errors
Collecting - Verify service exachkcfg autostart status on storage server
Collecting - Verify storage server disk controllers use writeback cache
Collecting - verify asr exadata configuration check via ASREXACHECK on storage servers
Collecting - Configure Storage Server alerts to be sent via email
Collecting - Exadata Celldisk predictive failures
Collecting - Exadata Critical Issue EX9
Collecting - Exadata storage server root filesystem free space
Collecting - HCA firmware version on storage server
Collecting - OFED Software version on storage server
Collecting - OSWatcher status on storage servers
Collecting - Operating system and Kernel version on storage server
Collecting - Scan storage server alerthistory for open alerts
Collecting - Storage server flash cache mode
Collecting - Verify Data Network is Separate from Management Network on storage server
Collecting - Verify Ethernet Cable Connection Quality on storage servers
Collecting - Verify Exadata Smart Flash Cache is created
Collecting - Verify Exadata Smart Flash Log is Created
Collecting - Verify InfiniBand Cable Connection Quality on storage servers
Collecting - Verify Software on Storage Servers (CheckSWProfile.sh)
Collecting - Verify average ping times to DNS nameserver
Collecting - Verify celldisk configuration on disk drives
Collecting - Verify celldisk configuration on flash memory devices
Collecting - Verify griddisk ASM status
Collecting - Verify griddisk count matches across all storage servers where a given prefix name exists
Collecting - Verify storage server metric CD_IO_ST_RQ
Collecting - Verify there are no griddisks configured on flash memory devices
Collecting - Verify total number of griddisks with a given prefix name is evenly divisible of celldisks
Collecting - Verify total size of all griddisks fully utilizes celldisk capacity
Collecting - mpt_cmd_retry_count from /etc/modprobe.conf on Storage Servers
Data collections completed. Checking best practices on 192.
--------------------------------------------------------------------------------------
FAIL => CSS misscount should be set to the recommended value of 60
FAIL => Database Server InfiniBand network MTU size is NOT 65520
WARNING => Database has one or more dictionary managed tablespace
WARNING => RDBMS Version is NOT 11.2.0.2 as expected
FAIL => Storage Server alerts are not configured to be sent via email
FAIL => Management network is not separate from data network
WARNING => NIC bonding is NOT configured for public network (VIP)
WARNING => NIC bonding is not configured for interconnect
WARNING => SYS.AUDSES$ sequence cache size < 10,000 WARNING => GC blocks lost is occurring
WARNING => Some tablespaces are not using Automatic segment storage management.
WARNING => SYS.IDGEN1$ sequence cache size < 1,000 WARNING => Interconnect is configured on routable network addresses
FAIL => Some data or temp files are not autoextensible
FAIL => One or more Ethernet network cables are not connected.
WARNING => Multiple RDBMS instances discovered, observe database consolidation best practices
INFO => ASM griddisk,diskgroup and Failure group mapping not checked.
FAIL => One or more storage server has stateless alerts with null "examinedby" fields.
WARNING => Standby is not opened read only with managed recovery in real time apply mode
FAIL => Managed recovery process is not running
FAIL => Flashback on PRIMARY is not configured
WARNING => Standby is not in READ ONLY WITH APPLY mode
FAIL => Flashback on STANDBY is not configured
FAIL => No one high redundancy diskgroup configured
INFO => Operational Best Practices
INFO => Consolidation Database Practices
INFO => Network failure prevention best practices
INFO => Computer failure prevention best practices
INFO => Data corruption prevention best practices
INFO => Logical corruption prevention best practices
INFO => Storage failures prevention best practices
INFO => Database/Cluster/Site failure prevention best practices
INFO => Client failover operational best practices
FAIL => Some bigfile tablespaces do not have non-default maxbytes values set
FAIL => Standby database is not in sync with primary database
FAIL => Redo transport from primary to standby has more than 5 minutes or more lag
FAIL => Standby database is not in sync with primary database
FAIL => System may be exposed to Exadata Critical Issue DB11 /u01/app/oracle/product/11.2.0.3/dbhome_1
FAIL => System may be exposed to Exadata Critical Issue DB11 /u01/app/oracle/product/11.2.0.3/orcl
INFO => Software maintenance best practices
FAIL => Operating system hugepages count does not satisfy total SGA requirements
FAIL => Table AUD$[FGA_LOG$] should use Automatic Segment Space Management
INFO => Database failure prevention best practices
WARNING => Database Archivelog Mode should be set to ARCHIVELOG
WARNING => Some tablespaces are not using Automatic segment storage management.
WARNING => Database has one or more dictionary managed tablespace
WARNING => Unsupported data types preventing Data Guard (transient logical standby or logical standby) rolling upgrade
Collecting patch inventory on CRS HOME /u01/app/11.2.0.3/grid
Collecting patch inventory on ORACLE_HOME /u01/app/oracle/product/11.2.0.3/dbhome_1
Collecting patch inventory on ORACLE_HOME /u01/app/oracle/product/11.2.0.3/orcl
Copying results from dm01db01-priv and generating report. This might take a while. Be patient.
---------------------------------------------------------------------------------
CLUSTERWIDE CHECKS
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
Detailed report (html) - /tmp/exachk_192_dbm_040613_105703/exachk_192_dbm_040613_105703.html
UPLOAD(if required) - /tmp/exachk_192_dbm_040613_105703.zip
最后将生成打包成zip的报告,可以定期上传给GCS。
生成报告的HTML版如下图: