Oracle数据库数据恢复、性能优化»论坛 › Oracle › Oracle数据库管理 › 产生大量tnslsnr进程.导致ksvcreate: Process(m000) cre ...

ricky

158 积分	1 好友	8 主题

发消息

产生大量tnslsnr进程.导致ksvcreate: Process(m000) creation failed

1^#

发表于 2012-7-17 15:00:25 | 查看: 11493| 回复: 9

环境：

系统环境:
oracle 10.2.0.4 RAC 64bit+ IBM AIX 6.1 +OLTP系统

节点一：该节点是主节点，现在没有做负载均衡，基本就是用节点一作为主节点。

Oracle Database 10g Enterprise EditionRelease 10.2.0.4.0 - 64bit Production

With the Partitioning, Real ApplicationClusters, OLAP, Data Mining

and Real Application Testing options

ORACLE_HOME =/oracle/app/oracle/product/10.2/db

System name: AIX

Node name: RESDB1

Release: 1

Version: 6

Machine: 00CD21164C00

Instance name: nrms1

昨晚18:40开始数据库有问题，hang住，一直到20点hang住不行，只有重启了主机。

具体处理操作请看：

处理过程16号.txt (296.89 KB, 下载次数: 997)

-----以下是今早分析过程------------------

alert报错如下：

alert.txt (2.14 KB, 下载次数: 1182)

附件为：诊断日志：

nrms1_diag_5046632.rar (999.86 KB, 下载次数: 1310)

SQL> showparameter process

NAME TYPE VALUE

----------------------------------------------- ------------------------------

aq_tm_processes integer 0

db_writer_processes integer 6

gcs_server_processes integer 12

job_queue_processes integer 10

log_archive_max_processes integer 2

processes integer 2500

分析过程：

google结果是：可以看到m000进程创建失败，PMON进程无法启动该进程。一般情况下，

PMON无法启动进程原因有3个：1、Oracle连接数超过进程数限制。2、进程死锁。3、bug

我初步怀疑是1，因为我在listen.log里面看到大量报错如下：

16-JUL-201221:18:39 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=nrms)(CID=(PROGRAM=oracle)(HOST=NGCRM595A)(USER=oracle)))* (ADDRESS=(PROTOCOL=tcp)(HOST=135.32.9.1

39)(PORT=59094)) *establish * nrms * 12528

TNS-12528:TNS:listener: all appropriate instances are blocking new connections

监听程序: 所有适用例程都无法建立新连接

16-JUL-201220:00:55 *(CONNECT_DATA=(SID=nrms1)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=weblogic))) *(ADDRESS=(PROTOCOL=tcp)(HOST=135.32.21.24)(PORT=57990)) * establish *

nrms1 * 12518

TNS-12518:TNS:listener could not hand off client connection

TNS-12540:TNS:internal limit restriction exceeded

问题是：

刘大你说找资料，会分析是一种能力，我现在想要跟你学习学习，我找了这些东西出来，应该怎么过滤成自己有用的呢？乱了~~

我现在就是怀疑到达最大的process的限制，所以系统hang住，不能建立M000进程。根据是：监听日志的内容。

为什么会达到最大限制：就是因为tnslsnr的异常进程产生了很多。

另外就找不到有用的资料了。还请刘大帮忙指点一二，谢谢啊~~

现在我还没有找到源头的什么啊，就是为什么会产生这么多的tnslsnr进程，导致达到了最大操作系统限制。

racle@RESDB1[/oracle/app/oracle/admin/nrms/bdump]$ls *pmon*

pmon.rar (213.99 KB, 下载次数: 1555)

nrms1_pmon_2490858.trc nrms1_pmon_3670294.trc nrms1_pmon_37224998.trc nrms1_pmon_38076824.trc nrms1_pmon_5767290.trc

listener_nrms1.rar

3.83 KB, 下载次数: 1514

分享0

收藏0 回复只看该作者道具举报

ricky

10^#

发表于 2012-7-17 17:30:33

oracle@RESDB1[/oracle]$lsattr -El sys0
SW_dist_intr false             Enable SW distribution of interrupts             True
autorestart    true             Automatically REBOOT OS after a crash          True
boottype       disk             N/A                                              False
capacity_inc 1.00             Processor capacity increment                   False
capped       true             Partition is capped                            False
conslogin    enable          System Console Login                            False
cpuguard       enable          CPU Guard                                        True
dedicated    true             Partition is dedicated                         False
enhanced_RBAC true             Enhanced RBAC Mode                               True
ent_capacity 24.00             Entitled processor capacity                      False
frequency    2660000000       System Bus Frequency                            False
fullcore       false             Enable full CORE dump                            True
fwversion    IBM,EM350_107    Firmware version and revision levels             False
ghostdev       0                Recreate devices in ODM on system change       True
id_to_partition 0X800014E9EBA00002 Partition ID                                     False
id_to_system 0X800014E9EBA00000 System ID                                        False
iostat       true             Continuously maintain DISK I/O history          True
keylock       normal          State of system keylock at boot time             False
log_pg_dealloc  true             Log predictive memory page deallocation events True
max_capacity 32.00             Maximum potential processor capacity             False
max_logname    9                Maximum login name length at boot time          True
maxbuf       20                Maximum number of pages in block I/O BUFFER CACHE True
maxmbuf       0                Maximum Kbytes of real memory allowed for MBUFS True
maxpout       33                HIGH water mark for pending write I/Os per file True
maxuproc       8192             Maximum number of PROCESSES allowed per user    True
min_capacity 8.00             Minimum potential processor capacity             False
minpout       24                LOW water mark for pending write I/Os per file True
modelname    IBM,9117-MMA    Machine name                                     False
ncargs       256             ARG/ENV list size in 4K byte blocks             True
nfs4_acl_compat secure          NFS4 ACL Compatibility Mode                      True
pre430core    false             Use pre-430 style CORE dump                      True
pre520tune    disable          Pre-520 tuning compatibility mode                True
realmem       136314880       Amount of usable physical memory in Kbytes       False
rtasversion    1                Open Firmware RTAS version                      False
sed_config    select          Stack Execution Disable (SED) Mode             True
systemid       IBM,0206D2116    Hardware system identifier                      False
variable_weight 0                Variable processor capacity weight             False

回复只看该作者道具举报

Maclean Liu(刘相兵

9^#

发表于 2012-7-17 17:06:35

action plan:

执行以下OS 命令看一下

lsattr -El sys0

回复只看该作者道具举报

ricky

8^#

发表于 2012-7-17 16:59:37

快照是断的~~
3462 16 Jul 2012 14:00    1
                           3463 16 Jul 2012 15:00    1
                           3464 16 Jul 2012 16:00    1
                           3465 16 Jul 2012 17:00    1
                           3466 16 Jul 2012 18:00    1

                           3470 16 Jul 2012 22:00    1
                           3471 16 Jul 2012 23:00    1
                           3472 17 Jul 2012 00:00    1
                           3473 17 Jul 2012 01:00    1

中间重启过，所以快照可能看不成了，谢谢你的分析，我也大概有点思路了，我和主机工程师沟通之后，他说当时情况是这样的，主机资源没有耗尽，主机资源够用，就是process的达到了最大连接数，远程连接不进去数据库，所以只有来到现场手工重启了主机。我们一致怀疑是tnslsnr的僵尸进程产生很多，达到了数据库的最大process，所以才导致ksvcreate: Process(m000) creation failed。
现在不明白的就是，为什么会产生这么多的僵尸进程？？刘大你有思路没有？
我在metlink查到这个文章，我觉得有点类似，你看看是否正确。谢谢
  Intermittent TNS Listener Hang, New Child Listener Process Forked [ID 340091.1]
https://support.oracle.com/CSP/m ... ERT&id=340091.1

回复只看该作者道具举报

Maclean Liu(刘相兵

7^#

发表于 2012-7-17 16:24:25

mmon trace:

*** 2012-07-16 18:26:09.378
Waited for process m000 to initialize for 60 seconds
*** 2012-07-16 18:26:09.386
Dumping diagnostic information for m000:
OS pid = 32440464
loadavg : 1.81 2.11 3.12
swap info: free_mem = 3836.72M rsv = 192.00M
         alloc = 323.43M avail = 49152.00M swap_free = 48828.57M
   F S    UID    PID    PPID C PRI NI ADDR SZ WCHAN STIME TTY  TIME CMD
1040005 A oracle 32440464       1 0  60 20 14651c6590 89192       18:25:07    -  0:00 [oracle]
skgpgpstack: read() for cmd /bin/sh -c '/usr/bin/procstack 32440464 2>&1' timed out after 60 seconds
*** 2012-07-16 18:28:19.990
*** 2012-07-16 18:28:30.420
Waited for process m000 to initialize for 70 seconds
*** 2012-07-16 18:28:30.420
Dumping diagnostic information for m000:
OS pid = 32440464
loadavg : 2.09 1.97 2.91
swap info: free_mem = 3788.56M rsv = 192.00M
         alloc = 323.38M avail = 49152.00M swap_free = 48828.62M
   F S    UID    PID    PPID C PRI NI ADDR SZ WCHAN STIME TTY  TIME CMD
1040005 A oracle 32440464       1 0  60 20 14651c6590 89192       18:25:07    -  0:00 [oracle]
skgpgpstack: read() for cmd /bin/sh -c '/usr/bin/procstack 32440464 2>&1' timed out after 60 seconds
*** 2012-07-16 18:30:41.092
*** 2012-07-16 18:30:51.453
Waited for process m000 to initialize for 81 seconds
*** 2012-07-16 18:30:51.453
Dumping diagnostic information for m000:
OS pid = 32440464
loadavg : 1.74 1.82 2.71
swap info: free_mem = 3841.77M rsv = 192.00M
         alloc = 323.31M avail = 49152.00M swap_free = 48828.69M
   F S    UID    PID    PPID C PRI NI ADDR SZ WCHAN STIME TTY  TIME CMD
1040005 A oracle 32440464       1 0  60 20 14651c6590 89192       18:25:07    -  0:00 [oracle]
skgpgpstack: read() for cmd /bin/sh -c '/usr/bin/procstack 32440464 2>&1' timed out after 60 seconds
*** 2012-07-16 18:33:02.164
*** 2012-07-16 18:33:13.394
Waited for process m000 to initialize for 92 seconds

loadavg : 1.74 1.82 2.71  2012-07-16    18:33:02.164这个时间点有略高的负载
使用 procstack诊断进程60s 超时没有获得有用的stack call

LMON的TRACE显示关闭实例的原因是kjxgmrcfg: Reconfiguration started, reason 1

*** SERVICE NAME:() 2012-07-16 21:18:22.450
*** SESSION ID:(2752.1) 2012-07-16 21:18:22.450
GES resources 67766 pool 48
GES enqueues 100204
GES IPC: Receivers 13  Senders 13
GES IPC: Buffers  Receive 1000  Send (i:5750 b:1934) Reserve 1000
GES IPC: Msg Size  Regular 408  Batch 8192
Batching factor: enqueue replay 201, ack 224
Batching factor: cache replay 126 size per lock 64
kjxggin: receive buffer size = 32768
high load threshold = 172032
*** 2012-07-16 21:18:27.267
kjxgmrcfg: Reconfiguration started, reason 1
kjxgmcs: Setting state to 0 0.
*** 2012-07-16 21:18:27.308
   Name Service frozen

ODM FINDING:

Here, you can see the reason for the reconfiguration event. The most common reasons would be 1, 2, or 3. Reason 1 means that the NM initiated the reconfiguration event, as typically seen when a node joins or leaves a cluster.

Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure

FROM MOS :
Troubleshooting ORA-29740 in a RAC Environment [ID 219361.1]

Reason 1: The Node Monitor generated the reconfiguration.  This can happen if:

a) An instance joins the cluster
b) An instance leaves the cluster
c) A node is halted

It should be easy to determine the cause of the error by reviewing the alert
logs and LMON trace files from all instances.  If an instance joins or leaves
the cluster or a node is halted then the ORA-29740 error is not a problem.

ORA-29740 evictions with reason 1 are usually expected when the cluster
membership changes.  Very rarely are these types of evictions a real problem.

If you feel that this eviction was not correct, do a search in Metalink or
the bug database for:

ORA-29740 'reason 1'

Important files to review are:

a) Each instance's alert log
b) Each instance's LMON trace file
c) Statspack reports from all nodes leading up to the eviction
d) Each node's syslog or messages file
      e) iostat output before, after, and during evictions
      f) vmstat output before, after, and during evictions
      g) netstat output before, after, and during evictions

需要当时的主机和Oracle性能快照进一步诊断，我的感觉偏向于 OS资源耗尽导致的僵尸进程和ksvcreate: Process(m000) creation failed，

建议：

1. 最好能部署一下 oswather

2. 参考AIX+RAC最佳实践调优网络参数

回复只看该作者道具举报

ricky

6^#

发表于 2012-7-17 16:09:18

补充~~我在metlink查到一篇文章，不知道有用不？

Intermittent TNS Listener Hang, New Child Listener Process Forked [ID 340091.1]

回复只看该作者道具举报

ricky

5^#

发表于 2012-7-17 16:08:32

事故当时的smon没有：
oracle@RESDB1[/oracle/app/oracle/admin/nrms/bdump]$ls -l *smon*
-rw-r----- 1 oracle oinstall    3444 Jan 13 2012  nrms1_smon_10879352.trc
-rw-r----- 1 oracle oinstall    38198 Mar 07 22:04 nrms1_smon_36110516.trc
-rw-r----- 1 oracle oinstall       934 Mar 15 23:56 nrms1_smon_3867316.trc
-rw-r----- 1 oracle oinstall    1499 Jun 12 10:00 nrms1_smon_4456626.trc
-rw-r----- 1 oracle oinstall       734 Mar 16 02:42 nrms1_smon_4850270.trc
-rw-r----- 1 oracle oinstall    6670 May 09 11:33 nrms1_smon_5243084.trc
-rw-r----- 1 oracle oinstall       968 May 11 02:12 nrms1_smon_61145626.trc

oracle@RESDB1[/oracle/app/oracle/admin/nrms/bdump]$ls -l *mmon*
-rw-r----- 1 oracle oinstall    75626 Jan 12 2012  nrms1_mmon_10092900.trc
-rw-r----- 1 oracle oinstall       990 Mar 15 23:56 nrms1_mmon_27983952.trc
-rw-r----- 1 oracle oinstall    881285 Mar 07 21:15 nrms1_mmon_33620210.trc
-rw-r----- 1 oracle oinstall       861 May 09 15:00 nrms1_mmon_3670692.trc
-rw-r----- 1 oracle oinstall    13032 Jul 16 20:15 nrms1_mmon_3866810.trc----有一个文件。

nrms1_mmon_3866810.txt (12.98 KB, 下载次数: 1226)
oracle@RESDB1[/oracle/app/oracle/admin/nrms/bdump]$ls -l *mmnl* --无~~
-rw-r----- 1 oracle oinstall 2574 Mar 07 21:15 nrms1_mmnl_37945678.trc
oracle@RESDB1[/oracle/app/oracle/admin/nrms/bdump]$ls -l *lmd*
-rw-r----- 1 oracle oinstall 3710562 Jul 16 13:30 nrms1_lmd0_4653806.trc---有一个文件

nrms1_lmd0_4653806.txt (3.63 MB, 下载次数: 934)
-rw-r----- 1 oracle oinstall 2110 Jul 16 21:37 nrms1_lmd0_6095160.trc-----有一个文件

nrms1_lmd0_6095160.txt (2.11 KB, 下载次数: 919)

*lmon*
4246 Jul 16 21:18 nrms1_lmon_3276970.trc ---一个

nrms1_lmon_3276970.txt (4.27 KB, 下载次数: 857)

oracle@RESDB1[/oracle/app/oracle/admin/nrms/bdump]$ls -l *dbw1*
-rw-r----- 1 oracle oinstall 13903 Apr 08 10:12 nrms1_dbw1_3080778.trc
-rw-r----- 1 oracle oinstall 700 Jul 16 20:19 nrms1_dbw1_5963798.trc---一个

nrms1_dbw1_5963798.txt (719 Bytes, 下载次数: 1133)

回复只看该作者道具举报

Maclean Liu(刘相兵

4^#

发表于 2012-7-17 15:36:33

ODM FINDING:

问题的时间线：

16-JUL-2012 18:22:08 * (CONNECT_DATA=(SID=nrms1)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=weblogic))) * (ADDRESS=(PROTOCOL=tcp)(HOST=135.32.21.24)(PORT=36700)) * establish *
nrms1 * 12518
TNS-12518: TNS:listener could not hand off client connection
TNS-12540: TNS:internal limit restriction exceeded

16日 18:22  开始出现大量  TNS-12540 和 TNS-12518

对应你的处理日志可以发现大量  tnslsnr的僵尸进程是在18:16- 18:17 启动的

oracle@RESDB1[/oracle/app/oracle/admin/nrms/bdump]$ps -ef | grep tnslsnr
  oracle  3276882       1 0 18:17:46    -  0:00 [tnslsnr]
  oracle  3342414       1 0 18:17:43    -  0:00 [tnslsnr]
  oracle  3407978       1 0 18:17:35    -  0:00 [tnslsnr]
  oracle  3473510       1 0 18:17:44    -  0:00 [tnslsnr]
  oracle  3604686       1 0 18:16:56    -  0:00 [tnslsnr]
  oracle  3735706       1 0 18:17:18    -  0:00 [tnslsnr]
  oracle  3932340       1 0 18:16:51    -  0:00 [tnslsnr]
  oracle  4587610       1 0 18:16:21    -  0:00 [tnslsnr]
  oracle  4915280       1 0 18:17:21    -  0:00 [tnslsnr]
  oracle  4980836       1 0 18:16:18    -  0:00 [tnslsnr]
  oracle  5177568       1 0 18:17:20    -  0:00 [tnslsnr]
  oracle  5308544       1 0 18:16:49    -  0:00 [tnslsnr]
  oracle  5439590       1 0 18:16:27    -  0:00 [tnslsnr]
  oracle  5505144       1 0 18:14:47    -  0:00 [tnslsnr]
  oracle  5701638       1 0 18:17:21    -  0:00 [tnslsnr]
  oracle  5832942       1 0 18:16:17    -  0:00 [tnslsnr]
  oracle  6029432       1 0 18:17:35    -  0:00 [tnslsnr]
  oracle  6094980       1 0 18:17:23    -  0:00 [tnslsnr]
  oracle  6160430       1 0 18:16:11    -  0:00 [tnslsnr]
  oracle  6226116       1 0 18:16:27    -  0:00 [tnslsnr]
  oracle  6422670       1 0 18:14:59    -  0:00 [tnslsnr]
  oracle  6488136       1 0 18:17:57    -  0:00 [tnslsnr]
  oracle  6553812       1 0 18:15:10    -  0:00 [tnslsnr]
  oracle  6619284       1 0 18:13:59    -  0:00 [tnslsnr]
  oracle  6684774       1 0 18:17:17    -  0:00 [tnslsnr]
  oracle  6750344       1 0 18:16:01    -  0:00 [tnslsnr]
  oracle  6815990       1 0 18:17:07    -  0:00 [tnslsnr]
  oracle  6881364       1 0 18:17:15    -  0:00 [tnslsnr]
  oracle  6946840       1 0 18:14:39    -  0:00 [tnslsnr]

Mon Jul 16 18:40:14 2012
ksvcreate: Process(m000) creation failed---------------------------异常开始

18:40 开始异常， m000子进程无法启动，但是 diag进程没有对该问题做trace  只有最后的systemstate dump

建议你上传 mmon进程的 TRACE文件

导致m000无法启动的原因可能是过多僵尸进程 tnslsnr 造成的内存耗尽

Mon Jul 16 20:15:24 2012
ksvcreate: Process(m000) creation failed------------------------------------
Mon Jul 16 20:19:12 2012
DBW1: terminating instance due to error 472---------------------------
Mon Jul 16 20:19:12 2012
System state dump is made for local instance
System State dumped to trace file /oracle/app/oracle/admin/nrms/bdump/nrms1_diag_5046632.trc-----------------附件

之后在 20:19分 dbw终止实例

action plan:

上传当时 smon、mmon、mmnl、lmd、lmon、dbw1进程的TRACE

回复只看该作者道具举报

ricky

3^#

发表于 2012-7-17 15:13:08

好像pmon的trace文件没有用啊，因为不是事故发生时候产生的，请看：
oracle@RESDB1[/oracle/app/oracle/admin/nrms/bdump]$ls -l *pmon*
-rw-r----- 1 oracle oinstall    850317 May 11 01:33 nrms1_pmon_2490858.trc
-rw-r----- 1 oracle oinstall    1070 Mar 16 02:38 nrms1_pmon_3670294.trc
-rw-r----- 1 oracle oinstall    8095 Mar 15 21:22 nrms1_pmon_37224998.trc
-rw-r----- 1 oracle oinstall 3965284 Mar 07 22:01 nrms1_pmon_38076824.trc
-rw-r----- 1 oracle oinstall    15924 Jul 11 21:13 nrms1_pmon_5767290.trc

回复只看该作者道具举报

ricky

2^#

发表于 2012-7-17 15:01:47

补充：：

metlink结果是：
'Ksvcreate: Process(m000) creation failed' after Standby Database Open Read Only Multiple Times [ID 418553.1]
但是我的不是Standby Database，我想这个文章没有用。

回复只看该作者道具举报

返回列表

		自动登录	找回密码
密码			注册