- 最后登录
- 2016-6-27
- 在线时间
- 134 小时
- 威望
- 351
- 金钱
- 2586
- 注册时间
- 2012-3-16
- 阅读权限
- 60
- 帖子
- 188
- 精华
- 6
- 积分
- 351
- UID
- 309
|
1#
发表于 2012-12-12 15:03:57
|
查看: 5662 |
回复: 4
这是重启节点的操作系统日志:
Dec 12 11:00:48 GIO138 MR_MONITOR[10507]: <MRMON044> Controller ID: 0 Time established since power on: Time 2012-12-12,11:00:48 72738 Seconds
Dec 12 12:00:48 GIO138 MR_MONITOR[10507]: <MRMON044> Controller ID: 0 Time established since power on: Time 2012-12-12,12:00:48 76338 Seconds
Dec 12 13:00:48 GIO138 MR_MONITOR[10507]: <MRMON044> Controller ID: 0 Time established since power on: Time 2012-12-12,13:00:48 79937 Seconds
Dec 12 14:00:48 GIO138 MR_MONITOR[10507]: <MRMON044> Controller ID: 0 Time established since power on: Time 2012-12-12,14:00:48 83537 Seconds
Dec 12 14:28:10 GIO138 syslogd 1.4.1: restart.
Dec 12 14:28:10 GIO138 syslog: syslogd startup succeeded
Dec 12 14:28:10 GIO138 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Dec 12 14:28:10 GIO138 kernel: Bootdata ok (command line is ro root=LABEL=/ rhgb quiet)
Dec 12 14:28:10 GIO138 kernel: Linux version 2.6.9-78.ELlargesmp (brewbuilder@ls20-bc2-14.build.redhat.com) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-10)) #1 SMP Wed Jul 9 16:03:59 EDT 2008
Dec 12 14:28:10 GIO138 kernel: BIOS-provided physical RAM map:
Dec 12 14:28:10 GIO138 kernel: BIOS-e820: 0000000000000000 - 000000000009c000 (usable)
Dec 12 14:28:10 GIO138 kernel: BIOS-e820: 000000000009c000 - 00000000000a0000 (reserved)
Dec 12 14:28:10 GIO138 kernel: BIOS-e820: 00000000000ca000 - 0000000000100000 (reserved)
Dec 12 14:28:10 GIO138 kernel: BIOS-e820: 0000000000100000 - 000000007ffe0000 (usable)
Dec 12 14:28:10 GIO138 kernel: BIOS-e820: 000000007ffe0000 - 000000007ffea000 (ACPI data)
Dec 12 14:28:10 GIO138 kernel: BIOS-e820: 000000007ffea000 - 0000000080000000 (ACPI NVS)
Dec 12 14:28:10 GIO138 kernel: BIOS-e820: 0000000080000000 - 00000000cfc00000 (usable)
这是另外一个节点(也就是集群主节点)的cssd.log:
[ CSSD]2012-12-12 14:25:53.608 [1241577824] >WARNING: clssnmPollingThread: node gio138 (2) at 50 3.119222e-317artbeat fatal, eviction in 29.770 seconds
[ CSSD]2012-12-12 14:25:53.608 [1241577824] >TRACE: clssnmPollingThread: node gio138 (2) is impending reconfig, flag 1037, misstime 30230
[ CSSD]2012-12-12 14:25:53.608 [1241577824] >TRACE: clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1)
[ CSSD]2012-12-12 14:25:54.600 [1241577824] >WARNING: clssnmPollingThread: node gio138 (2) at 50 3.119246e-317artbeat fatal, eviction in 28.780 seconds
[ CSSD]2012-12-12 14:26:08.606 [1241577824] >WARNING: clssnmPollingThread: node gio138 (2) at 75 3.119270e-317artbeat fatal, eviction in 14.780 seconds
[ CSSD]2012-12-12 14:26:09.608 [1241577824] >WARNING: clssnmPollingThread: node gio138 (2) at 75 3.119293e-317artbeat fatal, eviction in 13.770 seconds
[ CSSD]2012-12-12 14:26:17.603 [1241577824] >WARNING: clssnmPollingThread: node gio138 (2) at 90 3.119317e-317artbeat fatal, eviction in 5.780 seconds
[ CSSD]2012-12-12 14:26:18.604 [1241577824] >WARNING: clssnmPollingThread: node gio138 (2) at 90 3.119341e-317artbeat fatal, eviction in 4.780 seconds
[ CSSD]2012-12-12 14:26:19.606 [1241577824] >WARNING: clssnmPollingThread: node gio138 (2) at 90 3.119364e-317artbeat fatal, eviction in 3.770 seconds
[ CSSD]2012-12-12 14:26:20.598 [1241577824] >WARNING: clssnmPollingThread: node gio138 (2) at 90 3.119388e-317artbeat fatal, eviction in 2.780 seconds
[ CSSD]2012-12-12 14:26:21.600 [1241577824] >WARNING: clssnmPollingThread: node gio138 (2) at 90 3.119412e-317artbeat fatal, eviction in 1.780 seconds
[ CSSD]2012-12-12 14:26:22.602 [1241577824] >WARNING: clssnmPollingThread: node gio138 (2) at 90 3.119436e-317artbeat fatal, eviction in 0.780 seconds
由于主节点的时间比另外一个节点会慢两分钟,所以从日志很难看出是系统异常导致心跳报错还是心跳异常导致系统重启的。
可以看出系统是异常重启的,重启之前没有记录任何异常日志。
问题:
1.假如是心跳异常导致的重启节点那系统是正常关闭还是异常关闭的?此时节点的系统日志应该会记录集群等关闭的信息吧?
2.怎么判断是因为系统异常重启导致的心跳报错,还是因为心跳错误导致系统异常重启的?
3.假如是心跳问题怎么判断是不是由于网络问题造成的,因为经常会出现重启,而且网线和交换机也换过了,不可能新的网络设备也有问题吧,另外根据sar显示重启之前两个节点负载都比较低,cpu空闲都有85%以上。 |
|