Install 11.2.0.2 RAC on OEL5.5 x86-64 (root.sh issue on second node)

在安装11.2.0.2 RAC的时候,第一步安装Grid,在第二个节点上运行root.sh的时候,报错如下:

Start of resource "ora.ctssd" failed
CRS-2672: Attempting to start 'ora.ctssd' on 'xsh-server2'
CRS-2674: Start of 'ora.ctssd' on 'xsh-server2' failed
CRS-4000: Command Start failed, or completed with errors.
Cluster Time Synchronisation Service  start in exclusive mode failed at /u01/app/11.2.0/grid/crs/install/crsconfig_lib.pm line 6455.
/u01/app/11.2.0/grid/perl/bin/perl -I/u01/app/11.2.0/grid/perl/lib -I/u01/app/11.2.0/grid/crs/install /u01/app/11.2.0/grid/crs/install/rootcrs.pl execution failed

从报错信息上看是ctssd进程启动失败(在这之前会显示cssd进程启动成功,这与MOS上的其它一些第二节点运行root.sh失败的情形是不一样的,那些场景在cssd进程启动的时候就失败了),查看ctssd进程的启动log(位于$GRID_HOME/log/ctssd目录下),发现如下错误信息。

2010-11-12 18:55:46.132: [    GIPC][2424495392] gipcCheckInitialization: possible incompatible non-threaded init from [prom.c : 687], original from [clsss.c : 5325]
[ default][2424495392]Failure 4 in trying to open SV key SYSTEM.version.localhost

[ default][2424495392]procr_open_key error 4 errorbuf : PROCL-4: The local registry key to be operated on does not exist.

2010-11-12 18:55:46.135: [    CTSS][2424495392]clsctss_r_av2: Error [3] retrieving Active Version from OLR. Returns [19].
2010-11-12 18:55:46.138: [    CTSS][2424495392](:ctss_init16:): Error [19] retrieving active version. Returns [19].
2010-11-12 18:55:46.138: [    CTSS][2424495392]ctss_main: CTSS init failed [19]
2010-11-12 18:55:46.138: [    CTSS][2424495392]ctss_main: CTSS daemon aborting [19].
2010-11-12 18:55:46.138: [    CTSS][2424495392]CTSS daemon aborting

从crsctl命令中也可以看出ora.cssd启动成功,但是ora.ctssd是OFFLINE状态。

 $ crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        OFFLINE OFFLINE                                                   
ora.cluster_interconnect.haip
      1        OFFLINE OFFLINE                                                   
ora.crf
      1        OFFLINE OFFLINE                                                   
ora.crsd
      1        OFFLINE OFFLINE                                                   
ora.cssd
      1        ONLINE  ONLINE       xsh-server2                                  
ora.cssdmonitor
      1        ONLINE  ONLINE       xsh-server2                                  
ora.ctssd
      1        ONLINE  OFFLINE                                                   
ora.diskmon
      1        ONLINE  ONLINE       xsh-server2                                  
ora.drivers.acfs
      1        OFFLINE OFFLINE                                                   
ora.evmd
      1        OFFLINE OFFLINE                                                   
ora.gipcd
      1        ONLINE  ONLINE       xsh-server2                                  
ora.gpnpd
      1        ONLINE  ONLINE       xsh-server2                                  
ora.mdnsd
      1        ONLINE  ONLINE       xsh-server2   

此时如果用此命令查看第一个节点的状况会发现所有资源都是正常ONLINE的。继续检查cssd.log(位于$GRID_HOME/log/cssd目录中),显示在发现ASM磁盘的时候报错。

2010-11-12 13:44:30.505: [   SKGFD][1087203648]UFS discovery with :ORCL:VOL*:

2010-11-12 13:44:30.505: [   SKGFD][1087203648]OSS discovery with :ORCL:VOL*:

2010-11-12 13:44:30.505: [   SKGFD][1087203648]Discovery with asmlib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: str :ORCL:VOL*:

2010-11-12 13:44:30.505: [   SKGFD][1087203648]Fetching asmlib disk :ORCL:VOL1:

2010-11-12 13:44:30.505: [   SKGFD][1087203648]Fetching asmlib disk :ORCL:VOL2:

2010-11-12 13:44:30.505: [   SKGFD][1087203648]Fetching asmlib disk :ORCL:VOL3:

2010-11-12 13:44:30.505: [   SKGFD][1087203648]ERROR: -15(asmlib ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so op asm_open error Operation not permitted
)
2010-11-12 13:44:30.505: [   SKGFD][1087203648]ERROR: -15(asmlib ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so op asm_open error Operation not permitted
)
2010-11-12 13:44:30.505: [   SKGFD][1087203648]ERROR: -15(asmlib ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so op asm_open error Operation not permitted

值得注意的是,这样的报错在第一个节点上也同样存在,但是第一个节点上所有的资源包括ASM磁盘组却都是正常运行的。

对于以上cssd.log中的错误,按照MOS Note [1050164.1]处理,修改/etc/sysconfig/oracleasm-_dev_oracleasm文件,指定ASMLib在发现磁盘的时候需要忽略的盘和需要检查的盘。在我们的环境中是使用了Multipath来对多块磁盘做多路径处理,因此需要包括dm开头的磁盘,而忽略sd开头的磁盘。这样的问题也应该只会发生在使用了Multipath的磁盘上。

# ORACLEASM_SCANORDER: Matching patterns to order disk scanning
ORACLEASM_SCANORDER="dm"

# ORACLEASM_SCANEXCLUDE: Matching patterns to exclude disks from scan
ORACLEASM_SCANEXCLUDE="sd"

可以通过以下方法来确认是否遭遇了此问题。

# ls -l /dev/oracleasm/disks
brw-rw---- 1 oracle dba 3, 65 May 14 12:08 CRSVOL
# cat /proc/partitions
  3 65 4974448 sda
253  1 4974448 dm-1

在上面可以看到CRSVOL这个用oracleasm创建的ASM磁盘的major和minor号分别是3,65,而这正是/dev/sda的号,并不是/dev/dm-1的号,所以表示在创建ASM磁盘组的时候并没有使用到Multipath设备。通常情况下,在节点1上是正确的,而在节点2上不正确的,因此出现了问题。

在处理完以上问题以后,必须要对grid环境做deconfig再reconfig,而不能只是在失败节点上重新运行root.sh(我在这里耗费了大量时间),重新配置grid的步骤可以参考MOS Note [942166.1] – How to Proceed from Failed 11gR2 Grid Infrastructure (CRS) Installation。之后root.sh顺利在第二节点上运行成功。

在错误解决以后,回顾之前的安装信息,可以发现虽然第一个节点显示所有资源都正常,但是和正常的root.sh运行信息相比则缺少了几行显示。

正常的信息如下:

# $GRID_HOME/root.sh
Running Oracle 11g root script...

The following environment variables are set as:
    ORACLE_OWNER= grid
    ORACLE_HOME=  /u01/app/11.2.0/grid

Enter the full pathname of the local bin directory: [/usr/local/bin]: 
The contents of "dbhome" have not changed. No need to overwrite.
The contents of "oraenv" have not changed. No need to overwrite.
The contents of "coraenv" have not changed. No need to overwrite.

Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Using configuration parameter file: /u01/app/11.2.0/grid/crs/install/crsconfig_params
LOCAL ADD MODE 
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
OLR initialization - successful
Adding daemon to inittab
ACFS-9200: Supported
ACFS-9300: ADVM/ACFS distribution files found.
ACFS-9307: Installing requested ADVM/ACFS software.
ACFS-9308: Loading installed ADVM/ACFS drivers.
ACFS-9321: Creating udev for ADVM/ACFS.
ACFS-9323: Creating module dependencies - this may take some time.
ACFS-9327: Verifying ADVM/ACFS devices.
ACFS-9309: ADVM/ACFS installation correctness verified.
CRS-2672: Attempting to start 'ora.mdnsd' on 'xsh-server1'
CRS-2676: Start of 'ora.mdnsd' on 'xsh-server1' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'xsh-server1'
CRS-2676: Start of 'ora.gpnpd' on 'xsh-server1' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'xsh-server1'
CRS-2672: Attempting to start 'ora.gipcd' on 'xsh-server1'
CRS-2676: Start of 'ora.cssdmonitor' on 'xsh-server1' succeeded
CRS-2676: Start of 'ora.gipcd' on 'xsh-server1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'xsh-server1'
CRS-2672: Attempting to start 'ora.diskmon' on 'xsh-server1'
CRS-2676: Start of 'ora.diskmon' on 'xsh-server1' succeeded
CRS-2676: Start of 'ora.cssd' on 'xsh-server1' succeeded

ASM created and started successfully.

Disk Group CRSDG created successfully.

clscfg: -install mode specified
Successfully accumulated necessary OCR keys.
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
Successful addition of voting disk 67463e71af084f76bf98b3ee55081e40.
Successfully replaced voting disk group with +CRSDG.
CRS-4266: Voting file(s) successfully replaced
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   67463e71af084f76bf98b3ee55081e40 (ORCL:VOL1) [CRSDG]
Located 1 voting disk(s).

CRS-2672: Attempting to start 'ora.asm' on 'xsh-server1'
CRS-2676: Start of 'ora.asm' on 'xsh-server1' succeeded
CRS-2672: Attempting to start 'ora.CRSDG.dg' on 'xsh-server1'
CRS-2676: Start of 'ora.CRSDG.dg' on 'xsh-server1' succeeded
ACFS-9200: Supported
ACFS-9200: Supported
CRS-2672: Attempting to start 'ora.registry.acfs' on 'xsh-server1'
CRS-2676: Start of 'ora.registry.acfs' on 'xsh-server1' succeeded
Preparing packages for installation...
cvuqdisk-1.0.9-1
Configure Oracle Grid Infrastructure for a Cluster ... succeeded

而之前的信息则缺少了以下4行。

LOCAL ADD MODE 
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
OLR initialization - successful

Oracle显然不会承认这是bug,好吧,解决问题就好。

2 Comments Add yours

Leave a Reply

Your email address will not be published. Required fields are marked *