Readme for Platform Open Cluster Stack (OCS) Rolls


Version 4.1.1-2.0

October 25 2006

Platform Computing

Contents

[ Top ]


Platform Roll

The Platform roll includes some useful tools that help make cluster management easier once you have set up your initial cluster.

Add-hosts tool

The add-hosts tool lets you quickly and easily add nodes to your cluster without using the Insert-Ethers command. By pre-populating the Platform OCS database from an XML configuration file, you can pre-load all the host information and simply startup your machines to start the compute node installation process.

The add-hosts tool is bundled with the Platform OCS software.

How it works

The add-hosts tool allows you to define an XML configuration file that describes the information for each host you want to add to your cluster. For example, when you list your hosts in the configuration file, your host attributes (like IP address, MAC address, and host name) are inserted directly into the Platform OCS database. By populating the database using the add-hosts tool, you no longer need to start your machines in a pre-determined order.

You need to do two things before you use the add-hosts tool:

  1. Enter the MAC addresses of your nodes into a text file
  2. Enter your host information into the XML configuration file

The XML configuration file allows you to define information on a host or a subnet (group of hosts) level.

The host section of the configuration file lets you list your hosts and host attributes individually.

The subnet section lets you collect your nodes into a virtual grouping with the same appliance type so that you do not have to list individual hosts. When you specify a beginning IP address and the number of nodes in the subnet section, the add-hosts tool can extrapolate a group of nodes, name them, and deduce their IP addresses. If your IP addresses are not incremental or if you have gaps in your IP addresses, you can also specify any IP addresses to be excluded.

The add-hosts tool processes the <host> and <subnet> sections in the order that they occur in the XML configuration file. The MAC addresses in the MAC address file must be listed in the same order that the <host> and <subnet> sections occur. Also, these MAC addresses must correspond to the physical order of your hosts. The XML configuration file is processed in tandem with the MAC address file. Each host is paired up with the next occurring MAC address in the MAC address file.

Note: The add-hosts tool only loads host information into the database. You may need to perform additional configuration to set up a network (for example, you may need to set up switches to route from one subnet to another).

Installing a new Platform OCS cluster
Adding a new host to your existing cluster
Replacing a host

Once you have run the add-hosts tool, you can easily replace one host with another in the same physical location with the same IP address:

  1. Replace the old MAC address with the new one in /opt/rocks/etc/mac.txt.
  2. Remove the old MAC address in the Platform OCS database. For example:
    # dbreport ethers | grep 00:c0:9f:45:02:16
    00:c0:9f:45:02:16 compute-0-3.local
    # insert-ethers --remove "compute-0-3"
    
  3. From the command line, run add-hosts or add-hosts --testmode to test your outcome before making any configuration file changes.

    An error will occur indicating trouble adding the first host. This error is expected since you have already added this host to your database.

  4. When prompted, select all to skip all subsequent errors.
  5. Check ./add-hosts.log to see if you were successful. Each host added successfully has its own line and indicates "SUCCESS" at the end. If you failed to add a host, the line indicates "FAILED".
Upgrading your Platform OCS installation

If you are upgrading your Platform OCS installation, you can transfer your existing host configuration to the new installation.

  1. Before upgrading the front end, copy /opt/rocks/etc/add-hostsrc and /opt/rocks/etc/mac.txt onto a disk or shared directory on another file server.
  2. After you have upgraded your front end, copy /opt/rocks/etc/add-hostsrc and /opt/rocks/etc/mac.txt to /opt/rocks/etc.
  3. Run the add-hosts tool.
  4. Check ./add-hosts.log to see if you were successful. Each host added successfully has its own line and indicates "SUCCESS" at the end. If you failed to add a host, the line indicates "FAILED".

Add/remove roll tool

This command-line tool allows you to dynamically add or remove a roll on the front end. You can also use this tool to specify if you want a roll to install only on the front end and not on all compute nodes. Once you have added or removed a roll on the front end, you must reinstall the compute nodes.

Requirements
Limitations


Platform Lava and Platform LSF HPC cannot run on the same node. The Platform LSF HPC roll will disable Platform Lava


Different versions of the Intel compilers roll cannot be used together. They will overwrite each other.

Rolls you can add and remove

The following is a list of rolls that you can add or remove after you have installed on the front end.


WARNING: If you try to add or remove a roll that is not supported, you could seriously damage or impair your system. Check the list of rolls that can be safely added and removed before using this tool.

Rolls that can be added and removed:

Rolls that can be added but not removed:


Do not attempt to remove the following rolls. They will not uninstall correctly.

To see a list of installed rolls

In the command line, run the following command:

# rollops -l
To add a new roll

You can add a new roll by either using the DVD or having a copy of the ISO file for the roll you want.

  1. Make sure you have not already installed the roll. To see a list of installed rolls, run the following command:
    # rollops -l
    
  2. Add the roll.
    • If you have the DVD with the roll you want to add, insert the DVD and run the following command:
      # rollops -a
      
    • If you have a meta roll, type rollops -a roll_name. For example:
      # rollops -a lava
      
    • If you have an ISO file, type rollops -a -i iso_file_name. For example:
      # rollops -a -i lava-4.0.0-i386-disk1.iso
      

    If you do not specify a roll name and the DVD or ISO contains a meta roll, you will be prompted to select from a list of available rolls.

    If you are adding the CluMon roll, you are prompted for your root password.

  3. The tool gives you a message indicating your success or failure to add the new roll. For example:

    The lava roll was added successfully.

  4. If you have added any of the following rolls, you must restart your front end:
    • CluMon
    • Cisco® Topspin®
    • Ganglia
    • Lava
    • LSF HPC
    • Myrinet®
    • PVFS2
    • SGE6
  5. Reinstall your compute nodes.
To remove a roll

To remove a roll from the front end, type rollops -e <roll_name>. For example:

# rollops -e lava

If you are removing the CluMon roll, you are prompted for your root password.

The tool gives you a message indicating your success or failure to remove the roll. For example:

The lava roll has been sucessfully removed.


If you have removed this roll on your front end node, you must reinstall your compute nodes for the removal to take effect.

To upgrade a roll

To upgrade a roll in the front end, type rollops -u <roll_name>. For example:

# rollops -u dell

If you have the media with the roll you want to upgrade, insert the media and run the following command without specifying the roll name:

# rollops -u

Note: If you have a meta roll, you will be prompted with a list of the rolls found in the meta roll, if available. Specify the name of the roll you want to upgrade, as listed in the meta roll.

If you have an ISO image, type rollops -u <optional_roll_name> -i <iso_file_name>. For example:

# rollops -u -i dell-4.1.1-0.x86_64.disk1.iso

Note: If you specified a meta roll ISO image and you do not specify a roll name, you will be prompted with a list of the rolls found in the meta roll, if available.

Only the Dell roll supports upgrading at this time.

The tool gives you a message indicating your success or failure to add the new roll. For example:

The dell roll was added successfully.


If you upgraded the Dell roll, you must restart your front end node.


If you upgraded the Dell roll on your front end node, you must reinstall your compute nodes for them to have the latest version installed.

To prevent a roll from installing on compute nodes

You can choose to install a roll on the front end only and not on the compute nodes.

Type rollops -r <roll_name> -p no. For example: 
# rollops -r lava -p no

When you install on your compute nodes, the roll for which you specified -p no is not installed.

Patch management tool

The patch management tool lets you update your cluster with new packages from your operating system's update network. With this tool, you no longer have to reinstall your whole cluster to get the latest patches and enhancements for your OS.

You may use the patch management tool regardless of whether you have a central install server or not.

How it works

The patch management tool checks for updates and enhancements to your operating system available from your operating system's update network, directs the download of the updates, and tracks the version that you are working with. Any appliance is patched as long as the packages on the appliance exist on both the front end and appliance type itself.

Requirements
Limitations
Downloading an update

Run the following command:

# rocks-update -d <packagename>

For Platform OCS Enterprise Edition, if you have not registered with Red Hat Network, you will be prompted to register. Enter your Red Hat Network account information and follow the prompts. Do not de-select any of the packages listed for download. When the registration is complete, rocks-update will download the update required.

rocks-update downloads the update to your front end. If there are new packages available, the patch management tool returns the following message indicating the new repository version and that you can proceed to update your appliances.

rocks-update: repository for updates is now version 
4.1.1-2.0. You can now install/update your compute node or 
front end appliances.

The repository version database is also updated. It does not install the packages on the front end.

Downloading or patching updates from the central server front end

Before downloading or patching updates from the central server front end, change the following settings:

  1. On the central install server, enable the Apache user access to the app_globals table for your new front end using the following commands:
    # mysql -u root -p cluster
    mysql> GRANT SELECT on cluster.app_globals to 
    apache@<new_front_end_external_ip>
    

    where <new_front_end_external_ip> is the external IP address of the new front end.

  2. On the new front end, edit the /etc/front end.repo file with vi and replace the baseurl line with the following line:
    baseurl=http://<central_install_server_ip>/updates/
    

    where <central_install_server_ip> is the IP address of the central server.

Follow these steps to download updates from a central server front end:

  1. Run the following command:
    # rocks-update -g
    

    The tool displays the download details and the following message:

    rocks-update: repository for updates is now version 
    4.0.0.1. You can now install/update your compute node or 
    front end appliances.
    
  2. If you want to reinstall your compute nodes, you may do so now they will have the new updated applied to them.

Follow these steps to patch updates from a central server front end:

  1. Run the following command:
    # rocks-update -f
    

    The tool displays the installed packages and the following message:

    rocks-update: 4 Update(s) installed successfully on front 
    end!
    
  2. If you want to reinstall your compute nodes, you may do so now they will have the new updated applied to them.
Patching a compute node

From root, run rocks-update -c.

The tool checks for new updates and returns a message indicating whether there was a patch available or not. The default is to perform updates on 64 nodes at a time. You can change this default by specifying the number of nodes to update concurrently, up to a maximum of 250 nodes.

For example, rocks-update -c 128. 128 compute nodes will be patched concurrently.

Installing or reinstalling a compute node

Run insert-ethers --replace=<host_designation>

OR

Use the add-hosts tool to add a host. See Adding a new host to your existing cluster and Replacing a host.

Listing versions of installed updates

From your front end, run rocks-update -g.

Extend-compute CLI tools

Two new CLI tools have been added to ease the process of customizing your compute nodes. These tools are as follows:

Requirements
Using the rocks-compute tool

This tool customizes compute nodes by modifying the extend-compute.xml file. This XML file is located in /export/home/install/site-profiles/4.1.1/nodes. The tool allows you to add, list, or remove RPM packages and post-install scripts.


The extend-compute tool does not handle package dependencies. If you wish to install a new package to your compute nodes you need to keep in mind of any dependencies the package may need. You will need to add all of them using the extend-compute tool.

You may occasionally need to update extend-compute.xml manually to add, remove, or update the contents. User-added changes are preserved as long as they are not affected by operations carried out by the rocks-compute tool.

Using the custom-partition tool

This tool customizes the partition sizes for compute nodes by modifying the extend-auto-partition.xml file. This file is located in /export/home/install/site-profiles/4.0.0/nodes.

You may occasionally need to update extend-auto-partition.xml manually to add, remove, or update the contents. User-added changes are preserved as long as they are not affected by operations carried out by the custom-partition tool.

[ Top ]


Platform OCS LSF HPC

Platform OCS LSF HPC® is an optional roll for managing and accelerating High Performance Computing (HPC) mission-critical workload.

With Platform LSF HPC you can intelligently schedule parallel and serial workload providing the capability of solving large, challenging problems while utilizing the available computing resources at maximum capacity.

For more information about Platform LSF HPC, see the Platform Web site: http://www.platform.com/products/HPC.

Product support

For Platform OCS hardware, operating system support, and CD distributions, see Readme for Platform OCS.


Platform LSF HPC will disable Platform Lava. Do not install Platform LSF HPC unless you intend to use it.

If you have already installed Platform LSF HPC, but want to use Platform Lava, you must use the rollops tool to remove Platform LSF HPC, then re-enable Platform Lava by running chkconfig -add lava on all affected nodes. On the front end node, you must also run chkconfig --add lavagui.

Configuring and managing Platform OCS LSF HPC

Once you have installed Platform OCS and the LSF HPC roll, get a Platform LSF HPC license and start license manager daemons. Then set up and start your Platform LSF HPC cluster.

Setting up a Platform LSF HPC license
  1. Decide which machine will be a license server machine.

    The following steps use front end-0 as the license server.

  2. Get FLEXlm hostid

    Use the lmhostid command on the FLEXlm server host to get the

    hardware identifier of your FLEXlm license server host. For example:

    # lmhostid
       lmhostid - Copyright (c) 1989-2003 by Macrovision Corporation. All rights 
    reserved.
       The FLEXlm host ID of this machine is "0006296d1f2c"
    

    In this example, send the code "0006296d1f2c" to Platform.

  3. Contact license@platform.com to get a permanent Platform LSF HPC license.

    Send the following information to Platform at license@platform.com:

    • Host name of the license server host
    • Host identifier of the license server host (lmhostid output)
    • Products required
    • Number of licenses required for your cluster
  4. When you receive your license file, save it as /opt/lsfhpc/conf/license.dat.

    The following is an example of a permanent license:

    SERVER front end-0 0006296f1f2c 1700
    DAEMON lsf_ld /opt/lsfhpc/6.1/linux2.4-glibc2.3-x86/etc/lsf_ld
    FEATURE lsf_base lsf_ld 6.000 1-sep-0000 10 CCF7C3C92A5471A12345 "Platform"
    FEATURE lsf_manager lsf_ld 6.000 1-sep-0000 10 4CF7C37944B023A12345 "Platform"
    FEATURE lsf_sched_fairshare lsf_ld 6.000 1-sep-0000 10 8CA763A93AC825C12345 
       "Platform"
    FEATURE lsf_sched_parallel lsf_ld 6.000 1-sep-0000 10 3C77F30945F7FBC12345 
       "Platform"
    FEATURE lsf_sched_preemption lsf_ld 6.000 1-sep-0000 10 3C0733892C1683812345 
       "Platform"
    FEATURE lsf_sched_resource_reservation lsf_ld 6.000 1-sep-0000 10 
       ECD7C369072CA3812345 "Platform"
    FEATURE platform_hpc lsf_ld 6.000 1-sep-0000 10 CA6CBE08B635EAC765EC "Platform"
    
  5. Copy the license.dat file to /opt/lsfhpc/conf/license.dat.

    LSF_LICENSE_FILE in lsf.conf is set automatically during installation to LSF_LICENSE_FILE=/opt/lsfhpc/conf/license.dat.

  6. Start the license daemons (lmgrd):
    1. Log on to the license server host.
    2. Use the lmgrd command to start the license server daemon. For example:
      % lmgrd -c /opt/lsfhpc/conf/license.dat -l /tmp/license.log
      


      DO NOT run lmgrd as root.

    LSF installation puts the lmgrd command in LSF_SERVERDIR. For example: /opt/lsfhpc/6.1/linux2.4-glibc2.3-x86/etc/lmgrd.

    You should include LSF_SERVERDIR in your PATH environment variable. You should also include the full lmgrd command line in your system startup files on the license server host, so that lmgrd starts automatically during system restart.

  7. Check the license daemons (lmstat):
    License server status: 1700@front end-0
    License file(s) on front end-0: /opt/lsfhpc/conf/license.dat:
    front end-0: license server UP (MASTER) v7.0
    Vendor daemon status (on front end-0):
    lsf_ld: UP v7.0
    Feature usage info:
    Users of lsf_base:  (Total of 4 licenses available)
    Users of lsf_manager:  (Total of 4 licenses available)
    Users of Platform_HPC:  (Total of 4 licenses available)
    

    See Licensing Platform LSF for detailed information about configuring Platform LSF licenses.

Configure compute nodes in your Platform LSF HPC cluster:

Compute nodes are automatically added to the cluster either when the lsf service is restarted or if insert-ethers exists. New compute nodes are added to the lsf.cluster.lsfhpc file. Computer nodes are assigned to the lammpi and mpichp4 resources by default. You can change the default by changing the value of DefaultLSFHostResource in the Platform OCS database, as follows:

# mysql -u apache cluster
mysql> insert into app_globals (service,component,value) 
values (`Info','DefaultLSFHostResource','<resource_list>') ;

where <resource_list> is a list of all the Resources for a node. LSF HPC includes support for the following MPI implementations:

See the lsf.shared file for a full list of all supported MPI implementations.

The following is a sample host entry:

Begin Host
HOSTNAME   model  type   server r1m  mem  swp  RESOURCES
...
compute-0-1  !      !        1   3.5  ()  ()    (lammpi mpichp4)
...
End Host
Setting up and starting your Platform LSF HPC cluster.
  1. Log on to the front end node as root.
  2. Install the Platform LSF HPC license as described above.
  3. Restart the LSF HPC services:
    # service lsf restart
    

The compute nodes start LSF HPC at boot time.

After starting your cluster, run a few basic LSF commands (lsid, lshosts, bhosts). For example:

% lsid
   Platform LSF HPC 6.1 for Linux, Sep 1 2005
   Copyright 1992-2005 Platform Computing Corporation
   My cluster name is lsfhpc
   My master name is front end-0.public
Configuring your Platform LSF HPC cluster for master failover

The LSF HPC Master appliance type is a node type for LSF HPC master candidates. These nodes are used to offload the LSF Master to a less busy node. You should use the LSF HPC Master appliance type in larger clusters to install one or more nodes as LSF master candidates.

Configure your cluster for master failover using the following procedure:

  1. Run insert-ethers and select the LSF HPC Master appliance type.
  2. Install one or more of the LSF HPC Master nodes.
  3. Exit insert-ethers.

    This is needed to update the lsf.cluster.lsfhpc file.

  4. Verify that the NFS filesystem can be mounted on the lsfhpc-0-0 node by running the following commands:
    1. # ssh lsfhpc-0-0
    2. # mount NFS_server_name:export_name /mnt
    3. # unmount /mnt
  5. Run the config-lsf-master script in /home/install/upgrades/lsfhpc
  6. Answer the dialog questions when prompted to by the script.

After completing the questions, the LSF HPC cluster should be using lsfhpc-0-0 node as the primary LSF master, and the front end node should be the last node to which it will fail over.

You should run the config-lsf-master script each time you add or remove an LSF HPC Master appliance node.

Removing a compute node

To remove a compute node, you should shut down the entire compute node, or at least shut down the Platform LSF HPC daemons, then remove the compute node from LSF and the Platform OCS cluster.

  1. # ssh compute_node_name
  2. # shutdown

    OR

    # /etc/init.d/lsf stop
    
  3. On the front end node, run remove the related host line from the lsf.cluster.lsfhpc file.
  4. # lsadmin reconfig
  5. # badmin reconfig
Removing a host from the Platform OCS cluster:
# insert-ethers --remove="<compute_node_name>"

Troubleshooting

Solution:

  1. Symptom: Unable to launch parallel jobs using LAM MPI when the number of processes increases (approximately 32).

    Explanation: lamboot is failing to ssh to the nodes in a timely manner. The issue can be further traced to name resolution in ssh. The cluster may be trying to use a non-existent domain or DNS server.

    To test if the issue exists:

    1. Connect to a compute node and try to ssh to another compute node.

      Take note of the time required.

    2. Log out of the other compute node and try to ssh to the IP address of another compute node.
    3. If the result is dramatically faster, the issue exists.
    4. To verify the problem is with name resolution, edit the /etc/resolv.conf file on a compute node. Change the search line to only include the private domain.
    5. ssh to another compute node. It should respond faster.

    Solution: Fix this problem using any one of the following:

    • Update the front end's /etc/resolv.conf file to use a real DNS server.
    • Update the database and set the PublicDNSDomain in the app_globals table to be blank, and then reinstall the compute nodes.
    • As a temporary fix, change the search parameter in the /etc/resolv.conf of all the compute nodes.
  2. Symptom: A suspended lammpi job cannot be terminated by the owner with bkill.

    Solution: Resume the job to be killed; the job will go to EXIT.

  3. Symptom: If an MPI job is launched on a compute node to run across other compute nodes, and other compute nodes are not accessible from the launching node, the job execution goes to failure, but bjobs shows the job is DONE.

    Solution: Make sure all compute nodes are accessible to each other.

  4. Symptom: An mpich_gm job can be dispatched to a host without Myrinet card installed.

    Solution: Configure a host group gm_hosts for those hosts with Myrinet card properly installed and specify the option -m gm_hosts for bsub when submitting the job.

  5. Symptom: When a job failed to launch or a job is terminated during execution, the related temp files are left over under HOME.

    Solution: Remove the files manually.

  6. Symptom: Ctrl-C cannot terminate a started interactive lammpi job.

    Solution: Use bkill instead.

  7. Symptom: lammpi application fails if run across the nodes mixed with and without Myrinet card installed.

    Solution: Set the environment variable LAM_MPI_SSI_rpi=tcp before submitting the job.

[ Top ]


Platform Lava

Platform Lava is a distributed batch system for submitting jobs and managing the workload on a Platform OCSTM cluster. Platform Lava is free and is based on Platform LSF.

Platform Lava lets you easily manage the day-to-day worklfront endad of a whole cluster, providing simplified job execution, management, and accounting.


Platform LSF HPC will disable Platform Lava. Do not install Platform LSF HPC unless you intend to use it.

If you have already installed Platform LSF HPC, but want to use Platform Lava, you must use the rollops tool to remove Platform LSF HPC, then re-enable Platform Lava by running chkconfig -add lava on all affected nodes. On the front end node, you must also run chkconfig --add lavagui.

About the Platform Lava GUI

The Platform Lava GUI gives you the ability to monitor and control your Platform Lava jobs, queues, and hosts. This browser-based interface is installed on your front end only; no components are installed on your compute hosts.

The Platform Lava GUI is not intended to be production quality. We hope to get feedback on its usefulness as a tool for future releases.

Supported browsers
The Platform Lava GUI Web address

Use your browser to navigate to your Platform Lava GUI Web address which is http://<host_name>:<port_number>/Platform.

The word "Platform" in the URL is case sensitive. Ensure that it appears exactly as it appears here. The default port number is 8080.

Logging in to the Platform Lava GUI

To log in to the Platform Lava GUI:

  1. Navigate to http://<host_name>:<port_number>/Platform with your browser. The default port is 8080.
  2. Log in as the Platform Lava Administrator or as a listed user with your OS user name and password.
    1. To login as a Platform Lava Administrator, run passwd lavaadmin to set the password. The username is lavaadmin, with no password as default. For more information, see Administering Platform Lava.


    You can not log in as root.

Using the MPI job submission scripts

Platform Lava includes two scripts to aid in submitting LAM over Ethernet and MPICH over Ethernet MPI jobs. The scripts are wrappers to mpirun which handle the setup of the machinefile for MPICH, or the bhosts for LAM. The scripts are included with the Lava roll and are located in the $LSF_BINDIR directory. The scripts are for use with Lava only, and are named according to the type of MPI:

lam-mpirun

Run the following command to run the lam-mpirun script:

% bsub -n <num_processors> lam-mpirun -np <num_processors> 
<MPI_JOB> <ARGS>

Note: A properly configured LAM environment is required before using this wrapper. The lam-mpirun will also call lamboot and lamhalt.

mpich-mpirun

Run the following command to run the mpich-mpirun script:

% bsub -n <num_processors> mpich-mpirun -np <num_processors> 
<MPI_JOB> <ARGS>

Known issues

Lava GUI

Troubleshooting

  1. Symptom: The Lava GUI (lavagui) link is broken in the main page after reinstall the Lava Roll using the rollops tool.

    Solution: Restart the lavagui daemon as follows:

    1. # /etc/init.d/lavagui stop
    2. # /etc/init.d/lavagui start

    The Lava GUI should function normally after you restart the lavagui daemon.

  2. Symptom: After removing a compute node from your Platform OCS cluster, the bhosts or lsload command shows that the Platform Lava daemons are still running on the host. This occurs after you run the following command:
    # insert-ethers --remove host_name
    

    Solution: Restart the daemons on the master Platform Lava host by running the following command:

    # /etc/init.d/lava stop
    # /etc/init.d/lava start
    
  3. Symptom: After physically disconnecting a compute node from your Platform OCS cluster, the bhosts or lsload command shows that the host is UNKNOWN.

    Solution: Restart the daemons on the master Platform Lava host by running the following commands:

    # /etc/init.d/lava stop
    # /etc/init.d/lava start
    

[ Top ]


Intel® Software Tools Roll

The Intel® Software Tools roll contains the Intel C++ compiler and the Intel MPI Library 2.0 packages, including library and integrated performance primitives, LAM MPI, and MPICH.

Getting a license for Intel compiler and tools

To get a license (either evaluation or commercial) for the Intel compiler and tools, follow these steps:

  1. Go to the following URL:

    http://www.intel.com/software/products/distributors/rock_cluster.htm

  2. Select the tool for which you want to obtain a license by clicking on the appropriate link under the Intel Software Development Products section.

    You need a separate license for each Intel tool.

  3. Get the license:
    • To get a demo/evaluation license, click the Free Evaluation Software link in the Evaluate/Purchase table, and follow the instructions.
    • To get a commercial license, click the Buy/Renew link in the Evaluate/Purchase table, and follow the instructions.

Troubleshooting

  1. Symptom: When trying to run or compile applications using the X86, or EM64T Intel MPI package, you get link time and runtime errors.

    Explanation: This occurs because the MPI libraries are not set in the system library path.

    Solution: Run the following commands on a front end and compute nodes.

    • On the front end node, run:
        • For x86 :
          # echo "/opt/intel_mpi_10/lib" >> /etc/ld.so.conf
          # ldconfig
          
        • For Intel EM64T:
          # echo "/opt/intel_mpi_10/lib64" >> /etc/ld.so.conf
          # ldconfig
          
    • For your compute nodes, log into the front end as root and run:
        • For x86:
          # cluster-fork 'echo "/opt/intel_mpi_10/lib" >> /etc/ld.so.conf; ldconfig'
          
        • For Intel EM64T:
          # cluster-fork 'echo "/opt/intel_mpi_10/lib64" >> /etc/ld.so.conf; ldconfig'
          


    You will see warnings about shared libraries being too small. They can be safely ignored.

  2. Symptom: Applications compiled with the Intel MKL library cannot run. The following error is encountered when running applications compiled with the Intel MKL:
    xhpl: error while loading shared libraries: libmkl_lapack64.so: cannot open 
    shared object file: No such file or directory
    

    Solution: Add the MKL library path to /etc/ld.so.conf as follows and run ldconfig:

    • For x86, add /opt/intel/mkl701cluster/lib/32 to /etc/ld.so.conf
    • For Intel EM64T, add /opt/intel/mkl701cluster/lib/em64t to /etc/ld.so.conf

    To avoid the following error:

    xhpl: relocation error: /usr/lib64/libguide.so: undefined symbol: 
    _intel_fast_memset
    

    make sure the line /opt/intel/mkl701cluster/lib/em64t comes before the /lib64 and /usr/lib64 lines in /etc/ld.so.conf to make it link correctly: For example:

    /opt/gm/lib
    /usr/X11R6/lib64
    /usr/kerberos/lib
    /usr/X11R6/lib
    /usr/kerberos/lib64
    /usr/lib64/mysql
    /usr/local/topspin/lib64
    /usr/local/topspin/mpi/mpich/lib64
    /opt/gridengine/lib/lx24-amd64
    /opt/intel/mkl701cluster/lib/em64t
    /lib64
    /usr/lib64
    /usr/kerberos/lib64
    /opt/sge/lib/glinux
    /opt/nmi/lib
    /usr/lib64/qt-3.1/lib
    /usr/lib64/mysql
    /usr/X11R6/lib64
    /opt/intel_fce_80/lib
    /opt/intel_cce_80/lib
    

[ Top ]


CluMon Roll

CluMon (Beta) is an open source cluster monitoring system developed at the National Center for Supercomputing Applications (NCSA) to keep track of its Linux® clusters. CluMon is a tunable system that can be made to work for almost any set of Linux machines.

For more information on CluMon, see http://clumon.ncsa.uiuc.edu.

The optional CluMon roll fully integrates the CluMon monitoring application with Platform Lava or Platform LSF HPC.

The CluMon roll is for both Standard and Enterprise Editions of Platform OCS version 4.1.1-2.0.

CluMon roll architecture

CluMon roll supports x86 Red Hat Enterprise Linux® 4.0 and Intel EM64T Red Hat Enterprise Linux® 4.0.

Known issues

The CluMon roll is beta quality and has the following known issues:

Installation
CluMon Interface

[ Top ]


PVFS2 Roll

The PVFS2 (Parallel Virtual Filesystem 2) roll is a bundle of all the components you need to run a high-performance distributed file system.

The following groups have collaborated on or supported the development of PVFS:

There are three components in PVFS2 1.3.2:

  1. Meta server
  2. Data server
  3. Client

The roll creates a sample distributed file system that should then be fine tuned for your own configuration and hardware. PVFS2 allows the disk space in each node to be accessible to all nodes as a single file system, creating a high-speed file system ideal for datasets and job information. There are some limitations and an administrator should understand them before configuring it. The latest documentation is available from the PVFS2 Web site at: http://www.pvfs.org/pvfs2/documentation.html.

Requirements

One host must be dedicated as the meta server and named
pvfs2-meta-server-0-0. Once this host is installed, all the others will be able to access a sample PVFS2 file system under /mnt/pvfs2.

Installation

During installation, an autofs configuration file is installed along with the binaries and source code on all clients. On the first startup after installation, the kernel module will be built. The source