Proposals:Condor: Difference between revisions

From KitwarePublic
Jump to navigationJump to search
Line 201: Line 201:


= Additional Information  =
= Additional Information  =
* The right processor architecture
== The right processor architecture ==
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors.  
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors.  


Line 275: Line 275:


Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.
Be sure that your executable is statically linked.
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.
When building BatchMake, you need to build with grid support on
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.
== Useful commands ==
condor_status
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.
condor_q
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.
condor_q -analyze CID.PID
condor_q -better-analyze CID.PID
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.
condor_config_val <CONDOR_VARIABLE>
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.
condor_rm CID.PID
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs. For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.
[edit] condor_master
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter
NETWORK_INTERFACE = <desired IP>
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.
[edit] condor_startd
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.
[edit] condor_starter
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.
[edit] condor_schedd
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.
[edit] condor_shadow
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.
[edit] condor_collector
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.
[edit] condor_negotiator
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog
[edit] condor_kbdd
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.


= Links =
= Links =
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]

Revision as of 18:56, 16 April 2011

Introduction

Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can jobs in serial and parallel mode. For parallel jobs, it supports the standard MPI standard. This Wiki page is dedicated to document our working experience using Condor.

Downloading Condor

Different versions of condor can be downloaded from here. This documentation focuses on our experience installing/configure Condor Version 7.2.0. Detail documentation for this version can be found here

Preparation

As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.

  1. What machine will be the central manager?
  2. What machines should be allowed to submit jobs?
  3. Will Condor run as root or not?
  4. Do I have enough disk space for Condor?
  5. Do I need MPI configured?

Condor can be installed as either a manager node, a execute or a submit node. Or any combination of these ones. See The Different Roles a Machine Can Play

  • Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request
  • Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.
  • Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.

Installation

Unix

The official instructions on how to install Condor in Unix can be found here . Below we present some of tweaks we had to do to get it to work on our Unix machines.

Prerequisites

  • Be sure the server has a hostname and a domain name
 hostname

should return mymachine.mydomain.com (or .org, .edu, etc.) , if it only returns mymachine, then your server does not have a fully qualified domain name.

To set the domain name, edit /etc/hosts and add your domain name to the first line. You might see something like

10.171.1.124 mymachine

change this to

10.171.1.124 mymachine.mydomain.com

Also edit /etc/hostname to be

mymachine.mydomain.com

Then reboot so that the hostname changes take effect.

  • Make sure the following packages are installed:
apt-get install mailutils
  • Make sure the server has a hostname and a domainname.

For example, you could run a similar command to download the desired package:

wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz
  • You should install Condor as root or with a user having equivalent privileges

Configuring a Condor Manager in Unix

  • Make sure the condor archive is in your home directory (/home/kitware), then untar it.
cd ~
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz
cd ./condor-7.2.X
  • If not yet done, create a condor user
adduser condor
  • Run the installation scripts condor_install
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor

After running the installation script, you should get the following output:

Installing Condor from /root/condor-7.2.X to /root/condor

Condor has been installed into:
    /root/condor

Configured condor using these configuration files:
  global: /root/condor/etc/condor_config
  local:  /home/condor/localcondor/condor_config.local
Created scripts which can be sourced by users to setup their
Condor environment variables.  These are:
   sh: /root/condor/condor.sh
  csh: /root/condor/condor.csh
  • Switch to the directory where condor is now installed
cd /root/condor
  • Edit /etc/environment and update PATH variable to include the directory /root/condor/bin and /root/condor/sbin
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"
  • Add the following line
CONDOR_CONFIG="/root/condor/etc/condor_config"
  • Save file and apply the change by running
source /etc/environment
  • Make sure CONDOR_CONFIG and PATH are set correctly
root@rigel:~$ echo $PATH
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games

root@rigel:~$ echo $CONDOR_CONFIG
/root/condor/etc/condor_config
  • You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.
  • Edit condor manager config_file and update the line as referenced below:
cd ~/condor
vi ./etc/condor_config
RELEASE_DIR              = /root/condor
LOCAL_DIR                = /home/condor/localcondor
CONDOR_ADMIN             = email@website.com
UID_DOMAIN               = website.com
FILESYSTEM_DOMAIN        = website.com
HOSTALLOW_READ           = *.website.com
HOSTALLOW_WRITE          = *.website.com
HOSTALLOW_CONFIG         = $(CONDOR_HOST)
  • In order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor
cd /home/condor
ln -s /home/condor/etc/condor_config condor_config


Configuring a Executer/Submitter in Unix

The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script condor_install. Nevertheless, you still need to update its configuration file.

  • Edit condor node config_file.local and update the line as referenced below:
vi /home/condor/condor_config.local
CONDOR_ADMIN        = email@website.com

If the installation went well, the line having UID_DOMAIN and FILESYSTEM_DOMAIN should already be set to website.com

Windows

The official documentation on how to install Condor in Windows can be found here

Running Condor

The official user's manual on how to perform distributed computing here


  • run the condor manager
condor_master
  • Assuming at the installation process, you setup the type as manager,execute,submit (the default), running the following command
ps -e | egrep condor_
  • You should get something similar to:
1063 ?        00:00:00 condor_master
1064 ?        00:00:00 condor_collecto
1065 ?        00:00:00 condor_negotiat
1066 ?        00:00:00 condor_schedd
1067 ?        00:00:00 condor_startd
1068 ?        00:00:00 condor_procd
  • If you run the command ps -e | egrep condor_ just after you started condor, you may also see the following line
1077 ?        00:00:00 condor_starter


  • Check the status
kitware@rigel:~$ condor_status

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
 
slot1@rigel        LINUX      X86_64 Unclaimed Idle     0.010  1006  0+00:10:04
slot2@rigel        LINUX      X86_64 Unclaimed Idle     0.000  1006  0+00:10:05

                    Total Owner Claimed Unclaimed Matched Preempting Backfill

       X86_64/LINUX     2     0       0         2       0          0        0

              Total     2     0       0         2       0          0        0
  • Setup condor to automatically startup
cp /root/condor/etc/example/condor.boot /etc/init.d/
  • Update MASTER parameter in condor.boot to match your current setup
vi /etc/init.d/condor.boot

MASTER=/root/condor/sbin/condor_master
  • Add condor.boot service to all runlevel
kitware@rigel:~$ update-rc.d condor.boot defaults

/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot

Additional Information

The right processor architecture

The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to IA64 which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors.

While trying to run the condor_master, the shell returned the following error message cannot execute binary file

Using the program readelf, it's possible to extract the header of an executable and understand if a given executable could run on a given platform.

kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Intel IA-64
  Version:                           0x1
  Entry point address:               0x40000000000bf3e0
  Start of program headers:          64 (bytes into file)
  Start of section headers:          9382744 (bytes into file)
  Flags:                             0x10, 64-bit
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         7
  Size of section headers:           64 (bytes)
  Number of section headers:         32
  Section header string table index: 31
kitware@rigel:~$ readelf -h /bin/ls
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x4023c0
  Start of program headers:          64 (bytes into file)
  Start of section headers:          104384 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         8
  Size of section headers:           64 (bytes)
  Number of section headers:         28
  Section header string table index: 27

kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master ELF Header:

 Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
 Class:                             ELF64
 Data:                              2's complement, little endian
 Version:                           1 (current)
 OS/ABI:                            UNIX - System V
 ABI Version:                       0
 Type:                              EXEC (Executable file)
 Machine:                           Advanced Micro Devices X86-64
 Version:                           0x1
 Entry point address:               0x4b9450
 Start of program headers:          64 (bytes into file)
 Start of section headers:          4553256 (bytes into file)
 Flags:                             0x0
 Size of this header:               64 (bytes)
 Size of program headers:           56 (bytes)
 Number of program headers:         8
 Size of section headers:           64 (bytes)
 Number of section headers:         31
 Section header string table index: 30

Comparing the different output, it's possible to observe the architecture Intel IA-64 isn't the right one.


Be sure that your executable is statically linked. For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor. When building BatchMake, you need to build with grid support on Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.

Useful commands

condor_status Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems. condor_q Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process. condor_q -analyze CID.PID condor_q -better-analyze CID.PID When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.

condor_config_val <CONDOR_VARIABLE> Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity. condor_rm CID.PID Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs. For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual. [edit] condor_master On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter

NETWORK_INTERFACE = <desired IP>

in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter. [edit] condor_startd This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon. [edit] condor_starter This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details. [edit] condor_schedd This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon. [edit] condor_shadow This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation. [edit] condor_collector This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have. [edit] condor_negotiator This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog [edit] condor_kbdd This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.

Links

  • Detailed Condor documentation is also available on the website here