https://public.kitware.com/Wiki/api.php?action=feedcontributions&user=Michael.grauer&feedformat=atomKitwarePublic - User contributions [en]2024-03-29T09:28:51ZUser contributionsMediaWiki 1.38.6https://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41142Proposals:Condor2011-06-24T18:51:09Z<p>Michael.grauer: /* Adding an Additional Compute Node, for MPICH2 and Condor on Windows */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== Adding an Additional Compute Node, for MPICH2 and Condor on Windows ==<br />
<br />
Once you have an existing pool and you want to add a new node to the pool, it is relatively straightforward if you have Condor initially installed on the node. We have found that it isn't necessary to have the same user account on the new node to run Condor/MPI as it is on the existing pool machines.<br />
<br />
After you have Condor set up initially on the new node, stop condor via a command line started with Administrative privileges:<br />
<br />
net stop condor<br />
<br />
Edit your condor_config.local as shown above for the execute node example, so that this new node is a dedicated execute node.<br />
<br />
Restart Condor:<br />
<br />
net start condor<br />
<br />
<br />
Now create a work directory on the new machine, it is important that this has the same name as the work directory on the other machines, in this example '''C:\mpich2work'''.<br />
<br />
<br />
Copy any needed input files to the work directory (another option is to have Condor transfer them).<br />
<br />
Copy the '''mpiexec.exe''' file from the '''MPICH2_INSTALL\bin''' directory to the work directory.<br />
<br />
That is all that is necessary. You can have the actual executable and the driver batch script both sent to the new execute node via commands in your Condor submission file, for example, here the executable is '''MetaOptimizer.exe''' and the driver script is '''driver.bat''':<br />
<br />
<br />
universe = parallel<br />
executable = driver.bat<br />
transfer_executable = true<br />
arguments = MetaOptimizer.exe C:\mpich2work\Input1.mhd C:\mpich2work\Input2.mhd C:\mpich2work\output<br />
machine_count = 6<br />
output = metaopt.$(NODE).log<br />
error = metaopt.$(NODE).log<br />
log = metaopt.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = MetaOptimizer.exe<br />
queue<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the computation nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of '''MetaOptimizer''', '''JobDefinition''', '''Results''', and '''RegistrationWorker''' specific to the computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
[[File:Mpi metaopt class diagram.png | UML Diagram of C++ classes]]<br />
<br />
'''MasterSlave''': Entry point of the application, initializes the MPI infrastructure, starts up a '''Master''' (sending it a particular '''MetaOptimizer''' subclass), starts up as many '''Slave''' nodes as are necessary (sending each of them a particular '''RegistrationWorker''' subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
<br />
[[File:Mpimeta start.png | MetaOptmization Framework Startup]]<br />
[[File:Mpimeta computation.png | MetaOptmization Framework Computation]]<br />
<br />
<br />
<br />
<br />
'''Master''': The central controlling part of the infrastructure, it manages the available '''Slave'''s, sending them '''JobDefinition'''s as needed, collects '''Results''' from the '''Slave'''s as '''Results''' come in, will re-run a '''JobDefinition''' if it has failed or timed out, and shuts down all '''Slave'''s when the computation is finished. The '''Master''' owns a particular '''MetaOptimizer''' subclass. The '''MetaOptimizer''' subclass is aware of the particular computation, the '''Master''' is only knows how to communicate with '''Slave'''s and does not have any responsibility for determining the breakdown of a computation into parts or its completion.<br />
<br />
'''Slave''': One of the processing nodes, part of the infrastructure. Each '''Slave''' manages a '''RegistrationWorker''' subclass. '''Slave'''s announce their presence to the '''Master''', receive a '''JobDefinition''' describing a particular job, pass the '''JobDescription''' on to the '''RegistrationWorker''' to run the job, receive back a '''Results''' upon completion of the job, and send the '''Results''' back to the '''Master'''. '''Slave'''s are intended to run multiple rounds of registration during a particular computational run.<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
'''MasterEvent''' & '''SlaveEvent''' (named for the origin of the Event) are Infrastructure Event classes, and each hold a Message Wrapper class as a payload.<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class ('''SlaveEvent'''s own a '''Results''', and '''MasterEvent'''s own a '''JobDefinition'''). The Infrastructure Event classes take care of sending from '''Master''' classes to '''Slave''' classes, and from '''Slave''' classes to '''Master''' classes.<br />
<br />
The '''MetaOptimizer''' subclasses will return '''JobDefinition''' subclasses to the '''Master''', which then wraps the '''JobDefinition''' subclass in a '''MasterEvent''', and sends the '''MasterEvent''' to a '''Slave'''. A '''Slave''' reads a '''MasterEvent''', extracts the '''Results''' subclass, and sends the '''Results''' subclass to a '''RegistrationWorker''' subclass. Upon completion of the calculation for the '''JobDefinition''', the '''RegistrationWorker''' returns a '''Results''' subclass to the '''Slave''', which then wraps the '''Results''' in a '''SlaveEvent''' and sends the '''SlaveEvent''' to the '''Master'''. The '''Master''' reads the '''SlaveEvent''', extracts the '''Results''' subclass, and sends this '''Results''' subclass to the '''MetaOptimizer'''.<br />
<br />
=== Computation Specific C++ classes ===<br />
<br />
Note: currently these classes are specific to a particular example computation, they need to have the interface abstracted out and the specific example computation extracted and moved to subclasses.<br />
<br />
<br />
<br />
'''JobDefinition''': A Message Wrapper class that defines the parameters of a particular job, a specific subclass of '''JobDefinition''' should be used for a particular computation.<br />
<br />
'''Results''': A Message Wrapper class that defines the return values of a particular job, a specific subclass of '''Results''' should be used for a particular computation.<br />
<br />
'''MetaOptimizer''': The '''MetaOptimizer''' subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a '''JobDefinition'''), and determining how many '''JobDefinition'''s will be in a round of computation. The '''MetaOptimizer''' then takes in the '''Results''' from each '''JobDefinition''', and upon receiving '''Results''' for each '''JobDefinition''' in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
'''RegistrationWorker''': The '''RegistrationWorker''' subclass for a particular computation will take in a particular '''JobDefinition''' subclass germane to the computation, perform the actual registration computation, then return the results in the form of a specific '''Results''' subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41141Proposals:Condor2011-06-24T18:50:25Z<p>Michael.grauer: /* Adding an Additional Compute Node, for MPICH2 and Condor on Windows */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== Adding an Additional Compute Node, for MPICH2 and Condor on Windows ==<br />
<br />
Once you have an existing pool and you want to add a new node to the pool, it is relatively straightforward if you have Condor initially installed on the node. We have found that it isn't necessary to have the same user account on the new node to run Condor/MPI as it is on the existing pool machines.<br />
<br />
After you have Condor set up initially on the new node, stop condor via a command line started with Administrative privileges:<br />
<br />
net stop condor<br />
<br />
Edit your condor_config.local as shown above for the execute node example, so that this new node is a dedicated execute node.<br />
<br />
Restart Condor:<br />
<br />
net start condor<br />
<br />
<br />
Now create a work directory on the new machine, it is important that this has the same name as the work directory on the other machines, in this example '''C:\mpich2work'''.<br />
<br />
<br />
Copy any needed input files to the work directory.<br />
<br />
Copy the '''mpiexec.exe''' file from the '''MPICH2_INSTALL\bin''' directory to the work directory.<br />
<br />
That is all that is necessary. You can have the actual executable and the driver batch script both sent to the new execute node via commands in your Condor submission file, for example, here the executable is '''MetaOptimizer.exe''' and the driver script is '''driver.bat''':<br />
<br />
<br />
universe = parallel<br />
executable = driver.bat<br />
transfer_executable = true<br />
arguments = MetaOptimizer.exe C:\mpich2work\Input1.mhd C:\mpich2work\Input2.mhd C:\mpich2work\output<br />
machine_count = 6<br />
output = metaopt.$(NODE).log<br />
error = metaopt.$(NODE).log<br />
log = metaopt.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = MetaOptimizer.exe<br />
queue<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the computation nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of '''MetaOptimizer''', '''JobDefinition''', '''Results''', and '''RegistrationWorker''' specific to the computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
[[File:Mpi metaopt class diagram.png | UML Diagram of C++ classes]]<br />
<br />
'''MasterSlave''': Entry point of the application, initializes the MPI infrastructure, starts up a '''Master''' (sending it a particular '''MetaOptimizer''' subclass), starts up as many '''Slave''' nodes as are necessary (sending each of them a particular '''RegistrationWorker''' subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
<br />
[[File:Mpimeta start.png | MetaOptmization Framework Startup]]<br />
[[File:Mpimeta computation.png | MetaOptmization Framework Computation]]<br />
<br />
<br />
<br />
<br />
'''Master''': The central controlling part of the infrastructure, it manages the available '''Slave'''s, sending them '''JobDefinition'''s as needed, collects '''Results''' from the '''Slave'''s as '''Results''' come in, will re-run a '''JobDefinition''' if it has failed or timed out, and shuts down all '''Slave'''s when the computation is finished. The '''Master''' owns a particular '''MetaOptimizer''' subclass. The '''MetaOptimizer''' subclass is aware of the particular computation, the '''Master''' is only knows how to communicate with '''Slave'''s and does not have any responsibility for determining the breakdown of a computation into parts or its completion.<br />
<br />
'''Slave''': One of the processing nodes, part of the infrastructure. Each '''Slave''' manages a '''RegistrationWorker''' subclass. '''Slave'''s announce their presence to the '''Master''', receive a '''JobDefinition''' describing a particular job, pass the '''JobDescription''' on to the '''RegistrationWorker''' to run the job, receive back a '''Results''' upon completion of the job, and send the '''Results''' back to the '''Master'''. '''Slave'''s are intended to run multiple rounds of registration during a particular computational run.<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
'''MasterEvent''' & '''SlaveEvent''' (named for the origin of the Event) are Infrastructure Event classes, and each hold a Message Wrapper class as a payload.<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class ('''SlaveEvent'''s own a '''Results''', and '''MasterEvent'''s own a '''JobDefinition'''). The Infrastructure Event classes take care of sending from '''Master''' classes to '''Slave''' classes, and from '''Slave''' classes to '''Master''' classes.<br />
<br />
The '''MetaOptimizer''' subclasses will return '''JobDefinition''' subclasses to the '''Master''', which then wraps the '''JobDefinition''' subclass in a '''MasterEvent''', and sends the '''MasterEvent''' to a '''Slave'''. A '''Slave''' reads a '''MasterEvent''', extracts the '''Results''' subclass, and sends the '''Results''' subclass to a '''RegistrationWorker''' subclass. Upon completion of the calculation for the '''JobDefinition''', the '''RegistrationWorker''' returns a '''Results''' subclass to the '''Slave''', which then wraps the '''Results''' in a '''SlaveEvent''' and sends the '''SlaveEvent''' to the '''Master'''. The '''Master''' reads the '''SlaveEvent''', extracts the '''Results''' subclass, and sends this '''Results''' subclass to the '''MetaOptimizer'''.<br />
<br />
=== Computation Specific C++ classes ===<br />
<br />
Note: currently these classes are specific to a particular example computation, they need to have the interface abstracted out and the specific example computation extracted and moved to subclasses.<br />
<br />
<br />
<br />
'''JobDefinition''': A Message Wrapper class that defines the parameters of a particular job, a specific subclass of '''JobDefinition''' should be used for a particular computation.<br />
<br />
'''Results''': A Message Wrapper class that defines the return values of a particular job, a specific subclass of '''Results''' should be used for a particular computation.<br />
<br />
'''MetaOptimizer''': The '''MetaOptimizer''' subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a '''JobDefinition'''), and determining how many '''JobDefinition'''s will be in a round of computation. The '''MetaOptimizer''' then takes in the '''Results''' from each '''JobDefinition''', and upon receiving '''Results''' for each '''JobDefinition''' in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
'''RegistrationWorker''': The '''RegistrationWorker''' subclass for a particular computation will take in a particular '''JobDefinition''' subclass germane to the computation, perform the actual registration computation, then return the results in the form of a specific '''Results''' subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41140Proposals:Condor2011-06-24T18:50:00Z<p>Michael.grauer: /* Adding an Additional Compute Node, for MPICH2 and Condor on Windows */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== Adding an Additional Compute Node, for MPICH2 and Condor on Windows ==<br />
<br />
Once you have an existing pool and you want to add a new node to the pool, it is relatively straightforward once you have Condor initially installed on the node. We have found that it isn't necessary to have the same user account on the new node to run Condor/MPI as it is on the existing pool machines.<br />
<br />
After you have Condor set up initially on the new node, stop condor via a command line started with Administrative privileges:<br />
<br />
net stop condor<br />
<br />
Edit your condor_config.local as shown above for the execute node example, so that this new node is a dedicated execute node.<br />
<br />
Restart Condor:<br />
<br />
net start condor<br />
<br />
<br />
Now create a work directory on the new machine, it is important that this has the same name as the work directory on the other machines, in this example '''C:\mpich2work'''.<br />
<br />
<br />
Copy any needed input files to the work directory.<br />
<br />
Copy the '''mpiexec.exe''' file from the '''MPICH2_INSTALL\bin''' directory to the work directory.<br />
<br />
That is all that is necessary. You can have the actual executable and the driver batch script both sent to the new execute node via commands in your Condor submission file, for example, here the executable is '''MetaOptimizer.exe''' and the driver script is '''driver.bat''':<br />
<br />
<br />
universe = parallel<br />
executable = driver.bat<br />
transfer_executable = true<br />
arguments = MetaOptimizer.exe C:\mpich2work\Input1.mhd C:\mpich2work\Input2.mhd C:\mpich2work\output<br />
machine_count = 6<br />
output = metaopt.$(NODE).log<br />
error = metaopt.$(NODE).log<br />
log = metaopt.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = MetaOptimizer.exe<br />
queue<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the computation nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of '''MetaOptimizer''', '''JobDefinition''', '''Results''', and '''RegistrationWorker''' specific to the computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
[[File:Mpi metaopt class diagram.png | UML Diagram of C++ classes]]<br />
<br />
'''MasterSlave''': Entry point of the application, initializes the MPI infrastructure, starts up a '''Master''' (sending it a particular '''MetaOptimizer''' subclass), starts up as many '''Slave''' nodes as are necessary (sending each of them a particular '''RegistrationWorker''' subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
<br />
[[File:Mpimeta start.png | MetaOptmization Framework Startup]]<br />
[[File:Mpimeta computation.png | MetaOptmization Framework Computation]]<br />
<br />
<br />
<br />
<br />
'''Master''': The central controlling part of the infrastructure, it manages the available '''Slave'''s, sending them '''JobDefinition'''s as needed, collects '''Results''' from the '''Slave'''s as '''Results''' come in, will re-run a '''JobDefinition''' if it has failed or timed out, and shuts down all '''Slave'''s when the computation is finished. The '''Master''' owns a particular '''MetaOptimizer''' subclass. The '''MetaOptimizer''' subclass is aware of the particular computation, the '''Master''' is only knows how to communicate with '''Slave'''s and does not have any responsibility for determining the breakdown of a computation into parts or its completion.<br />
<br />
'''Slave''': One of the processing nodes, part of the infrastructure. Each '''Slave''' manages a '''RegistrationWorker''' subclass. '''Slave'''s announce their presence to the '''Master''', receive a '''JobDefinition''' describing a particular job, pass the '''JobDescription''' on to the '''RegistrationWorker''' to run the job, receive back a '''Results''' upon completion of the job, and send the '''Results''' back to the '''Master'''. '''Slave'''s are intended to run multiple rounds of registration during a particular computational run.<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
'''MasterEvent''' & '''SlaveEvent''' (named for the origin of the Event) are Infrastructure Event classes, and each hold a Message Wrapper class as a payload.<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class ('''SlaveEvent'''s own a '''Results''', and '''MasterEvent'''s own a '''JobDefinition'''). The Infrastructure Event classes take care of sending from '''Master''' classes to '''Slave''' classes, and from '''Slave''' classes to '''Master''' classes.<br />
<br />
The '''MetaOptimizer''' subclasses will return '''JobDefinition''' subclasses to the '''Master''', which then wraps the '''JobDefinition''' subclass in a '''MasterEvent''', and sends the '''MasterEvent''' to a '''Slave'''. A '''Slave''' reads a '''MasterEvent''', extracts the '''Results''' subclass, and sends the '''Results''' subclass to a '''RegistrationWorker''' subclass. Upon completion of the calculation for the '''JobDefinition''', the '''RegistrationWorker''' returns a '''Results''' subclass to the '''Slave''', which then wraps the '''Results''' in a '''SlaveEvent''' and sends the '''SlaveEvent''' to the '''Master'''. The '''Master''' reads the '''SlaveEvent''', extracts the '''Results''' subclass, and sends this '''Results''' subclass to the '''MetaOptimizer'''.<br />
<br />
=== Computation Specific C++ classes ===<br />
<br />
Note: currently these classes are specific to a particular example computation, they need to have the interface abstracted out and the specific example computation extracted and moved to subclasses.<br />
<br />
<br />
<br />
'''JobDefinition''': A Message Wrapper class that defines the parameters of a particular job, a specific subclass of '''JobDefinition''' should be used for a particular computation.<br />
<br />
'''Results''': A Message Wrapper class that defines the return values of a particular job, a specific subclass of '''Results''' should be used for a particular computation.<br />
<br />
'''MetaOptimizer''': The '''MetaOptimizer''' subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a '''JobDefinition'''), and determining how many '''JobDefinition'''s will be in a round of computation. The '''MetaOptimizer''' then takes in the '''Results''' from each '''JobDefinition''', and upon receiving '''Results''' for each '''JobDefinition''' in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
'''RegistrationWorker''': The '''RegistrationWorker''' subclass for a particular computation will take in a particular '''JobDefinition''' subclass germane to the computation, perform the actual registration computation, then return the results in the form of a specific '''Results''' subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41139Proposals:Condor2011-06-24T18:47:21Z<p>Michael.grauer: /* Adding an Additional Compute Node, for MPICH2 and Condor on Windows */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== Adding an Additional Compute Node, for MPICH2 and Condor on Windows ==<br />
<br />
Once you have an existing pool and you want to add a new node to the pool, it is relatively straightforward once you have Condor initially installed on the node. We have found that it isn't necessary to have the same user account on the new node to run Condor/MPI as it is on the existing pool machines.<br />
<br />
After you have Condor set up initially on the new node, stop condor via a command line started with Administrative privileges:<br />
<br />
net stop condor<br />
<br />
Edit your condor_config.local as shown above for the execute node example, so that this new node is a dedicated execute node.<br />
<br />
Restart Condor:<br />
<br />
net start condor<br />
<br />
<br />
Now create a work directory on the new machine, it is important that this has the same name as the work directory on the other machines, in this example '''C:\mpich2work'''.<br />
<br />
<br />
Copy any needed input files to the work directory.<br />
<br />
Copy the '''mpiexec.exe''' file from the '''MPICH2_INSTALL\bin''' directory to the work directory.<br />
<br />
That is all that is necessary. You can have the actual executable and the driver batch script both sent to the new execute node via commands in your Condor submission file.<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the computation nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of '''MetaOptimizer''', '''JobDefinition''', '''Results''', and '''RegistrationWorker''' specific to the computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
[[File:Mpi metaopt class diagram.png | UML Diagram of C++ classes]]<br />
<br />
'''MasterSlave''': Entry point of the application, initializes the MPI infrastructure, starts up a '''Master''' (sending it a particular '''MetaOptimizer''' subclass), starts up as many '''Slave''' nodes as are necessary (sending each of them a particular '''RegistrationWorker''' subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
<br />
[[File:Mpimeta start.png | MetaOptmization Framework Startup]]<br />
[[File:Mpimeta computation.png | MetaOptmization Framework Computation]]<br />
<br />
<br />
<br />
<br />
'''Master''': The central controlling part of the infrastructure, it manages the available '''Slave'''s, sending them '''JobDefinition'''s as needed, collects '''Results''' from the '''Slave'''s as '''Results''' come in, will re-run a '''JobDefinition''' if it has failed or timed out, and shuts down all '''Slave'''s when the computation is finished. The '''Master''' owns a particular '''MetaOptimizer''' subclass. The '''MetaOptimizer''' subclass is aware of the particular computation, the '''Master''' is only knows how to communicate with '''Slave'''s and does not have any responsibility for determining the breakdown of a computation into parts or its completion.<br />
<br />
'''Slave''': One of the processing nodes, part of the infrastructure. Each '''Slave''' manages a '''RegistrationWorker''' subclass. '''Slave'''s announce their presence to the '''Master''', receive a '''JobDefinition''' describing a particular job, pass the '''JobDescription''' on to the '''RegistrationWorker''' to run the job, receive back a '''Results''' upon completion of the job, and send the '''Results''' back to the '''Master'''. '''Slave'''s are intended to run multiple rounds of registration during a particular computational run.<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
'''MasterEvent''' & '''SlaveEvent''' (named for the origin of the Event) are Infrastructure Event classes, and each hold a Message Wrapper class as a payload.<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class ('''SlaveEvent'''s own a '''Results''', and '''MasterEvent'''s own a '''JobDefinition'''). The Infrastructure Event classes take care of sending from '''Master''' classes to '''Slave''' classes, and from '''Slave''' classes to '''Master''' classes.<br />
<br />
The '''MetaOptimizer''' subclasses will return '''JobDefinition''' subclasses to the '''Master''', which then wraps the '''JobDefinition''' subclass in a '''MasterEvent''', and sends the '''MasterEvent''' to a '''Slave'''. A '''Slave''' reads a '''MasterEvent''', extracts the '''Results''' subclass, and sends the '''Results''' subclass to a '''RegistrationWorker''' subclass. Upon completion of the calculation for the '''JobDefinition''', the '''RegistrationWorker''' returns a '''Results''' subclass to the '''Slave''', which then wraps the '''Results''' in a '''SlaveEvent''' and sends the '''SlaveEvent''' to the '''Master'''. The '''Master''' reads the '''SlaveEvent''', extracts the '''Results''' subclass, and sends this '''Results''' subclass to the '''MetaOptimizer'''.<br />
<br />
=== Computation Specific C++ classes ===<br />
<br />
Note: currently these classes are specific to a particular example computation, they need to have the interface abstracted out and the specific example computation extracted and moved to subclasses.<br />
<br />
<br />
<br />
'''JobDefinition''': A Message Wrapper class that defines the parameters of a particular job, a specific subclass of '''JobDefinition''' should be used for a particular computation.<br />
<br />
'''Results''': A Message Wrapper class that defines the return values of a particular job, a specific subclass of '''Results''' should be used for a particular computation.<br />
<br />
'''MetaOptimizer''': The '''MetaOptimizer''' subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a '''JobDefinition'''), and determining how many '''JobDefinition'''s will be in a round of computation. The '''MetaOptimizer''' then takes in the '''Results''' from each '''JobDefinition''', and upon receiving '''Results''' for each '''JobDefinition''' in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
'''RegistrationWorker''': The '''RegistrationWorker''' subclass for a particular computation will take in a particular '''JobDefinition''' subclass germane to the computation, perform the actual registration computation, then return the results in the form of a specific '''Results''' subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41138Proposals:Condor2011-06-24T18:44:46Z<p>Michael.grauer: /* Adding an Additional Compute Node, for MPICH2 and Condor on Windows */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== Adding an Additional Compute Node, for MPICH2 and Condor on Windows ==<br />
<br />
Once you have an existing pool and you want to add a new node to the pool, it is relatively straightforward once you have Condor initially installed on the node. We have found that it isn't necessary to have the same user account on the new node to run Condor/MPI as it is on the existing pool machines.<br />
<br />
After you have Condor set up initially on the new node, stop condor via a command line started with Administrative privileges:<br />
<br />
net stop condor<br />
<br />
Edit your condor_config.local as shown above for the execute node example, so that this new node is a dedicated execute node.<br />
<br />
Restart Condor:<br />
<br />
net start condor<br />
<br />
<br />
Now create a work directory on the new machine.<br />
<br />
3) Start condor<br />
<br />
net start condor<br />
<br />
4) Create a mpich directory in C:\<br />
<br />
C:\mpich2work<br />
<br />
5) Copy input files to the work directory<br />
<br />
6) Create a driver script with contents similar to this<br />
<br />
--------------<br />
<br />
set CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
<br />
exit 0<br />
--------------------------<br />
<br />
7) copy the mpiexec executable to the mpi work directory<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the computation nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of '''MetaOptimizer''', '''JobDefinition''', '''Results''', and '''RegistrationWorker''' specific to the computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
[[File:Mpi metaopt class diagram.png | UML Diagram of C++ classes]]<br />
<br />
'''MasterSlave''': Entry point of the application, initializes the MPI infrastructure, starts up a '''Master''' (sending it a particular '''MetaOptimizer''' subclass), starts up as many '''Slave''' nodes as are necessary (sending each of them a particular '''RegistrationWorker''' subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
<br />
[[File:Mpimeta start.png | MetaOptmization Framework Startup]]<br />
[[File:Mpimeta computation.png | MetaOptmization Framework Computation]]<br />
<br />
<br />
<br />
<br />
'''Master''': The central controlling part of the infrastructure, it manages the available '''Slave'''s, sending them '''JobDefinition'''s as needed, collects '''Results''' from the '''Slave'''s as '''Results''' come in, will re-run a '''JobDefinition''' if it has failed or timed out, and shuts down all '''Slave'''s when the computation is finished. The '''Master''' owns a particular '''MetaOptimizer''' subclass. The '''MetaOptimizer''' subclass is aware of the particular computation, the '''Master''' is only knows how to communicate with '''Slave'''s and does not have any responsibility for determining the breakdown of a computation into parts or its completion.<br />
<br />
'''Slave''': One of the processing nodes, part of the infrastructure. Each '''Slave''' manages a '''RegistrationWorker''' subclass. '''Slave'''s announce their presence to the '''Master''', receive a '''JobDefinition''' describing a particular job, pass the '''JobDescription''' on to the '''RegistrationWorker''' to run the job, receive back a '''Results''' upon completion of the job, and send the '''Results''' back to the '''Master'''. '''Slave'''s are intended to run multiple rounds of registration during a particular computational run.<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
'''MasterEvent''' & '''SlaveEvent''' (named for the origin of the Event) are Infrastructure Event classes, and each hold a Message Wrapper class as a payload.<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class ('''SlaveEvent'''s own a '''Results''', and '''MasterEvent'''s own a '''JobDefinition'''). The Infrastructure Event classes take care of sending from '''Master''' classes to '''Slave''' classes, and from '''Slave''' classes to '''Master''' classes.<br />
<br />
The '''MetaOptimizer''' subclasses will return '''JobDefinition''' subclasses to the '''Master''', which then wraps the '''JobDefinition''' subclass in a '''MasterEvent''', and sends the '''MasterEvent''' to a '''Slave'''. A '''Slave''' reads a '''MasterEvent''', extracts the '''Results''' subclass, and sends the '''Results''' subclass to a '''RegistrationWorker''' subclass. Upon completion of the calculation for the '''JobDefinition''', the '''RegistrationWorker''' returns a '''Results''' subclass to the '''Slave''', which then wraps the '''Results''' in a '''SlaveEvent''' and sends the '''SlaveEvent''' to the '''Master'''. The '''Master''' reads the '''SlaveEvent''', extracts the '''Results''' subclass, and sends this '''Results''' subclass to the '''MetaOptimizer'''.<br />
<br />
=== Computation Specific C++ classes ===<br />
<br />
Note: currently these classes are specific to a particular example computation, they need to have the interface abstracted out and the specific example computation extracted and moved to subclasses.<br />
<br />
<br />
<br />
'''JobDefinition''': A Message Wrapper class that defines the parameters of a particular job, a specific subclass of '''JobDefinition''' should be used for a particular computation.<br />
<br />
'''Results''': A Message Wrapper class that defines the return values of a particular job, a specific subclass of '''Results''' should be used for a particular computation.<br />
<br />
'''MetaOptimizer''': The '''MetaOptimizer''' subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a '''JobDefinition'''), and determining how many '''JobDefinition'''s will be in a round of computation. The '''MetaOptimizer''' then takes in the '''Results''' from each '''JobDefinition''', and upon receiving '''Results''' for each '''JobDefinition''' in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
'''RegistrationWorker''': The '''RegistrationWorker''' subclass for a particular computation will take in a particular '''JobDefinition''' subclass germane to the computation, perform the actual registration computation, then return the results in the form of a specific '''Results''' subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41137Proposals:Condor2011-06-24T18:41:15Z<p>Michael.grauer: /* MPICH2 and Condor on Windows */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== Adding an Additional Compute Node, for MPICH2 and Condor on Windows ==<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the computation nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of '''MetaOptimizer''', '''JobDefinition''', '''Results''', and '''RegistrationWorker''' specific to the computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
[[File:Mpi metaopt class diagram.png | UML Diagram of C++ classes]]<br />
<br />
'''MasterSlave''': Entry point of the application, initializes the MPI infrastructure, starts up a '''Master''' (sending it a particular '''MetaOptimizer''' subclass), starts up as many '''Slave''' nodes as are necessary (sending each of them a particular '''RegistrationWorker''' subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
<br />
[[File:Mpimeta start.png | MetaOptmization Framework Startup]]<br />
[[File:Mpimeta computation.png | MetaOptmization Framework Computation]]<br />
<br />
<br />
<br />
<br />
'''Master''': The central controlling part of the infrastructure, it manages the available '''Slave'''s, sending them '''JobDefinition'''s as needed, collects '''Results''' from the '''Slave'''s as '''Results''' come in, will re-run a '''JobDefinition''' if it has failed or timed out, and shuts down all '''Slave'''s when the computation is finished. The '''Master''' owns a particular '''MetaOptimizer''' subclass. The '''MetaOptimizer''' subclass is aware of the particular computation, the '''Master''' is only knows how to communicate with '''Slave'''s and does not have any responsibility for determining the breakdown of a computation into parts or its completion.<br />
<br />
'''Slave''': One of the processing nodes, part of the infrastructure. Each '''Slave''' manages a '''RegistrationWorker''' subclass. '''Slave'''s announce their presence to the '''Master''', receive a '''JobDefinition''' describing a particular job, pass the '''JobDescription''' on to the '''RegistrationWorker''' to run the job, receive back a '''Results''' upon completion of the job, and send the '''Results''' back to the '''Master'''. '''Slave'''s are intended to run multiple rounds of registration during a particular computational run.<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
'''MasterEvent''' & '''SlaveEvent''' (named for the origin of the Event) are Infrastructure Event classes, and each hold a Message Wrapper class as a payload.<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class ('''SlaveEvent'''s own a '''Results''', and '''MasterEvent'''s own a '''JobDefinition'''). The Infrastructure Event classes take care of sending from '''Master''' classes to '''Slave''' classes, and from '''Slave''' classes to '''Master''' classes.<br />
<br />
The '''MetaOptimizer''' subclasses will return '''JobDefinition''' subclasses to the '''Master''', which then wraps the '''JobDefinition''' subclass in a '''MasterEvent''', and sends the '''MasterEvent''' to a '''Slave'''. A '''Slave''' reads a '''MasterEvent''', extracts the '''Results''' subclass, and sends the '''Results''' subclass to a '''RegistrationWorker''' subclass. Upon completion of the calculation for the '''JobDefinition''', the '''RegistrationWorker''' returns a '''Results''' subclass to the '''Slave''', which then wraps the '''Results''' in a '''SlaveEvent''' and sends the '''SlaveEvent''' to the '''Master'''. The '''Master''' reads the '''SlaveEvent''', extracts the '''Results''' subclass, and sends this '''Results''' subclass to the '''MetaOptimizer'''.<br />
<br />
=== Computation Specific C++ classes ===<br />
<br />
Note: currently these classes are specific to a particular example computation, they need to have the interface abstracted out and the specific example computation extracted and moved to subclasses.<br />
<br />
<br />
<br />
'''JobDefinition''': A Message Wrapper class that defines the parameters of a particular job, a specific subclass of '''JobDefinition''' should be used for a particular computation.<br />
<br />
'''Results''': A Message Wrapper class that defines the return values of a particular job, a specific subclass of '''Results''' should be used for a particular computation.<br />
<br />
'''MetaOptimizer''': The '''MetaOptimizer''' subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a '''JobDefinition'''), and determining how many '''JobDefinition'''s will be in a round of computation. The '''MetaOptimizer''' then takes in the '''Results''' from each '''JobDefinition''', and upon receiving '''Results''' for each '''JobDefinition''' in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
'''RegistrationWorker''': The '''RegistrationWorker''' subclass for a particular computation will take in a particular '''JobDefinition''' subclass germane to the computation, perform the actual registration computation, then return the results in the form of a specific '''Results''' subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41136Proposals:Condor2011-06-24T18:39:54Z<p>Michael.grauer: /* MPI Aware Infrastructure C++ classes */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the computation nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of '''MetaOptimizer''', '''JobDefinition''', '''Results''', and '''RegistrationWorker''' specific to the computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
[[File:Mpi metaopt class diagram.png | UML Diagram of C++ classes]]<br />
<br />
'''MasterSlave''': Entry point of the application, initializes the MPI infrastructure, starts up a '''Master''' (sending it a particular '''MetaOptimizer''' subclass), starts up as many '''Slave''' nodes as are necessary (sending each of them a particular '''RegistrationWorker''' subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
<br />
[[File:Mpimeta start.png | MetaOptmization Framework Startup]]<br />
[[File:Mpimeta computation.png | MetaOptmization Framework Computation]]<br />
<br />
<br />
<br />
<br />
'''Master''': The central controlling part of the infrastructure, it manages the available '''Slave'''s, sending them '''JobDefinition'''s as needed, collects '''Results''' from the '''Slave'''s as '''Results''' come in, will re-run a '''JobDefinition''' if it has failed or timed out, and shuts down all '''Slave'''s when the computation is finished. The '''Master''' owns a particular '''MetaOptimizer''' subclass. The '''MetaOptimizer''' subclass is aware of the particular computation, the '''Master''' is only knows how to communicate with '''Slave'''s and does not have any responsibility for determining the breakdown of a computation into parts or its completion.<br />
<br />
'''Slave''': One of the processing nodes, part of the infrastructure. Each '''Slave''' manages a '''RegistrationWorker''' subclass. '''Slave'''s announce their presence to the '''Master''', receive a '''JobDefinition''' describing a particular job, pass the '''JobDescription''' on to the '''RegistrationWorker''' to run the job, receive back a '''Results''' upon completion of the job, and send the '''Results''' back to the '''Master'''. '''Slave'''s are intended to run multiple rounds of registration during a particular computational run.<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
'''MasterEvent''' & '''SlaveEvent''' (named for the origin of the Event) are Infrastructure Event classes, and each hold a Message Wrapper class as a payload.<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class ('''SlaveEvent'''s own a '''Results''', and '''MasterEvent'''s own a '''JobDefinition'''). The Infrastructure Event classes take care of sending from '''Master''' classes to '''Slave''' classes, and from '''Slave''' classes to '''Master''' classes.<br />
<br />
The '''MetaOptimizer''' subclasses will return '''JobDefinition''' subclasses to the '''Master''', which then wraps the '''JobDefinition''' subclass in a '''MasterEvent''', and sends the '''MasterEvent''' to a '''Slave'''. A '''Slave''' reads a '''MasterEvent''', extracts the '''Results''' subclass, and sends the '''Results''' subclass to a '''RegistrationWorker''' subclass. Upon completion of the calculation for the '''JobDefinition''', the '''RegistrationWorker''' returns a '''Results''' subclass to the '''Slave''', which then wraps the '''Results''' in a '''SlaveEvent''' and sends the '''SlaveEvent''' to the '''Master'''. The '''Master''' reads the '''SlaveEvent''', extracts the '''Results''' subclass, and sends this '''Results''' subclass to the '''MetaOptimizer'''.<br />
<br />
=== Computation Specific C++ classes ===<br />
<br />
Note: currently these classes are specific to a particular example computation, they need to have the interface abstracted out and the specific example computation extracted and moved to subclasses.<br />
<br />
<br />
<br />
'''JobDefinition''': A Message Wrapper class that defines the parameters of a particular job, a specific subclass of '''JobDefinition''' should be used for a particular computation.<br />
<br />
'''Results''': A Message Wrapper class that defines the return values of a particular job, a specific subclass of '''Results''' should be used for a particular computation.<br />
<br />
'''MetaOptimizer''': The '''MetaOptimizer''' subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a '''JobDefinition'''), and determining how many '''JobDefinition'''s will be in a round of computation. The '''MetaOptimizer''' then takes in the '''Results''' from each '''JobDefinition''', and upon receiving '''Results''' for each '''JobDefinition''' in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
'''RegistrationWorker''': The '''RegistrationWorker''' subclass for a particular computation will take in a particular '''JobDefinition''' subclass germane to the computation, perform the actual registration computation, then return the results in the form of a specific '''Results''' subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41135Proposals:Condor2011-06-24T18:38:49Z<p>Michael.grauer: </p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the computation nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of '''MetaOptimizer''', '''JobDefinition''', '''Results''', and '''RegistrationWorker''' specific to the computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
[[File:Mpi metaopt class diagram.png | UML Diagram of C++ classes]]<br />
<br />
'''MasterSlave''': Entry point of the application, initializes the MPI infrastructure, starts up a '''Master''' (sending it a particular '''MetaOptimizer''' subclass), starts up as many '''Slave''' nodes as are necessary (sending each of them a particular '''RegistrationWorker''' subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
<br />
[[File:Mpimeta start.png | MetaOptmization Framework Startup]]<br />
[[File:Mpimeta computation.png | MetaOptmization Framework Computation]]<br />
<br />
<br />
<br />
<br />
'''Master''': The central controlling part of the infrastructure, it manages the available '''Slave'''s, sending them '''JobDefinition'''s as needed, collects '''Results''' from the '''Slave'''s as '''Results''' come in, will re-run (NOT YET IMPLEMENTED) a '''JobDefinition'''s if it has failed or timed out, and shuts down all '''Slave'''s when the computation is finished. The '''Master''' owns a particular '''MetaOptimizer''' subclass. The '''MetaOptimizer''' subclass is aware of the particular computation, the '''Master''' is only knows how to communicate with '''Slave'''s and does not have any responsibility for determining the breakdown of a computation into parts or its completion.<br />
<br />
'''Slave''': One of the processing nodes, part of the infrastructure. Each '''Slave''' manages a '''RegistrationWorker''' (SHOULD THIS BE MORE GENERAL THAN REGISTRATION?) subclass. '''Slave'''s announce their presence to the '''Master''', receive a '''JobDefinition''' describing a particular job, pass the '''JobDescription''' on to the '''RegistrationWorker''' to run the job, receive back a '''Results''' upon completion of the job, and send the '''Results''' back to the '''Master'''. '''Slave'''s are intended to run multiple rounds of registration during a particular computational run.<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
'''MasterEvent''' & '''SlaveEvent''' (named for the origin of the Event) are Infrastructure Event classes, and each hold a Message Wrapper class as a payload.<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class ('''SlaveEvent'''s own a '''Results''', and '''MasterEvent'''s own a '''JobDefinition'''). The Infrastructure Event classes take care of sending from '''Master''' classes to '''Slave''' classes, and from '''Slave''' classes to '''Master''' classes.<br />
<br />
The '''MetaOptimizer''' subclasses will return '''JobDefinition''' subclasses to the '''Master''', which then wraps the '''JobDefinition''' subclass in a '''MasterEvent''', and sends the '''MasterEvent''' to a '''Slave'''. A '''Slave''' reads a '''MasterEvent''', extracts the '''Results''' subclass, and sends the '''Results''' subclass to a '''RegistrationWorker''' subclass. Upon completion of the calculation for the '''JobDefinition''', the '''RegistrationWorker''' returns a '''Results''' subclass to the '''Slave''', which then wraps the '''Results''' in a '''SlaveEvent''' and sends the '''SlaveEvent''' to the '''Master'''. The '''Master''' reads the '''SlaveEvent''', extracts the '''Results''' subclass, and sends this '''Results''' subclass to the '''MetaOptimizer'''.<br />
<br />
=== Computation Specific C++ classes ===<br />
<br />
Note: currently these classes are specific to a particular example computation, they need to have the interface abstracted out and the specific example computation extracted and moved to subclasses.<br />
<br />
<br />
<br />
'''JobDefinition''': A Message Wrapper class that defines the parameters of a particular job, a specific subclass of '''JobDefinition''' should be used for a particular computation.<br />
<br />
'''Results''': A Message Wrapper class that defines the return values of a particular job, a specific subclass of '''Results''' should be used for a particular computation.<br />
<br />
'''MetaOptimizer''': The '''MetaOptimizer''' subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a '''JobDefinition'''), and determining how many '''JobDefinition'''s will be in a round of computation. The '''MetaOptimizer''' then takes in the '''Results''' from each '''JobDefinition''', and upon receiving '''Results''' for each '''JobDefinition''' in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
'''RegistrationWorker''': The '''RegistrationWorker''' subclass for a particular computation will take in a particular '''JobDefinition''' subclass germane to the computation, perform the actual registration computation, then return the results in the form of a specific '''Results''' subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=File:Mpimeta_computation.png&diff=41134File:Mpimeta computation.png2011-06-24T18:37:18Z<p>Michael.grauer: </p>
<hr />
<div></div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41133Proposals:Condor2011-06-24T18:37:01Z<p>Michael.grauer: </p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the computation nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of '''MetaOptimizer''', '''JobDefinition''', '''Results''', and '''RegistrationWorker''' specific to the computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
[[File:Mpi metaopt class diagram.png | UML Diagram of C++ classes]]<br />
<br />
'''MasterSlave''': Entry point of the application, initializes the MPI infrastructure, starts up a '''Master''' (sending it a particular '''MetaOptimizer''' subclass), starts up as many '''Slave''' nodes as are necessary (sending each of them a particular '''RegistrationWorker''' subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
<br />
[[File:Mpimeta start.png | MetaOptmization Framework Startup]]<br />
[[File:Mpimeta start.png | MetaOptmization Framework Startup]]<br />
<br />
<br />
<br />
<br />
'''Master''': The central controlling part of the infrastructure, it manages the available '''Slave'''s, sending them '''JobDefinition'''s as needed, collects '''Results''' from the '''Slave'''s as '''Results''' come in, will re-run (NOT YET IMPLEMENTED) a '''JobDefinition'''s if it has failed or timed out, and shuts down all '''Slave'''s when the computation is finished. The '''Master''' owns a particular '''MetaOptimizer''' subclass. The '''MetaOptimizer''' subclass is aware of the particular computation, the '''Master''' is only knows how to communicate with '''Slave'''s and does not have any responsibility for determining the breakdown of a computation into parts or its completion.<br />
<br />
'''Slave''': One of the processing nodes, part of the infrastructure. Each '''Slave''' manages a '''RegistrationWorker''' (SHOULD THIS BE MORE GENERAL THAN REGISTRATION?) subclass. '''Slave'''s announce their presence to the '''Master''', receive a '''JobDefinition''' describing a particular job, pass the '''JobDescription''' on to the '''RegistrationWorker''' to run the job, receive back a '''Results''' upon completion of the job, and send the '''Results''' back to the '''Master'''. '''Slave'''s are intended to run multiple rounds of registration during a particular computational run.<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
'''MasterEvent''' & '''SlaveEvent''' (named for the origin of the Event) are Infrastructure Event classes, and each hold a Message Wrapper class as a payload.<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class ('''SlaveEvent'''s own a '''Results''', and '''MasterEvent'''s own a '''JobDefinition'''). The Infrastructure Event classes take care of sending from '''Master''' classes to '''Slave''' classes, and from '''Slave''' classes to '''Master''' classes.<br />
<br />
The '''MetaOptimizer''' subclasses will return '''JobDefinition''' subclasses to the '''Master''', which then wraps the '''JobDefinition''' subclass in a '''MasterEvent''', and sends the '''MasterEvent''' to a '''Slave'''. A '''Slave''' reads a '''MasterEvent''', extracts the '''Results''' subclass, and sends the '''Results''' subclass to a '''RegistrationWorker''' subclass. Upon completion of the calculation for the '''JobDefinition''', the '''RegistrationWorker''' returns a '''Results''' subclass to the '''Slave''', which then wraps the '''Results''' in a '''SlaveEvent''' and sends the '''SlaveEvent''' to the '''Master'''. The '''Master''' reads the '''SlaveEvent''', extracts the '''Results''' subclass, and sends this '''Results''' subclass to the '''MetaOptimizer'''.<br />
<br />
=== Computation Specific C++ classes ===<br />
<br />
Note: currently these classes are specific to a particular example computation, they need to have the interface abstracted out and the specific example computation extracted and moved to subclasses.<br />
<br />
<br />
<br />
'''JobDefinition''': A Message Wrapper class that defines the parameters of a particular job, a specific subclass of '''JobDefinition''' should be used for a particular computation.<br />
<br />
'''Results''': A Message Wrapper class that defines the return values of a particular job, a specific subclass of '''Results''' should be used for a particular computation.<br />
<br />
'''MetaOptimizer''': The '''MetaOptimizer''' subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a '''JobDefinition'''), and determining how many '''JobDefinition'''s will be in a round of computation. The '''MetaOptimizer''' then takes in the '''Results''' from each '''JobDefinition''', and upon receiving '''Results''' for each '''JobDefinition''' in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
'''RegistrationWorker''': The '''RegistrationWorker''' subclass for a particular computation will take in a particular '''JobDefinition''' subclass germane to the computation, perform the actual registration computation, then return the results in the form of a specific '''Results''' subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=File:Mpimeta_start.png&diff=41131File:Mpimeta start.png2011-06-24T18:35:13Z<p>Michael.grauer: </p>
<hr />
<div></div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41130Proposals:Condor2011-06-24T18:34:15Z<p>Michael.grauer: /* MPI Aware Infrastructure C++ classes */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the computation nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of '''MetaOptimizer''', '''JobDefinition''', '''Results''', and '''RegistrationWorker''' specific to the computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
[[File:Mpi metaopt class diagram.png | UML Diagram of C++ classes]]<br />
<br />
<br />
<br />
'''MasterSlave''': Entry point of the application, initializes the MPI infrastructure, starts up a '''Master''' (sending it a particular '''MetaOptimizer''' subclass), starts up as many '''Slave''' nodes as are necessary (sending each of them a particular '''RegistrationWorker''' subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
'''Master''': The central controlling part of the infrastructure, it manages the available '''Slave'''s, sending them '''JobDefinition'''s as needed, collects '''Results''' from the '''Slave'''s as '''Results''' come in, will re-run (NOT YET IMPLEMENTED) a '''JobDefinition'''s if it has failed or timed out, and shuts down all '''Slave'''s when the computation is finished. The '''Master''' owns a particular '''MetaOptimizer''' subclass. The '''MetaOptimizer''' subclass is aware of the particular computation, the '''Master''' is only knows how to communicate with '''Slave'''s and does not have any responsibility for determining the breakdown of a computation into parts or its completion.<br />
<br />
'''Slave''': One of the processing nodes, part of the infrastructure. Each '''Slave''' manages a '''RegistrationWorker''' (SHOULD THIS BE MORE GENERAL THAN REGISTRATION?) subclass. '''Slave'''s announce their presence to the '''Master''', receive a '''JobDefinition''' describing a particular job, pass the '''JobDescription''' on to the '''RegistrationWorker''' to run the job, receive back a '''Results''' upon completion of the job, and send the '''Results''' back to the '''Master'''. '''Slave'''s are intended to run multiple rounds of registration during a particular computational run.<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
'''MasterEvent''' & '''SlaveEvent''' (named for the origin of the Event) are Infrastructure Event classes, and each hold a Message Wrapper class as a payload.<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class ('''SlaveEvent'''s own a '''Results''', and '''MasterEvent'''s own a '''JobDefinition'''). The Infrastructure Event classes take care of sending from '''Master''' classes to '''Slave''' classes, and from '''Slave''' classes to '''Master''' classes.<br />
<br />
The '''MetaOptimizer''' subclasses will return '''JobDefinition''' subclasses to the '''Master''', which then wraps the '''JobDefinition''' subclass in a '''MasterEvent''', and sends the '''MasterEvent''' to a '''Slave'''. A '''Slave''' reads a '''MasterEvent''', extracts the '''Results''' subclass, and sends the '''Results''' subclass to a '''RegistrationWorker''' subclass. Upon completion of the calculation for the '''JobDefinition''', the '''RegistrationWorker''' returns a '''Results''' subclass to the '''Slave''', which then wraps the '''Results''' in a '''SlaveEvent''' and sends the '''SlaveEvent''' to the '''Master'''. The '''Master''' reads the '''SlaveEvent''', extracts the '''Results''' subclass, and sends this '''Results''' subclass to the '''MetaOptimizer'''.<br />
<br />
=== Computation Specific C++ classes ===<br />
<br />
Note: currently these classes are specific to a particular example computation, they need to have the interface abstracted out and the specific example computation extracted and moved to subclasses.<br />
<br />
<br />
<br />
'''JobDefinition''': A Message Wrapper class that defines the parameters of a particular job, a specific subclass of '''JobDefinition''' should be used for a particular computation.<br />
<br />
'''Results''': A Message Wrapper class that defines the return values of a particular job, a specific subclass of '''Results''' should be used for a particular computation.<br />
<br />
'''MetaOptimizer''': The '''MetaOptimizer''' subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a '''JobDefinition'''), and determining how many '''JobDefinition'''s will be in a round of computation. The '''MetaOptimizer''' then takes in the '''Results''' from each '''JobDefinition''', and upon receiving '''Results''' for each '''JobDefinition''' in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
'''RegistrationWorker''': The '''RegistrationWorker''' subclass for a particular computation will take in a particular '''JobDefinition''' subclass germane to the computation, perform the actual registration computation, then return the results in the form of a specific '''Results''' subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41129Proposals:Condor2011-06-24T18:33:45Z<p>Michael.grauer: /* MPI Based Meta-Optimization Framework */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the computation nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of '''MetaOptimizer''', '''JobDefinition''', '''Results''', and '''RegistrationWorker''' specific to the computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
[[File:Mpi metaopt class diagram.png | 100]]<br />
<br />
<br />
<br />
<br />
<br />
<br />
'''MasterSlave''': Entry point of the application, initializes the MPI infrastructure, starts up a '''Master''' (sending it a particular '''MetaOptimizer''' subclass), starts up as many '''Slave''' nodes as are necessary (sending each of them a particular '''RegistrationWorker''' subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
'''Master''': The central controlling part of the infrastructure, it manages the available '''Slave'''s, sending them '''JobDefinition'''s as needed, collects '''Results''' from the '''Slave'''s as '''Results''' come in, will re-run (NOT YET IMPLEMENTED) a '''JobDefinition'''s if it has failed or timed out, and shuts down all '''Slave'''s when the computation is finished. The '''Master''' owns a particular '''MetaOptimizer''' subclass. The '''MetaOptimizer''' subclass is aware of the particular computation, the '''Master''' is only knows how to communicate with '''Slave'''s and does not have any responsibility for determining the breakdown of a computation into parts or its completion.<br />
<br />
'''Slave''': One of the processing nodes, part of the infrastructure. Each '''Slave''' manages a '''RegistrationWorker''' (SHOULD THIS BE MORE GENERAL THAN REGISTRATION?) subclass. '''Slave'''s announce their presence to the '''Master''', receive a '''JobDefinition''' describing a particular job, pass the '''JobDescription''' on to the '''RegistrationWorker''' to run the job, receive back a '''Results''' upon completion of the job, and send the '''Results''' back to the '''Master'''. '''Slave'''s are intended to run multiple rounds of registration during a particular computational run.<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
'''MasterEvent''' & '''SlaveEvent''' (named for the origin of the Event) are Infrastructure Event classes, and each hold a Message Wrapper class as a payload.<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class ('''SlaveEvent'''s own a '''Results''', and '''MasterEvent'''s own a '''JobDefinition'''). The Infrastructure Event classes take care of sending from '''Master''' classes to '''Slave''' classes, and from '''Slave''' classes to '''Master''' classes.<br />
<br />
The '''MetaOptimizer''' subclasses will return '''JobDefinition''' subclasses to the '''Master''', which then wraps the '''JobDefinition''' subclass in a '''MasterEvent''', and sends the '''MasterEvent''' to a '''Slave'''. A '''Slave''' reads a '''MasterEvent''', extracts the '''Results''' subclass, and sends the '''Results''' subclass to a '''RegistrationWorker''' subclass. Upon completion of the calculation for the '''JobDefinition''', the '''RegistrationWorker''' returns a '''Results''' subclass to the '''Slave''', which then wraps the '''Results''' in a '''SlaveEvent''' and sends the '''SlaveEvent''' to the '''Master'''. The '''Master''' reads the '''SlaveEvent''', extracts the '''Results''' subclass, and sends this '''Results''' subclass to the '''MetaOptimizer'''.<br />
<br />
=== Computation Specific C++ classes ===<br />
<br />
Note: currently these classes are specific to a particular example computation, they need to have the interface abstracted out and the specific example computation extracted and moved to subclasses.<br />
<br />
<br />
<br />
'''JobDefinition''': A Message Wrapper class that defines the parameters of a particular job, a specific subclass of '''JobDefinition''' should be used for a particular computation.<br />
<br />
'''Results''': A Message Wrapper class that defines the return values of a particular job, a specific subclass of '''Results''' should be used for a particular computation.<br />
<br />
'''MetaOptimizer''': The '''MetaOptimizer''' subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a '''JobDefinition'''), and determining how many '''JobDefinition'''s will be in a round of computation. The '''MetaOptimizer''' then takes in the '''Results''' from each '''JobDefinition''', and upon receiving '''Results''' for each '''JobDefinition''' in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
'''RegistrationWorker''': The '''RegistrationWorker''' subclass for a particular computation will take in a particular '''JobDefinition''' subclass germane to the computation, perform the actual registration computation, then return the results in the form of a specific '''Results''' subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=File:Mpi_metaopt_class_diagram.png&diff=41128File:Mpi metaopt class diagram.png2011-06-24T18:31:47Z<p>Michael.grauer: </p>
<hr />
<div></div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41114Proposals:Condor2011-06-23T18:23:38Z<p>Michael.grauer: /* Computation Specific C++ classes */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the computation nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of '''MetaOptimizer''', '''JobDefinition''', '''Results''', and '''RegistrationWorker''' specific to the computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
'''MasterSlave''': Entry point of the application, initializes the MPI infrastructure, starts up a '''Master''' (sending it a particular '''MetaOptimizer''' subclass), starts up as many '''Slave''' nodes as are necessary (sending each of them a particular '''RegistrationWorker''' subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
'''Master''': The central controlling part of the infrastructure, it manages the available '''Slave'''s, sending them '''JobDefinition'''s as needed, collects '''Results''' from the '''Slave'''s as '''Results''' come in, will re-run (NOT YET IMPLEMENTED) a '''JobDefinition'''s if it has failed or timed out, and shuts down all '''Slave'''s when the computation is finished. The '''Master''' owns a particular '''MetaOptimizer''' subclass. The '''MetaOptimizer''' subclass is aware of the particular computation, the '''Master''' is only knows how to communicate with '''Slave'''s and does not have any responsibility for determining the breakdown of a computation into parts or its completion.<br />
<br />
'''Slave''': One of the processing nodes, part of the infrastructure. Each '''Slave''' manages a '''RegistrationWorker''' (SHOULD THIS BE MORE GENERAL THAN REGISTRATION?) subclass. '''Slave'''s announce their presence to the '''Master''', receive a '''JobDefinition''' describing a particular job, pass the '''JobDescription''' on to the '''RegistrationWorker''' to run the job, receive back a '''Results''' upon completion of the job, and send the '''Results''' back to the '''Master'''. '''Slave'''s are intended to run multiple rounds of registration during a particular computational run.<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
'''MasterEvent''' & '''SlaveEvent''' (named for the origin of the Event) are Infrastructure Event classes, and each hold a Message Wrapper class as a payload.<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class ('''SlaveEvent'''s own a '''Results''', and '''MasterEvent'''s own a '''JobDefinition'''). The Infrastructure Event classes take care of sending from '''Master''' classes to '''Slave''' classes, and from '''Slave''' classes to '''Master''' classes.<br />
<br />
The '''MetaOptimizer''' subclasses will return '''JobDefinition''' subclasses to the '''Master''', which then wraps the '''JobDefinition''' subclass in a '''MasterEvent''', and sends the '''MasterEvent''' to a '''Slave'''. A '''Slave''' reads a '''MasterEvent''', extracts the '''Results''' subclass, and sends the '''Results''' subclass to a '''RegistrationWorker''' subclass. Upon completion of the calculation for the '''JobDefinition''', the '''RegistrationWorker''' returns a '''Results''' subclass to the '''Slave''', which then wraps the '''Results''' in a '''SlaveEvent''' and sends the '''SlaveEvent''' to the '''Master'''. The '''Master''' reads the '''SlaveEvent''', extracts the '''Results''' subclass, and sends this '''Results''' subclass to the '''MetaOptimizer'''.<br />
<br />
=== Computation Specific C++ classes ===<br />
<br />
Note: currently these classes are specific to a particular example computation, they need to have the interface abstracted out and the specific example computation extracted and moved to subclasses.<br />
<br />
<br />
<br />
'''JobDefinition''': A Message Wrapper class that defines the parameters of a particular job, a specific subclass of '''JobDefinition''' should be used for a particular computation.<br />
<br />
'''Results''': A Message Wrapper class that defines the return values of a particular job, a specific subclass of '''Results''' should be used for a particular computation.<br />
<br />
'''MetaOptimizer''': The '''MetaOptimizer''' subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a '''JobDefinition'''), and determining how many '''JobDefinition'''s will be in a round of computation. The '''MetaOptimizer''' then takes in the '''Results''' from each '''JobDefinition''', and upon receiving '''Results''' for each '''JobDefinition''' in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
'''RegistrationWorker''': The '''RegistrationWorker''' subclass for a particular computation will take in a particular '''JobDefinition''' subclass germane to the computation, perform the actual registration computation, then return the results in the form of a specific '''Results''' subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41113Proposals:Condor2011-06-23T18:23:06Z<p>Michael.grauer: /* Computation Specific C++ classes */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the computation nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of '''MetaOptimizer''', '''JobDefinition''', '''Results''', and '''RegistrationWorker''' specific to the computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
'''MasterSlave''': Entry point of the application, initializes the MPI infrastructure, starts up a '''Master''' (sending it a particular '''MetaOptimizer''' subclass), starts up as many '''Slave''' nodes as are necessary (sending each of them a particular '''RegistrationWorker''' subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
'''Master''': The central controlling part of the infrastructure, it manages the available '''Slave'''s, sending them '''JobDefinition'''s as needed, collects '''Results''' from the '''Slave'''s as '''Results''' come in, will re-run (NOT YET IMPLEMENTED) a '''JobDefinition'''s if it has failed or timed out, and shuts down all '''Slave'''s when the computation is finished. The '''Master''' owns a particular '''MetaOptimizer''' subclass. The '''MetaOptimizer''' subclass is aware of the particular computation, the '''Master''' is only knows how to communicate with '''Slave'''s and does not have any responsibility for determining the breakdown of a computation into parts or its completion.<br />
<br />
'''Slave''': One of the processing nodes, part of the infrastructure. Each '''Slave''' manages a '''RegistrationWorker''' (SHOULD THIS BE MORE GENERAL THAN REGISTRATION?) subclass. '''Slave'''s announce their presence to the '''Master''', receive a '''JobDefinition''' describing a particular job, pass the '''JobDescription''' on to the '''RegistrationWorker''' to run the job, receive back a '''Results''' upon completion of the job, and send the '''Results''' back to the '''Master'''. '''Slave'''s are intended to run multiple rounds of registration during a particular computational run.<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
'''MasterEvent''' & '''SlaveEvent''' (named for the origin of the Event) are Infrastructure Event classes, and each hold a Message Wrapper class as a payload.<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class ('''SlaveEvent'''s own a '''Results''', and '''MasterEvent'''s own a '''JobDefinition'''). The Infrastructure Event classes take care of sending from '''Master''' classes to '''Slave''' classes, and from '''Slave''' classes to '''Master''' classes.<br />
<br />
The '''MetaOptimizer''' subclasses will return '''JobDefinition''' subclasses to the '''Master''', which then wraps the '''JobDefinition''' subclass in a '''MasterEvent''', and sends the '''MasterEvent''' to a '''Slave'''. A '''Slave''' reads a '''MasterEvent''', extracts the '''Results''' subclass, and sends the '''Results''' subclass to a '''RegistrationWorker''' subclass. Upon completion of the calculation for the '''JobDefinition''', the '''RegistrationWorker''' returns a '''Results''' subclass to the '''Slave''', which then wraps the '''Results''' in a '''SlaveEvent''' and sends the '''SlaveEvent''' to the '''Master'''. The '''Master''' reads the '''SlaveEvent''', extracts the '''Results''' subclass, and sends this '''Results''' subclass to the '''MetaOptimizer'''.<br />
<br />
=== Computation Specific C++ classes ===<br />
<br />
Note: currently these classes are specific to a particular example computation, they need to have the interface abstracted out and the specific example computation extracted and moved to subclasses.<br />
<br />
<br />
<br />
'''JobDefinition''': A Message Wrapper class that defines the parameters of a particular job, a specific subclass of '''JobDefinition''' should be used for a particular computation.<br />
<br />
'''Results''': A Message Wrapper class that defines the return values of a particular job, a specific subclass of Results should be used for a particular computation.<br />
<br />
'''MetaOptimizer''': The '''MetaOptimizer''' subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a '''JobDefinition'''), and determining how many '''JobDefinition'''s will be in a round of computation. The '''MetaOptimizer''' then takes in the '''Results''' from each '''JobDefinition''', and upon receiving '''Results''' for each '''JobDefinition''' in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
'''RegistrationWorker''': The '''RegistrationWorker''' subclass for a particular computation will take in a particular '''JobDefinition''' subclass germane to the computation, perform the actual registration computation, then return the results in the form of a specific '''Results''' subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41112Proposals:Condor2011-06-23T18:21:48Z<p>Michael.grauer: /* Infrastructure Event C++ classes */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the computation nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of '''MetaOptimizer''', '''JobDefinition''', '''Results''', and '''RegistrationWorker''' specific to the computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
'''MasterSlave''': Entry point of the application, initializes the MPI infrastructure, starts up a '''Master''' (sending it a particular '''MetaOptimizer''' subclass), starts up as many '''Slave''' nodes as are necessary (sending each of them a particular '''RegistrationWorker''' subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
'''Master''': The central controlling part of the infrastructure, it manages the available '''Slave'''s, sending them '''JobDefinition'''s as needed, collects '''Results''' from the '''Slave'''s as '''Results''' come in, will re-run (NOT YET IMPLEMENTED) a '''JobDefinition'''s if it has failed or timed out, and shuts down all '''Slave'''s when the computation is finished. The '''Master''' owns a particular '''MetaOptimizer''' subclass. The '''MetaOptimizer''' subclass is aware of the particular computation, the '''Master''' is only knows how to communicate with '''Slave'''s and does not have any responsibility for determining the breakdown of a computation into parts or its completion.<br />
<br />
'''Slave''': One of the processing nodes, part of the infrastructure. Each '''Slave''' manages a '''RegistrationWorker''' (SHOULD THIS BE MORE GENERAL THAN REGISTRATION?) subclass. '''Slave'''s announce their presence to the '''Master''', receive a '''JobDefinition''' describing a particular job, pass the '''JobDescription''' on to the '''RegistrationWorker''' to run the job, receive back a '''Results''' upon completion of the job, and send the '''Results''' back to the '''Master'''. '''Slave'''s are intended to run multiple rounds of registration during a particular computational run.<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
'''MasterEvent''' & '''SlaveEvent''' (named for the origin of the Event) are Infrastructure Event classes, and each hold a Message Wrapper class as a payload.<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class ('''SlaveEvent'''s own a '''Results''', and '''MasterEvent'''s own a '''JobDefinition'''). The Infrastructure Event classes take care of sending from '''Master''' classes to '''Slave''' classes, and from '''Slave''' classes to '''Master''' classes.<br />
<br />
The '''MetaOptimizer''' subclasses will return '''JobDefinition''' subclasses to the '''Master''', which then wraps the '''JobDefinition''' subclass in a '''MasterEvent''', and sends the '''MasterEvent''' to a '''Slave'''. A '''Slave''' reads a '''MasterEvent''', extracts the '''Results''' subclass, and sends the '''Results''' subclass to a '''RegistrationWorker''' subclass. Upon completion of the calculation for the '''JobDefinition''', the '''RegistrationWorker''' returns a '''Results''' subclass to the '''Slave''', which then wraps the '''Results''' in a '''SlaveEvent''' and sends the '''SlaveEvent''' to the '''Master'''. The '''Master''' reads the '''SlaveEvent''', extracts the '''Results''' subclass, and sends this '''Results''' subclass to the '''MetaOptimizer'''.<br />
<br />
=== Computation Specific C++ classes ===<br />
<br />
Note: currently these classes are specific to a particular example computation, they need to have the interface abstracted out and the specific example computation extracted and moved to subclasses.<br />
<br />
<br />
<br />
JobDefinition: A Message Wrapper class that defines the parameters of a particular job, a specific subclass of JobDefinition should be used for a particular computation.<br />
<br />
Results: A Message Wrapper class that defines the return values of a particular job, a specific subclass of Results should be used for a particular computation.<br />
<br />
MetaOptimizer: The MetaOptimizer subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a JobDefinition), and determining how many JobDefinitions will be in a round of computation. The MetaOptimizer then takes in the Results from each JobDefinition, and upon receiving Results for each JobDefinition in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
RegistrationWorker: The RegistrationWorker subclass for a particular computation will take in a particular JobDefinition subclass germane to the computation, perform the actual Registration computation, then return the results in the form of a specific Results subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41111Proposals:Condor2011-06-23T18:19:30Z<p>Michael.grauer: /* MPI Aware Infrastructure C++ classes */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the computation nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of '''MetaOptimizer''', '''JobDefinition''', '''Results''', and '''RegistrationWorker''' specific to the computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
'''MasterSlave''': Entry point of the application, initializes the MPI infrastructure, starts up a '''Master''' (sending it a particular '''MetaOptimizer''' subclass), starts up as many '''Slave''' nodes as are necessary (sending each of them a particular '''RegistrationWorker''' subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
'''Master''': The central controlling part of the infrastructure, it manages the available '''Slave'''s, sending them '''JobDefinition'''s as needed, collects '''Results''' from the '''Slave'''s as '''Results''' come in, will re-run (NOT YET IMPLEMENTED) a '''JobDefinition'''s if it has failed or timed out, and shuts down all '''Slave'''s when the computation is finished. The '''Master''' owns a particular '''MetaOptimizer''' subclass. The '''MetaOptimizer''' subclass is aware of the particular computation, the '''Master''' is only knows how to communicate with '''Slave'''s and does not have any responsibility for determining the breakdown of a computation into parts or its completion.<br />
<br />
'''Slave''': One of the processing nodes, part of the infrastructure. Each '''Slave''' manages a '''RegistrationWorker''' (SHOULD THIS BE MORE GENERAL THAN REGISTRATION?) subclass. '''Slave'''s announce their presence to the '''Master''', receive a '''JobDefinition''' describing a particular job, pass the '''JobDescription''' on to the '''RegistrationWorker''' to run the job, receive back a '''Results''' upon completion of the job, and send the '''Results''' back to the '''Master'''. '''Slave'''s are intended to run multiple rounds of registration during a particular computational run.<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
MasterEvent & SlaveEvent (named for the origin of the Event) are Infrastructure Event classes, and each hold a Message Wrapper class as a payload.<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class (SlaveEvents own a Results, and MasterEvents own a JobDefinition). The Infrastructure Event classes take care of sending from Master classes to Slave classes, and from Slave classes to Master classes.<br />
<br />
The MetaOptimizer subclasses will return JobDefinition subclasses to the Master, which then wraps the JobDefinition subclass in a MasterEvent, and sends the MasterEvent to a Slave. A Slave reads a MasterEvent, extracts the Results subclass, and sends the Results subclass to a RegistrationWorker subclass. Upon completion of the calculation for the JobDefinition, the RegistrationWorker returns a Results subclass to the Slave, which then wraps the Results in a SlaveEvent and sends the SlaveEvent to the Master. The Master reads the SlaveEvent, extracts the Results subclass, and sends this Results subclass to the MetaOptimizer.<br />
<br />
=== Computation Specific C++ classes ===<br />
<br />
Note: currently these classes are specific to a particular example computation, they need to have the interface abstracted out and the specific example computation extracted and moved to subclasses.<br />
<br />
<br />
<br />
JobDefinition: A Message Wrapper class that defines the parameters of a particular job, a specific subclass of JobDefinition should be used for a particular computation.<br />
<br />
Results: A Message Wrapper class that defines the return values of a particular job, a specific subclass of Results should be used for a particular computation.<br />
<br />
MetaOptimizer: The MetaOptimizer subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a JobDefinition), and determining how many JobDefinitions will be in a round of computation. The MetaOptimizer then takes in the Results from each JobDefinition, and upon receiving Results for each JobDefinition in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
RegistrationWorker: The RegistrationWorker subclass for a particular computation will take in a particular JobDefinition subclass germane to the computation, perform the actual Registration computation, then return the results in the form of a specific Results subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41110Proposals:Condor2011-06-23T18:17:11Z<p>Michael.grauer: /* High Level Architecture */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the computation nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of '''MetaOptimizer''', '''JobDefinition''', '''Results''', and '''RegistrationWorker''' specific to the computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
MasterSlave: Entry point of the application, initializes the MPI infrastructure, starts up a Master (sending it a particular MetaOptimizer subclass), starts up as many slave nodes as are necessary (sending each of them a particular RegistrationWorker subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
Master: The central controlling part of the infrastructure, it manages the available Slaves, sending them JobDefinitions as needed, collects Results from the Slaves as Results come in, will re-run (NOT YET IMPLEMENTED) a JobDefinitions if it has failed or timed out, and shuts down all Slaves when the computation is finished. The Master owns a particular MetaOptimizer subclass. The MetaOptimizer subclass is aware of the particular computation, the Master is only knows how to communicate with Slaves and does not have any responsibility for determining the breakdown of a computation into parts or its completion.<br />
<br />
Slave: One of the processing nodes, part of the infrastructure. Each slave manages a RegistrationWorker (SHOULD THIS BE MORE GENERAL THAN REGISTRATION?) subclass. Slaves announce their presence to the Master, receive a JobDefinition describing a particular job, pass the JobDescription on to the RegistrationWorker to run the job, receive back a Results upon completion of the job, and send the Results back to the Master. Slaves are intended to run multiple rounds of Registration during a particular computational run.<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
MasterEvent & SlaveEvent (named for the origin of the Event) are Infrastructure Event classes, and each hold a Message Wrapper class as a payload.<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class (SlaveEvents own a Results, and MasterEvents own a JobDefinition). The Infrastructure Event classes take care of sending from Master classes to Slave classes, and from Slave classes to Master classes.<br />
<br />
The MetaOptimizer subclasses will return JobDefinition subclasses to the Master, which then wraps the JobDefinition subclass in a MasterEvent, and sends the MasterEvent to a Slave. A Slave reads a MasterEvent, extracts the Results subclass, and sends the Results subclass to a RegistrationWorker subclass. Upon completion of the calculation for the JobDefinition, the RegistrationWorker returns a Results subclass to the Slave, which then wraps the Results in a SlaveEvent and sends the SlaveEvent to the Master. The Master reads the SlaveEvent, extracts the Results subclass, and sends this Results subclass to the MetaOptimizer.<br />
<br />
=== Computation Specific C++ classes ===<br />
<br />
Note: currently these classes are specific to a particular example computation, they need to have the interface abstracted out and the specific example computation extracted and moved to subclasses.<br />
<br />
<br />
<br />
JobDefinition: A Message Wrapper class that defines the parameters of a particular job, a specific subclass of JobDefinition should be used for a particular computation.<br />
<br />
Results: A Message Wrapper class that defines the return values of a particular job, a specific subclass of Results should be used for a particular computation.<br />
<br />
MetaOptimizer: The MetaOptimizer subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a JobDefinition), and determining how many JobDefinitions will be in a round of computation. The MetaOptimizer then takes in the Results from each JobDefinition, and upon receiving Results for each JobDefinition in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
RegistrationWorker: The RegistrationWorker subclass for a particular computation will take in a particular JobDefinition subclass germane to the computation, perform the actual Registration computation, then return the results in the form of a specific Results subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41108Proposals:Condor2011-06-23T18:03:01Z<p>Michael.grauer: /* Computation Specific C++ classes */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for Infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the computation nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of MetaOptimizer, JobDefinition, Results, and RegistrationWorker specific to the computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
MasterSlave: Entry point of the application, initializes the MPI infrastructure, starts up a Master (sending it a particular MetaOptimizer subclass), starts up as many slave nodes as are necessary (sending each of them a particular RegistrationWorker subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
Master: The central controlling part of the infrastructure, it manages the available Slaves, sending them JobDefinitions as needed, collects Results from the Slaves as Results come in, will re-run (NOT YET IMPLEMENTED) a JobDefinitions if it has failed or timed out, and shuts down all Slaves when the computation is finished. The Master owns a particular MetaOptimizer subclass. The MetaOptimizer subclass is aware of the particular computation, the Master is only knows how to communicate with Slaves and does not have any responsibility for determining the breakdown of a computation into parts or its completion.<br />
<br />
Slave: One of the processing nodes, part of the infrastructure. Each slave manages a RegistrationWorker (SHOULD THIS BE MORE GENERAL THAN REGISTRATION?) subclass. Slaves announce their presence to the Master, receive a JobDefinition describing a particular job, pass the JobDescription on to the RegistrationWorker to run the job, receive back a Results upon completion of the job, and send the Results back to the Master. Slaves are intended to run multiple rounds of Registration during a particular computational run.<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
MasterEvent & SlaveEvent (named for the origin of the Event) are Infrastructure Event classes, and each hold a Message Wrapper class as a payload.<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class (SlaveEvents own a Results, and MasterEvents own a JobDefinition). The Infrastructure Event classes take care of sending from Master classes to Slave classes, and from Slave classes to Master classes.<br />
<br />
The MetaOptimizer subclasses will return JobDefinition subclasses to the Master, which then wraps the JobDefinition subclass in a MasterEvent, and sends the MasterEvent to a Slave. A Slave reads a MasterEvent, extracts the Results subclass, and sends the Results subclass to a RegistrationWorker subclass. Upon completion of the calculation for the JobDefinition, the RegistrationWorker returns a Results subclass to the Slave, which then wraps the Results in a SlaveEvent and sends the SlaveEvent to the Master. The Master reads the SlaveEvent, extracts the Results subclass, and sends this Results subclass to the MetaOptimizer.<br />
<br />
=== Computation Specific C++ classes ===<br />
<br />
Note: currently these classes are specific to a particular example computation, they need to have the interface abstracted out and the specific example computation extracted and moved to subclasses.<br />
<br />
<br />
<br />
JobDefinition: A Message Wrapper class that defines the parameters of a particular job, a specific subclass of JobDefinition should be used for a particular computation.<br />
<br />
Results: A Message Wrapper class that defines the return values of a particular job, a specific subclass of Results should be used for a particular computation.<br />
<br />
MetaOptimizer: The MetaOptimizer subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a JobDefinition), and determining how many JobDefinitions will be in a round of computation. The MetaOptimizer then takes in the Results from each JobDefinition, and upon receiving Results for each JobDefinition in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
RegistrationWorker: The RegistrationWorker subclass for a particular computation will take in a particular JobDefinition subclass germane to the computation, perform the actual Registration computation, then return the results in the form of a specific Results subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41107Proposals:Condor2011-06-23T17:57:58Z<p>Michael.grauer: /* Infrastructure Event C++ classes */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for Infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the computation nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of MetaOptimizer, JobDefinition, Results, and RegistrationWorker specific to the computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
MasterSlave: Entry point of the application, initializes the MPI infrastructure, starts up a Master (sending it a particular MetaOptimizer subclass), starts up as many slave nodes as are necessary (sending each of them a particular RegistrationWorker subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
Master: The central controlling part of the infrastructure, it manages the available Slaves, sending them JobDefinitions as needed, collects Results from the Slaves as Results come in, will re-run (NOT YET IMPLEMENTED) a JobDefinitions if it has failed or timed out, and shuts down all Slaves when the computation is finished. The Master owns a particular MetaOptimizer subclass. The MetaOptimizer subclass is aware of the particular computation, the Master is only knows how to communicate with Slaves and does not have any responsibility for determining the breakdown of a computation into parts or its completion.<br />
<br />
Slave: One of the processing nodes, part of the infrastructure. Each slave manages a RegistrationWorker (SHOULD THIS BE MORE GENERAL THAN REGISTRATION?) subclass. Slaves announce their presence to the Master, receive a JobDefinition describing a particular job, pass the JobDescription on to the RegistrationWorker to run the job, receive back a Results upon completion of the job, and send the Results back to the Master. Slaves are intended to run multiple rounds of Registration during a particular computational run.<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
MasterEvent & SlaveEvent (named for the origin of the Event) are Infrastructure Event classes, and each hold a Message Wrapper class as a payload.<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class (SlaveEvents own a Results, and MasterEvents own a JobDefinition). The Infrastructure Event classes take care of sending from Master classes to Slave classes, and from Slave classes to Master classes.<br />
<br />
The MetaOptimizer subclasses will return JobDefinition subclasses to the Master, which then wraps the JobDefinition subclass in a MasterEvent, and sends the MasterEvent to a Slave. A Slave reads a MasterEvent, extracts the Results subclass, and sends the Results subclass to a RegistrationWorker subclass. Upon completion of the calculation for the JobDefinition, the RegistrationWorker returns a Results subclass to the Slave, which then wraps the Results in a SlaveEvent and sends the SlaveEvent to the Master. The Master reads the SlaveEvent, extracts the Results subclass, and sends this Results subclass to the MetaOptimizer.<br />
<br />
=== Computation Specific C++ classes ===<br />
<br />
JobDefinition: A Message Wrapper class that defines the parameters of a particular job, a specific subclass of JobDefinition should be used for a particular computation.<br />
<br />
Results: A Message Wrapper class that defines the return values of a particular job, a specific subclass of Results should be used for a particular computation.<br />
<br />
MetaOptimizer: The MetaOptimizer subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a JobDefinition), and determining how many JobDefinitions will be in a round of computation. The MetaOptimizer then takes in the Results from each JobDefinition, and upon receiving Results for each JobDefinition in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
RegistrationWorker: The RegistrationWorker subclass for a particular computation will take in a particular JobDefinition subclass germane to the computation, perform the actual Registration computation, then return the results in the form of a specific Results subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41106Proposals:Condor2011-06-23T17:57:05Z<p>Michael.grauer: /* Infrastructure Event C++ classes */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for Infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the computation nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of MetaOptimizer, JobDefinition, Results, and RegistrationWorker specific to the computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
MasterSlave: Entry point of the application, initializes the MPI infrastructure, starts up a Master (sending it a particular MetaOptimizer subclass), starts up as many slave nodes as are necessary (sending each of them a particular RegistrationWorker subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
Master: The central controlling part of the infrastructure, it manages the available Slaves, sending them JobDefinitions as needed, collects Results from the Slaves as Results come in, will re-run (NOT YET IMPLEMENTED) a JobDefinitions if it has failed or timed out, and shuts down all Slaves when the computation is finished. The Master owns a particular MetaOptimizer subclass. The MetaOptimizer subclass is aware of the particular computation, the Master is only knows how to communicate with Slaves and does not have any responsibility for determining the breakdown of a computation into parts or its completion.<br />
<br />
Slave: One of the processing nodes, part of the infrastructure. Each slave manages a RegistrationWorker (SHOULD THIS BE MORE GENERAL THAN REGISTRATION?) subclass. Slaves announce their presence to the Master, receive a JobDefinition describing a particular job, pass the JobDescription on to the RegistrationWorker to run the job, receive back a Results upon completion of the job, and send the Results back to the Master. Slaves are intended to run multiple rounds of Registration during a particular computational run.<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
MasterEvent & SlaveEvent (named for the origin of the Event) are Infrastructure Event classes, and each hold a Message Wrapper class as a payload.<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class (SlaveEvents own a Results, and MasterEvents own a JobDefinition). The Infrastructure Event classes take care of sending from Master classes to Slave classes, and from Slave classes to Master classes.<br />
<br />
The MetaOptimizer subclasses will return JobDefinition subclasses to the Master, which then wraps the JobDefinition subclass in a MasterEvent, and sends the MasterEvent to a Slave. A Slave reads a MasterEvent, extracts the Results subclass, and sends the Results subclass to a RegistrationWorker subclass. Upon completion of the calculation for the JobDefinition, the RegistrationWorker returns a Results sublcass to the Slave, which then wraps the Results in a SlaveEvent and sends the SlaveEvent to the Master. The Master reads the SlaveEvent, extracts the Results subclass, and sends this Results subclass to the MetaOptimizer.<br />
<br />
=== Computation Specific C++ classes ===<br />
<br />
JobDefinition: A Message Wrapper class that defines the parameters of a particular job, a specific subclass of JobDefinition should be used for a particular computation.<br />
<br />
Results: A Message Wrapper class that defines the return values of a particular job, a specific subclass of Results should be used for a particular computation.<br />
<br />
MetaOptimizer: The MetaOptimizer subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a JobDefinition), and determining how many JobDefinitions will be in a round of computation. The MetaOptimizer then takes in the Results from each JobDefinition, and upon receiving Results for each JobDefinition in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
RegistrationWorker: The RegistrationWorker subclass for a particular computation will take in a particular JobDefinition subclass germane to the computation, perform the actual Registration computation, then return the results in the form of a specific Results subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41105Proposals:Condor2011-06-23T17:56:27Z<p>Michael.grauer: /* MPI Aware Infrastructure C++ classes */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for Infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the computation nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of MetaOptimizer, JobDefinition, Results, and RegistrationWorker specific to the computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
MasterSlave: Entry point of the application, initializes the MPI infrastructure, starts up a Master (sending it a particular MetaOptimizer subclass), starts up as many slave nodes as are necessary (sending each of them a particular RegistrationWorker subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
Master: The central controlling part of the infrastructure, it manages the available Slaves, sending them JobDefinitions as needed, collects Results from the Slaves as Results come in, will re-run (NOT YET IMPLEMENTED) a JobDefinitions if it has failed or timed out, and shuts down all Slaves when the computation is finished. The Master owns a particular MetaOptimizer subclass. The MetaOptimizer subclass is aware of the particular computation, the Master is only knows how to communicate with Slaves and does not have any responsibility for determining the breakdown of a computation into parts or its completion.<br />
<br />
Slave: One of the processing nodes, part of the infrastructure. Each slave manages a RegistrationWorker (SHOULD THIS BE MORE GENERAL THAN REGISTRATION?) subclass. Slaves announce their presence to the Master, receive a JobDefinition describing a particular job, pass the JobDescription on to the RegistrationWorker to run the job, receive back a Results upon completion of the job, and send the Results back to the Master. Slaves are intended to run multiple rounds of Registration during a particular computational run.<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
MasterEvent & SlaveEvent (named for the origin of the Event) are Infrastructure Event classes, and each hold a Message Wrapper class as a payload.<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class (SlaveEvents own a Results, and MasterEvents own a JobDefinition). The Infrastructure Event classes take care of sending from Master classes to Slave classes, and from Slave classes to Master classes..<br />
<br />
The MetaOptimizer subclasses will return JobDefinition subclasses to the Master, which then wraps the JobDefinition subclass in a MasterEvent, and sends the MasterEvent to a Slave. A Slave reads a MasterEvent, extracts the Results subclass, and sends the Results subclass to a RegistrationWorker subclass. Upon completion of the calculation for the JobDefinition, the RegistrationWorker returns a Results sublcass to the Slave, which then wraps the Results in a SlaveEvent and sends the SlaveEvent to the Master. The Master reads the SlaveEvent, extracts the Results subclass, and sends this Results subclass to the MetaOptimizer.<br />
<br />
=== Computation Specific C++ classes ===<br />
<br />
JobDefinition: A Message Wrapper class that defines the parameters of a particular job, a specific subclass of JobDefinition should be used for a particular computation.<br />
<br />
Results: A Message Wrapper class that defines the return values of a particular job, a specific subclass of Results should be used for a particular computation.<br />
<br />
MetaOptimizer: The MetaOptimizer subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a JobDefinition), and determining how many JobDefinitions will be in a round of computation. The MetaOptimizer then takes in the Results from each JobDefinition, and upon receiving Results for each JobDefinition in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
RegistrationWorker: The RegistrationWorker subclass for a particular computation will take in a particular JobDefinition subclass germane to the computation, perform the actual Registration computation, then return the results in the form of a specific Results subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41104Proposals:Condor2011-06-23T17:54:48Z<p>Michael.grauer: /* High Level Architecture */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for Infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the computation nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of MetaOptimizer, JobDefinition, Results, and RegistrationWorker specific to the computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
MasterSlave: Entry point of the application, initializes the MPI infrastructure, starts up a Master (sending it a particular MetaOptimizer subclass), starts up as many slave nodes as are necessary (sending each of them a particular RegistrationWorker subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
Master: The central controlling part of the infrastructure, it manages the available Slaves, sending them JobDefinitions as needed, collects Results from the Slaves as Results come in, will re-run (NOT YET IMPLEMENTED) a JobDefinitions if it has failed or timed out, and shuts down all Slaves when the computation is finished. The Master owns a particular MetaOptimizer subclass. The MetaOptimizer subclass is aware of the particular computation, the Master is only knows how to communicate with Slaves and does not have any responsibility for determining the breakdown or completion of the computation.<br />
<br />
Slave: One of the processing nodes, part of the infrastructure. Each slave manages a RegistrationWorker (SHOULD THIS BE MORE GENERAL THAN REGISTRATION?) subclass. Slaves announce their presence to the Master, receive a JobDefinition describing a particular job, pass the JobDescription on to the RegistrationWorker to run the job, receive back a Results upon completion of the job, and send the Results back to the Master.<br />
<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
MasterEvent & SlaveEvent (named for the origin of the Event) are Infrastructure Event classes, and each hold a Message Wrapper class as a payload.<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class (SlaveEvents own a Results, and MasterEvents own a JobDefinition). The Infrastructure Event classes take care of sending from Master classes to Slave classes, and from Slave classes to Master classes..<br />
<br />
The MetaOptimizer subclasses will return JobDefinition subclasses to the Master, which then wraps the JobDefinition subclass in a MasterEvent, and sends the MasterEvent to a Slave. A Slave reads a MasterEvent, extracts the Results subclass, and sends the Results subclass to a RegistrationWorker subclass. Upon completion of the calculation for the JobDefinition, the RegistrationWorker returns a Results sublcass to the Slave, which then wraps the Results in a SlaveEvent and sends the SlaveEvent to the Master. The Master reads the SlaveEvent, extracts the Results subclass, and sends this Results subclass to the MetaOptimizer.<br />
<br />
=== Computation Specific C++ classes ===<br />
<br />
JobDefinition: A Message Wrapper class that defines the parameters of a particular job, a specific subclass of JobDefinition should be used for a particular computation.<br />
<br />
Results: A Message Wrapper class that defines the return values of a particular job, a specific subclass of Results should be used for a particular computation.<br />
<br />
MetaOptimizer: The MetaOptimizer subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a JobDefinition), and determining how many JobDefinitions will be in a round of computation. The MetaOptimizer then takes in the Results from each JobDefinition, and upon receiving Results for each JobDefinition in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
RegistrationWorker: The RegistrationWorker subclass for a particular computation will take in a particular JobDefinition subclass germane to the computation, perform the actual Registration computation, then return the results in the form of a specific Results subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41103Proposals:Condor2011-06-23T17:54:10Z<p>Michael.grauer: /* Computation Specific C++ classes */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for Infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the communication nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of MetaOptimizer, JobDefinition, Results, and RegistrationWorker specific to their computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
MasterSlave: Entry point of the application, initializes the MPI infrastructure, starts up a Master (sending it a particular MetaOptimizer subclass), starts up as many slave nodes as are necessary (sending each of them a particular RegistrationWorker subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
Master: The central controlling part of the infrastructure, it manages the available Slaves, sending them JobDefinitions as needed, collects Results from the Slaves as Results come in, will re-run (NOT YET IMPLEMENTED) a JobDefinitions if it has failed or timed out, and shuts down all Slaves when the computation is finished. The Master owns a particular MetaOptimizer subclass. The MetaOptimizer subclass is aware of the particular computation, the Master is only knows how to communicate with Slaves and does not have any responsibility for determining the breakdown or completion of the computation.<br />
<br />
Slave: One of the processing nodes, part of the infrastructure. Each slave manages a RegistrationWorker (SHOULD THIS BE MORE GENERAL THAN REGISTRATION?) subclass. Slaves announce their presence to the Master, receive a JobDefinition describing a particular job, pass the JobDescription on to the RegistrationWorker to run the job, receive back a Results upon completion of the job, and send the Results back to the Master.<br />
<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
MasterEvent & SlaveEvent (named for the origin of the Event) are Infrastructure Event classes, and each hold a Message Wrapper class as a payload.<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class (SlaveEvents own a Results, and MasterEvents own a JobDefinition). The Infrastructure Event classes take care of sending from Master classes to Slave classes, and from Slave classes to Master classes..<br />
<br />
The MetaOptimizer subclasses will return JobDefinition subclasses to the Master, which then wraps the JobDefinition subclass in a MasterEvent, and sends the MasterEvent to a Slave. A Slave reads a MasterEvent, extracts the Results subclass, and sends the Results subclass to a RegistrationWorker subclass. Upon completion of the calculation for the JobDefinition, the RegistrationWorker returns a Results sublcass to the Slave, which then wraps the Results in a SlaveEvent and sends the SlaveEvent to the Master. The Master reads the SlaveEvent, extracts the Results subclass, and sends this Results subclass to the MetaOptimizer.<br />
<br />
=== Computation Specific C++ classes ===<br />
<br />
JobDefinition: A Message Wrapper class that defines the parameters of a particular job, a specific subclass of JobDefinition should be used for a particular computation.<br />
<br />
Results: A Message Wrapper class that defines the return values of a particular job, a specific subclass of Results should be used for a particular computation.<br />
<br />
MetaOptimizer: The MetaOptimizer subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a JobDefinition), and determining how many JobDefinitions will be in a round of computation. The MetaOptimizer then takes in the Results from each JobDefinition, and upon receiving Results for each JobDefinition in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
RegistrationWorker: The RegistrationWorker subclass for a particular computation will take in a particular JobDefinition subclass germane to the computation, perform the actual Registration computation, then return the results in the form of a specific Results subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41102Proposals:Condor2011-06-23T17:53:44Z<p>Michael.grauer: /* Infrastructure Event C++ classes */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for Infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the communication nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of MetaOptimizer, JobDefinition, Results, and RegistrationWorker specific to their computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
MasterSlave: Entry point of the application, initializes the MPI infrastructure, starts up a Master (sending it a particular MetaOptimizer subclass), starts up as many slave nodes as are necessary (sending each of them a particular RegistrationWorker subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
Master: The central controlling part of the infrastructure, it manages the available Slaves, sending them JobDefinitions as needed, collects Results from the Slaves as Results come in, will re-run (NOT YET IMPLEMENTED) a JobDefinitions if it has failed or timed out, and shuts down all Slaves when the computation is finished. The Master owns a particular MetaOptimizer subclass. The MetaOptimizer subclass is aware of the particular computation, the Master is only knows how to communicate with Slaves and does not have any responsibility for determining the breakdown or completion of the computation.<br />
<br />
Slave: One of the processing nodes, part of the infrastructure. Each slave manages a RegistrationWorker (SHOULD THIS BE MORE GENERAL THAN REGISTRATION?) subclass. Slaves announce their presence to the Master, receive a JobDefinition describing a particular job, pass the JobDescription on to the RegistrationWorker to run the job, receive back a Results upon completion of the job, and send the Results back to the Master.<br />
<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
MasterEvent & SlaveEvent (named for the origin of the Event) are Infrastructure Event classes, and each hold a Message Wrapper class as a payload.<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class (SlaveEvents own a Results, and MasterEvents own a JobDefinition). The Infrastructure Event classes take care of sending from Master classes to Slave classes, and from Slave classes to Master classes..<br />
<br />
The MetaOptimizer subclasses will return JobDefinition subclasses to the Master, which then wraps the JobDefinition subclass in a MasterEvent, and sends the MasterEvent to a Slave. A Slave reads a MasterEvent, extracts the Results subclass, and sends the Results subclass to a RegistrationWorker subclass. Upon completion of the calculation for the JobDefinition, the RegistrationWorker returns a Results sublcass to the Slave, which then wraps the Results in a SlaveEvent and sends the SlaveEvent to the Master. The Master reads the SlaveEvent, extracts the Results subclass, and sends this Results subclass to the MetaOptimizer.<br />
<br />
=== Computation Specific C++ classes ===<br />
<br />
JobDefinition: A Message Wrapper class that defines the parameters of a particular job, a specific subclass of JobDefinition should be used for a particular computation.<br />
<br />
Results: A Message Wrapper class that defines the return values of a particular job, a specific subclass of Results should be used for a particular computation.<br />
<br />
<br />
Computational classes:<br />
<br />
MetaOptimizer: The MetaOptimizer subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a JobDefinition), and determining how many JobDefinitions will be in a round of computation. The MetaOptimizer then takes in the Results from each JobDefinition, and upon receiving Results for each JobDefinition in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
RegistrationWorker: The RegistrationWorker subclass for a particular computation will take in a particular JobDefinition subclass germane to the computation, perform the actual Registration computation, then return the results in the form of a specific Results subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41101Proposals:Condor2011-06-23T17:50:34Z<p>Michael.grauer: /* MPI Based Meta-Optimization Framework */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
=== High Level Architecture ===<br />
<br />
The system was designed for Infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the communication nodes. When a new MetaOptimization is needed, the client programmer should create subclasses of MetaOptimizer, JobDefinition, Results, and RegistrationWorker specific to their computational problem.<br />
<br />
=== MPI Aware Infrastructure C++ classes ===<br />
<br />
MasterSlave: Entry point of the application, initializes the MPI infrastructure, starts up a Master (sending it a particular MetaOptimizer subclass), starts up as many slave nodes as are necessary (sending each of them a particular RegistrationWorker subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
Master: The central controlling part of the infrastructure, it manages the available Slaves, sending them JobDefinitions as needed, collects Results from the Slaves as Results come in, will re-run (NOT YET IMPLEMENTED) a JobDefinitions if it has failed or timed out, and shuts down all Slaves when the computation is finished. The Master owns a particular MetaOptimizer subclass. The MetaOptimizer subclass is aware of the particular computation, the Master is only knows how to communicate with Slaves and does not have any responsibility for determining the breakdown or completion of the computation.<br />
<br />
Slave: One of the processing nodes, part of the infrastructure. Each slave manages a RegistrationWorker (SHOULD THIS BE MORE GENERAL THAN REGISTRATION?) subclass. Slaves announce their presence to the Master, receive a JobDefinition describing a particular job, pass the JobDescription on to the RegistrationWorker to run the job, receive back a Results upon completion of the job, and send the Results back to the Master.<br />
<br />
<br />
=== Infrastructure Event C++ classes ===<br />
<br />
Job Message Wrapper classes and :<br />
<br />
JobDefinition & Results are Job Message Wrapper classes.<br />
<br />
MasterEvent & SlaveEvent (named for the origin of the Event) are Infrastructure event classes.<br />
<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class (SlaveEvents own a Results, and MasterEvents own a JobDefinition). The Infrastructure Event classes take care of sending from Master classes to Slave classes, and from Slave classes to Master classes, and hold a message wrapper payload.<br />
<br />
The MetaOptimizer subclasses will return JobDefinition subclasses to the Master, which then wraps the JobDefinition subclass in a MasterEvent, and sends the MasterEvent to a Slave. A Slave reads a MasterEvent, extracts the Results subclass, and sends the Results subclass to a RegistrationWorker subclass. Upon completion of the calculation for the JobDefinition, the RegistrationWorker returns a Results sublcass to the Slave, which then wraps the Results in a SlaveEvent and sends the SlaveEvent to the Master. The Master reads the SlaveEvent, extracts the Results subclass, and sends this Results subclass to the MetaOptimizer.<br />
<br />
JobDefinition: Defines the parameters of a particular job, a specific subclass of JobDefinition should be used for a particular computation.<br />
<br />
Results: Defines the return values of a particular job, a specific subclass of Results should be used for a particular computation.<br />
<br />
<br />
Computational classes:<br />
<br />
MetaOptimizer: The MetaOptimizer subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a JobDefinition), and determining how many JobDefinitions will be in a round of computation. The MetaOptimizer then takes in the Results from each JobDefinition, and upon receiving Results for each JobDefinition in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
RegistrationWorker: The RegistrationWorker subclass for a particular computation will take in a particular JobDefinition subclass germane to the computation, perform the actual Registration computation, then return the results in the form of a specific Results subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41100Proposals:Condor2011-06-23T17:48:52Z<p>Michael.grauer: /* MPI Based Meta-Optimization Framework */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
High Level Architecture: The system was designed for Infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the communication nodes. When a new MetaOptimization is needed, the client programmer should be able to create subclasses of MetaOptimizer, JobDefinition, Results, and RegistrationWorker specific to their computational problem.<br />
<br />
<br />
Here are the C++ classes:<br />
<br />
MPI Aware Infrastructure classes:<br />
<br />
MasterSlave: Entry point of the application, initializes the MPI infrastructure, starts up a Master (sending it a particular MetaOptimizer subclass), starts up as many slave nodes as are necessary (sending each of them a particular RegistrationWorker subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
Master: The central controlling part of the infrastructure, it manages the available Slaves, sending them JobDefinitions as needed, collects Results from the Slaves as Results come in, will re-run (NOT YET IMPLEMENTED) a JobDefinitions if it has failed or timed out, and shuts down all Slaves when the computation is finished. The Master owns a particular MetaOptimizer subclass. The MetaOptimizer subclass is aware of the particular computation, the Master is only knows how to communicate with Slaves and does not have any responsibility for determining the breakdown or completion of the computation.<br />
<br />
Slave: One of the processing nodes, part of the infrastructure. Each slave manages a RegistrationWorker (SHOULD THIS BE MORE GENERAL THAN REGISTRATION?) subclass. Slaves announce their presence to the Master, receive a JobDefinition describing a particular job, pass the JobDescription on to the RegistrationWorker to run the job, receive back a Results upon completion of the job, and send the Results back to the Master.<br />
<br />
<br />
Job Message Wrapper classes and Infrastructure Event classes:<br />
<br />
JobDefinition & Results are Job Message Wrapper classes.<br />
<br />
MasterEvent & SlaveEvent (named for the origin of the Event) are Infrastructure event classes.<br />
<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class (SlaveEvents own a Results, and MasterEvents own a JobDefinition). The Infrastructure Event classes take care of sending from Master classes to Slave classes, and from Slave classes to Master classes, and hold a message wrapper payload.<br />
<br />
The MetaOptimizer subclasses will return JobDefinition subclasses to the Master, which then wraps the JobDefinition subclass in a MasterEvent, and sends the MasterEvent to a Slave. A Slave reads a MasterEvent, extracts the Results subclass, and sends the Results subclass to a RegistrationWorker subclass. Upon completion of the calculation for the JobDefinition, the RegistrationWorker returns a Results sublcass to the Slave, which then wraps the Results in a SlaveEvent and sends the SlaveEvent to the Master. The Master reads the SlaveEvent, extracts the Results subclass, and sends this Results subclass to the MetaOptimizer.<br />
<br />
JobDefinition: Defines the parameters of a particular job, a specific subclass of JobDefinition should be used for a particular computation.<br />
<br />
Results: Defines the return values of a particular job, a specific subclass of Results should be used for a particular computation.<br />
<br />
<br />
Computational classes:<br />
<br />
MetaOptimizer: The MetaOptimizer subclass for a particular computation takes care of determining the parameters for a particular job (in the form of a JobDefinition), and determining how many JobDefinitions will be in a round of computation. The MetaOptimizer then takes in the Results from each JobDefinition, and upon receiving Results for each JobDefinition in a round, can either determine a new round of computation by recalculating the new set of parameters for the new round, or else terminate the computation.<br />
<br />
RegistrationWorker: The RegistrationWorker subclass for a particular computation will take in a particular JobDefinition subclass germane to the computation, perform the actual Registration computation, then return the results in the form of a specific Results subclass that is relevant to the computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41099Proposals:Condor2011-06-23T17:45:23Z<p>Michael.grauer: /* MPI Based Meta-Optimization Framework */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
High Level Architecture: The system was designed for Infrastructure reuse. The infrastructure classes take care of organization of the computation nodes and communication between the communication nodes. When a new MetaOptimization is needed, the client programmer should be able to create subclasses of MetaOptimizer, JobDefinition, Results, and RegistrationWorker specific to their computational problem.<br />
<br />
<br />
Here are the C++ classes:<br />
<br />
MPI Aware Infrastructure classes:<br />
<br />
MasterSlave: Entry point of the application, initializes the MPI infrastructure, starts up a Master (sending it a particular MetaOptimizer subclass), starts up as many slave nodes as are necessary (sending each of them a particular RegistrationWorker subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
Master: The central controlling part of the infrastructure, it manages the available Slaves, sending them JobDefinitions as needed, collects Results from the Slaves as Results come in, will re-run (NOT YET IMPLEMENTED) a JobDefinitions if it has failed or timed out, and shuts down all Slaves when the computation is finished. The Master owns a particular MetaOptimizer subclass. The MetaOptimizer subclass is aware of the particular computation, the Master is only knows how to communicate with Slaves and does not have any responsibility for determining the breakdown or completion of the computation.<br />
<br />
Slave: One of the processing nodes, part of the infrastructure. Each slave manages a RegistrationWorker (SHOULD THIS BE MORE GENERAL THAN REGISTRATION?) subclass. Slaves announce their presence to the Master, receive a JobDefinition describing a particular job, pass the JobDescription on to the RegistrationWorker to run the job, receive back a Results upon completion of the job, and send the Results back to the Master.<br />
<br />
<br />
Job Message Wrapper classes and Infrastructure Event classes:<br />
<br />
JobDefinition & Results are Job Message Wrapper classes.<br />
<br />
MasterEvent & SlaveEvent (named for the origin of the Event) are Infrastructure event classes.<br />
<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class (SlaveEvents own a Results, and MasterEvents own a JobDefinition). The Infrastructure Event classes take care of sending from Master classes to Slave classes, and from Slave classes to Master classes, and hold a message wrapper payload.<br />
<br />
The MetaOptimizer subclasses will return JobDefinition subclasses to the Master, which then wraps the JobDefinition subclass in a MasterEvent, and sends the MasterEvent to a Slave. A Slave reads a MasterEvent, extracts the Results subclass, and sends the Results subclass to a RegistrationWorker subclass. Upon completion of the calculation for the JobDefinition, the RegistrationWorker returns a Results sublcass to the Slave, which then wraps the Results in a SlaveEvent and sends the SlaveEvent to the Master. The Master reads the SlaveEvent, extracts the Results subclass, and sends this Results subclass to the MetaOptimizer.<br />
<br />
JobDefinition: Defines the parameters of a particular job, a specific subclass of JobDefinition should be used for a particular computation.<br />
<br />
Results: Defines the return values of a particular job, a specific subclass of Results should be used for a particular computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41098Proposals:Condor2011-06-23T17:43:08Z<p>Michael.grauer: /* MPI Based Meta-Optimization Framework */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
<br />
Here are the C++ classes:<br />
<br />
MPI Aware Infrastructure classes:<br />
<br />
MasterSlave: Entry point of the application, initializes the MPI infrastructure, starts up a Master (sending it a particular MetaOptimizer subclass), starts up as many slave nodes as are necessary (sending each of them a particular RegistrationWorker subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
Master: The central controlling part of the infrastructure, it manages the available Slaves, sending them JobDefinitions as needed, collects Results from the Slaves as Results come in, will re-run (NOT YET IMPLEMENTED) a JobDefinitions if it has failed or timed out, and shuts down all Slaves when the computation is finished. The Master owns a particular MetaOptimizer subclass. The MetaOptimizer subclass is aware of the particular computation, the Master is only knows how to communicate with Slaves and does not have any responsibility for determining the breakdown or completion of the computation.<br />
<br />
Slave: One of the processing nodes, part of the infrastructure. Each slave manages a RegistrationWorker (SHOULD THIS BE MORE GENERAL THAN REGISTRATION?) subclass. Slaves announce their presence to the Master, receive a JobDefinition describing a particular job, pass the JobDescription on to the RegistrationWorker to run the job, receive back a Results upon completion of the job, and send the Results back to the Master.<br />
<br />
<br />
Job Message Wrapper classes and Infrastructure Event classes:<br />
<br />
JobDefinition & Results are Job Message Wrapper classes.<br />
<br />
MasterEvent & SlaveEvent (named for the origin of the Event) are Infrastructure event classes.<br />
<br />
<br />
The idea behind the job message wrapper classes was to make message wrappers that knew how to bi-directionally convert their specific named parameters to array buffers, thus giving them a pseudo-serialization ability without having them to be aware of MPI specifics. Each message wrapper class is associated with a specific Infrastructure Event class (SlaveEvents own a Results, and MasterEvents own a JobDefinition). The Infrastructure Event classes take care of sending from Master classes to Slave classes, and from Slave classes to Master classes, and hold a message wrapper payload.<br />
<br />
The MetaOptimizer subclasses will return JobDefinition subclasses to the Master, which then wraps the JobDefinition subclass in a MasterEvent, and sends the MasterEvent to a Slave. A Slave reads a MasterEvent, extracts the Results subclass, and sends the Results subclass to a RegistrationWorker subclass. Upon completion of the calculation for the JobDefinition, the RegistrationWorker returns a Results sublcass to the Slave, which then wraps the Results in a SlaveEvent and sends the SlaveEvent to the Master. The Master reads the SlaveEvent, extracts the Results subclass, and sends this Results subclass to the MetaOptimizer.<br />
<br />
JobDefinition: Defines the parameters of a particular job, a specific subclass of JobDefinition should be used for a particular computation.<br />
<br />
Results: Defines the return values of a particular job, a specific subclass of Results should be used for a particular computation.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41095Proposals:Condor2011-06-23T17:30:39Z<p>Michael.grauer: /* MPI Based Meta-Optimization Framework */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
<br />
Here are the C++ classes:<br />
<br />
MPI Aware Infrastructure classes:<br />
<br />
MasterSlave: Entry point of the application, initializes the MPI infrastructure, starts up a Master (sending it a particular MetaOptimizer subclass), starts up as many slave nodes as are necessary (sending each of them a particular RegistrationWorker subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
Master: The central controlling part of the infrastructure, it manages the available Slaves, sending them JobDefinitions as needed, collects Results from the Slaves as Results come in, will re-run (NOT YET IMPLEMENTED) a JobDefinitions if it has failed or timed out, and shuts down all Slaves when the computation is finished. The Master owns a particular MetaOptimizer subclass. The MetaOptimizer subclass is aware of the particular computation, the Master is only knows how to communicate with Slaves and does not have any responsibility for determining the breakdown or completion of the computation.<br />
<br />
Slave: One of the processing nodes, part of the infrastructure. Each slave manages a RegistrationWorker (SHOULD THIS BE MORE GENERAL THAN REGISTRATION?) subclass. Slaves announce their presence to the Master, receive a JobDefinition describing a particular job, pass the JobDescription on to the RegistrationWorker to run the job, receive back a Results upon completion of the job, and send the Results back to the Master.<br />
<br />
<br />
<br />
<br />
, and leaves it up to the MetaOptimizer subclass to determine jobs to be sent to the slaves and requests jobs from the MetaOptimizer, listens for</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41094Proposals:Condor2011-06-23T17:27:19Z<p>Michael.grauer: /* MPI Based Meta-Optimization Framework */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
<br />
Here are the C++ classes:<br />
<br />
MPI Aware Infrastructure classes:<br />
<br />
MasterSlave: Entry point of the application, initializes the MPI infrastructure, starts up a Master (sending it a particular MetaOptimizer subclass), starts up as many slave nodes as are necessary (sending each of them a particular RegistrationWorker subclass), and finalizes the MPI infrastructure when the computation is complete.<br />
<br />
Master: The central controlling part of the infrastructure, it manages the available slaves, sending them jobs as needed, collects results from the slaves as results come in, will re-run (NOT YET IMPLEMENTED) a job if it has failed or timed out, and shuts down all slaves when the computation is finished. The Master owns a particular MetaOptimizer subclass. The MetaOptimizer subclass is aware of the particular computation, the Master is only knows how to communicate with slaves and does not have any responsibility for determining the breakdown or completion of the computation.<br />
<br />
<br />
<br />
, and leaves it up to the MetaOptimizer subclass to determine jobs to be sent to the slaves and requests jobs from the MetaOptimizer, listens for</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41093Proposals:Condor2011-06-23T17:22:12Z<p>Michael.grauer: /* MPI Based Meta-Optimization Framework */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==<br />
<br />
<br />
Here are the C++ classes:<br />
<br />
MPI Aware Infrastructure classes:<br />
<br />
MasterSlave: entry point of the application, initializes the MPI infrastructure, starts up a Master (sending it a particular MetaOptimizer subclass), starts up as many slave nodes as are necessary (sending each of them a particular RegistrationWorker subclass), and finalizes the MPI infrastructure when the computation is complete.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41092Proposals:Condor2011-06-23T16:56:00Z<p>Michael.grauer: /* MPICH2 and Condor on Windows */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.<br />
<br />
<br />
== MPI Based Meta-Optimization Framework ==</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41069Proposals:Condor2011-06-22T18:47:54Z<p>Michael.grauer: /* MPICH2 Environment on Windows 7 */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's Windows account password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=41068Proposals:Condor2011-06-22T18:45:44Z<p>Michael.grauer: /* MPICH2 Environment on Windows 7 */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
The following was performed in a Windows CMD prompt run with Administrative Privileges (right click on the CMD executable and run it as an Administrator).<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=ITK/Summer_v4_2011_Meeting&diff=40567ITK/Summer v4 2011 Meeting2011-06-10T17:30:34Z<p>Michael.grauer: /* Kitware */</p>
<hr />
<div>ITKv4 Summer Meeting<br />
<br />
*'''Dates: June 27-29, 2011''' <br />
*'''City: Chapel Hill, NC'''<br />
*'''Location: Franklin Hotel<br />
<br />
== Travel / Hotel Information ==<br />
<br />
The Meeting will take place at the Franklin Hotel.<br />
<br />
* http://franklinhotelnc-px.trvlclick.com/index.html<br />
<br />
Since the meeting starts at<br />
<br />
* 8 am on June 27th,<br />
<br />
we recommend people to arrange their hotel accommodation for the previous night!<br />
<br />
* Use the room group 'ITKV4'<br />
** Reservation line: 866.831.5999<br />
** Ask for the "room block" reserved under 'ITKV4'<br />
* The rate is $129 / Night<br />
<br />
== Registration Information ==<br />
<br />
* Download [[Media:ITKv4 Summer2011Meeting Registration.pdf |Registration Form]]<br />
* It is a fillable PDF form<br />
* Fill it up please.<br />
** Indicate the number of days that you are attending.<br />
** Registration fee = ( NumberOfDays * $65 );<br />
** Print it as a PDF file<br />
*** since otherwise the form is still modifiable<br />
* Send the processed form back to Kitware<br />
** email it at: admin at kitware dot com<br />
<br />
== Meeting Room ==<br />
<br />
<br />
== Meeting Agenda ==<br />
<br />
* '''WARNING : THE AGENDA WAS REORGANIZED ON MAY 27th : PLEASE REVIEW'''<br />
<br />
=== Must See Topics ===<br />
<br />
* BETA Release<br />
* GPU<br />
* Modularization<br />
* SimpleITK<br />
* DICOM<br />
* Registration Refactoring<br />
* LevelSet Refactoring<br />
<br />
=== Monday June 27th - A2D2 Summit ===<br />
<br />
* [[ITK_Release_4/The Team/A2D2 Development Team|A2D2 Development Team]]<br />
* [[ITK_Release_4/The Team/ITKv4 Development Team|ITKv4 Development Team]]<br />
<br />
==== Monday Morning ====<br />
<br />
* 8:30 am Welcome: Terry Yoo<br />
* 9:00 am State of the Toolkit: Hans Johnson<br />
* 9:30 am Split in to Focus Groups<br />
** '''Group1''' : Microscopy<br />
** '''Group2''' : Clinical Applications<br />
** '''Group3''' : Video<br />
** '''Group4''' : Data and Web-based Applications<br />
* 10:00 Break<br />
* 10:30 am Working Groups (continuation)<br />
** '''Group1''' : Microscopy<br />
** '''Group2''' : Clinical Applications<br />
** '''Group3''' : Video<br />
** '''Group4''' : Data and Web-based Applications<br />
* 12:00 pm Lunch<br />
<br />
==== Monday Afternoon ====<br />
<br />
* 1:00 pm Plenary: '''Group 1''' : 20min presentation + discussion<br />
* 2:00 pm Plenary: '''Group 2''' : 20min presentation + discussion<br />
* 2:45 pm Break<br />
* 3:15 pm Plenary: '''Group 3''' : 20min presentation + discussion<br />
* 4:00 pm Plenary: '''Group 4''' : 20min presentation + discussion<br />
* 4:45 pm Adjourn<br />
<br />
==== Group Details ====<br />
<br />
===== Group 1: Microscopy and Histology =====<br />
<br />
# '''Ross Whitaker''' (designated speaker) ''Fast Nonlocal Algorithms for Denoising Microscopy, MRI, and Ultrasound Images Using Nonparametric Neighborhood Statistics.''<br />
# '''Marc Niethammer''' ''Adding Deconvoltion Algorithms to ITK''<br />
# '''Raghu Machiraju''' ''A Comprehensive Workflow for Robust Characterization of Microstructure for Cancer Studies''<br />
# '''Raghu Machiraju''' ''A Comprehensive Workflow for Large Histology Segmentation and Visualization''<br />
<br />
===== Group 2: Clinical Applications and CADs =====<br />
<br />
# '''Thomas Fletcher''' (designated speaker) ''ITK Algorithms for Analyzing Time-Varying Shape with Application to Longitudinal Heart Modeling''<br />
# '''Ricardo Avila''' ''Fostering Open Science for Lung Cancer Lesion Sizing''<br />
# '''Nikos Chrisochoides''' ''3D Real-Time Physics-Based Non-Rigid Registration for Image Guided Neurosurgery''<br />
<br />
===== Group 3: Video =====<br />
<br />
# '''Amitha Perera''' and '''Patrick Reynolds''' (designated speakers) ''ITKExtensions for Video Processing''<br />
# '''Kevin Cleary''' ''Real-Time Image Capture for ITK through a Video Grabber''<br />
# '''John Galeotti''' ''Methods in Medical Image Analysis: An ITK-Based Course with Deliverable Algorithms that extends and evaluates ITK while broadening its developer base''<br />
<br />
===== Group 4: Data and Web-based Applications =====<br />
<br />
# '''Sean Megason''' (designated speaker) ''SCORE++: Crowd source data, automatic segmentation and ground truth for ITK4''<br />
# '''Marcel Prastawa''' ''SCORE: Systematic Comparison through Objective Rating and Evaluation''<br />
# '''Ziv Yaniv''' ''Framework for automated parameter tuning of ITK registration pipelines''<br />
<br />
===== Working Groups Tasks =====<br />
<br />
* Each PI or representative will share a brief summary (max 5min) of their proposal with the other members of the group.<br />
* Things to Discuss:<br />
*# How the A2D2s will advance the subject under consideration.<br />
*# Find possible overlaps and similarities between the A2D2s - solve them.<br />
*# Decide how the software will be distributed. (e.g. ITK module, ITK classes, independent software, IJ, etc...)<br />
*# List all the new classes/modules that will be contributed to ITK<br />
*# Discuss the design, architecture, and dependencies<br />
*# List ITKv4 features that you might need to use (e.g. GPU? Multi-thread? Streaming?)<br />
*# Discuss how each member of the group can help/assist each other<br />
*# Come-up with a plan of action and '''time-line'''<br />
*# Combine slides into a single presentation showing all the points that were discussed<br />
<br />
===== Plenary Sessions =====<br />
<br />
* Designated speaker will present<br />
* Each of the other members should be available to anser questions and/or provide additional explanation<br />
* Discuss possible problems and challenges<br />
<br />
==== Dinner ====<br />
==== Arrival Group Dinner ====<br />
<br />
* Irish Pub<br />
<br />
=== Tuesday June 28th ===<br />
<br />
<br />
==== Tuesday Morning ====<br />
<br />
* 8:30 am Welcome, Questions, Concerns<br />
* 9:00 am Working Groups<br />
** '''Group 1''' ITK Revise<br />
** '''Group 2''' DICOM<br />
* 10:30 am Break<br />
* 11:00 am Working Groups<br />
** '''Group 3''' GPU and Multithreading<br />
** '''Group 4''' Simplify<br />
* 12:30 pm Lunch<br />
<br />
==== Tuesday Afternoon ====<br />
<br />
* 1:30 pm Plenary Session: '''Revise'''<br />
** 20min presentations and the discussion about Registration, FEM, LevelSets.<br />
* 3:00 pm Break<br />
* 3:30 pm Plenary Session: '''Simplify'''<br />
** 20min presentations and the discussion about SimpleITK, WrapITK, Doxygen for SimpleITK<br />
* 4:30 pm Plenary Session: '''GPU & Multithreading'''<br />
** 20min presentations and discussion<br />
* 5:30 pm Adjourn<br />
<br />
==== Dinner ====<br />
<br />
* GROUP DINNER<br />
** Mama Dip's<br />
*** http://www.mamadips.com/<br />
<br />
=== Wednesday June 29th ===<br />
<br />
==== Wednesday Morning ====<br />
<br />
* 8:30 am Welcome, Questions, Concerns<br />
* 9:00 am Plenary Session: '''DICOM'''<br />
** 20min presentation and discussion DCMTK, GDCM<br />
* 9:30 am Modularization (Bill Hoffman)<br />
* 10:00 am Break<br />
* 10:30 am Migration Guide (Gabe Hart / Dave Cole)<br />
* 11:00 am Road Ahead (What's Next ?) (Terry Yoo)<br />
<br />
* 12:00 pm Lunch<br />
<br />
==== Wednesday Afternoon ====<br />
<br />
* 1:00 pm Testing Data (Patric Reynolds / Bill Hoffman)<br />
* 1:30 pm Integration (Slicer/Wiki Examples/OTB/ImageJ/ICY/OME/V3D) (Bill Lorensen / Luis Ibanez)<br />
* 2:00 pm Code Revisions (Jim Miller)<br />
* 2:30 pm Break<br />
* 3:00 pm New Process for New Incoming Code (Bill Lorensen)<br />
* 3:30 pm Doxygen Documentation (Arnaud Gelas)<br />
* 4:00 pm Adjourn<br />
<br />
==== Dinner ====<br />
<br />
* GROUP DINNER for Survivors<br />
** Carolina Brewery.<br />
<br />
== Attendees ==<br />
<br />
Please add your name to the list below if you are planning to attend.<br />
<br />
=== Kitware ===<br />
<br />
* Luis Ibanez<br />
* Bill Hoffman<br />
* Stephen Aylward<br />
* David Cole <br />
* Marcus Hanwell<br />
* Xiaoxiao Liu (Lesion Sizing Toolkit)<br />
* Andinet Enquobahrie (A2D2 Registration)<br />
* Michel Audette (A2D2 Meshes)<br />
* Amitha Perera (A2D2 Video)<br />
* Gabe Hart (A2D2 Video / Simple ITK)<br />
* Patrick Reynolds (A2D2 Video/SCORE/SCORE++)<br />
* Brad Davis<br />
* Mike Grauer (SCORE/SCORE++/A2D2 Registration)<br />
<br />
=== University of Iowa ===<br />
* Vincent Magnotta<br />
* Hans Johnson<br />
<br />
=== University of Pennsylvania ===<br />
*Brian Avants<br />
*James C. Gee<br />
*Nick Tustison<br />
<br />
=== Harvard University ===<br />
* Sean Megason<br />
* Arnaud Gelas<br />
* Won-Ki Jeong (SEAS)<br />
<br />
=== The Ohio State University ===<br />
<br />
* Raghu Machiraju<br />
* Kun Huang<br />
* Zhi Han<br />
<br />
=== College of William and Mary ===<br />
<br />
* Nikos Chrisochoides<br />
* Dr. Kot<br />
* Dr. Liu<br />
<br />
=== University of Utah ===<br />
<br />
* Ross Whitaker<br />
* Marcel Prastawa<br />
<br />
=== Cosmo Software===<br />
<br />
* Drew Wasem<br />
* Ashish Sharma<br />
* Alex Gouaillard [A*STAR] (over the phone / internet)<br />
<br />
=== GE ===<br />
* Jim Miller<br />
* Dirk Padfield<br />
<br />
=== Mayo Clinic ===<br />
* Dan Blezek<br />
<br />
=== University of North Carolina ===<br />
<br />
* Cory Quammen<br />
* Marc Niethammer<br />
<br />
=== National Library of Medicine ===<br />
* Terry Yoo<br />
* Bradley Lowekamp<br />
* Jesus Caban<br />
<br />
=== Georgetown University / CNMC ===<br />
* Ziv Yaniv<br />
<br />
=== Noware ===<br />
* Bill Lorensen<br />
<br />
=== Bioscan ===<br />
* John McInerney<br />
<br />
=== Attendance Matrix ===<br />
<br />
{| border="1"<br />
|- bgcolor="#abcdef"<br />
! Name !! Monday June 27 !! Tuesday June 28 !! Wednesday June 29<br />
|-<br />
| Luis Ibanez || X || X || X<br />
|-<br />
| Bill Hoffman || || || <br />
|-<br />
| Stephen Aylward || || || <br />
|-<br />
| David Cole || || || <br />
|-<br />
| Marcus Hanwell || || || <br />
|-<br />
| Xiaoxiao Liu || || || <br />
|-<br />
| Andinet Enquobahrie || || || <br />
|-<br />
| Michel Audette || || || <br />
|-<br />
| Amitha Perera || || || <br />
|-<br />
| Gabe Hart || || || <br />
|-<br />
| Patrick Reynolds || || || <br />
|-<br />
| Brad Davis || || || <br />
|-<br />
| Vincent Magnotta || || || <br />
|-<br />
| Hans Johnson || || || <br />
|-<br />
| Brian Avants || || || <br />
|-<br />
| James C. Gee || || || <br />
|-<br />
| Nick Tustison || x || x || x <br />
|-<br />
| Sean Megason || x || x || x <br />
|-<br />
| Arnaud Gelas || || || <br />
|-<br />
| Won-Ki Jeong || X || X || X <br />
|-<br />
| Raghu Machiraju || x || x || x<br />
|-<br />
| Kun Huang || || || <br />
|-<br />
| Zhi Han || || || <br />
|-<br />
| Nikos Chrisochoides || || || <br />
|-<br />
| Dr. Kot || || || <br />
|-<br />
| Dr. Liu || || || <br />
|-<br />
| Ross Whitaker || || || <br />
|-<br />
| Drew Wasem || X || X || X <br />
|-<br />
| Ashish sharma || X || X || X <br />
|-<br />
| Alex Gouaillard (remote) || || || <br />
|-<br />
| Jim Miller || X || X || X <br />
|-<br />
| Dirk Padfield || X || X || X <br />
|-<br />
| Dan Blezek || X || X || X <br />
|-<br />
| Cory Quammen || || || <br />
|-<br />
| Marc Niethammer || || || <br />
|-<br />
| Terry Yoo || || || <br />
|-<br />
| Bradley Lowekamp || || || <br />
|-<br />
| Jesus Caban || || || <br />
|-<br />
| Ziv Yaniv || X || X || <br />
|-<br />
| Bill Lorensen || X || X || X <br />
|-<br />
| Marcel Prastawa || X || || <br />
|-<br />
<br />
|}</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=40162Proposals:Condor2011-05-26T21:25:29Z<p>Michael.grauer: /* MPICH2 and Condor on Windows */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
After setting up this configuration, I rebooted both machines to start in this configuration. At this point, the command<br />
net stop condor<br />
no longer returns correctly, since Condor is managing the '''smpd.exe''' service now and does not know how to stop that service.<br />
<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=40161Proposals:Condor2011-05-26T21:23:41Z<p>Michael.grauer: /* MPICH2 and Condor on Windows */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as described above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=40160Proposals:Condor2011-05-26T21:23:03Z<p>Michael.grauer: /* MPICH2 and Condor on Windows */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid, using the same user/password combination on both machines, and this user had Administrative rights. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as describe above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=40159Proposals:Condor2011-05-26T21:21:30Z<p>Michael.grauer: /* MPICH2 and Condor on Windows */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as describe above.<br />
<br />
A point of confusion I have is that I would have expected the '''mpiwork.exe''' code above to have the name of '''clavicle''' and '''scapula''' in the output when I ran it on 6 nodes. Instead only the name '''clavicle''' appeared. I do believe that this was running on both machines and all 6 cores, as both machine's IP addresses were noted in the work.parallel.log, and both machines had CPU activity at the time the executable ran.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=40158Proposals:Condor2011-05-26T21:15:33Z<p>Michael.grauer: /* MPICH2 and Condor on Windows */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.<br />
<br />
<br />
Based on the Condor submit file described in the Condor wiki page, I had to adapt my own submit file for my above MPI executable called '''mpiwork.exe''' to be the following:<br />
<br />
<br />
universe = parallel<br />
executable = mp2script.bat<br />
arguments = mpiwork.exe<br />
machine_count = 6<br />
output = out.$(NODE).log<br />
error = error.$(NODE).log<br />
log = work.$(NODE).log<br />
should_transfer_files = yes<br />
when_to_transfer_output = on_exit<br />
transfer_input_files = mpiwork.exe<br />
Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51")) || ((Arch == "X86_64") && (OpSys == "WINNT61"))<br />
queue<br />
<br />
and my '''mp2script.bat''' file was adapted to be: <br />
<br />
set _CONDOR_PROCNO=%_CONDOR_PROCNO%<br />
set _CONDOR_NPROCS=%_CONDOR_NPROCS%<br />
REM If not the head node, just sleep forever<br />
if not [%_CONDOR_PROCNO%] == [0] copy con nothing<br />
REM Set this to the bin directory of MPICH installation<br />
set MPDIR="C:\mpich2work"<br />
REM run the actual mpijob<br />
%MPDIR%\mpiexec.exe -n %_CONDOR_NPROCS% -p 6666 %*<br />
exit 0<br />
<br />
with the change being that '''MPDIR''' is a location that is known to have '''mpiexec.exe''' on both machines, since I copied that file there on both machines as describe above.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=40157Proposals:Condor2011-05-26T21:11:04Z<p>Michael.grauer: /* MPICH2 and Condor on Windows */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER<br />
<br />
<br />
The second machine, a 2 core Windows XP desktop called '''scapula''' (INTEL,WINNT51) was a dedicated machine with the following important parameters in its '''condor_config.local''' file:<br />
<br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, STARTD, SMPD_SERVER<br />
<br />
Note how it points to '''clavicle''' both as '''CONDOR_HOST''' and as the '''DedicatedScheduler''', and that is does not have '''(x86)''' in its path to the '''smpd.exe'''.<br />
<br />
MPICH2 gets installed to different locations, to '''C:\Program Files\MPICH2''' for an (INTEL,WINNT51) machine, and to '''C:\Program Files (x86)\MPICH2''' for a (X86_64,WINNT61) machine. Because I needed to refer to the '''mpiexec.exe''' path in the batch file that my Condor submit file referred to, and this batch file would be the same for both machines, I opted to copy '''mpiexec.exe''' to the work directory I was using on all machines, '''C:\mpich2work'''.</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=40156Proposals:Condor2011-05-26T20:58:29Z<p>Michael.grauer: /* MPICH2 and Condor on Windows */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=40155Proposals:Condor2011-05-26T20:57:18Z<p>Michael.grauer: /* MPICH2 and Condor on Windows */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}<br />
<br />
<br />
I then created a two machine Condor parallel universe grid. My 4 core Windows 7 profession laptop (X86_64,WINNT61) was the '''CONDOR_HOST''' and the '''DedicatedScheduler'''. Assuming this laptop is called '''clavicle''' in the '''shoulder.com''' domain, here are the important configuration parameters in its '''condor_config.local''' file: <br />
<br />
CONDOR_HOST=clavicle.shoulder.com<br />
SMPD_SERVER = C:\Program Files (x86)\MPICH2\bin\smpd.exe<br />
SMPD_SERVER_ARGS = -p 6666 -d 1<br />
DedicatedScheduler = "DedicatedScheduler@clavicle.shoulder.com"<br />
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler<br />
START = TRUE<br />
SUSPEND = FALSE<br />
PREEMPT = FALSE<br />
KILL = FALSE<br />
WANT_SUSPEND = FALSE<br />
WANT_VACATE = FALSE<br />
CONTINUE = TRUE<br />
RANK = 0<br />
<br />
<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD, COLLECTOR, NEGOTIATOR, SMPD_SERVER</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=40154Proposals:Condor2011-05-26T20:51:06Z<p>Michael.grauer: /* MPICH2 and Condor on Windows 7 */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows ==<br />
<br />
To configure MPI over Condor, I set my Windows 7 laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
I turned off smpd.exe as a service, since Condor will manage it:<br />
<br />
smpd.exe -remove<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=40072Proposals:Condor2011-05-20T20:15:17Z<p>Michael.grauer: /* MPICH2 on Windows */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines<br />
<br />
== MPICH2 and Condor on Windows 7 ==<br />
<br />
To configure MPI over Condor, I set my laptop up as a dedicated Condor scheduler and executer followed the instructions at this link:<br />
<br />
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigureMpiOnWindows<br />
<br />
<br />
I then created the following MPI example program, and ran this with a MachineCount of 2 and 4 at different times (I ran this on my laptop, which has 4 cores, and had Condor running locally) . This program assumes an even number of processes. It shows examples of both synchronous and asynchronous communication using MPI.<br />
<br />
Resources that I found to be helpful in developing this program are:<br />
<br />
http://www-uxsup.csx.cam.ac.uk/courses/MPI/<br />
<br />
https://computing.llnl.gov/tutorials/mpi/<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char *argv[])<br />
{<br />
MPI::Init(argc, argv);<br />
//<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
int size = MPI::COMM_WORLD.Get_size();<br />
char processor_name[MPI_MAX_PROCESSOR_NAME];<br />
MPI::Get_processor_name(processor_name,namelen);<br />
// <br />
int buf[1];<br />
int numElements = 1;<br />
// initialize buffer with rank<br />
buf[0] = rank;<br />
// example of synchronous communication<br />
// have even ranks send to odd ranks first<br />
// assumes an even number of processes<br />
int syncTag = 123;<br />
if (rank % 2 == 0)<br />
{<br />
// send to next higher rank<br />
int dest = rank + 1;<br />
std::cout << processor_name << "." << rank << " will send to " << dest << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Send(buf,numElements,MPI::INT,dest,syncTag);<br />
// now wait for dest to send back<br />
// even though dest is the param, in the Recv call it is used as the source<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,dest,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << dest << ", buf[0]= " << buf[0] << std::endl;<br />
}<br />
else<br />
{<br />
// source is next lower rank<br />
int source = rank - 1;<br />
std::cout << processor_name << "." << rank << ", will receive from " << source << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::COMM_WORLD.Recv(buf,numElements,MPI::INT,source,syncTag); <br />
std::cout << processor_name << "." << rank << ", after receiving from " << source << ", buf[0]= " << buf[0] << std::endl;<br />
// send rank rather than buffer, just to be sure<br />
// even though param is source, in Send call it is for dest<br />
MPI::COMM_WORLD.Send(&rank,numElements,MPI::INT,source,syncTag);<br />
}<br />
// now for asynchronous communication<br />
std::cout << processor_name << "." << rank << " switching to asynchronous" << std::endl;<br />
int asyncTag = 321;<br />
int leftRank;<br />
if (rank == 0)<br />
{<br />
leftRank = size-1;<br />
}<br />
else <br />
{<br />
leftRank = (rank-1);<br />
}<br />
int rightRank;<br />
if (rank == size-1)<br />
{<br />
rightRank = 0;<br />
}<br />
else <br />
{<br />
rightRank = (rank+1);<br />
}<br />
// everyone sends to the leftRank, and receives from the rightRank<br />
// if this were synchronous, would be in danger of deadlocking because of circular waits<br />
// reset buffers to rank<br />
buf[0] = rank;<br />
std::cout << processor_name << "." << rank << " will send to " << leftRank << ", for now buf[0]= " << buf[0] << std::endl;<br />
MPI::Request sendReq = MPI::COMM_WORLD.Isend(buf,numElements,MPI::INT,leftRank,asyncTag);<br />
MPI::Request recvReq = MPI::COMM_WORLD.Irecv(buf,numElements,MPI::INT,rightRank,asyncTag);<br />
// wait on the receipt<br />
MPI::Status status;<br />
recvReq.Wait(status);<br />
std::cout << processor_name << "." << rank << " after receiving from " << rightRank << ", buf[0]= " << buf[0] << std::endl;<br />
//<br />
MPI::Finalize();<br />
return(0);<br />
}</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=39198Proposals:Condor2011-04-20T22:10:13Z<p>Michael.grauer: /* Creating an MPI program on Windows */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 full_path_to_exe_on_both_machines</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=39197Proposals:Condor2011-04-20T22:09:25Z<p>Michael.grauer: /* Creating an MPI program on Windows */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe<br />
<br />
Then to run your exe on two different machines (I had to build a RELEASE for this, at first I had built a DEBUG build, but this caused problems on the second machine):<br />
*create the same directory on both machines (e.g. '''C:\mpich2work''')<br />
*copy the exe to this directory on both machines<br />
*execute the exe on one machine, specifying both IPs and how many processes you want to run on each machine (I run as many processes as cores):<br />
mpiexec -hosts 2 IP_1 #_cores_IP_1 IP_2 #_cores_IP_2 full_path_to_exe_on_both_machines<br />
<br />
== Running your MPI program on two different Windows machines</div>Michael.grauerhttps://public.kitware.com/Wiki/index.php?title=Proposals:Condor&diff=39196Proposals:Condor2011-04-20T22:05:44Z<p>Michael.grauer: /* MPICH2 Environment on Windows 7 */</p>
<hr />
<div>= Introduction =<br />
<br />
Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.<br />
<br />
= Downloading Condor =<br />
<br />
Different versions of condor can be downloaded from [http://www.cs.wisc.edu/condor/downloads-v2/download.pl here]. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found [http://www.cs.wisc.edu/condor/manual/v7.2/ here]<br />
<br />
= Preparation =<br />
<br />
As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.<br />
<br />
#What machine will be the central manager?<br />
#What machines should be allowed to submit jobs?<br />
#Will Condor run as root or not?<br />
#Do I have enough disk space for Condor?<br />
#Do I need MPI configured?<br />
<br />
Condor can be installed as either a ''manager'' node, a ''execute'' or a ''submit'' node. Or any combination of these ones. See [http://www.cs.wisc.edu/condor/manual/v7.2/3_1Introduction.html#SECTION00411000000000000000 The Different Roles a Machine Can Play]<br />
<br />
* Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request<br />
<br />
* Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.<br />
<br />
* Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.<br />
<br />
For more information regarding other required preparatory work, refer the [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00422000000000000000 documentation]<br />
<br />
= Installation =<br />
== Unix ==<br />
<br />
''' The official instructions on how to install Condor in Unix can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00423000000000000000 here] '''. Below we present some of tweaks we had to do to get it to work on our Unix machines.<br />
<br />
==== Prerequisites ====<br />
<br />
*Be sure the server has a hostname and a domain name<br />
<br />
hostname<br />
<br />
should return ''mymachine.mydomain.com'' (or .org, .edu, etc.) , if it only returns ''mymachine'', then your server does not have a fully qualified domain name.<br />
<br />
To set the domain name, edit ''/etc/hosts'' and add your domain name to the first line. You might see something like<br />
<br />
''10.171.1.124 mymachine''<br />
<br />
change this to<br />
<br />
''10.171.1.124 mymachine.mydomain.com''<br />
<br />
Also edit ''/etc/hostname'' to be<br />
<br />
''mymachine.mydomain.com''<br />
<br />
Then reboot so that the hostname changes take effect.<br />
<br />
* Make sure the following packages are installed:<br />
apt-get install mailutils<br />
<br />
* Make sure the server has a hostname and a domainname.<br />
<br />
* Download the package ''condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz'' ( Platform RHEL 5 Intel x86/64 ) See http://www.cs.wisc.edu/condor/downloads-v2/download.pl<br />
For example, you could run a similar command to download the desired package:<br />
wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
<br />
* You should install Condor as '''root''' or with a user having equivalent privileges<br />
<br />
=== Configuring a Condor Manager in Unix === <br />
<br />
* Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.<br />
cd ~<br />
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz<br />
cd ./condor-7.2.X<br />
<br />
* If not yet done, create a ''condor'' user<br />
adduser condor<br />
<br />
* Run the installation scripts ''condor_install''<br />
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor<br />
<br />
After running the installation script, you should get the following output:<br />
Installing Condor from /root/condor-7.2.X to /root/condor<br />
<br />
Condor has been installed into:<br />
/root/condor<br />
<br />
Configured condor using these configuration files:<br />
global: /root/condor/etc/condor_config<br />
local: /home/condor/localcondor/condor_config.local<br />
Created scripts which can be sourced by users to setup their<br />
Condor environment variables. These are:<br />
sh: /root/condor/condor.sh<br />
csh: /root/condor/condor.csh<br />
<br />
* Switch to the directory where condor is now installed<br />
cd /root/condor<br />
<br />
* Edit ''/etc/environment'' and update PATH variable to include the directory ''/root/condor/bin'' and ''/root/condor/sbin''<br />
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"<br />
<br />
* Add the following line<br />
CONDOR_CONFIG="/root/condor/etc/condor_config"<br />
<br />
* Save file and apply the change by running <br />
source /etc/environment<br />
<br />
* Make sure CONDOR_CONFIG and PATH are set correctly<br />
root@rigel:~$ echo $PATH<br />
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games<br />
<br />
root@rigel:~$ echo $CONDOR_CONFIG<br />
/root/condor/etc/condor_config<br />
<br />
* You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.<br />
<br />
* Edit condor manager config_file and update the line as referenced below:<br />
cd ~/condor<br />
vi ./etc/condor_config<br />
<br />
RELEASE_DIR = /root/condor<br />
LOCAL_DIR = /home/condor/localcondor<br />
CONDOR_ADMIN = email@website.com<br />
UID_DOMAIN = website.com<br />
FILESYSTEM_DOMAIN = website.com<br />
HOSTALLOW_READ = *.website.com<br />
HOSTALLOW_WRITE = *.website.com<br />
HOSTALLOW_CONFIG = $(CONDOR_HOST)<br />
<br />
* If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor<br />
cd /home/condor<br />
ln -s /home/condor/etc/condor_config condor_config<br />
<br />
=== Configuring a Executer/Submitter in Unix ===<br />
The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script ''condor_install''. Nevertheless, you still need to update its configuration file.<br />
<br />
* Edit condor node config_file.local and update the line as referenced below:<br />
vi /home/condor/condor_config.local<br />
<br />
CONDOR_ADMIN = email@website.com<br />
<br />
If the installation went well, the line having ''UID_DOMAIN'' and '' FILESYSTEM_DOMAIN'' should already be set to ''website.com''<br />
<br />
== Windows ==<br />
'''The official documentation on how to install Condor in Windows can be found [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html#SECTION00425000000000000000 here]'''. Below we describe our experience installing Condor in Windows 7.<br />
<br />
# Download the Windows install MSI, run it, installing to "C:/condor".<br />
# Accept the license agreement. <br />
# Decide if you are installing a central controller or a submit/execute node<br />
##If installing a Central Controller, then select "create a new central pool" and set the name of the pool<br />
##Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).<br />
# Decide whether the machine should be a submitter node, and select the appropriate option.<br />
# Decide when Condor should run jobs, if the machine will be an executor.<br />
##Decide what happens to jobs when the machine stops being idle.<br />
# For accounting domain enter your domain (e.g. yourdomaininternal.com)<br />
# For Email settings (I ignored this by clicking next)<br />
# for Java settings (I ignored this as we weren't using Java, by clicking next)<br />
<br />
Set the following settings when prompted<br />
<br />
Host Permission Settings:<br />
hosts with read: *<br />
hosts with write: *<br />
hosts with administrator access $(FULL_HOSTNAME)<br />
enable vm universe: no<br />
enable hdfs support: no<br />
<br />
When asked if you want a custom install or install, choose install. This will install condor to C:\condor.<br />
<br />
<br />
The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.<br />
<br />
<br />
When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.<br />
<br />
<br />
After the install, I could see the condor system by running <br />
<br />
condor_status<br />
<br />
and this helped me fix up some problems. My '''condor_status''' at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was '''.master_address'''. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).<br />
<br />
<br />
I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:<br />
<br />
NETWORK_INTERFACE = <IP address><br />
UID_DOMAIN = *.yourdomaininternal.com<br />
FILESYSTEM_DOMAIN = *.yourdomaininternal.com<br />
COLLECTOR_NAME = PoolName<br />
ALLOW_READ = *<br />
ALLOW_WRITE = *<br />
# Choose one of the following:<br />
#<br />
# For a submit/execute node:<br />
DAEMON_LIST = MASTER, SCHEDD, STARTD<br />
# For a central collector host and submit/execute node:<br />
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD<br />
TRUST_UID_DOMAIN = True<br />
START = True<br />
<br />
You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.<br />
<br />
<br />
and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.<br />
<br />
<br />
At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.<br />
<br />
Here is the batch file contents for the actual job (printname.bat)<br />
<br />
@echo off<br />
echo Here is the output from "net USER" :<br />
net USER<br />
<br />
<br />
And here is the printname.sub condor submission file I ran with<br />
<br />
condor_submit printname.sub<br />
<br />
<br />
universe = vanilla<br />
environment = path=c:\Windows\system32<br />
executable = printname.bat<br />
output = printname.out<br />
error = printname.err<br />
log = printname.log<br />
queue<br />
<br />
<br />
<br />
==== Useful Condor Commands on Windows ====<br />
<br />
To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".<br />
<br />
condor_master runs as a service on Windows, which controls the other daemons.<br />
<br />
To stop condor <br />
<br />
net stop condor<br />
<br />
To start condor <br />
<br />
net start condor<br />
<br />
At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run<br />
<br />
condor_store_cred add <br />
<br />
then enter your password.<br />
<br />
= Running Condor =<br />
<br />
'''The official user's manual on how to perform distributed computing in Condor is [http://www.cs.wisc.edu/condor/manual/v7.2/2_Users_Manual.html here]'''<br />
<br />
<br />
* run the condor manager <br />
condor_master<br />
<br />
* Assuming at the installation process, you setup the type as ''manager,execute,submit'' (the default), running the following command<br />
ps -e | egrep condor_<br />
<br />
* You should get something similar to:<br />
<br />
1063 ? 00:00:00 condor_master<br />
1064 ? 00:00:00 condor_collecto<br />
1065 ? 00:00:00 condor_negotiat<br />
1066 ? 00:00:00 condor_schedd<br />
1067 ? 00:00:00 condor_startd<br />
1068 ? 00:00:00 condor_procd<br />
<br />
* If you run the command ''ps -e | egrep condor_'' just after you started condor, you may also see the following line<br />
1077 ? 00:00:00 condor_starter<br />
<br />
<br />
<br />
* Check the status<br />
kitware@rigel:~$ condor_status<br />
<br />
Name OpSys Arch State Activity LoadAv Mem ActvtyTime<br />
<br />
slot1@rigel LINUX X86_64 Unclaimed Idle 0.010 1006 0+00:10:04<br />
slot2@rigel LINUX X86_64 Unclaimed Idle 0.000 1006 0+00:10:05<br />
<br />
Total Owner Claimed Unclaimed Matched Preempting Backfill<br />
<br />
X86_64/LINUX 2 0 0 2 0 0 0<br />
<br />
Total 2 0 0 2 0 0 0<br />
<br />
* Setup condor to automatically startup <br />
cp /root/condor/etc/example/condor.boot /etc/init.d/<br />
<br />
* Update MASTER parameter in ''condor.boot'' to match your current setup<br />
vi /etc/init.d/condor.boot<br />
<br />
MASTER=/root/condor/sbin/condor_master<br />
<br />
* Add ''condor.boot'' service to all runlevel<br />
kitware@rigel:~$ update-rc.d condor.boot defaults<br />
<br />
/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot<br />
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot<br />
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot<br />
<br />
== A simple example demonstrating the use of Condor ==<br />
<br />
#include <unistd.h><br />
#include <stdio.h><br />
int main( int argc, char** argv )<br />
{<br />
printf( "%s\n", argv[1] );<br />
fflush( stdout );<br />
sleep( 30 );<br />
return 0; <br />
}<br />
<br />
This exe will repeat the command line argument it is given, wait 30 seconds, then exit.<br />
<br />
Save this file as foo.c, then compile it with (note static linking).<br />
<br />
gcc foo.c -o foo --static<br />
<br />
<br />
Create a condor job description, saving the file as condorjob:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob.log<br />
error = condorjob.err<br />
output = condorjob.out<br />
arguments = "helloworld"<br />
Queue<br />
<br />
then submit the job to condor:<br />
<br />
condor_submit condorjob<br />
<br />
After this job finishes, you should have three files in the submission directory:<br />
<br />
condorjob.err (contains the standard error, empty in this case)<br />
<br />
condorjob.out (should contain standard output, in this case "helloworld")<br />
<br />
condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)<br />
<br />
<br />
If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can <br />
change the condorjob file to be like this:<br />
<br />
universe = vanilla<br />
executable = foo <br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
log = condorjob1.log<br />
error = condorjob1.err<br />
output = condorjob1.out<br />
arguments = "helloworld1"<br />
Queue<br />
<br />
log = condorjob2.log<br />
error = condorjob2.err<br />
output = condorjob2.out<br />
arguments = "helloworld2"<br />
Queue<br />
<br />
We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including<br />
<br />
Requirements = Arch == "INTEL"<br />
<br />
and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying<br />
<br />
Exec format error<br />
<br />
and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.<br />
<br />
= Additional Information =<br />
== Troubleshooting Condor ==<br />
<br />
Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.<br />
<br />
*Be sure that your executable is statically linked.<br />
*For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
*When building BatchMake, you need to build with grid support on<br />
<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems.<br />
Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
<br />
condor_status<br />
<br />
Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.<br />
<br />
condor_q<br />
<br />
Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.<br />
<br />
condor_q -analyze CID.PID<br />
condor_q -better-analyze CID.PID<br />
<br />
When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.<br />
<br />
<br />
condor_config_val <CONDOR_VARIABLE><br />
<br />
Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.<br />
<br />
condor_rm CID.PID<br />
<br />
Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs.<br />
For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.<br />
<br />
=== condor_master ===<br />
<br />
On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter <br />
<br />
NETWORK_INTERFACE = <desired IP><br />
<br />
in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.<br />
<br />
=== condor_startd ===<br />
<br />
This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.<br />
<br />
=== condor_starter ===<br />
<br />
This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.<br />
<br />
=== condor_schedd ===<br />
<br />
This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.<br />
<br />
=== condor_shadow ===<br />
<br />
This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.<br />
<br />
=== condor_collector ===<br />
<br />
This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.<br />
<br />
=== condor_negotiator ===<br />
<br />
This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog<br />
<br />
=== condor_kbdd ===<br />
<br />
This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.<br />
<br />
== The right processor architecture ==<br />
The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to [http://en.wikipedia.org/wiki/IA-64 IA64] which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors. <br />
<br />
While trying to run the ''condor_master'', the shell returned the following error message ''cannot execute binary file''<br />
<br />
Using the program [http://sourceware.org/binutils/docs/binutils/readelf.html#readelf readelf], it's possible to extract the header of an executable and understand if a given executable could run on a given platform.<br />
<br />
kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Intel IA-64'''<br />
Version: 0x1<br />
Entry point address: 0x40000000000bf3e0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 9382744 (bytes into file)<br />
Flags: 0x10, 64-bit<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 7<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 32<br />
Section header string table index: 31<br />
<br />
kitware@rigel:~$ readelf -h /bin/ls<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-6'''4<br />
Version: 0x1<br />
Entry point address: 0x4023c0<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 104384 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 28<br />
Section header string table index: 27<br />
<br />
kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master<br />
ELF Header:<br />
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00<br />
Class: ELF64<br />
Data: 2's complement, little endian<br />
Version: 1 (current)<br />
OS/ABI: UNIX - System V<br />
ABI Version: 0<br />
Type: EXEC (Executable file)<br />
Machine: '''Advanced Micro Devices X86-64'''<br />
Version: 0x1<br />
Entry point address: 0x4b9450<br />
Start of program headers: 64 (bytes into file)<br />
Start of section headers: 4553256 (bytes into file)<br />
Flags: 0x0<br />
Size of this header: 64 (bytes)<br />
Size of program headers: 56 (bytes)<br />
Number of program headers: 8<br />
Size of section headers: 64 (bytes)<br />
Number of section headers: 31<br />
Section header string table index: 30<br />
<br />
Comparing the different output, it's possible to observe the architecture '''Intel IA-64''' isn't the right one.<br />
<br />
<br />
Be sure that your executable is statically linked.<br />
For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.<br />
When building BatchMake, you need to build with grid support on<br />
Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.<br />
<br />
= Links =<br />
* Detailed Condor documentation is also available on the website [http://www.cs.wisc.edu/condor/manual/v7.2/3_2Installation.html here]<br />
<br />
= MPICH2 on Windows =<br />
<br />
Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.<br />
<br />
This will have to be cleaned up to make more sense as we gain more experience.<br />
<br />
== MPICH2 Environment on Windows 7 ==<br />
<br />
First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add '''mpiexec.exe''' and '''smpd.exe''' to the list of exceptions in the Windows firewall.<br />
<br />
I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.<br />
<br />
I reset the smpd passphrase<br />
smpd -remove<br />
then set the same smpd passphrase on both machines<br />
smpd -install -phrase mypassphrase<br />
and could then check the status using<br />
smpd -status<br />
<br />
<br />
I had to register my user credentials on both machines by<br />
mpiexec -register<br />
then accepted the user it suggested by hitting enter, then entered my user's password.<br />
<br />
You can check which user you are with<br />
mpiexec -whoami<br />
<br />
And now you should be able to validate with<br />
mpiexec -validate<br />
it will ask you for an authentication password for smpd, this the '''mypassphrase''' you entered above. If all is correct at this point, you will get a result of '''SUCCESS'''.<br />
<br />
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):<br />
mpiexec -n 4 full_path_to\cpi.exe<br />
<br />
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.<br />
mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname<br />
<br />
== Creating an MPI program on Windows ==<br />
<br />
I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.<br />
<br />
Here is a very simple working C++ example program:<br />
<br />
#include "stdafx.h"<br />
#include <iostream><br />
#include "mpi.h"<br />
using namespace std;<br />
//<br />
int main(int argc, char* argv[]) <br />
{<br />
// initialize the MPI world<br />
MPI::Init(argc,argv);<br />
//<br />
// get this process's rank<br />
int rank = MPI::COMM_WORLD.Get_rank();<br />
//<br />
// get the total number of processes in the computation<br />
int size = MPI::COMM_WORLD.Get_size();<br />
//<br />
// print out where this process ranks in the total<br />
std::cout << "I am " << rank << " out of " << size << std::endl;<br />
//<br />
// Finalize the MPI world<br />
MPI::Finalize();<br />
return 0;<br />
}<br />
<br />
To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:<br />
<br />
* Under C/C++ menu, Additional Include Directories property, add the full path to the '''MPICH2\include directory''' <br />
* Under Linker/General menu, Additional Library Directories, add the full path to the '''MPICH2\lib directory'''<br />
* Under Linker/Input menu, Additional Dependencies, add '''mpi.lib''' and '''cxx.lib'''<br />
<br />
You can test your application using:<br />
mpiexec -n 4 full_path_to\your.exe</div>Michael.grauer