Tuesday, August 26, 2014

Install PBS: the job management software

PBS has serveral subtype: OpenPBS, PBSPro and Torque, here we mainly refer to installation of torque 1. download torque from: http://www.adaptivecomputing.com/resources/downloads/torque/
2. tar zxvf torque-2.5.5.tar.gz
3. cd torque-2.5.5
4. ./configure --prefix=/opt/pbs
5. make
6. make install
7.  ./torque.setup liumh
met problem here:
[root@rocks4 torque-2.5.5]# ./torque.setup liumh
initializing TORQUE (admin: liumh@)
./torque.setup:line 31: pbs_server: command not found
./torque.setup:line 33: qmgr: command not found
ERROR:cannot set TORQUE admins
./torque.setup:line 37: qterm: command not found
7.1  We need to do
[root@rocks4 torque-2.5.5]# PATH=$PATH:/opt/pbs/bin:/opt/pbs/sbin
[root@rocks4 torque-2.5.5]# export PATH
[root@rocks4 torque-2.5.5]# MANPATH=$MANPATH:/opt/pbs/man
[root@rocks4 torque-2.5.5]# export MANPATH
7.2 [root@rocks4 torque-2.5.5]# ./torque.setup liumh
met problem again:
initializing TORQUE (admin: liumh@)
PBS_Server: LOG_ERROR::pbsd_main, unable to determine local server hostname - gethostbyname(rocks4)
failed, h_errno=1
Cannot resolve default server host 'rocks4' - check server_name file.
qmgr: cannot connect to server (errno=15010) Access from host not allowed, or unknown host
ERROR: cannot set TORQUE admins
Cannot resolve default server host 'rocks4' - check server_name file.
qterm: could not connect to server '' (15010) Access from host not allowed, or unknown host
7.3  We need to do:
[root@rocks4 torque-2.5.5]# vi /etc/hosts
! add the following line:
! 210.45.78.9 rocks4.lcg.ustc.edu.cn rocks4
7.4  [root@rocks4 torque-2.5.5]# ./torque.setup liumh
initializing TORQUE (admin: liumh@rocks4SPAMNOT.lcg.ustc.edu.cn)
Max open servers: 4
Max open servers: 4
8. [root@rocks4 torque-2.5.5]# make packages
9. install packages: 在 master 机器上需要安装的是 server 包,在节点上需要安装的是 mom 包。在需要提交 PBS 任务的机器上需要安装 clients 包
./torque-package-server-linux-x86_64.sh --install
./torque-package-mom-linux-x86_64.sh --install
./torque-package-clients-linux-x86_64.sh --install
Start to install worknodes
10. copy torque-package-mom-linux-x86_64.sh and torque-package-clients-linux-x86_64.sh to all work nodes
11. Take bl-3-1.local for an example:
11.1  ./torque-package-clients-linux-x86_64.sh --install
11.2  ./torque-package-mom-linux-x86_64.sh --install
11.3  libtool --finish /opt/pbs/lib
11.4  edit /etc/rc.local, add the following lines:
PATH=$PATH:/opt/pbs/bin:/opt/pbs/sbin
export PATH
MANPATH=$MANPATH:/opt/pbs/man
export MANPATH
11.5  perform the 4 commands in 11.4(PATH...)=
11.6  edit /var/spool/torque/servername, make sure it's "rocks4"
11.7  edit /var/spool/torque/mom_priv/config (new created), add the following lines:
pbsserver rocks4
logevent 255
11.8  run pbs_mom:
pbs_mom -c /var/spool/torque/mom_priv/config
12. configer server node:rocks4
12.1 edit /var/spool/torque/server_priv/nodes, add the following lines
bl-3-1.local np=8
...
12.2 edit /var/spool/torque/mom_priv/config (new created), add the following lines
pbsserver rocks4
logevent 255
12.3 run the following services:
pbs_mom -c /var/spool/torque/mom_priv/config
qterm -t quick
pbs_server
pbs_sched
12.4 add the services in 12.3 to start when the service is started:
vi /etc/rc.local
pbs_mom -c /var/spool/torque/mom_priv/config
pbs_server
pbs_sched
13. since the output of jobs will be returned back to the server with ssh, so we need to config ssh on all the work nodes
13.1 ssh-keygen -P "" -t rsa
13.2 eval `ssh-agent`
13.3 ssh-add /root/.ssh/id_rsa
13.4 put the information in /root/.ssh/id_rsa.pub into [/home/liumh/.ssh/authorized_keys@rocks4]
13.5 try command:
scp file liumh@rocks4SPAMNOT.lcg.ustc.edu.cn:/tmp
if file is copied without passwd, then your setup is successful
13.6 Attention, in 13.5, since bl-3-1.local can't identify rocks4.lcg.ustc.edu.cn, such errors may happen on bl-3-l:
see /var/log/message:
Feb 6 16:22:32 bl-3-1 pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file /var/sp ool/torque/spool/10.rocks4.lcg.ustc.edu.cn.OU to liumh@rocks4SPAMNOT.lcg.ustc.edu.cn:/home /liumh/test/pbs/pbsjob.o10
Feb 6 16:22:36 bl-3-1 pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB /va r/spool/torque/spool/10.rocks4.lcg.ustc.edu.cn.ER liumh@rocks4SPAMNOT.lcg.ustc.edu.cn:/hom e/liumh/test/pbs/pbsjob.e10' failed with status=1, giving up after 4 attempts
WHAT WE SHOULD DO IS ADD THE FOLLOWING LINES IN: /etc/hosts AT bl-3-1.local
10.1.1.11 rocks4.lcg.ustc.edu.cn rocks4
14.  Done
Troubleshooting:
Q1.in step 13, after adding content in id_rsa.pub(@bl-3-1.local) to authorized_keys of some users, the scp still need passwd, why?
A1. if the directory: /home/user or /home/user/.ssh has a bad permission, this problem will appear, you just need to perform: chmod 755 /home/user/.ssh
-- MinghuiLiu - 07-Feb-2012

No comments:

Post a Comment