[メモ] コンピュータをクラスタ化してみる
とにかくやってみないとわからないことが多い。どんなプログラムでは効果的だがどんなプログラムでは効果的でないのか。結局のところ使う人それぞれが別のプログラムを使うわけなので、個々人が試して見なさいということ。それが一番。こういうネタは基本的にああプロンプト帰ってくるまでに時間かなるなぁとか思っていると其の間につい調べちゃうんだよね。
自分で書くプログラムはMPIにせよマルチスレッドやマルチプロセスとか自由に出来るわけなんだけど、出来あいのプログラムってのは変えにくい。でもやっぱり出来合いのプログラムであっても速くなってほしい。そんなときの解決策の一つとしてSSIというものが実用的なら僕はすごくうれしい。
とりあえずキーワードの羅列だけ。openMosix,ClusterKnoppix,OpenSSI,Kerrighed GridEngine,Rocks Clusters,SCore,巫女ぐにょLinuxスイッチングハブ,,SMPマシン(複数のプロセッサをもつマシン),Beowulf
- みっし~の研究生活: Linux HPCクラスターの構築(その2)
- Amazon.com: High Performance Linux Clusters with OSCAR, Rocks, OpenMosix, and MPI (Nutshell Handbooks): Joseph Sloan: Books
- 負荷分散ソフトウエアGrid Engine
- rubyneko - 第10回 関西 Debian 勉強会 行ってきました
- PC Cluster Consortium
- OpenMosixによる計算クラスタの構築
- MIKO GNYO/Linux
- MIKO GNYO/Linux: 検索結果
- kuroyagi さんのノートブック
SSI(Single Server Image)環境
調べた感じだとしたのように5つほど選択肢があるようだ。これらから派生していくつかのディストリビューションがある。ClusterKnoppixは openMosixを組み込んだカーネルを用いたKnoppix。僕にとって重要なのはクラスタの運用中にノードの抜き差しが出来るかどうかだ。
- openMosix(LinuxPIM)
- Kerrighed
- OpenSSI
- openmosix|Kerrighed|OpenSSI|LinuxPMI - Google 検索
- オープンソースのクラスター管理システム - SourceForge.JP Magazine
- Linux.com :: A survey of open source cluster management systems
- coLinuxとopenMosixで異機種混合のクラスターを構成する
- スラッシュドット・ジャパン | openMosixでHPCクラスタはいかが?
Kerrighed
とりあえず上に上げた3つの中で最新の更新のものKerrighedを試してみる。debian etchで環境構築する。まずはInstalling Kerrighed 2.3.0 - Kerrighedをみつつカーネルのコンパイルを行う。なんだか余分なパッケージを大量に入れたような気もするが。
$ su - erter # apt-get install xmlto # apt-get install lsb # apt-get install rsync # apt-get install pkg-config # apt-get install libtool # apt-get install gcc # apt-get install bzip2 # cd /usr/src/ # wget http://kerrighed.gforge.inria.fr/kerrighed-latest.tar.gz # wget http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.20.tar.bz2 # ls kerrighed-2.3.0.tar.gz linux-2.6.20.tar.bz2 # tar zxf kerrighed-2.3.0.tar.gz # tar jxf linux-2.6.20.tar.bz2 # ls kerrighed-2.3.0 kerrighed-2.3.0.tar.gz linux-2.6.20 linux-2.6.20.tar.bz2 # cd kerrighed-2.3.0 # ./configure --with-kernel=/usr/src/linux-2.6.20 # make patch # make defconfig # make kernel # make # make kernel-install # make install # ls -l /boot/vmlinuz-2.6.20-krg -rw-r--r-- 1 root root 2488432 Jan 4 12:38 /boot/vmlinuz-2.6.20-krg # ls -l /boot/System.map lrwxrwxrwx 1 root root 21 Jan 4 12:38 /boot/System.map -> System.map-2.6.20-krg # ls -l /lib/modules/2.6.20-krg total 52 lrwxrwxrwx 1 root root 21 Jan 4 12:38 build -> /usr/src/linux-2.6.20 drwxr-xr-x 2 root root 4096 Jan 4 12:49 extra drwxr-xr-x 2 root root 4096 Jan 4 12:38 kernel -rw-r--r-- 1 root root 45 Jan 4 12:49 modules.alias -rw-r--r-- 1 root root 69 Jan 4 12:49 modules.ccwmap -rw-r--r-- 1 root root 44 Jan 4 12:49 modules.dep -rw-r--r-- 1 root root 73 Jan 4 12:49 modules.ieee1394map -rw-r--r-- 1 root root 141 Jan 4 12:49 modules.inputmap -rw-r--r-- 1 root root 81 Jan 4 12:49 modules.isapnpmap -rw-r--r-- 1 root root 74 Jan 4 12:49 modules.ofmap -rw-r--r-- 1 root root 99 Jan 4 12:49 modules.pcimap -rw-r--r-- 1 root root 43 Jan 4 12:49 modules.seriomap -rw-r--r-- 1 root root 3217 Jan 4 12:49 modules.symbols -rw-r--r-- 1 root root 189 Jan 4 12:49 modules.usbmap lrwxrwxrwx 1 root root 21 Jan 4 12:38 source -> /usr/src/linux-2.6.20 # ls -l /etc/default/kerrighed -rwxr-xr-x 1 root root 327 Jan 4 12:49 /etc/default/kerrighed # ls -l /lib/modules/2.6.20-krg total 52 lrwxrwxrwx 1 root root 21 Jan 4 12:38 build -> /usr/src/linux-2.6.20 drwxr-xr-x 2 root root 4096 Jan 4 12:49 extra drwxr-xr-x 2 root root 4096 Jan 4 12:38 kernel -rw-r--r-- 1 root root 45 Jan 4 12:49 modules.alias -rw-r--r-- 1 root root 69 Jan 4 12:49 modules.ccwmap -rw-r--r-- 1 root root 44 Jan 4 12:49 modules.dep -rw-r--r-- 1 root root 73 Jan 4 12:49 modules.ieee1394map -rw-r--r-- 1 root root 141 Jan 4 12:49 modules.inputmap -rw-r--r-- 1 root root 81 Jan 4 12:49 modules.isapnpmap -rw-r--r-- 1 root root 74 Jan 4 12:49 modules.ofmap -rw-r--r-- 1 root root 99 Jan 4 12:49 modules.pcimap -rw-r--r-- 1 root root 43 Jan 4 12:49 modules.seriomap -rw-r--r-- 1 root root 3217 Jan 4 12:49 modules.symbols -rw-r--r-- 1 root root 189 Jan 4 12:49 modules.usbmap lrwxrwxrwx 1 root root 21 Jan 4 12:38 source -> /usr/src/linux-2.6.20 # ls -l /etc/default/kerrighed -rwxr-xr-x 1 root root 327 Jan 4 12:49 /etc/default/kerrighed # ls -lR /usr/local/share/man* /usr/local/share/man: total 36 drwxr-sr-x 2 root staff 4096 Jan 4 12:49 man1 drwxr-sr-x 2 root staff 4096 Jan 4 12:49 man2 drwxr-sr-x 2 root staff 4096 Jan 4 12:49 man3 drwxr-sr-x 2 root staff 4096 Jan 4 12:49 man4 drwxr-sr-x 2 root staff 4096 Jan 4 12:49 man5 drwxr-sr-x 2 root staff 4096 Jan 4 12:49 man6 drwxr-sr-x 2 root staff 4096 Jan 4 12:49 man7 drwxr-sr-x 2 root staff 4096 Jan 4 12:49 man8 drwxr-sr-x 2 root staff 4096 Jan 4 12:49 man9 /usr/local/share/man/man1: total 20 -rw-r--r-- 1 root staff 886 Jan 4 12:49 checkpoint.1 -rw-r--r-- 1 root staff 1314 Jan 4 12:49 krgadm.1 -rw-r--r-- 1 root staff 2334 Jan 4 12:49 krgcapset.1 -rw-r--r-- 1 root staff 813 Jan 4 12:49 migrate.1 -rw-r--r-- 1 root staff 894 Jan 4 12:49 restart.1 /usr/local/share/man/man2: total 12 -rw-r--r-- 1 root staff 1322 Jan 4 12:49 krgcapset.2 -rw-r--r-- 1 root staff 1349 Jan 4 12:49 migrate.2 -rw-r--r-- 1 root staff 1248 Jan 4 12:49 migrate_self.2 /usr/local/share/man/man3: total 0 /usr/local/share/man/man4: total 0 /usr/local/share/man/man5: total 4 -rw-r--r-- 1 root staff 1838 Jan 4 12:49 kerrighed_nodes.5 /usr/local/share/man/man6: total 0 /usr/local/share/man/man7: total 8 -rw-r--r-- 1 root staff 2055 Jan 4 12:49 kerrighed.7 -rw-r--r-- 1 root staff 2900 Jan 4 12:49 kerrighed_capabilities.7 /usr/local/share/man/man8: total 0 /usr/local/share/man/man9: total 0 node01:/usr/src/kerrighed-2.3.0# ls -l /usr/local/bin/krgadm -rwxr-xr-x 1 root staff 21315 Jan 4 12:49 /usr/local/bin/krgadm node01:/usr/src/kerrighed-2.3.0# ls -l /usr/local/bin/krgcapset -rwxr-xr-x 1 root staff 21058 Jan 4 12:49 /usr/local/bin/krgcapset node01:/usr/src/kerrighed-2.3.0# ls -l /usr/local/bin/migrate -rwxr-xr-x 1 root staff 11358 Jan 4 12:49 /usr/local/bin/migrate node01:/usr/src/kerrighed-2.3.0# ls -l /usr/local/lib/libkerrighed.* -rw-r--r-- 1 root staff 36258 Jan 4 12:49 /usr/local/lib/libkerrighed.a -rwxr-xr-x 1 root staff 843 Jan 4 12:49 /usr/local/lib/libkerrighed.la lrwxrwxrwx 1 root staff 21 Jan 4 12:49 /usr/local/lib/libkerrighed.so -> libkerrighed.so.1.0.0 lrwxrwxrwx 1 root staff 21 Jan 4 12:49 /usr/local/lib/libkerrighed.so.1 -> libkerrighed.so.1.0.0 -rwxr-xr-x 1 root staff 28805 Jan 4 12:49 /usr/local/lib/libkerrighed.so.1.0.0 node01:/usr/src/kerrighed-2.3.0# ls -l /usr/local/include/kerrighed total 56 -rw-r--r-- 1 root staff 810 Jan 4 12:49 capabilities.h -rw-r--r-- 1 root staff 840 Jan 4 12:49 capability.h -rw-r--r-- 1 root staff 601 Jan 4 12:49 checkpoint.h -rw-r--r-- 1 root staff 197 Jan 4 12:49 comm.h -rw-r--r-- 1 root staff 1054 Jan 4 12:49 hotplug.h -rw-r--r-- 1 root staff 233 Jan 4 12:49 kerrighed.h -rw-r--r-- 1 root staff 13742 Jan 4 12:49 kerrighed_tools.h -rw-r--r-- 1 root staff 1163 Jan 4 12:49 krgnodemask.h -rw-r--r-- 1 root staff 1459 Jan 4 12:49 proc.h -rw-r--r-- 1 root staff 405 Jan 4 12:49 process_group_types.h -rw-r--r-- 1 root staff 1494 Jan 4 12:49 types.h # mkinitramfs -o /boot/initrd.img-2.6.20-krg 2.6.20-krg # vi /boot/grub/menu.lst default 3 title Debian GNU/Linux, kernel 2.6.20-krg root (hd0,0) kernel /boot/vmlinuz-2.6.20-krg root=/dev/hda1 ro session_id=1 initrd /boot/initrd.img-2.6.20-krg savedefault # ifconfig # echo "session=1">> /etc/kerrighed_nodes # echo "nbmin=1">> /etc/kerrighed_nodes # echo "127.0.0.1:0:lo">> /etc/kerrighed_nodes # cat /etc/kerrighed_nodes session=1 nbmin=1 127.0.0.1:0:lo
とりあえずここまで。まだ動いてはいない。何かがおかしいのだがそれが何かわからない状態。
- ocs/Howto/Kerrighed - Mandriva Community Wiki
- kerrighed installation how to | In da Wok ......
- Installing Kerrighed 2.2.0 - Kerrighed
- Main Page - Kerrighed
- grub menu.lst default - Google 検索
- GNU_GRUB
- Grubでデュアルブート時のデフォルト(標準)起動OS設定
- session_id kerrighed menu.lst - Google 検索
- Tutorial: Kerrighed | Bioinformatics
- krg_DRBL - Grid Architecture - Trac
- Linux安裝入門與基本管理
Sun Grid Engine
マシンが増えると、どうにかして全部使いたくなります。まぁリーズナブルな時間、1月とか、かかる計算も、死んでるマシンを追加すれば20日くらいに短縮できるかもしれない。ああ。お金がないって素敵。色々と工夫するから。
せっかくなので、最新版をもらってくる。Sun Grid Engine 6.2を。このときSunのアカウントが必要。古めの版にはアカウント不必要。Linux版をダウンロードしておく。
- Sun Grid Engine の機能詳細
- Sun Grid Engine(SGE)利用法 | スーパーコンピュータ | ヒトゲノム解析センター
- gridengine: ホーム
- gridengine: Grid Engine HOWTOs
マスターホストのセットアップ
まずはSGEをインストールするディレクトリを作る。
# mkdir -p /opt/sge62
作ったディレクトリを$SGE_ROOT環境変数にセットする。
# export SGE_ROOT=/opt/sge62
SGEの管理者を作る。
# useradd sgeagmin
ダウンロードしてきたファイル(ここではx86アーキテクチャ)を解凍。
# tar zxf ge62_lx24-x86.tar.gz
Sun Grid Engineの導入(2回目)
やり方が大方わかったところで本番用の環境でインストール。まずはダウンロード。どうやら、6.2u2が出ているのでこれのlinux版をダウンロード。ダウンロードしたものは以下の14ファイル。
$ ls bytecount_cksum.list LICENSE.txt sdm10u2_core_rpm.zip sdm10u2_core_targz.zip sge62u2_1_linux24-i586_rpm.zip sge62u2_1_linux24-i586_targz.zip sge62u2_1_linux24-ia64_rpm.zip sge62u2_1_linux24-ia64_targz.zip sge62u2_1_linux24-x64_rpm.zip sge62u2_1_linux24-x64_targz.zip sge62u2_arco_rpm.zip sge62u2_arco_targz.zip THIRDPARTYLICENSEREADME.txt webconsole3.0.2-linux.targz.zip
とても親切なインストールマニュアルがあるのでそれを参照。英語版だけどわかりやすい。基本的にCD-ROMに収められたソフトのインストール手順のようなので、そこは読み替え。マシンはx86で、tar methodでインストールしたいのでこのsge62u2_1_linux24-i586_targz.zipファイルを解凍。すると、sge6_2u2_1/ディレクトリが作られて、そのなかにマニュアルで言及されているcommonとarchtecture dependentのbinファイルが出来る。
$ unzip sge62u2_1_linux24-i586_targz.zip $ ls sge6_2u2_1/ sge-6_2u2_1-bin-linux24-i586.tar.gz sge-6_2u2-common.tar.gz $ pwd /usr/src/
これでマニュアルにそってインストールが進められそう。まずはsge-rootディレクトリ(/opt/sge-6.2)を作って、そこに移動して、先に解凍しておいた2つの*.tar.gzをsge-rootに解凍する。
$ su - Password: # mkdir -p /opt/sge6-2 # cd /opt/sge6-2 # tar zxf /usr/src/sge6_2u2_1/sge-6_2u2-common.tar.gz # tar zxf /usr/src/sge6_2u2_1/sge-6_2u2_1-bin-linux24-i586.tar.gz # ls 3rd_party doc include install_qmaster mpi start_gui_installer catman dtrace inst_sge lib pvm util ckpt examples install_execd man qmon
次に環境変数SGE_ROOTを設定し、確認。
# export SGE_ROOT='/opt/sge6-2' # printenv SGE_ROOT /opt/sge6-2
最後にutil/setfileperm.shを走らせる。
# util/setfileperm.sh $SGE_ROOT
ここからはいちいち質問に答えていく。この質問はyes。
WARNING WARNING WARNING ----------------------- We will set the the file ownership and permission to UserID: 0 GroupID: 0 In directory: /opt/sge6-2 We will also install the following binaries as SUID-root: $SGE_ROOT/utilbin/<arch>/rlogin $SGE_ROOT/utilbin/<arch>/rsh $SGE_ROOT/utilbin/<arch>/testsuidroot $SGE_ROOT/bin/<arch>/sgepasswd $SGE_ROOT/bin/<arch>/authuser Do you want to set the file permissions (yes/no) [NO] >> yes
enterキーでずらずら流れる。どうやらパーミッションを設定しているようだ。
Verifying and setting file permissions and owner in >3rd_party< Verifying and setting file permissions and owner in >bin< Verifying and setting file permissions and owner in >ckpt< Verifying and setting file permissions and owner in >dtrace< Verifying and setting file permissions and owner in >examples< Verifying and setting file permissions and owner in >inst_sge< Verifying and setting file permissions and owner in >install_execd< Verifying and setting file permissions and owner in >install_qmaster< Verifying and setting file permissions and owner in >lib< Verifying and setting file permissions and owner in >mpi< Verifying and setting file permissions and owner in >pvm< Verifying and setting file permissions and owner in >qmon< Verifying and setting file permissions and owner in >util< Verifying and setting file permissions and owner in >utilbin< Verifying and setting file permissions and owner in >catman< Verifying and setting file permissions and owner in >doc< Verifying and setting file permissions and owner in >include< Verifying and setting file permissions and owner in >man< Your file permissions were set
次にguiインストールかcommand lineインストールか。ここではcommand lineインストールにする。マニュアルのnote部分を読む。とりあえずlinuxに新規インストールするぶんには問題なさそうだな。マニュアルのやることリストには2つある。
- インストールスクリプトをマスターホストとすべての計算ホストで実行する。
- 認証ホストと計算キューをsubmitするホストの情報を登録する。
よくわからんが進める。インストールを進める前に、セキュリティを高めたかったら読めと言われている文書があるのでそれに目を通す。で、わかったこと。
- csp-protocolで暗号化されたメッセージをホスト間でやり取り
- 秘密鍵の交換は公開鍵プロトコルで行われる。
- 暗号化は透過的に行われる
- 暗号化セッションはセッションの開始からある時間内で有効。
暗号化はホスト間でやり取りされるメッセージに不正な操作を行われる恐れがある場合には有効だが、それ以外では計算リソースの無駄になる。で、暗号化は行わないことに決定。
さらにInstalling SMF Servicesも読むが、solalis 10のための機能らしいので飛ばす。
マスターホストのインストールに進む。ミスったら最初からやり直すことが出来るそうな。しかし、始める前にのセクションで注意がある。計算ホストと計算キューをsubmitするホストでユーザネームが同じじゃないとだめだそうな。と言うことで、
インストールの開始。
# ./install_qmaster
ライセンス表示の後に同意を求められるのでy
Do you agree with that license? (y/n) [n] >> y
80x24がいいとか何とか言われる。enterキーを押して次に進む。
Welcome to the Grid Engine installation --------------------------------------- Grid Engine qmaster host installation ------------------------------------- Before you continue with the installation please read these hints: - Your terminal window should have a size of at least 80x24 characters - The INTR character is often bound to the key Ctrl-C. The term >Ctrl-C< is used during the installation if you have the possibility to abort the installation The qmaster installation procedure will take approximately 5-10 minutes. Hit <RETURN> to continue >>
hostnameがlocalhostだったり、IPアドレスが127.0.x.xだったりすると怒られる。
Unsupported local hostname -------------------------- The current hostname is resolved as follows: Hostname: localhost Aliases: hoge Host Address(es): 127.0.0.1 It is not supported for a Grid Engine installation that the local hostname contains the hostname "localhost" and/or the IP address "127.0.x.x" of the loopback interface. The "localhost" hostname should be reserved for the loopback interface ("127.0.0.1") and the real hostname should be assigned to one of the physical or logical network interfaces of this machine. Installation failed. Press <RETURN> to exit the installation procedure >>
enterで抜ける。/etc/hostsを編集して切り抜けるか。とりあえず編集前がこんな感じ。
# cat /etc/hosts 127.0.0.1 localhost hoge # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters ff02::3 ip6-allhosts
で、編集。ifconfigででてきたイーサネットアダプタに割り当てられたアドレスにたいして名前をつければよい。今までは127.0.0.1にhogeというホスト名(エイリアスの)が割り当てられていたが、これを192.168.14.6のホスト名にする。このマシンには4つのethアダプタがあるので、それ以外のものについても適当に追加。
127.0.0.1 localhost 192.168.14.6 hoge 192.168.1.1 hoge1 192.168.2.1 hoge2 192.168.3.1 hoge3 # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters ff02::3 ip6-allhosts
この状態で、もういっかい./install_qmasterを行う。前と同様にライセンス表示から始まるが、前と重複する部分は気にせず進む。rootでは無いsge管理者を設定したいのでy
# ./install_qmaster Choosing Grid Engine admin user account --------------------------------------- You may install Grid Engine that all files are created with the user id of an unprivileged user. This will make it possible to install and run Grid Engine in directories where user >root< has no permissions to create and write files and directories. - Grid Engine still has to be started by user >root< - this directory should be owned by the Grid Engine administrator Do you want to install Grid Engine under an user id other than >root< (y/n) [y] >> y
sge管理ユーザネームを入力するのだが、管理ユーザを作っておくのを忘れたのでCtrl + Cで終了。
Choosing a Grid Engine admin user name -------------------------------------- Please enter a valid user name >>
sgeadminと言う名前でsge管理ユーザを作る。
# adduser sgeadmin Adding user `sgeadmin' ... Adding new group `sgeadmin' (1001) ... Adding new user `sgeadmin' (1001) with group `sgeadmin' ... Creating home directory `/home/sgeadmin' ... Copying files from `/etc/skel' ... Enter new UNIX password: Retype new UNIX password: passwd: password updated successfully Changing the user information for sgeadmin Enter the new value, or press ENTER for the default Full Name []: Room Number []: Work Phone []: Home Phone []: Other []: Is the information correct? [Y/n] Y
ユーザが出来たらもう一回./install_qmastar。sge管理ユーザ名を入力してenter。
# ./install_qmaster Choosing a Grid Engine admin user name -------------------------------------- Please enter a valid user name >> sgeadmin Installing Grid Engine as admin user >sgeadmin< Hit <RETURN> to continue >>
SEG_ROOT環境変数が間違っていれば書き換え。間違ってないのでenter。
Checking $SGE_ROOT directory ---------------------------- The Grid Engine root directory is: $SGE_ROOT = /opt/sge6-2 If this directory is not correct (e.g. it may contain an automounter prefix) enter the correct path to this directory or hit <RETURN> to use default [/opt/sge6-2] >> Your $SGE_ROOT directory: /opt/sge6-2 Hit <RETURN> to continue >>
qmasterがwatchするポートをどのようにして決めるか。シェル変数ではなく、/etc/servicesで設定するためにデフォルトの2を選択。
Grid Engine TCP/IP communication service ---------------------------------------- The port for sge_qmaster is currently set as service. sge_qmaster service set to port 6444 Now you have the possibility to set/change the communication ports by using the >shell environment< or you may configure it via a network service, configured in local >/etc/service<, >NIS< or >NIS+<, adding an entry in the form sge_qmaster <port_number>/tcp to your services database and make sure to use an unused port number. How do you want to configure the Grid Engine communication ports? Using the >shell environment<: [1] Using a network service like >/etc/service<, >NIS/NIS+<: [2] (default: 2) >>
sge_qmasterをgrid engineのコミュニケーション手段として使うと言うこと。enterで次に進む。
Grid Engine TCP/IP service >sge_qmaster< ---------------------------------------- Using the service sge_qmaster for communication with Grid Engine. Hit <RETURN> to continue >>
こんどは実行デーモンであるsge_execdの設定。マスターデーモンと同じで、/etc/servicesで設定するのでデフォルトの2を選択。
Grid Engine TCP/IP communication service ---------------------------------------- The port for sge_execd is currently set as service. sge_execd service set to port 6445 Now you have the possibility to set/change the communication ports by using the >shell environment< or you may configure it via a network service, configured in local >/etc/service<, >NIS< or >NIS+<, adding an entry in the form sge_execd <port_number>/tcp to your services database and make sure to use an unused port number. How do you want to configure the Grid Engine communication ports? Using the >shell environment<: [1] Using a network service like >/etc/service<, >NIS/NIS+<: [2] (default: 2) >>
マスターデーモンのときと同様にenter押すだけ。
Grid Engine TCP/IP communication service ----------------------------------------- Using the service sge_execd for communication with Grid Engine. Hit <RETURN> to continue >>
よくわかってないのでcellについては。だから言われたとおりデフォルトのままenter。
Grid Engine cells ----------------- Grid Engine supports multiple cells. If you are not planning to run multiple Grid Engine clusters or if you don't know yet what is a Grid Engine cell it is safe to keep the default cell name default If you want to install multiple cells you can enter a cell name now. The environment variable $SGE_CELL=<your_cell_name> will be set for all further Grid Engine commands. Enter cell name [default] >> Using cell >default<. Hit <RETURN> to continue >>
計算クラスターの名前を入力。とりあえずデフォルトのまま。
Unique cluster name ------------------- The cluster name uniquely identifies a specific Sun Grid Engine cluster. The cluster name must be unique throughout your organization. The name is not related to the SGE cell. The cluster name must start with a letter ([A-Za-z]), followed by letters, digits ([0-9]), dashes (-) or underscores (_). Enter new cluster name or hit <RETURN> to use default [p6444] >> creating directory: /opt/sge6-2/default/common Your $SGE_CLUSTER_NAME: p6444 Hit <RETURN> to continue >>
スプールディレクトリを選ぶが、デフォルトのままでOKなのでそのままenter
Grid Engine qmaster spool directory ----------------------------------- The qmaster spool directory is the place where the qmaster daemon stores the configuration and the state of the queuing system. The admin user >sgeadmin< must have read/write access to the qmaster spool directory. If you will install shadow master hosts or if you want to be able to start the qmaster daemon on other hosts (see the corresponding section in the Grid Engine Installation and Administration Manual for details) the account on the shadow master hosts also needs read/write access to this directory. The following directory [/opt/sge6-2/default/spool/qmaster] will be used as qmaster spool directory by default! Do you want to select another qmaster spool directory (y/n) [n] >>
windowsの実行ホストをインストールするか聞かれるのでno。
Windows Execution Host Support ------------------------------ Are you going to install Windows Execution Hosts? (y/n) [n] >>
パーミッションの確認。デフォルトではyだがここはnを選択してみる。
Verifying and setting file permissions -------------------------------------- Did you install this version with >pkgadd< or did you already verify and set the file permissions of your distribution (y/n) [y] >>
確認と変更を行うか聞かれるので、ここはデフォルトのままy。
Verifying and setting file permissions -------------------------------------- We may now verify and set the file permissions of your Grid Engine distribution. This may be useful since due to unpacking and copying of your distribution your files may be unaccessible to other users. We will set the permissions of directories and binaries to 755 - that means executable are accessible for the world and for ordinary files to 644 - that means readable for the world Do you want to verify and set your file permissions (y/n) [y] >>
どうやらここの処理は最初に行ったutil/setfileperm.sh $SGE_ROOTと同じことをしてくれているようだ。
Verifying and setting file permissions and owner in >3rd_party< Verifying and setting file permissions and owner in >bin< Verifying and setting file permissions and owner in >ckpt< Verifying and setting file permissions and owner in >dtrace< Verifying and setting file permissions and owner in >examples< Verifying and setting file permissions and owner in >inst_sge< Verifying and setting file permissions and owner in >install_execd< Verifying and setting file permissions and owner in >install_qmaster< Verifying and setting file permissions and owner in >lib< Verifying and setting file permissions and owner in >mpi< Verifying and setting file permissions and owner in >pvm< Verifying and setting file permissions and owner in >qmon< Verifying and setting file permissions and owner in >util< Verifying and setting file permissions and owner in >utilbin< Verifying and setting file permissions and owner in >catman< Verifying and setting file permissions and owner in >doc< Verifying and setting file permissions and owner in >include< Verifying and setting file permissions and owner in >man< Your file permissions were set Hit <RETURN> to continue >>
yにしておく。/etc/hostsとか参照しないのかな。
Select default Grid Engine hostname resolving method ---------------------------------------------------- Are all hosts of your cluster in one DNS domain? If this is the case the hostnames >hostA< and >hostA.foo.com< would be treated as equal, because the DNS domain name >foo.com< is ignored when comparing hostnames. Are all hosts of your cluster in a single DNS domain (y/n) [y] >> y Ignoring domain name when comparing hostnames. Hit <RETURN> to continue >>
Making directories ------------------ creating directory: /opt/sge6-2/default/spool/qmaster creating directory: /opt/sge6-2/default/spool/qmaster/job_scripts Hit <RETURN> to continue >>
バークレーDBにしておく。これってDBサーバ不可欠ってことなのかな。それだと結構辛いかもしれない。
Setup spooling -------------- Your SGE binaries are compiled to link the spooling libraries during runtime (dynamically). So you can choose between Berkeley DB spooling and Classic spooling method. Please choose a spooling method (berkeleydb|classic) [berkeleydb] >>
シャドーマスタは使わない、極力早いほうが良い。ということでデフォルトのn。スプールサーバはセットアップしない方針で。
The Berkeley DB spooling method provides two configurations! Local spooling: The Berkeley DB spools into a local directory on this host (qmaster host) This setup is faster, but you can't setup a shadow master host Berkeley DB Spooling Server: If you want to setup a shadow master host, you need to use Berkeley DB Spooling Server! In this case you have to choose a host with a configured RPC service. The qmaster host connects via RPC to the Berkeley DB. This setup is more failsafe, but results in a clear potential security hole. RPC communication (as used by Berkeley DB) can be easily compromised. Please only use this alternative if your site is secure or if you are not concerned about security. Check the installation guide for further advice on how to achieve failsafety without compromising security. Do you want to use a Berkeley DB Spooling Server? (y/n) [n] >> Hit <RETURN> to continue >>
スプールディレクトリの場所を指定する。デフォルトのまま。
Berkeley Database spooling parameters ------------------------------------- Please enter the database directory now, even if you want to spool locally, it is necessary to enter this database directory. Default: [/opt/sge6-2/default/spool/spooldb] >> creating directory: /opt/sge6-2/default/spool/spooldb Dumping bootstrapping information Initializing spooling database Hit <RETURN> to continue >>
これもデフォルト。
Grid Engine group id range -------------------------- When jobs are started under the control of Grid Engine an additional group id is set on platforms which do not support jobs. This is done to provide maximum control for Grid Engine jobs. This additional UNIX group id range must be unused group id's in your system. Each job will be assigned a unique id during the time it is running. Therefore you need to provide a range of id's which will be assigned dynamically for jobs. The range must be big enough to provide enough numbers for the maximum number of Grid Engine jobs running at a single moment on a single host. E.g. a range like >20000-20100< means, that Grid Engine will use the group ids from 20000-20100 and provides a range for 100 Grid Engine jobs at the same time on a single host. You can change at any time the group id range in your cluster configuration. Please enter a range [20000-20100] >> Using >20000-20100< as gid range. Hit <RETURN> to continue >>
デフォルトのまま
Grid Engine cluster configuration --------------------------------- Please give the basic configuration parameters of your Grid Engine installation: <execd_spool_dir> The pathname of the spool directory of the execution hosts. User >sgeadmin< must have the right to create this directory and to write into it. Default: [/opt/sge6-2/default/spool] >>
トラブル時のメールをどこに配送するか。とりあえずsgeadminでいいのではということで。
Grid Engine cluster configuration (continued) --------------------------------------------- <administrator_mail> The email address of the administrator to whom problem reports are sent. It's is recommended to configure this parameter. You may use >none< if you do not wish to receive administrator mail. Please enter an email address in the form >user@foo.com<. Default: [none] >> sgeadmin@localhost
最終確認。nでenter。
The following parameters for the cluster configuration were configured: execd_spool_dir /opt/sge6-2/default/spool administrator_mail sgeadmin@localhost Do you want to change the configuration parameters (y/n) [n] >>
Creating local configuration ---------------------------- Creating >act_qmaster< file Adding default complex attributes Adding default parallel environments (PE) Adding SGE default usersets Adding >sge_aliases< path aliases file Adding >qtask< qtcsh sample default request file Adding >sge_request< default submit options file Creating >sgemaster< script Creating >sgeexecd< script Creating settings files for >.profile/.cshrc< Hit <RETURN> to continue >>
ブート時にマスターホストを起動させるかどうか。起動させたいのでそのままenter。
qmaster startup script ---------------------- We can install the startup script that will start qmaster at machine boot (y/n) [y] >> cp /opt/sge6-2/default/common/sgemaster /etc/init.d/sgemaster.p6444 /usr/sbin/update-rc.d sgemaster.p6444 Adding system startup for /etc/init.d/sgemaster.p6444 ... /etc/rc0.d/K03sgemaster.p6444 -> ../init.d/sgemaster.p6444 /etc/rc1.d/K03sgemaster.p6444 -> ../init.d/sgemaster.p6444 /etc/rc6.d/K03sgemaster.p6444 -> ../init.d/sgemaster.p6444 /etc/rc2.d/S95sgemaster.p6444 -> ../init.d/sgemaster.p6444 /etc/rc3.d/S95sgemaster.p6444 -> ../init.d/sgemaster.p6444 /etc/rc4.d/S95sgemaster.p6444 -> ../init.d/sgemaster.p6444 /etc/rc5.d/S95sgemaster.p6444 -> ../init.d/sgemaster.p6444 Hit <RETURN> to continue >>
で、マスターデーモンを起動してくれる。
Grid Engine qmaster startup --------------------------- Starting qmaster daemon. Please wait ... starting sge_qmaster Hit <RETURN> to continue >>
ファイルから実行ホストのリストを入力するか聞かれるので、no。
Adding Grid Engine hosts ------------------------ Please now add the list of hosts, where you will later install your execution daemons. These hosts will be also added as valid submit hosts. Please enter a blank separated list of your execution hosts. You may press <RETURN> if the line is getting too long. Once you are finished simply press <RETURN> without entering a name. You also may prepare a file with the hostnames of the machines where you plan to install Grid Engine. This may be convenient if you are installing Grid Engine on many hosts. Do you want to use a file which contains the list of hosts (y/n) [n] >>
とりあえずのところは、マスターホストかつ実行ホストにしておく。自分のホストネームを実行ホストとして登録。
Adding admin and submit hosts ----------------------------- Please enter a blank seperated list of hosts. Stop by entering <RETURN>. You may repeat this step until you are entering an empty list. You will see messages from Grid Engine when the hosts are added. Host(s): master01 Finished adding hosts. Hit <RETURN> to continue >>
シャドウホストとシャドウマスターホストの違いが良くわからんが推奨されているのでここはy。
If you want to use a shadow host, it is recommended to add this host to the list of administrative hosts. If you are not sure, it is also possible to add or remove hosts after the installation with <qconf -ah hostname> for adding and <qconf -dh hostname> for removing this host Attention: This is not the shadow host installation procedure. You still have to install the shadow host separately Do you want to add your shadow host(s) now? (y/n) [y] >>
ファイルから読み込むかなのでno。
Adding Grid Engine shadow hosts ------------------------------- Please now add the list of hosts, where you will later install your shadow daemon. Please enter a blank separated list of your execution hosts. You may press <RETURN> if the line is getting too long. Once you are finished simply press <RETURN> without entering a name. You also may prepare a file with the hostnames of the machines where you plan to install Grid Engine. This may be convenient if you are installing Grid Engine on many hosts. Do you want to use a file which contains the list of hosts (y/n) [n] >>
状況を鑑みるに、シャドウホストは今のところ必要無いな。1台で実行ホストとマスターホストを動かしているわけで、こいつが落ちたらシャドウだろうがなんだろうが落ちるからね。と言うことで追加せずに先に進む。
Adding admin hosts ------------------ Please enter a blank seperated list of hosts. Stop by entering <RETURN>. You may repeat this step until you are entering an empty list. You will see messages from Grid Engine when the hosts are added. Host(s): Finished adding hosts. Hit <RETURN> to continue >>
確認だけなのでそのままenter。
Creating the default <all.q> queue and <allhosts> hostgroup ----------------------------------------------------------- root@master01 added "@allhosts" to host group list root@master01 added "all.q" to cluster queue list Hit <RETURN> to continue >>
これは1。normalだと負荷に応じてスケジューリングしてくれるのだそうな。
Scheduler Tuning ---------------- The details on the different options are described in the manual. Configurations -------------- 1) Normal Fixed interval scheduling, report limited scheduling information, actual + assumed load 2) High Fixed interval scheduling, report limited scheduling information, actual load 3) Max Immediate Scheduling, report no scheduling information, actual load Enter the number of your preferred configuration and hit <RETURN>! Default configuration is [1] >> 1 We're configuring the scheduler with >Normal< settings! Do you agree? (y/n) [y] >>
あとは使い方の解説。3回enterでプロンプトが帰ってくる。
Using Grid Engine ----------------- You should now enter the command: source /opt/sge6-2/default/common/settings.csh if you are a csh/tcsh user or # . /opt/sge6-2/default/common/settings.sh if you are a sh/ksh user. This will set or expand the following environment variables: - $SGE_ROOT (always necessary) - $SGE_CELL (if you are using a cell other than >default<) - $SGE_CLUSTER_NAME (always necessary) - $SGE_QMASTER_PORT (if you haven't added the service >sge_qmaster<) - $SGE_EXECD_PORT (if you haven't added the service >sge_execd<) - $PATH/$path (to find the Grid Engine binaries) - $MANPATH (to access the manual pages) Hit <RETURN> to see where Grid Engine logs messages >>
Grid Engine messages -------------------- Grid Engine messages can be found at: /tmp/qmaster_messages (during qmaster startup) /tmp/execd_messages (during execution daemon startup) After startup the daemons log their messages in their spool directories. Qmaster: /opt/sge6-2/default/spool/qmaster/messages Exec daemon: <execd_spool_dir>/<hostname>/messages Grid Engine startup scripts --------------------------- Grid Engine startup scripts can be found at: /opt/sge6-2/default/common/sgemaster (qmaster) /opt/sge6-2/default/common/sgeexecd (execd) Do you want to see previous screen about using Grid Engine again (y/n) [n] >>
Your Grid Engine qmaster installation is now completed ------------------------------------------------------ Please now login to all hosts where you want to run an execution daemon and start the execution host installation procedure. If you want to run an execution daemon on this host, please do not forget to make the execution host installation in this host as well. All execution hosts must be administrative hosts during the installation. All hosts which you added to the list of administrative hosts during this installation procedure can now be installed. You may verify your administrative hosts with the command # qconf -sh and you may add new administrative hosts with the command # qconf -ah <hostname> Please hit <RETURN> >>
とりあえず環境変数が上手くセットされるかチェックするために上にあったとおりシェルスクリプトを走らせる。上手く設定されている雰囲気。
# printenv SHELL=/bin/bash TERM=screen OLDPWD=/root USER=root MAIL=/var/mail/root PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin PWD=/opt/sge6-2 LANG=C SGE_ROOT=/opt/sge6-2 PS1=\h:\w\$ SHLVL=1 HOME=/root LOGNAME=root _=/usr/bin/printenv # . /opt/sge6-2/default/common/settings.sh # printenv MANPATH=/opt/sge6-2/man:/usr/share/man:/usr/local/share/man SHELL=/bin/bash TERM=screen SGE_CELL=default OLDPWD=/root USER=root MAIL=/var/mail/root PATH=/opt/sge6-2/bin/lx24-x86:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin PWD=/opt/sge6-2 LANG=C SGE_ROOT=/opt/sge6-2 PS1=\h:\w\$ SHLVL=1 HOME=/root LOGNAME=root SGE_CLUSTER_NAME=p6444 _=/usr/bin/printenv
これでマスターデーモンのセットアップは終了、なんかつかれた。次に実行デーモンのセットアップ。まずは実行デーモンが登録されているかチェック。hogeはマスターホストをインストールしたマシンだが、同じマシンで実行デーモンを走らせるので、これでよし。
# qconf -sh hoge
で、実行デーモンのインストーラを起動。実行デーモンのインストールは5分で出来るそうな。ほんとかよ。そのままreturn。
# ./install_execd Welcome to the Grid Engine execution host installation ------------------------------------------------------ If you haven't installed the Grid Engine qmaster host yet, you must execute this step (with >install_qmaster<) prior the execution host installation. For a sucessfull installation you need a running Grid Engine qmaster. It is also neccesary that this host is an administrative host. You can verify your current list of administrative hosts with the command: # qconf -sh You can add an administrative host with the command: # qconf -ah <hostname> The execution host installation will take approximately 5 minutes. Hit <RETURN> to continue >>
SEG_ROOTの場所確認。あっているのでそのままenter
Checking $SGE_ROOT directory ---------------------------- The Grid Engine root directory is: $SGE_ROOT = /opt/sge6-2 If this directory is not correct (e.g. it may contain an automounter prefix) enter the correct path to this directory or hit <RETURN> to use default [/opt/sge6-2] >> Your $SGE_ROOT directory: /opt/sge6-2 Hit <RETURN> to continue >>
セルの選択だが、これはマスターデーモンで作ったセルの名前がdefaultだったのでそのままenter。j
Grid Engine cells ----------------- Please enter cell name which you used for the qmaster installation or press <RETURN> to use [default] >> Using cell: >default< Hit <RETURN> to continue >>
次は確認。実行デーモンが監視するポートの選択。そのままenter。
Grid Engine TCP/IP communication service ---------------------------------------- The port for sge_execd is currently set as service. sge_execd service set to port 6445 Hit <RETURN> to continue >>
Checking hostname resolving --------------------------- This hostname is known at qmaster as an administrative host. Hit <RETURN> to continue >>
スプールディレクトリの選択。そのままでenter
Execd spool directory configuration ----------------------------------- You defined a global spool directory when you installed the master host. You can use that directory for spooling jobs from this execution host or you can define a different spool directory for this execution host. ATTENTION: For most operating systems, the spool directory does not have to be located on a local disk. The spool directory can be located on a network-accessible drive. However, using a local spool directory provides better performance. FOR WINDOWS USERS: On Windows systems, the spool directory MUST be located on a local disk. If you install an execution daemon on a Windows system without a local spool directory, the execution host is unusable. The spool directory is currently set to: <</opt/sge6-2/default/spool/master01>> Do you want to configure a different spool directory for this host (y/n) [n] >>
Creating local configuration ---------------------------- sgeadmin@master01 modified "master01" in configuration list Local configuration for host >master01< created. Hit <RETURN> to continue >>
実行デーモンのスタートアップ登録。yでenter。
execd startup script -------------------- We can install the startup script that will start execd at machine boot (y/n) [y] >> y cp /opt/sge6-2/default/common/sgeexecd /etc/init.d/sgeexecd.p6444 /usr/sbin/update-rc.d sgeexecd.p6444 Adding system startup for /etc/init.d/sgeexecd.p6444 ... /etc/rc0.d/K03sgeexecd.p6444 -> ../init.d/sgeexecd.p6444 /etc/rc1.d/K03sgeexecd.p6444 -> ../init.d/sgeexecd.p6444 /etc/rc6.d/K03sgeexecd.p6444 -> ../init.d/sgeexecd.p6444 /etc/rc2.d/S95sgeexecd.p6444 -> ../init.d/sgeexecd.p6444 /etc/rc3.d/S95sgeexecd.p6444 -> ../init.d/sgeexecd.p6444 /etc/rc4.d/S95sgeexecd.p6444 -> ../init.d/sgeexecd.p6444 /etc/rc5.d/S95sgeexecd.p6444 -> ../init.d/sgeexecd.p6444 Hit <RETURN> to continue >>
Grid Engine execution daemon startup ------------------------------------ Starting execution daemon. Please wait ... starting sge_execd Hit <RETURN> to continue >>
Adding a queue for this host ---------------------------- We can now add a queue instance for this host: - it is added to the >allhosts< hostgroup - the queue provides 1 slot(s) for jobs in all queues referencing the >allhosts< hostgroup You do not need to add this host now, but before running jobs on this host it must be added to at least one queue. Do you want to add a default queue instance for this host (y/n) [y] >> root@master01 modified "@allhosts" in host group list root@master01 modified "all.q" in cluster queue list Hit <RETURN> to continue >>
Using Grid Engine ----------------- You should now enter the command: source /opt/sge6-2/default/common/settings.csh if you are a csh/tcsh user or # . /opt/sge6-2/default/common/settings.sh if you are a sh/ksh user. This will set or expand the following environment variables: - $SGE_ROOT (always necessary) - $SGE_CELL (if you are using a cell other than >default<) - $SGE_CLUSTER_NAME (always necessary) - $SGE_QMASTER_PORT (if you haven't added the service >sge_qmaster<) - $SGE_EXECD_PORT (if you haven't added the service >sge_execd<) - $PATH/$path (to find the Grid Engine binaries) - $MANPATH (to access the manual pages) Hit <RETURN> to see where Grid Engine logs messages >>
Grid Engine messages -------------------- Grid Engine messages can be found at: /tmp/qmaster_messages (during qmaster startup) /tmp/execd_messages (during execution daemon startup) After startup the daemons log their messages in their spool directories. Qmaster: /opt/sge6-2/default/spool/qmaster/messages Exec daemon: <execd_spool_dir>/<hostname>/messages Grid Engine startup scripts --------------------------- Grid Engine startup scripts can be found at: /opt/sge6-2/default/common/sgemaster (qmaster) /opt/sge6-2/default/common/sgeexecd (execd) Do you want to see previous screen about using Grid Engine again (y/n) [n] >>
実行デーモンのインストールはこれで終了。テストしてみる。まずはrootからログアウト。
# exit
再度sgeadminでログインし、環境変数設定用のシェルスクリプトを読み込む。
$ . /opt/sge6-2/default/common/settings.sh
qconfでクラスタの設定を確認。
$ qconf -sconf #global: execd_spool_dir /opt/sge6-2/default/spool mailer /bin/mail xterm /usr/bin/X11/xterm load_sensor none prolog none epilog none shell_start_mode posix_compliant login_shells sh,ksh,csh,tcsh min_uid 0 min_gid 0 user_lists none xuser_lists none projects none xprojects none enforce_project false enforce_user auto load_report_time 00:00:40 max_unheard 00:05:00 reschedule_unknown 00:00:00 loglevel log_warning administrator_mail sgeadmin@localhost set_token_cmd none pag_cmd none token_extend_time none shepherd_cmd none qmaster_params none execd_params none reporting_params accounting=true reporting=false \ flush_time=00:00:15 joblog=false sharelog=00:00:00 finished_jobs 100 gid_range 20000-20100 qlogin_command builtin qlogin_daemon builtin rlogin_command builtin rlogin_daemon builtin rsh_command builtin rsh_daemon builtin max_aj_instances 2000 max_aj_tasks 75000 max_u_jobs 0 max_jobs 0 max_advance_reservations 0 auto_user_oticket 0 auto_user_fshare 0 auto_user_default_project none auto_user_delete_time 86400 delegated_file_staging false reprioritize 0 jsv_url none jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w
テストジョブをサブミットする。マニュアルによれば下のような感じ。
$ rsh hoge date Permission denied. $ qsub $SGE_ROOT/examples/jobs/simple.sh
rshのほうは失敗するが、qsubのほうは成功。rshが失敗する理由は、.rhostsが正しく設定されていないから。にもかかわらずqsubが成功した理由はrshを通さずに呼び出されたからか。
$ echo 'hoge sgeadmin' >> ~/.rhosts $ rsh hoge date 2009年 4月 10日 金曜日 15:12:40 JST $ date 2009年 4月 10日 金曜日 15:12:49 JST
ここまででテストジョブの投入は終了。
[メモ] hudsonとかTheSchwartzとか
コンパイルやテストをした後にコミットしていると、コンパイルとテストに時間がかかってしまう場合にあまり効率が良いとは言えない。ということでCIと呼ばれる考え方がある。それを実現するのにhudsonというソフトがあるそうな。コンパイルとテストの前にどんどんSubversionのようなCVSリポジトリのbranchesにコミットして、hudsonにコンパイルとテスト用のスクリプトを起動させて、コンパイルとテストが成功したら自動的にtrunkにマージするとかの処理を自動的に出来るようにするのかな。
ジョブキューサーバというものもあるわけで、これをやれと命令したらそれが終了するまでシェルが帰ってこないのは嫌なので、やれという命令はいったんキューイングされて、暇になったときに自動的に走らされるということ。これを実現するのが、TheSchwartzかな。
話は変わるけどジョブスケジューラとして、Torqueとかもあるそうな。