avatar
A question about DQS3.3.2# Computation - 科学计算
L*i
1
On our PC-Cluster system, we are using DQS3.3.2 to manage all jobs. There are
several different classes of queues, for example, X1000, X2000, X3000. Now, I
got a problem, one queue X2000 could not be submitted jobs. For instance, if
you execute 'qsub file.job' and then execute "qstat', there is no jobs with
queue "X2000 in the queueing list or running list. But sometimes, the jobs
could be picked up. It's very strange to me.

Does anyone knows? You would be greatly appreciated for your hel
avatar
d*w
2
r u sure the queue is okay?
1. dqs_execd is runing at the node?
2. the died job at the queue is clear?
try qstat -f to find the problem.

are
I

【在 L***i 的大作中提到】
: On our PC-Cluster system, we are using DQS3.3.2 to manage all jobs. There are
: several different classes of queues, for example, X1000, X2000, X3000. Now, I
: got a problem, one queue X2000 could not be submitted jobs. For instance, if
: you execute 'qsub file.job' and then execute "qstat', there is no jobs with
: queue "X2000 in the queueing list or running list. But sometimes, the jobs
: could be picked up. It's very strange to me.
:
: Does anyone knows? You would be greatly appreciated for your hel

avatar
d*w
3

are
I
Seems it is okay. Perhaps the file.job is running at the other nodes.
if qstat -f show X2000 is UP, then hsould be normal.


【在 L***i 的大作中提到】
: On our PC-Cluster system, we are using DQS3.3.2 to manage all jobs. There are
: several different classes of queues, for example, X1000, X2000, X3000. Now, I
: got a problem, one queue X2000 could not be submitted jobs. For instance, if
: you execute 'qsub file.job' and then execute "qstat', there is no jobs with
: queue "X2000 in the queueing list or running list. But sometimes, the jobs
: could be picked up. It's very strange to me.
:
: Does anyone knows? You would be greatly appreciated for your hel

avatar
L*i
4
I checked all queues with "qstat -f", every machine is UP. But those machine
with X2000 queue could not pick up jobs and also dqs_execd does run.
And in err_file, the following message are listed( where host033 runs qmaster
daemon):
time=1058023801 DQS_WARNING_0257 dqs_open_tcp: cannot connect to peer host033
errno= 111 ../SRC/dqs_io.c 212 /usr/local/DQS_332/bin/dq
s_execd332 host067
time=1058023801 DQS_ERROR_0458 unable to connect to host "host033"
../SRC/dqs_send_receive.c 170 /usr/

【在 d*****w 的大作中提到】
:
: are
: I
: Seems it is okay. Perhaps the file.job is running at the other nodes.
: if qstat -f show X2000 is UP, then hsould be normal.
:

相关阅读
logo
联系我们隐私协议©2024 redian.news
Redian新闻
Redian.news刊载任何文章,不代表同意其说法或描述,仅为提供更多信息,也不构成任何建议。文章信息的合法性及真实性由其作者负责,与Redian.news及其运营公司无关。欢迎投稿,如发现稿件侵权,或作者不愿在本网发表文章,请版权拥有者通知本网处理。