Working with the scheduler

next episode

Overview

Teaching: 25 min
Exercises: 5 min

Questions

How do I know if something went wrong with my jobs?

How to decrease the waiting time of your jobs?

How do I cancel a job?

Objectives

Submit a job and monitor the status of it.

Decipher the output of the monitoring application.

Quit or cancel an already running job.

Specify the expected runtime of your job to decrease the waiting time.

While submitting more tests jobs, Lola observes that she always mirrors the current directory for a log file to appear. This sometimes takes awhile and sometimes this happens almost instantly. How does she know, if a job is running or not?

$ cat multiple_commands_and_sleep.sh
#!/bin/bash

hostname
date
sleep 30
date
$ sbatch -o multiple_commands_and_sleep.log < multiple_commands_and_sleep.sh

Now Lola tries one of the monitoring commands, the she discovered in the manpages of her scheduletc/bash_completioner:

$ squeue -u $USER

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               171     batch   sbatch     lola  R       0:07      1 n01   

The output of the monitoring command provides her some telemetry data of her job:

in what state is her job (waiting to start also known as ‘pending’, is running, is it failing)
on what node is her job running
where did she submit the job to
the time she submitted the job
the job name

All of these information might appear pointless when submitting only one job at a time. But in larger data analysis or simulation campaigns, when individual users submit hundreds or thousands of jobs, these information become crucial.

Lola is quite happy with her progress so far. All of a sudden, she detects a mistake in one of her scripts.

$ cat multiple_commands_and_sleep.sh
#!/bin/bash

hostname
date
sleep 300
date
$ sbatch -o multiple_commands_and_sleep.log < multiple_commands_and_sleep.sh

No, that shouldn’t happen! She doesn’t want to wait 5 minutes for the job to complete. So Lola would love to cancel this job.

$ squeue -u ${USER}
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           2047172     batch   sbatch     lola  R       0:02      1 n02   

For this, she needs to find job ID of the job that she would like to stop. With this, she can ask the scheduler to cancel her job.

$ scancel 2047172

Right before lunch on that day, Lola notices that more and more staff members of her lab start using the cluster. Her own jobs that she would like to have done before lunch tend to wait for longer than expected before they are actually started. Lola revisits the cluster documentation. Maybe she oversaw something there? She finds a passage that talks about the possibility to provide the scheduler the estimated run time of her job (sometimes also referred to as wall time).

Wall time ?

Wall-clock time, or wall time, is the human perception of the passage of time from the start to the completion of a task. In the context of a task being performed on a computer, wall-clock time is a measure of the real time that elapses from start to end, including time that passes due to programmed (artificial) delays or waiting for resources to become available. In other words, it is the difference between the time at which a task finishes and the time at which the task started. Wall-clock time is the time that a clock on the wall (or a stopwatch in hand) would measure as having elapsed between the start of the process and “now”. (from en.wikipedia.org/wiki/Wall-clock_time)

The documentation indicates that she can provide an estimate of the wall time of her job to the scheduler.

$ cat multiple_commands_and_sleep.sh
#!/bin/bash

hostname
date
sleep 300
date
$ sbatch -t 00:06:00 -o multiple_commands_and_sleep.log < multiple_commands_and_sleep.sh

As the default wall time limit of the jobs is much higher than 6 minutes, Lola’s job is started a lot earlier than the one of her colleagues and she does finish her tasks before she goes to lunch.

Errors and Outputs

Submit this script to your cluster and tell the scheduler to split the output in stdout and stderr. Check the contents of the log files that were created.

Submit the same job again, but this time make the scheduler send both stdout and stderr to the same output file. Use the manpage of the scheduler commands to find out how.

Ready, Set, Go!
Download this small python script to some place on your cluster.
Run it by issuing:
$ python3 ./calc_pi.py 100000000
Put the time command before python3 to measure the runtime of this command.

Submit a job with time python3 but using 10000000000, i.e. 100 times the argument of above. What is the runtime limit that you want to specify?

Key Points

As there are many users logged in, using monitoring tools is key.

People commit errors. Cancelling jobs is key to make your admin happy and not stress the system unnecessarily.

The more information you give the job scheduler about your job, the quicker it will be dispatched/spooled.

previous episode

HPC novice

Working with the scheduler

next episode

Overview

Wall time ?

Errors and Outputs

Ready, Set, Go!

Key Points

previous episode

next episode