Skip to article frontmatterSkip to article content

Resource tracking

Tracking resource usage is helpful to understand how your jobs are currently consuming cluster resources and to monitor for any unexpected changes (e.g., large, unexpected changes in priority can reveal when jobs may be less efficient than expected). In this section, we provide general guidelines for checking your account’s (and its associated allocation’s) resource usage using both a GUI-based dashboard as well as via directly at the command line. We also overview how to check current usage patterns on a given cluster, such that you can make an informed choice about which resources to request.

Using the Metrix portal

Alliance Canada maintains a metrix portal for each of its clusters. Please refer to the the Alliance Canada documentation for each cluster (e.g., Rorqual) for the most up-to-date link for the portal.

This is the most user-friendly way to access all of the information we discuss below. However, it is still a good idea to understand how to access this information outside of the metrix portal; as this data is pulled in real time, it may fail to populate if the system is over-subscribed ! If there is a known problem with the Metrix portal, this should be documentated on https://status.alliancecan.ca.

Tracking resource usage at an account level

To view all users in a given allocation (e.g., rrg-pbellec_gpu) or in multiple allocations, we can run:

sshare -l --accounts=rrg-pbellec_gpu -a

We can also pass multiple allocations in a comma-separated list. To view only a subset of users within allocations, simply pass the -u flag with the relevant user name(s):

sshare -l --accounts=rrg-pbellec_cpu,def-pbellec_cpu -u emdupre

We can interpret each of the returned fields following Alliance Canada’s documentation:

Tracking resource usage at a job level

The Alliance Canada documentation has lots of resources for monitoring jobs, including tracking their resource usage.

One useful command to get a full accounting of a completed job is with scontrol:

scontrol show job -dd <JOBID>

Requesting resources

In order to more efficiently request resources, we can use the partition-stats command, called simply using:

partition-stats

We can interpret its output again following the Alliance Canada documentation. Specifically, the command will return:

Estimating start time for a given job

As Alliance Resources implement backfilling, jobs are not strictly started in terms of priority order, but also in terms of what resources are available. The start time for a given job can therefore be estimated using:

squeue --start -j <JOBID>

Note, though, that this is not a strictly accurate estimate, as it depends on multiple factors including that other, currently running jobs have requested accurate time limits.