preloader

Details about accounting (resource utilization) on the HSUper cluster can be found on this page.

Accounting

Job Output

At the end of each slurm job, some accounting information is appended to the output file:

########################################
#            Job Accounting            #
########################################
Name                : my-job-name
User                : myuser
Account             : hsuper
Partition           : small
QOS                 : normal
NNodes              : 1
Nodes               : node0267
Cores               : 144 (72 physical)
GPUs                : 0
State               : COMPLETED
ExitCode            : 0:0
Submit              : 2024-02-29T14:28:30
Start               : 2024-02-29T14:28:31
End                 : 2024-02-29T17:55:27
Waited              :    00:00:01
Reserved walltime   :  1-00:00:00
Used walltime       :    03:26:56
Used CPU time       : 10-07:02:27 (Efficiency: 99.48%)
% User (Computation): 99.91%
% System (I/O)      :  0.09%
Mem reserved        : 245000M
Max Mem used        : 205.48M (node0267)
Max Disk Write      : 16.36M (node0267)
Max Disk Read       : 1.14M (node0267)
Energy (CPU+Mem)    : 2.26kWh (0.95kg CO2, 1.17€)

This output is generated by the jobinfo script.
If a GPU node was used, the statistics from the NVIDIA Data Center GPU Manager are also appended.

Slurm Accounting

Additional metrics can be obtained using the slurm CLI using sacct -j <job id> -o <field1,field2,...> as documented here.

Energy Consumption

⚠ Note: The current implementation of the AcctGatherEnergy RAPL plugin has a bug. Therefore, energy measurements (capturing the CPU and DRAM) are significantly inflated and should not be used directly, but instead only to compare the energy intensity of jobs running on the HSUper cluster.

The raw RAPL measurements can be obtained manually using the linux Power Capping Framework. All relevant files are located in /sys/devices/virtual/powercap/intel-rapl/ with the following subdirectories mapping to individual RAPL domains:

SubdirectoryDomainExplanation
intel-rapl\:0package-0Energy consumption of CPU #1
intel-rapl\:0/intel-rapl\:0\:0/dramEnergy consumption of the DRAM for CPU #1
intel-rapl\:0package-1Energy consumption of CPU #2
intel-rapl\:1/intel-rapl\:1\:0/dramEnergy consumption of the DRAM for CPU #2

For each domain, the name file holds the name of the corresponing RAPL domain, the energy_uj file holds the current value of the counter in μJoules, and the max_energy_range_uj file the maximum value of this counter. Note that these counters will overflow rather quickly, so they must be read more than once per minute and account for this.

Additional Ressources

Additional information about RAPL can be found in the following resources: