At the end of each slurm job, some accounting information is appended to the output file:
########################################
# Job Accounting #
########################################
Name : my-job-name
User : myuser
Account : hsuper
Partition : small
QOS : normal
NNodes : 1
Nodes : node0267
Cores : 144 (72 physical)
GPUs : 0
State : COMPLETED
ExitCode : 0:0
Submit : 2024-02-29T14:28:30
Start : 2024-02-29T14:28:31
End : 2024-02-29T17:55:27
Waited : 00:00:01
Reserved walltime : 1-00:00:00
Used walltime : 03:26:56
Used CPU time : 10-07:02:27 (Efficiency: 99.48%)
% User (Computation): 99.91%
% System (I/O) : 0.09%
Mem reserved : 245000M
Max Mem used : 205.48M (node0267)
Max Disk Write : 16.36M (node0267)
Max Disk Read : 1.14M (node0267)
Energy (CPU+Mem) : 2.26kWh (0.95kg CO2, 1.17€)
This output is generated by the jobinfo script.
If a
GPU
node was used, the statistics from the
NVIDIA Data Center GPU Manager
are also appended.
Additional metrics can be obtained using the slurm CLI using sacct -j <job id> -o <field1,field2,...>
as documented here.
⚠ Note: The current implementation of the AcctGatherEnergy RAPL plugin has a bug. Therefore, energy measurements (capturing the CPU and DRAM) are significantly inflated and should not be used directly, but instead only to compare the energy intensity of jobs running on the HSUper cluster.
The raw
RAPL
measurements can be obtained manually using the linux Power Capping Framework.
All relevant files are located in /sys/devices/virtual/powercap/intel-rapl/
with the following subdirectories mapping to individual
RAPL
domains:
Subdirectory | Domain | Explanation |
---|---|---|
intel-rapl\:0 | package-0 | Energy consumption of CPU #1 |
intel-rapl\:0/intel-rapl\:0\:0/ | dram | Energy consumption of the DRAM for CPU #1 |
intel-rapl\:0 | package-1 | Energy consumption of CPU #2 |
intel-rapl\:1/intel-rapl\:1\:0/ | dram | Energy consumption of the DRAM for CPU #2 |
For each domain, the name
file holds the name of the corresponing
RAPL
domain,
the energy_uj
file holds the current value of the counter in μJoules,
and the max_energy_range_uj
file the maximum value of this counter.
Note that these counters will overflow rather quickly, so they must be read more than once per minute and account for this.
Additional information about RAPL can be found in the following resources: