
The Slurm also includes energy measurements using the RAPL interface.1 The corresponding energy consumption is reported in the job output as shown on the previous page.
To obtain the raw
RAPL
measurements, one can utilize the linux Power Capping Framework through the file system.
All relevant attributes are located under /sys/devices/virtual/powercap/intel-rapl/ with the following subdirectories mapping to individual
RAPL
domains:
| Subdirectory | Domain | Explanation |
|---|---|---|
intel-rapl\:0 | package-0 | Energy consumption of CPU #1 |
intel-rapl\:0/intel-rapl\:0\:0/ | dram | Energy consumption of the DRAM for CPU #1 |
intel-rapl\:1 | package-1 | Energy consumption of CPU #2 |
intel-rapl\:1/intel-rapl\:1\:0/ | dram | Energy consumption of the DRAM for CPU #2 |
For each domain, the name file holds the name of the corresponing
RAPL
domain,
the energy_uj file holds the current value of the counter in μJoules,
and the max_energy_range_uj file the maximum value of this counter.
Note that these counters will overflow rather quickly, so they must be read more than once per minute and account for this.
However, manually reading the RAPL measurements is not so trivial. To get the energy consumption for a specific command in your job (running on a single node), proceed as follows:
First, install the Python package Energy Monitoring Tool (EMT) as well as additional required packages in a virtual environment.
# Install Python dependencies
ml python/3.11 cuda/12
python -m venv .venv
source .venv/bin/activate
pip install numba-cuda[cu12]
# Install EMT
pip install emt
# ...or from source
# git clone https://github.com/FairCompute/energy-monitoring-tool emt
# cd emt; pip install . ; cd ..
Then, inside of your job, wrap your command in a measurement to determine the energy it consumed.2
ml python/3.11 cuda/12
python -m venv .venv
source .venv/bin/activate
# Required to account for new processes being spawned
export EMT_RELOAD_PROCS=1
# TODO: Replace with command to measure
CMD="..."
# Run command and measure energy consumption
python3 - <<EOF
import emt, json, os
with emt.EnergyMonitor() as m:
os.system("""$CMD""")
e = m.consumed_energy | dict(total_J=m.total_consumed_energy)
print(json.dumps(e, indent=3))
EOF
(Python based
ML
or data processing workloads may use the EnergyMonitor context manager directly.)
Alternatively, you can use pyRAPL in a similar fashion which does not however support measuring
GPU
energy.
Since RAPL does not capture the full energy consumption (including I/O , networking, etc.), custom software is provided to measure the input power using IPMI . The following commands can be used to measure the total power/energy consumption of the corresponding HSUper compute nodes:
# MPI is required (even for measuring non-MPI applications)
ml intel-oneapi-mpi
POWER_LOG=job-power-$SLURM_JOB_ID.csv
# Measure power at 3 sec interval
power-watch start 3000 $POWER_LOG
# TODO: Replace with command to measure
srun ...
# Stop the (latest) measurement
power-watch stop
# Compute and output the total energy consumption
power-watch aggregate $POWER_LOG
⚠ Note: On HSUper, the Slurm AcctGatherEnergy RAPL plugin is used. Earlier version of this plugin have a bug and the implementation of the RAPL interface is different between architectures. Therefore, energy measurements (capturing the CPU and DRAM) may vary significantly between different hardware platforms and cannot be directly compared or taken as the true energy consumption of the system. ↩︎
⚠ Note: This only works for single node jobs. ↩︎