Energy Measurement

RAPL (CPU and DRAM Energy)

The Slurm also includes energy measurements using the RAPL interface.¹ The corresponding energy consumption is reported in the job output as shown on the previous page.

Manual Usage

To obtain the raw RAPL measurements, one can utilize the linux Power Capping Framework through the file system. All relevant attributes are located under /sys/devices/virtual/powercap/intel-rapl/ with the following subdirectories mapping to individual RAPL domains:

Subdirectory	Domain	Explanation
`intel-rapl\:0`	package-0	Energy consumption of CPU #1
`intel-rapl\:0/intel-rapl\:0\:0/`	dram	Energy consumption of the DRAM for CPU #1
`intel-rapl\:1`	package-1	Energy consumption of CPU #2
`intel-rapl\:1/intel-rapl\:1\:0/`	dram	Energy consumption of the DRAM for CPU #2

For each domain, the name file holds the name of the corresponing RAPL domain, the energy_uj file holds the current value of the counter in μJoules, and the max_energy_range_uj file the maximum value of this counter. Note that these counters will overflow rather quickly, so they must be read more than once per minute and account for this.

EMT (RAPL + GPU Energy)

However, manually reading the RAPL measurements is not so trivial. To get the energy consumption for a specific command in your job (running on a single node), proceed as follows:

First, install the Python package Energy Monitoring Tool (EMT) as well as additional required packages in a virtual environment.

# Install Python dependencies
ml python/3.11 cuda/12
python -m venv .venv
source .venv/bin/activate
pip install numba-cuda[cu12]
# Install EMT
pip install emt
# ...or from source
# git clone https://github.com/FairCompute/energy-monitoring-tool emt
# cd emt; pip install . ; cd ..

Then, inside of your job, wrap your command in a measurement to determine the energy it consumed.²

ml python/3.11 cuda/12
python -m venv .venv
source .venv/bin/activate
# Required to account for new processes being spawned
export EMT_RELOAD_PROCS=1
# TODO: Replace with command to measure
CMD="..."
# Run command and measure energy consumption
python3 - <<EOF
import emt, json, os
with emt.EnergyMonitor() as m:
   os.system("""$CMD""")
   e = m.consumed_energy | dict(total_J=m.total_consumed_energy)
   print(json.dumps(e, indent=3))
EOF

(Python based ML or data processing workloads may use the EnergyMonitor context manager directly.)
Alternatively, you can use pyRAPL in a similar fashion which does not however support measuring GPU energy.

IPMI (Total System Power)

Since RAPL does not capture the full energy consumption (including I/O , networking, etc.), custom software is provided to measure the input power using IPMI . The following commands can be used to measure the total power/energy consumption of the corresponding HSUper compute nodes:

# MPI is required (even for measuring non-MPI applications)
ml intel-oneapi-mpi
POWER_LOG=job-power-$SLURM_JOB_ID.csv
# Measure power at 3 sec interval
power-watch start 3000 $POWER_LOG
# TODO: Replace with command to measure
srun ...
# Stop the (latest) measurement
power-watch stop
# Compute and output the total energy consumption
power-watch aggregate $POWER_LOG

⚠ Note: On HSUper, the Slurm AcctGatherEnergy RAPL plugin is used. Earlier version of this plugin have a bug and the implementation of the RAPL interface is different between architectures. Therefore, energy measurements (capturing the CPU and DRAM) may vary significantly between different hardware platforms and cannot be directly compared or taken as the true energy consumption of the system. ↩︎
⚠ Note: This only works for single node jobs. ↩︎

HSUper Documentation

On This Page

RAPL (CPU and DRAM Energy)

Manual Usage

EMT (RAPL + GPU Energy)

IPMI (Total System Power)