Measuring AI cluster energy performance beyond PUE
- Cedric KTORZA
- Oct 22
- 7 min read

Mesurer la performance énergétique d’un cluster IA au-delà du PUE classique — this article shows exactly how.
AI infrastructure pushes power and cooling to their limits, and Power Usage Effectiveness (PUE) alone no longer tells you if your compute is energy-efficient. In this guide, we explain practical, decision-ready metrics for AI clusters—from tokens-per-kWh and accelerator utilization to carbon- and water-aware KPIs—plus how to instrument, analyze, and act on them. At Score Group, we combine energy, digital infrastructure, and new tech to help you make efficiency a competitive edge, not an afterthought.
In brief
Move beyond PUE with layered metrics: compute, workload, system, facility, and carbon.
Instrument GPUs/accelerators, PDUs, cooling, and jobs with time-aligned telemetry.
Normalize by business output (e.g., tokens/kWh, accuracy-per-kWh) to guide trade-offs.
Optimize via utilization, cooling upgrades, scheduling, model and software efficiency.
Govern with dashboards and standards (ISO 30134, EU Code of Conduct, MLCommons).
Why PUE is not enough for AI clusters
PUE measures facility overhead versus IT power, but says nothing about how effectively compute power turns into business output. In AI clusters, accelerators dominate the power budget, workloads are bursty, and interconnects plus cooling strategies shape actual efficiency. A site can have an excellent PUE and still waste megawatt-hours on idle GPUs or suboptimal jobs.
What PUE captures—and misses
Captures: Relative facility efficiency (power to IT vs. total power) and common improvements (containment, UPS efficiency).
Misses: Accelerator utilization, software stack efficiency, model architecture choices, dataset and training strategy, inference throughput, and carbon intensity of power supply.
Risk: Optimizing PUE while leaving 30–50% of GPU cycles underutilized yields low real productivity-per-kWh.
For a primer on PUE and its standardization, see ISO/IEC 30134-2 and Energy Star’s overview:
A multi-metric framework for AI energy performance
Think in layers. Each layer answers a different decision-making question and together they form a coherent picture.
Compute layer: hardware efficiency
Accelerator utilization (%): Are GPUs/TPUs busy doing useful work?
Perf/W at the chip level: FLOPS/W or tokens/s/W for representative kernels.
Memory and interconnect: HBM bandwidth use, NVLink/InfiniBand counters to expose bottlenecks. How to measure
NVIDIA SMI, AMD ROCm SMI, and vendor telemetry APIs; Redfish/IPMI for server power.
Link: https://docs.nvidia.com/deploy/nvidia-smi/index.html and https://www.dmtf.org/standards/redfish
Workload layer: output per unit energy
Tokens-per-kWh (inference): Total generated tokens divided by energy at the node/rack boundary.
Samples/s/W or training energy-to-target-accuracy: kWh required to reach a defined accuracy or perplexity.
SLA-aware metrics: Energy per request at p99 latency. Why it matters
Converts engineering trade-offs into business terms; compares models, precisions (FP8/FP16), quantization and sparsity strategies on equal footing.
System layer: rack/pod efficiency and sustainability
ITUE/TUE-like views: How much of IT power reaches the accelerators vs. lost in platform overhead.
WUE (Water Usage Effectiveness) and CUE (Carbon Usage Effectiveness) from The Green Grid.
ERF (Energy Reuse Factor): Quantify useful heat recovery. References
Facility layer: power and cooling in real conditions
PUE at part-load and per zone: AI pods often run at 30–80 kW/rack; local PUE can differ from site average.
Thermal KPIs: Delta-T, containment effectiveness, supply temperature compliance (ASHRAE).
Cooling capability per rack: Air vs. rear-door heat exchangers vs. direct liquid cooling. References
https://www.opencompute.org/blog/direct-liquid-cooling-and-ocp
Carbon layer: time- and location-aware
Carbon intensity (gCO2e/kWh) at job time: Run flexible workloads when the grid is greener.
Scope 2 market-based emissions with hourly matching where available. Reference
https://www.iea.org/reports/data-centres-and-data-transmission-networks
“Measure what matters, at the granularity your decisions require.”
Instrumentation: getting the data you need
Good metrics require trustworthy, time-aligned telemetry that spans chips to chillers.
Telemetry sources and signals
Accelerators and servers: Per-GPU power, clocks, utilization, memory bandwidth (nvidia-smi/rocm-smi), CPU package power (RAPL), NIC counters.
Electrical: Smart PDUs, branch circuit meters, UPS input/output, generators.
Cooling: Chilled water flow and temperatures, CRAH/CRAC power, pump and fan VFDs, CDU heat rejection.
Environmental: Rack inlet temperature, humidity, differential pressure.
Job context: Workload tags (model, precision, batch size), SLA class, job start/stop.
Standards and tools
Redfish for server/fleet telemetry; OpenTelemetry for metric transport and tracing.
Links: https://www.dmtf.org/standards/redfish and https://opentelemetry.io/
Time alignment and normalization
Synchronize clocks (NTP/PTP) across IT and facility systems.
Aggregate at meaningful boundaries: per-GPU, per-node, per-rack, per-pod.
Normalize by output (tokens, accuracy, jobs completed) and by time-of-day carbon intensity.
Maintain baselines for A/B testing: model version, driver/container versions, and cooling setpoints.
Validating measurements
Cross-check chip-reported watts with PDU circuits during controlled load sweeps.
Detect drift and outliers with automated sanity checks (e.g., power > nameplate, or utilization at 0% with high power = misconfiguration).
Keep metadata: firmware versions, seating, and cabling that can impact performance.
From metrics to action: where efficiency gains come from
Metrics pay off when they inform design, operations, and software choices.
Improve accelerator and system utilization
Right-size jobs to fit memory; use multi-instance GPU (MIG) or partitioning to reduce fragmentation.
Batch and queue intelligently; co-schedule complementary jobs to keep SMs busy.
Eliminate “dark” power: disable unused links, cap idle clocks, and power-manage accelerators when queues empty. Reference
MLCommons power measurement and best practices: https://mlcommons.org/en/groups/research/power/
Optimize models and software
Adopt lower precision (e.g., FP8) where accuracy permits; apply quantization-aware training.
Exploit sparsity and operator fusion; profile kernels to remove bottlenecks.
Cache embeddings/results and use distillation to shrink models for inference. Benchmarking references
SPECpower and SERT for server energy characteristics: https://www.spec.org/power_ssj2008/ and https://www.spec.org/sert2/
Cooling and power delivery upgrades
Move to warm-water liquid cooling or rear-door heat exchangers for >30–50 kW/rack densities.
Raise supply temperatures within ASHRAE allowable ranges to unlock chiller efficiency.
Optimize airflow (containment, blanking, cable hygiene); reduce recirculation. References
Carbon-aware scheduling and energy sourcing
Shift flexible training to low-carbon hours; pin latency-critical inference to high-performance windows.
Use on-site or contracted renewables; integrate energy storage where viable.
Track ERF when heat reuse is available (district heating, industrial processes).
Governance, standards, and reporting
Align KPIs to ISO/IEC 30134 series (PUE, WUE, CUE) and EN 50600 concepts for consistent reporting.
Adopt EU Code of Conduct for Data Centres best practices for continual improvement.
Publish a small, stable KPI set: tokens/kWh, energy-to-target-accuracy, average accelerator utilization, part-load PUE, WUE, hourly CUE.
Build a cross-functional review: operations, AI engineering, sustainability, finance.
References
ISO/IEC 30134 overview: https://www.iso.org/standard/63426.html
EU Code of Conduct for Data Centres: https://joint-research-centre.ec.europa.eu/energy-efficiency/energy-efficient-products/data-centres_en
A practical assessment workflow (example)
Define outcomes and SLAs
Choose output metrics: tokens/kWh for inference; kWh-to-target-accuracy for training; p99 latency bounds.
Instrument and baseline
Enable per-GPU power, PDU circuits, cooling meters; tag jobs; sync clocks.
Run a 72-hour mixed workload to capture diurnal and thermal dynamics.
Analyze bottlenecks
Plot utilization vs. watts; find underfilled memory or chokepoints in networking.
Compare PUE sitewide vs. pod-local; identify hot/cold aisle imbalances.
Optimize and A/B test
Trial FP8, quantization, and batch adjustments; measure output-per-kWh changes.
Raise chilled water by 1–2°C within ASHRAE guidance; measure PUE and component temps.
Pilot carbon-aware scheduling for flexible jobs.
Operationalize
Codify policies in orchestrators and schedulers; set guardrails for power caps.
Review KPIs weekly; track regressions after driver or model changes.
Recommended metrics beyond PUE (operational cheat sheet)
How Score Group helps: energy, digital, and new tech working together
At Score Group, we unite energy efficiency and cutting-edge digital infrastructure—where efficiency embraces innovation. Our divisions work in concert to improve AI cluster performance end to end:
Noor ITS – The infrastructure backbone
Datacenter design and optimization, network and systems engineering, Cloud & Hosting, PRA/PCA and resilience.
We help you instrument telemetry (Redfish, PDUs, cooling), architect high-density pods, and baseline PUE at part-load.
Noor Energy – Intelligence for energy performance
Energy management systems, GTB/GTC, renewable integration, storage, and EV-ready facilities.
We deploy metering, optimize HVAC setpoints, and evaluate heat reuse to improve WUE/CUE/ERF.
Noor Technology – Innovation for tomorrow’s workloads
AI, RPA, IoT and smart connecting, and application development.
We collaborate with your ML teams to raise tokens/kWh via software and model optimization, and to integrate carbon-aware policies.
Discover how our integrated approach translates metrics into measurable gains: Score Group.
FAQ
How do I calculate tokens-per-kWh for LLM inference?
Aggregate the total number of tokens produced during a defined window (from application logs or your inference service) and divide by the electrical energy measured at the relevant boundary. For transparency, prefer node- or rack-level energy from PDUs; include only the time/energy when the workload runs. Separate prompt vs. completion tokens if that affects performance. Tag results by model version, precision (e.g., FP8/INT8), and batch size so A/B tests are comparable. Report tokens/kWh alongside latency (p95/p99) to ensure efficiency doesn’t compromise SLA.
What telemetry is essential to measure GPU power accurately?
Combine on-device readings (e.g., nvidia-smi power.draw) with independent electrical metering at the PDU or rack circuit. Chip-level data gives per-accelerator granularity, while PDU meters capture platform overheads and losses. Synchronize timestamps using NTP/PTP, and sample at 1–10 seconds for dynamic workloads. Validate by running controlled load steps and comparing trends between GPU-reported watts and PDU power. Include ambient and inlet temperatures to correlate thermal effects with power draw.
How can I include carbon intensity in AI job scheduling?
Fetch real-time or day-ahead grid carbon intensity for your location(s) via public APIs or utility feeds. Multiply measured energy for each job by the time-matched intensity to compute gCO2e/job. For flexible training jobs, define scheduling policies that prefer low-carbon windows; for latency-critical inference, set minimum performance windows but still shift non-urgent tasks. Track CUE and gCO2e per job on dashboards and periodically re-evaluate policies as grid mix and SLAs evolve. This approach aligns energy savings with decarbonization goals.
Is liquid cooling necessary for high-density AI racks?
Above roughly 30–50 kW per rack, traditional air cooling can struggle with efficiency and thermal headroom, especially for sustained accelerator loads. Rear-door heat exchangers or direct liquid cooling often improve thermal stability and reduce fan and chiller power. Evaluate using your part-load PUE, rack inlet/outlet Delta-T, and component temperatures versus ASHRAE guidelines. Pilot a small pod, measure energy and temperatures at matched workload, and compare lifecycle costs and constraints (e.g., water quality, CDUs, maintenance).
Which standards should I use to report beyond PUE?
Anchor your reporting in the ISO/IEC 30134 series (PUE, WUE, CUE) for consistency, and align operational practices with the EU Code of Conduct for Data Centres. For server-level characterization use SPECpower and SERT; for AI-specific power benchmarking, refer to MLCommons power measurement methodologies. Include workload-normalized KPIs (tokens/kWh, kWh-to-target-accuracy) and sustainability metrics (CUE, WUE, ERF). Maintain a concise KPI set and publish methods so stakeholders can compare across time and sites.
Key takeaways
PUE is necessary but insufficient for AI; add workload-, system-, and carbon-aware KPIs.
Instrument chips to chillers; time-align and normalize by business output.
Drive efficiency with utilization, model/software optimization, and cooling upgrades.
Adopt standards (ISO 30134, The Green Grid, EU Code) and automate dashboards.
Treat carbon intensity as a first-class signal in scheduling and planning.
Ready to turn metrics into outcomes? Connect with Score Group to design, instrument, and optimize your AI cluster—end to end.



