top of page

AI data center 2025: GPU density, power and cooling

  • Cedric KTORZA
  • Oct 7
  • 7 min read
ree

Datacenter intelligence artificielle is reshaping power, density, and cooling in 2025.

If you’re planning or upgrading AI data centers, this guide explains how GPU density drives electrical and thermal design, what power topologies work at multi‑MW scale, and how to choose the right cooling (air, rear‑door, direct‑to‑chip liquid, or immersion) while improving efficiency and resilience.

 

In brief

  • GPU clusters push racks from 10–20 kW to 50–120+ kW; cooling strategy must evolve accordingly.

  • Design power for stepwise expansion: modular MV/LV, scalable UPS, and liquid‑ready cooling loops.

  • Start with rear‑door heat exchangers, graduate to direct‑to‑chip for >50 kW/rack, immersion for niche use.

  • Use real‑time telemetry and AI‑assisted operations to balance performance, PUE, and WUE.

  • At Score Group, Noor ITS (infrastructure), Noor Energy (power/renewables), and Noor Technology (AI/IoT) deliver end‑to‑end integration.

Topic

2025 reality

Practical choice

Rack density

30–120+ kW per rack for GPU training

Liquid‑ready designs from day one

Power

Multi‑MW blocks, fast growth

Modular MV skids, scalable UPS/BESS

Cooling

Air plateau reached

RDHx to direct‑to‑chip liquid

Efficiency

PUE and WUE both matter

Heat reuse + renewable sourcing

Operations

Rapid change, supply risk

Reference designs + DCIM + AIOps

 

The 2025 inflection: why AI density changes everything

High‑end GPU nodes pack 6–8 accelerators with high‑bandwidth interconnects, concentrating heat in small footprints. The result: rack densities commonly jumping from traditional 5–10 kW to 30–80 kW, with 100–150 kW feasible when liquid cooled. Industry observers have tracked this shift; for example, Uptime Institute notes the steady rise of high‑density deployments and accelerating adoption of liquid cooling as AI workloads scale (2024) Uptime Institute.

Two consequences flow from this:

  • Electrical distribution must deliver stable, high current at the rack with minimal losses and redundancy suited to training clusters.

  • Thermal management must remove concentrated heat efficiently while preserving serviceability and safety.

Where efficiency meets innovation… That’s the design mindset AI data centers now require.

 

GPU density, interconnects, and thermal footprints

 

What drives heat density

  • Accelerator count per node and TDP per GPU.

  • High‑speed fabrics (e.g., NVLink/NVSwitch class) that raise board‑level heat.

  • Memory and storage proximity for feeding models at scale.

Even without exact TDPs, the trend is clear: training nodes can draw several kilowatts each, making 10–20 kW per chassis common and pushing racks beyond the limits of conventional room‑level air cooling. Open frameworks like the OCP Advanced Cooling Solutions highlight standardized approaches for high‑heat‑flux hardware (2023) Open Compute Project.

 

Density bands and what they imply

  • 10–30 kW/rack: High‑efficiency air with containment or rear‑door heat exchangers (RDHx).

  • 30–80 kW/rack: RDHx at the low end; direct‑to‑chip (D2C) liquid becomes primary choice.

  • 80–150 kW+/rack: D2C liquid with facility water loops; immersion cooling for specific use cases.

ASHRAE’s technical guidance for data processing environments provides envelopes and liquid‑cooling guidance that inform these thresholds ASHRAE TC 9.9.

 

Power architecture: feeding multi‑MW AI clusters

 

From grid to rack: scalable and resilient

Design for growth in modular blocks (e.g., 5–20 MW):

  • Medium‑voltage intake and prefabricated power skids for rapid expansion.

  • LV distribution at 400/415 V 3‑phase with busway or high‑capacity PDUs to support dense racks.

  • Proper branch circuit sizing (e.g., 63–125 A feeds) per rack, with metering for chargeback and capacity planning.

Redundancy strategies align with workload criticality:

  • N+1 for capacity clusters where retraining can resume.

  • 2N or distributed redundant for latency‑sensitive inference or shared enterprise loads.

  • Lithium‑ion UPS improves power density and lifecycle; consider battery energy storage systems (BESS) to shave peaks and support grid services where regulation permits.

The IEA notes rapidly rising electricity needs from data centers and AI, reinforcing the importance of efficiency and grid‑friendly design (2024) IEA.

 

Cabling and rack power distribution

  • Dual‑cord cabinet power strips with granular metering, hot‑swappable breakers, and high‑temperature ratings.

  • 48 V DC internal distribution within the rack is increasingly common for advanced platforms; ensure compatibility with facility AC delivery and vendor PSUs.

 

Cooling options: choosing the right path

 

Air cooling and its limits

Air remains viable up to roughly 20–30 kW/rack with strict containment and high‑efficiency coils. Above that, fan energy, delta‑T constraints, and chip hot‑spot temperatures make air alone insufficient.

 

Rear‑door heat exchangers (RDHx)

  • A bridge solution that can handle 20–50 kW/rack by removing heat at the rack rear.

  • Pros: retrofittable, minimal server changes.

  • Watchouts: floor loading, condensate control, water loop reliability and monitoring.

 

Direct‑to‑chip liquid cooling

  • Coolant plates at CPUs/GPUs move heat into a secondary loop, enabling 50–120+ kW/rack.

  • Pros: highest efficiency at scale, smaller white‑space footprint, lower fan power.

  • Requirements: facility water distribution, CDU (coolant distribution units), leak detection, maintenance procedures compatible with SLAs.

ASHRAE and vendor‑neutral references outline materials compatibility, coolant quality, and safe operating envelopes for liquid loops ASHRAE TC 9.9.

 

Immersion cooling

  • Full or partial immersion can reach very high densities and uniform chip temperatures.

  • Best for specialized environments; consider serviceability, fluid handling, and ecosystem maturity.

  • Engage early with hardware vendors to validate warranties and fluid compatibility.

 

Water, heat reuse, and sustainability

  • Balance PUE with WUE: adiabatic systems save power but consume water; hybrid solutions can optimize both.

  • Consider heat reuse where district or process heat sinks exist; liquid loops simplify recovery at useful temperatures.

  • Integrate on‑site renewables and storage where feasible, and explore PPAs to decarbonize supply. Global analyses emphasize the role of efficient design and clean electricity in mitigating AI’s energy footprint (2024) IEA.

 

Reference designs: rooms, rows, and modules

 

High‑density row in an existing room

  • RDHx retrofits, seal the envelope (containment), upgrade electrical whips, and add CDU in row.

  • Useful for pilots and incremental growth without full rebuild.

 

Liquid‑ready new build

  • Central plant sized for future liquid loads, overhead busway, hot aisle containment, and white space zoning (air vs liquid).

  • Modular power skids and CDU capacity added per block.

 

Prefabricated modules

  • Factory‑integrated power and liquid loops reduce time‑to‑AI.

  • Ideal when grid capacity or permits are phased.

 

Instrumentation and AI‑assisted operations

  • Deploy dense telemetry: rack inlet/outlet temps, coolant temperatures and flow, rack PDU metrics, and GPU‑level sensors.

  • DCIM integrated with IT telemetry (job schedulers) enables energy‑aware workload placement.

  • Use AIOps to detect hotspots, optimize setpoints, and predict failures. Uptime Institute research underscores the value of operational data in reducing incidents and improving efficiency (2024) Uptime Institute.

 

Risk, resilience, and continuity for AI

  • Plan for partial capacity loss: graceful degradation strategies for training jobs.

  • PRA/PCA considerations: replicate storage and checkpoints; align redundancy to business impact rather than a single tier label.

  • Test failover of liquid loops and power transitions under load.

 

How Score Group helps: Energy, Digital, New Tech working as one

Score Group acts as an end‑to‑end integrator—energy, digital infrastructure, and emerging tech aligned to outcomes.

  • Noor ITS – The digital infrastructure foundation

  • Data center design and optimization, scalable networks, secure architectures, cloud and hybrid integration.

  • High‑density reference designs, DCIM deployment, and migration strategies.

  • Noor Energy – Intelligence for energy performance

  • Smart energy management, building systems (GTB/GTC), renewable integration, EV infrastructure.

  • MV/LV design, UPS/BESS strategy, heat‑recovery projects, and efficiency roadmaps.

  • Noor Technology – Innovation to stay ahead

  • AI‑driven operations, IoT sensorization, automation (RPA), and custom app development for telemetry and workflow.

  • Predictive maintenance models for pumps, CDUs, and cooling plants; energy optimization algorithms.

At Score Group, we deliver “solutions adapted to each of your needs,” aligning GPU performance with power, cooling, and sustainability objectives.

To start a conversation with our architects and engineers, visit Score Group.

 

Sizing checklist: from concept to commissioning

  1. Define workload mix and growth curve (training vs inference, model sizes, utilization targets).

  2. Establish density bands per rack and thermal strategy (RDHx, D2C liquid, immersion).

  3. Plan power blocks, redundancy model (N+1/2N), and expansion steps.

  4. Design liquid loops (supply temperature, CDU sizing), water strategy, and heat‑reuse options.

  5. Instrument everything (power, thermal, flow) and integrate DCIM with IT telemetry.

  6. Commission with realistic workload tests; iterate setpoints for efficiency and stability.

 

FAQ

 

How many kilowatts per rack should I plan for an AI training cluster in 2025?

Plan for at least 30–60 kW/rack for mid‑range GPU nodes and 80–120 kW/rack for leading platforms when liquid cooled. The exact figure depends on accelerators per node, interconnects, and oversubscription. Start with an envelope (e.g., 60 kW standard, 120 kW peak) and make the facility liquid‑ready even if you begin with RDHx. Use measured IT loads during pilot phases to refine branch circuit sizing, CDU capacity, and cooling water temperatures before scaling.

 

When do I need liquid cooling instead of advanced air?

Air with containment can be efficient up to roughly 20–30 kW/rack. Beyond that, fan energy rises and component hotspots become harder to manage. RDHx can extend air‑assisted performance to about 50 kW/rack, but for sustained 50–120+ kW, direct‑to‑chip liquid delivers better thermals and lower total energy. Immersion suits specialized, very high‑density or acoustically constrained deployments. Follow ASHRAE envelopes and vendor guidance for exact limits and service procedures.

 

What electrical topology fits fast‑growing AI campuses?

Use modular MV intake and power skids feeding 400/415 V distribution, with scalable UPS (N+1 or distributed redundant) and dual‑cord cabinet power. Choose Li‑ion UPS for higher energy density and lifecycle. Design in stages (e.g., 5 MW blocks) that can be commissioned quickly as hardware arrives. Where allowed, add BESS for peak shaving and grid support. Keep fault domains small so maintenance or failures don’t interrupt entire training runs.

 

How can I reduce PUE and water use while cooling high‑density racks?

Operate with higher supply water temperatures in liquid loops to improve chiller efficiency and enable more free‑cooling hours. Use RDHx or D2C to cut fan energy. If using adiabatic systems, manage WUE with smart controls and seasonal modes. Consider heat reuse into district networks or building systems to raise overall site efficiency. Continuous telemetry and AI‑assisted setpoint optimization can improve both PUE and WUE without risking thermal limits.

 

Can existing enterprise data centers host AI training without major rebuilds?

Often, yes—but with limits. You can create high‑density rows inside existing rooms using RDHx, CDU modules, and upgraded electrical feeds. This is ideal for pilots or smaller clusters. For sustained growth, plan a liquid‑ready expansion: dedicated white space, facility water loops, and modular power. The best path is a phased roadmap that validates densities and operations before broad rollout, avoiding oversizing and stranded capacity.

 

Key takeaways

  • GPU density is the design driver: plan for 30–120+ kW/rack and liquid‑ready facilities.

  • Modular power, scalable UPS/BESS, and robust LV distribution de‑risk growth.

  • RDHx is a practical bridge; direct‑to‑chip liquid becomes the mainstream choice.

  • Balance PUE with WUE; pursue heat reuse and cleaner power to cut emissions.

  • Instrument deeply and apply AIOps to maintain stability and efficiency as you scale.

  • Ready to plan or upgrade your AI data center? Connect with Score Group to align energy, digital infrastructure, and new tech from day one.

External sources:

 
 
bottom of page