top of page

HPC and AI in existing data centers: challenges, ROI

  • Cedric KTORZA
  • Oct 22
  • 8 min read
ree

Intégrer le HPC et l’IA dans les datacenters existants contraintes/ROI/roadmap technique — this guide shows how to achieve it safely, efficiently, and with measurable ROI.

High‑density compute and AI accelerators are pushing legacy data centers past their design envelopes. In this article, we map the constraints you must resolve, the ROI levers to quantify, and a pragmatic technical roadmap to retrofit GPU/HPC capacity inside existing facilities. As an integrator, Score Group unifies energy, digital infrastructure and new technologies through its Noor divisions to deliver end‑to‑end outcomes.

 

At a glance

  • Start with a constraints map: power density, cooling method, space/paths, network/storage fabric, and governance/safety.

  • Build ROI on three layers: time‑to‑solution gains, energy efficiency improvements, and capacity reuse (brownfield over greenfield).

  • Expect liquid cooling for >30 kW/rack; design for modular, phased deployment to de‑risk and preserve uptime.

  • Use a dual‑stack control plane: DCIM/energy analytics plus cluster/MLOps observability for cost and carbon visibility.

  • Execute in phases: assess, upgrade facility, deploy IT, operationalize, then continuously optimize with workload placement and heat reuse.

 

The brownfield reality: key constraints to resolve

 

Power density and electrical distribution

Most legacy rooms were engineered for single‑digit kW per rack; modern GPU/HPC racks often exceed 30 kW, and immersion deployments can surpass 80 kW. Validate:

  • Feeder and busway capacity, selective coordination, and fault levels

  • UPS topology, runtime objectives and battery chemistry

  • Branch circuits and smart PDUs for high currents and telemetry

Plan for staged power upgrades and high‑efficiency power paths. Where appropriate, consider bypassing double‑conversion UPS for specific non‑critical training workloads to reduce losses, while maintaining resilience for business‑critical services.

 

Cooling and the thermal envelope

Air cooling alone struggles beyond ~20–30 kW per rack. Options include:

  • Enhanced air: containment, higher supply temperatures, and rear‑door heat exchangers

  • Direct‑to‑chip liquid cooling (D2C): removes high‑grade heat at the source

  • Immersion cooling: maximizes density and acoustic comfort, with specific maintenance workflows

Use ASHRAE’s thermal guidelines as your guardrail for allowable envelopes and component reliability expectations (see ASHRAE TC 9.9 guidance) ASHRAE Thermal Guidelines. For sustainability and grid impact context, track water use and power efficiency; PUE and WUE are commonly used metrics with clear definitions from industry bodies like Uptime Institute What is PUE.

 

Floor loading, space and pathways

GPU pods and liquid distribution introduce new weight and routing constraints:

  • Check slab ratings and localized loading under tanks/CDUs

  • Route supply/return manifolds, ensure service clearance, and spill‑containment

  • Maintain hot‑work and leak detection procedures with facilities

 

Network and storage fabric

AI training saturates east‑west traffic and storage I/O:

  • Leaf‑spine or dragonfly fabrics with 100–400 GbE or HDR/NDR InfiniBand

  • NVMe‑oF, parallel file systems for streaming datasets

  • DPU/SmartNIC offloads to contain CPU overhead and improve isolation

 

Cybersecurity, safety and governance

HPC/AI estates widen the attack surface and add new safety dimensions (coolants, higher currents):

  • Threat modeling, micro‑segmentation, and zero‑trust access

  • Safety cases for liquid systems, including materials compatibility and emergency response

  • AI governance aligned with NIST’s AI Risk Management Framework NIST AI RMF and readiness for evolving EU rules EU AI regulatory approach

 

ROI, TCO and sustainability: building the business case

 

What to measure before you invest

Tie outcomes to business value and resilience:

  • Time‑to‑solution (TTS) and throughput per watt for target workloads

  • Utilization (cluster and accelerator), queue wait times, and job preemption efficiency

  • Facility KPIs: PUE, WUE, carbon intensity (location‑based and market‑based), and rack density

  • Reliability: MTTI/MTTR, failure domains, and maintenance windows

Industry trackers show rapidly rising compute demand and energy stakes. The IEA notes data‑center electricity demand is set to increase substantially by mid‑decade, driven by AI and HPC growth IEA report. Uptime Institute trend data highlights rising densities and a gradual shift toward liquid cooling approaches Uptime survey.

 

Three common business patterns

  1. Pilot pod (quick wins)

  2. One to three racks, D2C liquid or rear‑door heat exchangers

  3. Demonstrate TTS gains on priority use cases

  4. Validate telemetry stack and operating procedures

  5. Phased cluster expansion

  6. Add GPU nodes and storage in waves

  7. Iterate on fabric design and cooling distribution

  8. Optimize scheduling policies and chargeback

  9. Inference at the edge/core

  10. Retrofit inference‑optimized nodes across existing rooms

  11. Use power‑capped profiles to fit air‑cooled envelopes

  12. Centralize model management and observability

 

Funding and risk levers

  • Energy performance contracts and ISO 50001 programs can fund efficiency upgrades ISO 50001.

  • Demand response and carbon‑aware scheduling reduce cost and emissions during peaks.

  • Brownfield reuse avoids greenfield lead times and embodied‑carbon impacts; prioritize modular upgrades to de‑risk capex.

 

A pragmatic technical roadmap for brownfield integration

 

Phase 0 — Assessment and strategy

  • Facility audit: electrical one‑line, cooling capacity, structural, and code compliance

  • IT workload discovery: GPU/CPU mix, storage tiers, network patterns

  • Risk register: safety, cybersecurity, operations, and regulatory exposure

  • Target architecture and phasing plan with success metrics

 

Phase 1 — Facility upgrades for high density

  • Power: selective coordination review, new panels/busways, high‑efficiency UPS modules, and advanced metering

  • Cooling: choose containment upgrades, rear‑door HEX, or D2C/immersion; plan CDUs, manifolds, and monitoring; study heat‑reuse options

  • Controls: integrate new sensors and leak detection into BMS/GTB and DCIM

For liquid‑cooling pathways and vendor‑neutral designs, OCP’s Advanced Cooling Solutions community offers helpful references OCP ACS.

 

Phase 2 — IT platform and data pipelines

  • Cluster: GPU servers, high‑bandwidth interconnects, and HPC schedulers (Slurm) or Kubernetes for AI/ML

  • Storage: NVMe‑oF and parallel filesystems for high‑throughput datasets

  • Security: hardened images, secrets management, and workload isolation with DPUs/Service Mesh

Leverage learnings from Green500 on performance per watt to inform node and cooling choices for energy‑efficient compute Green500 list.

 

Phase 3 — Operations, observability and governance

  • DCIM plus energy analytics for PUE/WUE and sub‑metered costs

  • Cluster observability (exporters, job telemetry) and FinOps chargeback/showback

  • MLOps lifecycle: data lineage, model registry, reproducibility, and approval workflows

  • Safety drills for liquid systems; update change‑management for high‑density bays

NREL’s resources on HPC data‑center efficiency provide practical guidance on monitoring and continuous improvement loops NREL HPC efficiency.

 

Phase 4 — Continuous optimization

  • Workload placement by thermal zones and carbon intensity

  • Power capping and QoS policies for predictable operations

  • Heat reuse to buildings/processes where feasible

  • Periodic right‑sizing and decommissioning of under‑utilized assets

Where efficiency embraces innovation: the most sustainable teraflop is the one delivered faster, at lower energy per solution, inside infrastructure you already own.

 

How Score Group delivers end‑to‑end integration

At Score Group, we act as a global integrator across energy, digital infrastructure and new technologies, aligning operations, sustainability and performance.

 

Noor Energy — Intelligent, profitable energy performance

  • Smart energy management and analytics, ISO 50001 programs

  • Building and plant controls (GTB/GTC), heat‑reuse studies and implementation

  • Renewables, storage and demand‑response integration to stabilize energy costs and reduce emissions

 

Noor ITS — Digital infrastructure as the transformation backbone

  • Data center design and optimization, high‑density retrofits, DCIM integration

  • Networks, systems engineering, and hybrid cloud landing zones

  • Cybersecurity audits, protection and incident response; PRA/PCA and resilience engineering

 

Noor Technology — Innovative solutions to stay ahead

  • AI enablement (from use‑case discovery to MLOps and governance)

  • Automation (RPA) and Smart Connecting for real‑time telemetry and control

  • Application development to bridge data pipelines, operations and decision‑making

To discuss your context and constraints, start with our team at Score Group.

 

Decision matrix for HPC/AI retrofits in existing facilities

Topic

Typical constraint in brownfield

Viable options

Facility impact

Indicative timeline

Led by (Noor division)

Rack density

Air cooling plateau at ~20–30 kW/rack

Containment + rear‑door HEX; D2C liquid; immersion (pod)

From minor airflow tweaks to new CDUs/manifolds

6–20 weeks (per pod)

Noor ITS + Noor Energy

Power path

UPS capacity/efficiency limits

Modular UPS, busway upgrades, high‑efficiency power path

Planned outages, commissioning tests

8–16 weeks

Noor ITS

Heat reuse

Waste heat rejected to atmosphere

D2C high‑grade heat capture, heat exchangers to HVAC/process

Integration with building loops and controls

8–24 weeks

Noor Energy

Fabric

East‑west congestion

100–400 GbE or IB spine/leaf, QoS and RoCE tuning

Cabling, optics, fabric telemetry

4–12 weeks

Noor ITS

Governance

AI risk and data lineage

NIST AI RMF, MLOps tooling, model registry and approvals

Policy, process and platform updates

4–10 weeks

Noor Technology

Observability

Limited visibility of cost/carbon

DCIM + energy analytics; cluster/FinOps telemetry

Metering, exporters, dashboards

3–8 weeks

Noor ITS + Noor Energy

 

Risk management and compliance essentials

 

Safety and reliability with liquid cooling

Adopt materials‑compatibility matrices, quick‑disconnect standards and leak detection. Follow ASHRAE TC 9.9 envelopes and maintenance practices, and document emergency procedures for spill containment and electrical safety. Validate warranties with server vendors when introducing D2C or immersion.

 

Energy, sustainability and reporting

Set baselines for PUE/WUE and track improvements. ISO 50001 offers a management framework to institutionalize continuous energy performance ISO 50001. Use carbon‑aware scheduling where supported, and explore heat‑reuse integrations to nearby buildings or industrial processes.

 

Data, models and regulatory readiness

Implement role‑based access, dataset governance, and model lifecycle controls aligned with the NIST AI RMF NIST AI RMF. Monitor evolving European AI regulations to ensure risk‑appropriate controls for high‑risk applications EU AI approach.

 

FAQ

 

Can I run GPU‑dense racks in a legacy air‑cooled room?

Yes, within limits. If your racks target 20–30 kW, you can often combine hot/cold aisle containment, higher supply temperatures, and rear‑door heat exchangers to stay within safe envelopes. Above ~30 kW/rack, direct‑to‑chip liquid cooling becomes the practical option for stable operation and acoustic comfort. Start with a pilot pod to verify airflow, electrical capacity and operations. Use DCIM and rack‑level telemetry to validate temperatures, leakage and power draw before scaling further, and plan for structured maintenance windows.

 

What is the fastest path to production for AI training in a brownfield site?

A two‑track approach works best: deploy a small, liquid‑ready GPU pod for immediate training needs while preparing facility upgrades in parallel. Choose nodes with proven interconnects, align on a scheduler (Slurm) or Kubernetes stack, and pre‑stage datasets and storage. Put observability in place from day one (jobs, energy, carbon). Keep the pod modular (CDU, manifolds) so it can scale without rework. This de‑risks the program, demonstrates time‑to‑solution gains quickly, and informs subsequent phases with real telemetry.

 

How do I estimate the cooling required for a 30 kW rack?

Compute heat roughly equals electrical input, so a 30 kW rack requires removal of ~30 kW of heat. Air‑only cooling at this level is challenging; rear‑door heat exchangers can help, but direct‑to‑chip liquid cooling offers greater margin and efficiency. Use manufacturer datasheets for flow rates and delta‑T, and model the loop (CDU capacity, pump curves). Validate against ASHRAE thermal classes and include redundancy (N+1) for pumps and heat rejection. Instrument the rack and loop to confirm real‑world performance before scaling.

 

Will liquid cooling increase my water usage and operational risk?

Not necessarily. Many D2C systems use closed loops that interface with existing dry coolers or chilled water, so site water usage can remain low. Immersion systems also operate as closed loops. The operational risk profile changes—leaks are managed through design (quick disconnects, dripless couplings), monitoring and procedures. Track WUE alongside PUE to quantify benefits. Organizations like OCP and ASHRAE provide design practices to minimize risk and ensure maintainability OCP ACS, ASHRAE guidelines.

 

How should I track efficiency and cost after deploying HPC/AI?

Combine facility and cluster views. At the facility layer, use DCIM and sub‑metering for PUE, WUE and per‑pod kWh. At the compute layer, collect node/GPU utilization, job duration and energy per job. Correlate with carbon intensity and apply chargeback/showback to encourage efficient usage. Benchmark periodically (e.g., against Green500‑style metrics) and revisit placement and power‑capping policies. Public resources from NREL and Uptime Institute can help structure your monitoring strategy NREL HPC efficiency, Uptime PUE.

 

Key takeaways

  • Start with a constraints audit; don’t force GPU racks into envelopes they can’t sustain.

  • Build ROI on business outcomes, energy performance and brownfield reuse.

  • Expect liquid cooling for high densities; design for modular, phased rollouts.

  • Run a dual telemetry stack: DCIM/energy plus cluster/MLOps for cost and carbon clarity.

  • Treat safety, cybersecurity and AI governance as first‑class requirements.

Ready to plan your HPC/AI retrofit? Speak with our integrated team at Score Group to align energy, digital and new technology streams into one actionable roadmap.

 
 
bottom of page