HPC and AI in existing data centers: challenges, ROI
- Cedric KTORZA
- Oct 22
- 8 min read

Intégrer le HPC et l’IA dans les datacenters existants contraintes/ROI/roadmap technique — this guide shows how to achieve it safely, efficiently, and with measurable ROI.
High‑density compute and AI accelerators are pushing legacy data centers past their design envelopes. In this article, we map the constraints you must resolve, the ROI levers to quantify, and a pragmatic technical roadmap to retrofit GPU/HPC capacity inside existing facilities. As an integrator, Score Group unifies energy, digital infrastructure and new technologies through its Noor divisions to deliver end‑to‑end outcomes.
At a glance
Start with a constraints map: power density, cooling method, space/paths, network/storage fabric, and governance/safety.
Build ROI on three layers: time‑to‑solution gains, energy efficiency improvements, and capacity reuse (brownfield over greenfield).
Expect liquid cooling for >30 kW/rack; design for modular, phased deployment to de‑risk and preserve uptime.
Use a dual‑stack control plane: DCIM/energy analytics plus cluster/MLOps observability for cost and carbon visibility.
Execute in phases: assess, upgrade facility, deploy IT, operationalize, then continuously optimize with workload placement and heat reuse.
The brownfield reality: key constraints to resolve
Power density and electrical distribution
Most legacy rooms were engineered for single‑digit kW per rack; modern GPU/HPC racks often exceed 30 kW, and immersion deployments can surpass 80 kW. Validate:
Feeder and busway capacity, selective coordination, and fault levels
UPS topology, runtime objectives and battery chemistry
Branch circuits and smart PDUs for high currents and telemetry
Plan for staged power upgrades and high‑efficiency power paths. Where appropriate, consider bypassing double‑conversion UPS for specific non‑critical training workloads to reduce losses, while maintaining resilience for business‑critical services.
Cooling and the thermal envelope
Air cooling alone struggles beyond ~20–30 kW per rack. Options include:
Enhanced air: containment, higher supply temperatures, and rear‑door heat exchangers
Direct‑to‑chip liquid cooling (D2C): removes high‑grade heat at the source
Immersion cooling: maximizes density and acoustic comfort, with specific maintenance workflows
Use ASHRAE’s thermal guidelines as your guardrail for allowable envelopes and component reliability expectations (see ASHRAE TC 9.9 guidance) ASHRAE Thermal Guidelines. For sustainability and grid impact context, track water use and power efficiency; PUE and WUE are commonly used metrics with clear definitions from industry bodies like Uptime Institute What is PUE.
Floor loading, space and pathways
GPU pods and liquid distribution introduce new weight and routing constraints:
Check slab ratings and localized loading under tanks/CDUs
Route supply/return manifolds, ensure service clearance, and spill‑containment
Maintain hot‑work and leak detection procedures with facilities
Network and storage fabric
AI training saturates east‑west traffic and storage I/O:
Leaf‑spine or dragonfly fabrics with 100–400 GbE or HDR/NDR InfiniBand
NVMe‑oF, parallel file systems for streaming datasets
DPU/SmartNIC offloads to contain CPU overhead and improve isolation
Cybersecurity, safety and governance
HPC/AI estates widen the attack surface and add new safety dimensions (coolants, higher currents):
Threat modeling, micro‑segmentation, and zero‑trust access
Safety cases for liquid systems, including materials compatibility and emergency response
AI governance aligned with NIST’s AI Risk Management Framework NIST AI RMF and readiness for evolving EU rules EU AI regulatory approach
ROI, TCO and sustainability: building the business case
What to measure before you invest
Tie outcomes to business value and resilience:
Time‑to‑solution (TTS) and throughput per watt for target workloads
Utilization (cluster and accelerator), queue wait times, and job preemption efficiency
Facility KPIs: PUE, WUE, carbon intensity (location‑based and market‑based), and rack density
Reliability: MTTI/MTTR, failure domains, and maintenance windows
Industry trackers show rapidly rising compute demand and energy stakes. The IEA notes data‑center electricity demand is set to increase substantially by mid‑decade, driven by AI and HPC growth IEA report. Uptime Institute trend data highlights rising densities and a gradual shift toward liquid cooling approaches Uptime survey.
Three common business patterns
Pilot pod (quick wins)
One to three racks, D2C liquid or rear‑door heat exchangers
Demonstrate TTS gains on priority use cases
Validate telemetry stack and operating procedures
Phased cluster expansion
Add GPU nodes and storage in waves
Iterate on fabric design and cooling distribution
Optimize scheduling policies and chargeback
Inference at the edge/core
Retrofit inference‑optimized nodes across existing rooms
Use power‑capped profiles to fit air‑cooled envelopes
Centralize model management and observability
Funding and risk levers
Energy performance contracts and ISO 50001 programs can fund efficiency upgrades ISO 50001.
Demand response and carbon‑aware scheduling reduce cost and emissions during peaks.
Brownfield reuse avoids greenfield lead times and embodied‑carbon impacts; prioritize modular upgrades to de‑risk capex.
A pragmatic technical roadmap for brownfield integration
Phase 0 — Assessment and strategy
Facility audit: electrical one‑line, cooling capacity, structural, and code compliance
IT workload discovery: GPU/CPU mix, storage tiers, network patterns
Risk register: safety, cybersecurity, operations, and regulatory exposure
Target architecture and phasing plan with success metrics
Phase 1 — Facility upgrades for high density
Power: selective coordination review, new panels/busways, high‑efficiency UPS modules, and advanced metering
Cooling: choose containment upgrades, rear‑door HEX, or D2C/immersion; plan CDUs, manifolds, and monitoring; study heat‑reuse options
Controls: integrate new sensors and leak detection into BMS/GTB and DCIM
For liquid‑cooling pathways and vendor‑neutral designs, OCP’s Advanced Cooling Solutions community offers helpful references OCP ACS.
Phase 2 — IT platform and data pipelines
Cluster: GPU servers, high‑bandwidth interconnects, and HPC schedulers (Slurm) or Kubernetes for AI/ML
Storage: NVMe‑oF and parallel filesystems for high‑throughput datasets
Security: hardened images, secrets management, and workload isolation with DPUs/Service Mesh
Leverage learnings from Green500 on performance per watt to inform node and cooling choices for energy‑efficient compute Green500 list.
Phase 3 — Operations, observability and governance
DCIM plus energy analytics for PUE/WUE and sub‑metered costs
Cluster observability (exporters, job telemetry) and FinOps chargeback/showback
MLOps lifecycle: data lineage, model registry, reproducibility, and approval workflows
Safety drills for liquid systems; update change‑management for high‑density bays
NREL’s resources on HPC data‑center efficiency provide practical guidance on monitoring and continuous improvement loops NREL HPC efficiency.
Phase 4 — Continuous optimization
Workload placement by thermal zones and carbon intensity
Power capping and QoS policies for predictable operations
Heat reuse to buildings/processes where feasible
Periodic right‑sizing and decommissioning of under‑utilized assets
Where efficiency embraces innovation: the most sustainable teraflop is the one delivered faster, at lower energy per solution, inside infrastructure you already own.
How Score Group delivers end‑to‑end integration
At Score Group, we act as a global integrator across energy, digital infrastructure and new technologies, aligning operations, sustainability and performance.
Noor Energy — Intelligent, profitable energy performance
Smart energy management and analytics, ISO 50001 programs
Building and plant controls (GTB/GTC), heat‑reuse studies and implementation
Renewables, storage and demand‑response integration to stabilize energy costs and reduce emissions
Noor ITS — Digital infrastructure as the transformation backbone
Data center design and optimization, high‑density retrofits, DCIM integration
Networks, systems engineering, and hybrid cloud landing zones
Cybersecurity audits, protection and incident response; PRA/PCA and resilience engineering
Noor Technology — Innovative solutions to stay ahead
AI enablement (from use‑case discovery to MLOps and governance)
Automation (RPA) and Smart Connecting for real‑time telemetry and control
Application development to bridge data pipelines, operations and decision‑making
To discuss your context and constraints, start with our team at Score Group.
Decision matrix for HPC/AI retrofits in existing facilities
Risk management and compliance essentials
Safety and reliability with liquid cooling
Adopt materials‑compatibility matrices, quick‑disconnect standards and leak detection. Follow ASHRAE TC 9.9 envelopes and maintenance practices, and document emergency procedures for spill containment and electrical safety. Validate warranties with server vendors when introducing D2C or immersion.
Energy, sustainability and reporting
Set baselines for PUE/WUE and track improvements. ISO 50001 offers a management framework to institutionalize continuous energy performance ISO 50001. Use carbon‑aware scheduling where supported, and explore heat‑reuse integrations to nearby buildings or industrial processes.
Data, models and regulatory readiness
Implement role‑based access, dataset governance, and model lifecycle controls aligned with the NIST AI RMF NIST AI RMF. Monitor evolving European AI regulations to ensure risk‑appropriate controls for high‑risk applications EU AI approach.
FAQ
Can I run GPU‑dense racks in a legacy air‑cooled room?
Yes, within limits. If your racks target 20–30 kW, you can often combine hot/cold aisle containment, higher supply temperatures, and rear‑door heat exchangers to stay within safe envelopes. Above ~30 kW/rack, direct‑to‑chip liquid cooling becomes the practical option for stable operation and acoustic comfort. Start with a pilot pod to verify airflow, electrical capacity and operations. Use DCIM and rack‑level telemetry to validate temperatures, leakage and power draw before scaling further, and plan for structured maintenance windows.
What is the fastest path to production for AI training in a brownfield site?
A two‑track approach works best: deploy a small, liquid‑ready GPU pod for immediate training needs while preparing facility upgrades in parallel. Choose nodes with proven interconnects, align on a scheduler (Slurm) or Kubernetes stack, and pre‑stage datasets and storage. Put observability in place from day one (jobs, energy, carbon). Keep the pod modular (CDU, manifolds) so it can scale without rework. This de‑risks the program, demonstrates time‑to‑solution gains quickly, and informs subsequent phases with real telemetry.
How do I estimate the cooling required for a 30 kW rack?
Compute heat roughly equals electrical input, so a 30 kW rack requires removal of ~30 kW of heat. Air‑only cooling at this level is challenging; rear‑door heat exchangers can help, but direct‑to‑chip liquid cooling offers greater margin and efficiency. Use manufacturer datasheets for flow rates and delta‑T, and model the loop (CDU capacity, pump curves). Validate against ASHRAE thermal classes and include redundancy (N+1) for pumps and heat rejection. Instrument the rack and loop to confirm real‑world performance before scaling.
Will liquid cooling increase my water usage and operational risk?
Not necessarily. Many D2C systems use closed loops that interface with existing dry coolers or chilled water, so site water usage can remain low. Immersion systems also operate as closed loops. The operational risk profile changes—leaks are managed through design (quick disconnects, dripless couplings), monitoring and procedures. Track WUE alongside PUE to quantify benefits. Organizations like OCP and ASHRAE provide design practices to minimize risk and ensure maintainability OCP ACS, ASHRAE guidelines.
How should I track efficiency and cost after deploying HPC/AI?
Combine facility and cluster views. At the facility layer, use DCIM and sub‑metering for PUE, WUE and per‑pod kWh. At the compute layer, collect node/GPU utilization, job duration and energy per job. Correlate with carbon intensity and apply chargeback/showback to encourage efficient usage. Benchmark periodically (e.g., against Green500‑style metrics) and revisit placement and power‑capping policies. Public resources from NREL and Uptime Institute can help structure your monitoring strategy NREL HPC efficiency, Uptime PUE.
Key takeaways
Start with a constraints audit; don’t force GPU racks into envelopes they can’t sustain.
Build ROI on business outcomes, energy performance and brownfield reuse.
Expect liquid cooling for high densities; design for modular, phased rollouts.
Run a dual telemetry stack: DCIM/energy plus cluster/MLOps for cost and carbon clarity.
Treat safety, cybersecurity and AI governance as first‑class requirements.
Ready to plan your HPC/AI retrofit? Speak with our integrated team at Score Group to align energy, digital and new technology streams into one actionable roadmap.



