AI isn’t just a software story anymore. It’s a physics problem.
New GPU platforms are pushing rack densities above 100 kW per rack, with roadmaps that point toward 200 kW, 600 kW, and even megawatt-class racks before the end of the decade.
At those power levels, air assisted cooling is no longer sufficient for high-density AI racks. It simply can’t move enough heat. Above roughly 50–100 kW per rack, liquid cooling stops being a “nice option” and becomes a hard requirement.
But the real differentiator between operators won’t just be who adopts liquid cooling; it’ll be who designs an operating model that can run it, 24/7, at scale.
This article goes beyond the usual “air vs. liquid” debate and looks at what it actually takes to design liquid cooling for 100 kW+ racks and the people who will live with that decision every day.
Why 100 kW Racks Change the Rules
Traditional enterprise racks lived in a world where:
- 5–15 kW per rack was common
- Air cooled almost everything
- “Cooling” mostly meant making the Data Hall colder
AI clusters have blown up those assumptions:
- Nvidia H100-class deployments already drive tens of kilowatts per rack; Blackwell and successor architectures push into 130 kW+ territory, with future densification targeting ~250 kW.
- Grand View Research and others estimate the global data center liquid cooling market will grow at ~20%+ CAGR into the 2030s, fueled by AI and HPC workloads.
- Cooling systems already account for around 40% of total data center energy consumption, meaning any gains here have an outsized impact on energy bills and emissions.
In other words, liquid cooling is no longer an exotic option. It’s the only realistic way to keep AI hardware inside its thermal envelope without blowing up your power bill or your PUE.
The catch? At 100 kW per rack and beyond, you’re no longer designing just a cooling system; you’re designing an operations system.
Principle 1: Treat Liquid Loops as Critical Infrastructure, Not Just Plumbing
At 100 kW per rack, a failed loop isn’t an inconvenience; it’s a major incident.
ASHRAE’s liquid cooling guidance emphasizes compatibility of wetted materials, strict water quality, and clearly defined Technology Cooling System (TCS) classes to avoid corrosion, scaling, and premature failures.
Practically, that means:
- Dedicated monitoring: Install sensors for pressure, temperature, flow, and leaks at every critical leg (CDUs, manifolds, rack inlets/outlets).
- Alarm engineering: Define thresholds and escalation paths like you would for power or network alarms—because that’s exactly how critical they are.
- Change control: Any maintenance involving drains, refills, or re-plumbing should be handled under formal change management, not “best effort” tickets.
This is where an execution partner like Guardian can help you extend the same rigor you already use for power, data destruction, and logistics into your liquid cooling environment. Guardian’s field teams already operate under strict procedures, chain-of-custody, and documentation requirements for data center projects nationwide.
Principle 2: Design for Humans, Not Just Heat Transfer Coefficients
We love talking about pump curves and cold plate designs. But in daily operations, the questions that matter are more human:
- Who is allowed to open a wet connection?
- Where is the isolation valve for this rack, and is it labeled clearly enough to find in the dark?
- How does a new technician learn the “right way” to do a fluid top-off without shadowing the one veteran who’s done it before?
A good 100 kW rack design includes:
- Visual clarity
– Color-coded loops and manifolds
– Clear labeling for isolation points and flow direction
– Diagrams posted in the Data Hall and available in the CMDB - Procedural clarity
– Written runbooks for commissioning, maintenance, and incident response
– Simple checklists for tasks like connecting new racks or replacing a CDU - Training and drills
– Hands-on practice for leak response (including cleanup and documentation)
– Cross-training between facilities and IT so that no one works in isolation
Guardian’s national field teams already run complex, multi-site projects where every step is documented, from decommissioning to onsite data destruction and logistics. Applying the same playbook to liquid cooling tasks reduces the dependency on “tribal knowledge” and makes 100 kW racks manageable for more than a handful of specialists.
Principle 3: Instrument for PUE, Not Just “Good Enough”
Most operators know the headline: a lower PUE is better. But in practice, PUE is often treated as a quarterly KPI, not a real-time feedback loop.
Research shows that cooling typically consumes around 40% of total data center power, and improvements in PUE can dramatically cut overhead energy and emissions.
Liquid cooling gives you new levers:
- Higher supply temperatures
- Reduced or eliminated server fans
- More efficient heat rejection (e.g., dry coolers or heat reuse)
To take advantage of those levers, design your monitoring so you can:
- Track partial PUE at the Data Hall or cluster level, not just for the whole site
- Correlate loop performance (temperatures, flows, pump power) with IT utilization and AI training runs
- Run experiments safely, e.g., nudging water temperature up while verifying chip temps stay within ASHRAE limits
This is where AI-assisted control can shine, similar to how Google’s DeepMind system cut cooling energy by up to 40% in one of its facilities.
But the foundation is boring: accurate, trusted data. Build that instrumentation into your 100 kW rack design from day zero.
Principle 4: Plan the Liquid Cooling Lifecycle on Day One
Today’s 100 kW rack is tomorrow’s “legacy architecture.”
Analysts estimate that the data center liquid cooling market could more than triple this decade, as operators retrofit existing sites and build new AI-ready facilities.
That growth guarantees change:
- Racks will be moved, expanded, or retired long before their nominal end-of-life.
- Fluids will need to be sampled, filtered, replaced, and ultimately removed.
- ESG teams will ask where those fluids and materials went and whether you have evidence.
To avoid surprises, bake the lifecycle into your design:
- Documented fluid inventory: types, volumes, and storage locations
- Standard processes for draining, capturing, and transporting fluids
- Defined decommissioning workflows that tie into ITAD, recycling and security
Guardian already handles the messy middle of the IT lifecycle, full data center decommissioning, onsite data destruction, and packing & logistics. Extending those services to fluid handling and liquid-cooled hardware means your future “tear-downs” are just another standard project type, not a bespoke fire drill.
Principle 5: Make 100 kW Racks a Repeatable Product, Not a Science Project
The ultimate test of your 100 kW liquid cooling design is simple:
Can you replicate it in a second site without reinventing the process?
To get there:
- Standardize a “liquid pod” pattern
– A reference design for 1–4 racks, including cooling, power, monitoring, and network
– A bill of materials that’s pre-vetted for compatibility and lead times - Bundle services with hardware
– Work with your OEMs, ITADs, VARs, and MSPs to ensure the design always includes implementation, training, and lifecycle services, often delivered by partners like Guardian behind the scenes. - Template the documentation
– Reusable runbooks, RACI charts, commissioning checklists, and decommissioning plans
– A consistent way to log what happened at each site for audit and ESG reporting
When you do this, 100 kW racks become a product your organization knows how to buy, deploy, support, and retire—not a one-off engineering experiment.
Where Guardian Fits in a 100 kW World
In the context of 100 kW+ liquid-cooled racks, that translates to:
- Commissioning and validation: Onsite support to bring new liquid-cooled racks into production using standardized checklists.
- Preventative maintenance – Scheduled visits that combine inspections, minor remediation, and documentation.
- Fluid management and remediation: Handling, cleanup, and coordination when liquid work is needed, always under clear procedures.
- Decommissioning and moves: Integrated fluid handling, data protection, packing, and logistics when racks or entire Data Halls change roles.
AI is rewriting the physics of the data center. Guardian helps make sure your operations and your risk management can keep up.
If your AI roadmap assumes 100 kW racks, what’s the one operational capability you’re most worried about—monitoring, training, incident response, or lifecycle / ESG?
