Google Uses ML to Cut Datacenter Power Usage

In a blog post on Wednesday, Rich Evans, a Research Engineer at Google DeepMind and Jim Gao, a Data Center Engineer at Google, described work the company has done to manage the power consumption of one of their data centers using machine learning, the result of which has been to lower the amount of energy they’ve used for cooling by as much as 40%. This is a particularly impressive number because it is on top of a number of investments the company has already made in reducing DC power consumption, such as developing extremely efficient servers, highly efficient cooling strategies and renewable energy sources.

The general idea behind applying ML to the problem of data center optimization was explained by Jim in a 2014 white paper on the topic:

Machine learning is well-suited for the DC environment given the complexity of plant operations and the abundance of existing monitoring data. The modern largescale DC has a wide variety of mechanical and electrical equipment, along with their associated setpoints and control schemes. The interactions between these systems and various feedback loops make it difficult to accurately predict DC efficiency using traditional engineering formulas.

For example, a simple change to the cold aisle temperature setpoint will produce load variations in the cooling infrastructure (chillers, cooling towers, heat exchangers, pumps, etc.), which in turn cause nonlinear changes in equipment efficiency. Ambient weather conditions and equipment controls will also impact the resulting DC efficiency. Using standard formulas for predictive modeling often produces large errors because they fail to capture such complex interdependencies.

The paper described a neural network with 5 hidden layers at 50 nodes each, that’s trained with features like Server Load, Number of Water Pumps Running, Humidity and Wind Speed. They use 19 such features, all normalized, derived from 2 years of plant sensor data captured at 5 minute increments. The output of the neural net, which is their prediction and optimization focus is primarily PUE, or Power Usage Effectiveness, a measure of data center energy efficiency, though they mention other possible optimization targets such as server utilization or equipment uptime.

So, Jim and his team at Google have been working on this for a couple of years, and over the past few months, this team has been collaborating with Rich and the team over at DeepMind to improve the results and utility of the system they’ve developed.

Whereas the previous effort was primarily used for alerting and simulation, the new system allows Google to actually control the data center using machine learning.

The new system uses an ensemble of deep networks to predict PUE, and they’ve developed a couple of additional ensembles that can predict future data center temperature and pressure. While they didn’t talk about the control system at all in the blog post, according to Jack Clark’s Bloomberg article, DeepMind co-founder Demis Hassabis has described this work in presentations, suggesting it uses a technique like the Deep Q Networks technique they’ve developed to play Atari games. The system apparently controls about 120 data center variables like fans, cooling systems, and windows.

In practice the system has been able to achieve a 40 percent reduction in the amount of energy used for cooling at the Google test site, which translates into a 15% reduction in overall PUE, producing the lowest PUE the site had ever seen.

Ultimately, the blog describes the new system as a general-purpose framework for understanding complex dynamics, and Google expects to apply this to other optimization challenges in the data center and beyond, such as improving power conversion efficiency, reducing semiconductor manufacturing energy and water usage, or increasing manufacturing throughput.

They also plan to share more details in an upcoming paper, which I’m looking forward to reading.