MIT Datacenter Challenge

The MIT Datacenter Challenge is part of the MIT-USAF AI Accelerator (AIA), a joint MIT-USAF program focusing on fundamental research in AI to improve Department of the Air Force operations while also addressing broader societal needs. Within the AIA, the Datacenter Challenge is part of the FastAI project, a research program dedicated to the quick development of portable high-performance AI applications. Areas of research in the FastAI project include programming languages, compiler technologies, comprehensive instrumentation, analytical productivity tools, and parallel algorithms.

To learn more about the USAF-MIT AIA and the FastAI project please visit https://aia.mit.edu/research/

If you have questions related to the Datacenter Challenge, please email us at mit-dcc@mit.edu

Overview

The MIT Datacenter Challenge is making publicly available an HPC cluster dataset for the research and development of innovative AI/ML techniques to that will facilitate the development of tools to detect, explain, recover, and potentially predict outliers and ultimately mitigate system failure in complex multi-sensor systems. Details about the dataset, collection methodology and preliminary analyss can be found in our paper :

Samsi, Siddharth, Weiss, Matthew, Bestor, David, et al. “The MIT Supercloud Dataset.” 2021 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2021.

Full text of the paper is available at https://ieeexplore.ieee.org/abstract/document/9622850 or https://arxiv.org/abs/2108.02037

Dataset

Datacenter monitoring systems offer a variety of data streams and events. The Datacenter Challenge datasets are a combination of high-level data (e.g. Slurm Workload Manager scheduler data) and low-level job-specific time series data. The high-level data includes parameters such as the number of nodes requested, number of CPU/GPU/memory requests, exit codes, and run time data. The low-level time series data is collected on the order of seconds for each job. This granular time series data includes CPU/GPU/memory utilization, amount of disk I/O, and environmental parameters such as power drawn and temperature. Ideally, leveraging both high-level scheduler data and low-level time series data will facilitate the development of AI/ML algorithms which not only predict/detect failures, but also allow for the accurate determination of their cause.

Below are some potential AI/ML challenge problems for the Datacenter Challenge.

Scheduling Characterization
- As jobs are submitted to a shared HPC system asynchronously, are there insights to be gained by characterizing job submission rates and times?
- Can such characterization help improve scheduling resource and allocation policies?
- Can AI/ML models accurately predict job run-times?

Workload Characterization
- As a variety of different jobs are run in a shared HPC cluster environment, one natural machine learning task is to classify different workloads into traditional HPC jobs and jobs that involve training AI/ML models.

File System Characterization
- Given high-level scheduler data and low-level, job-specific, time series data, can we determine what normal HPC system usage looks like?

Error Characterization
- Many HPC jobs run over the course of several hours or even days. Jobs that fail after a significant amount of time are an inefficient use of HPC resources and user time. In this context, can AI/ML techniques be developed to accurately predict job failures far in advance of their occurrence?

Home