Welcome to the MIT Datacenter Challenge website. To goal of the Datacenter Challenge is to foster innovative artificial intelligence (AI) and machine learning (ML) approaches to the analysis of large scale high performance computing (HPC) center data and monitoring.
From a “big picture” perspective, the algorithms developed for the analysis of HPC data can generalize to other multi-sensor systems such as predicting cyber attacks or monitoring the various sensors within an airplane. Furthermore, system failure in these contexts is often caused by a complex set of interacting events which are not easily isolated or predicted.
To this end, the MIT Datacenter Challenge is making publicly available an HPC cluster dataset for the research and development of innovative AI/ML techniques to that will facilitate the development of tools to detect, explain, recover, and potentially predict outliers and ultimately mitigate system failure in complex multi-sensor systems.
The Datacenter Challenge datasets are a combination of high-level data (e.g. Slurm Workload Manager scheduler data) and low-level job-specific time series data. The high-level data includes parameters such as the number of nodes requested, number of CPU/GPU/memory requests, exit codes, and run time data. The low-level time series data is collected on the order of seconds for each job. This granular time series data includes CPU/GPU/memory utilization, amount of disk I/O, and environmental parameters such as power drawn and temperature. Ideally, leveraging both high-level scheduler data and low-level time series data will facilitate the development of AI/ML algorithms which not only predict/detect failures, but also allow for the accurate determination of their cause.