Please cite the following paper if you use or reference this dataset :
Samsi, Siddharth, Weiss, Matthew, Bestor, David, et al. “The MIT Supercloud Dataset.” 2021 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2021.
If you have any questions, please email us at mit-dcc@mit.edu
The MIT Supercloud Dataset is available for download from the Amazon Open Data Registry via the following bucket :
s3://mit-supercloud-dataset/datacenter-challenge/202201/
The above S3 bucket contains data released as of January 2022. As we continue to collect data, this will be updated with additional folders reflecting the date of release. The easiest way to download the data is through AWS S3 command line tools. Instructions for installation of AWS CLI tools can be found here: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
A directory listing can be obtained using the following command :
aws s3 ls s3://mit-supercloud-dataset/datacenter-challenge/202201/ –no-sign-request
PRE cpu/
PRE gpu/
2022-01-19 14:11:40 285 LICENSE
2022-01-19 14:11:40 7079 README.md
2022-01-20 12:20:09 339 labelled_job_stats.csv
2022-01-20 12:20:10 82332 labelled_jobids.csv
2022-01-20 12:20:10 2274303873 node-data.csv
2022-01-20 12:20:11 103165427 slurm-log.csv
2022-01-20 12:20:11 111 tres-mapping.txt
CPU and CPU time series data is the corresponding folders shown above and is organized in subfolders.
Data can downloaded using the following command :
aws s3 cp s3://mit-supercloud-dataset/datacenter-challenge datacenter-challenge –recursive –no-sign-request
Note : The dataset is approximately 2TB. Please ensure that you have sufficient storage space available before running the above command to download the entire dataset.
For full details of the data please refer to the paper “The MIT Supercloud Dataset”, available at https://ieeexplore.ieee.org/abstract/document/9622850 or https://arxiv.org/abs/2108.02037