Problems

Datacenter monitoring systems offer a variety of data streams and events.  Below are some potential AI/ML challenge problems for the Datacenter Challenge. We anticipate providing baseline implementations, as well as, qualitative metrics that can be used to measure performance.  Full challenge problem specifications will be announced Summer 2021.

  • Scheduling Characterization

    • As jobs are submitted to a shared HPC system asynchronously, are there insights to be gained by characterizing job submission rates and times?

    • Can such characterization help improve scheduling resource and allocation policies?

    • Can AI/ML models accurately predict job run-times?
       

  • Workload Characterization

    • As a variety of different jobs are run in a shared HPC cluster environment, one natural machine learning task is to classify different workloads into traditional HPC jobs and jobs that involve training AI/ML models.
       

  • File System Characterization

    • Given high-level scheduler data and low-level, job-specific, time series data, can we determine what normal HPC system usage looks like?
       
  • Error Characterization

    • Many HPC jobs run over the course of several hours or even days. Jobs that fail after a significant amount of time are an inefficient use of HPC resources and user time. In this context, can AI/ML techniques be developed to accurately predict job failures far in advance of their occurrence?