Tag Archives: GPU

Fluid Accepted to Appear at MLSys’2021

January 22, 2021 Mosharaf

While training and inference of deep learning models have received significant attention in recent years (e.g., Tiresias, AlloX, and Salus from our group), hyperparamter tuning is often overlooked or put together in the same bucket of optimizations as training. Existing hyperparameter tuning solutions, primarily … Continue Reading ››

Recent News

Presented Keynote Speech at HotEdgeVideo’2020

August 12, 2020 Mosharaf Leave a comment

Earlier this week, I presented a keynote speech on the state of resource management for deep learning at the HotEdgeVideo'2020 workshop, covering our recent works on systems support for AI (Tiresias, AlloX, and Salus) and discussing open challenges in this space.

Recent News

AlloX Accepted to Appear at EuroSys’2020

February 15, 2020 Mosharaf 1 Comment

While GPUs are always in the news when it comes to deep learning clusters (e.g., Salus or Tiresias), we are in the midst of an emergence of many more computing devices (e.g., FPGAs and problem-specific accelerators), including the traditional CPUs. All of them are compute devices, but one cannot expect the … Continue Reading ››

Recent News

Salus Accepted to Appear at MLSys'2020

January 17, 2020 Mosharaf 1 Comment

With the rising popularity of deep learning, the popularity of GPUs has increased in recent years. Modern GPUs are extremely powerful with a lot of resources. A key challenge in this context is making sure that these devices are highly utilized. Although there has been a lot of research on improving GPU efficiency … Continue Reading ››

Recent News

NSF Award to Expand Our Systems+AI Research!

May 30, 2019 Mosharaf

This project aims to extend and expand our forays into micro- and macro-level GPU resource management for distributed deep learning applications.

Thanks NSF!

Recent News

Tiresias Accepted to Appear at NSDI’2019

December 5, 2018 Mosharaf

With the advancement of AI in recent years, GPUs have emerged as a popular choice for training deep learning (DL) models on large datasets. To deal with ever-growing datasets, it is also common to run distributed deep learning over multiple GPUs in parallel. Achieving cost-effectiveness and high performance in these clusters relies on … Continue Reading ››