Foundational Speech Models and their Efficient Training with NVIDIA NeMo

About this Topic:
The intersection of speech and language models offer unique opportunities and challenges. This talk provides a comprehensive walkthrough of speech-language model research from NVIDIA NeMo. We cover several types of models such as attention-encoder-decoder Canary-1B, and LLM-based architectures such as SALM or BESTOW. In particular, we highlight the challenges in training and inference efficiency of such models and propose robust solutions via 2D bucketing and batch size OOMptimizer. Finally, we highlight the difficulty of preserving text-domain capabilities in speech-augmented training and present several possible solutions: EMMeTT, VoiceTextBlender, and Canary-Qwen-2.5B.
About the Presenter:
Piotr Żelasko received the B.S. and M.Sc. degrees in acoustic engineering, and the Ph.D. in electronic engineering from AGH-University Krakow, Poland in 2013, 2014, and 2019 respectively.
He is currently a research scientist at NVIDIA NeMo building multitask and multimodal models and efficient training infrastructure. He held a research scientist position at JHU’s CLSP and developed speech technology at different companies (Techmo, Avaya, Meaning.Team).
Dr. Żelasko is a co-author of the next-generation Kaldi toolkit (k2) and the maintainer of Lhotse.
Want to learn more about upcoming events & webinars?
Visit the events section of the Signal Processing website to see all upcoming lectures, workshops, webinars, and more.