Nvidia and Google collaborate to optimize Gemma language models for enhanced GPU performance

Google’s Gemma, the latest advancement in open language models, is now optimized to run seamlessly on Nvidia GPUs, marking a significant collaboration between the two tech giants to enhance performance and accessibility. Gemma, boasting 2 billion to 7 billion parameters, represents a breakthrough in lightweight language models that can operate efficiently across various platforms, including Nvidia’s AI infrastructure.

Through close collaboration between Nvidia and Google teams, Gemma has been fine-tuned to leverage Nvidia’s TensorRT-LLM, a specialized library designed to optimize inference for large language models. This optimization extends to Nvidia GPUs deployed in data centers, cloud environments, and PCs equipped with Nvidia RTX GPUs. Notably, the integration with Nvidia RTX GPUs opens up opportunities for developers to harness the power of over 100 million installed RTX GPUs globally, facilitating high-performance AI computing.

Furthermore, developers can leverage Gemma on Nvidia GPUs within Google Cloud, utilizing instances powered by H100 Tensor Core GPUs, with plans for integration with Nvidia’s H200 Tensor Core GPUs in the near future. The latter boasts impressive specifications, including 141GB of HBM3e memory operating at 4.8 terabytes per second, promising enhanced performance and scalability for AI applications.

Enterprise developers stand to benefit from Nvidia’s comprehensive ecosystem of tools, including Nvidia AI Enterprise with the NeMo framework and TensorRT-LLM, enabling them to optimize Gemma for specific use cases and seamlessly integrate the optimized model into their production environments.

For developers seeking to delve deeper into Gemma’s capabilities and optimization with TensorRT-LLM, additional resources and model checkpoints are available, empowering them to leverage Gemma effectively in their projects. Moreover, Gemma is accessible for exploration directly through the Nvidia AI Playground, allowing developers to experiment with both the 2 billion and 7 billion parameter versions of the model.

Excitingly, Gemma will soon integrate with “Chat with RTX,” an innovative Nvidia tech demo showcasing retrieval-augmented generation and TensorRT-LLM software. This integration enables users to leverage generative AI capabilities on their local, RTX-powered Windows PCs, empowering them to personalize chatbots using their own data securely and efficiently. By running the model locally, Chat with RTX ensures fast results while safeguarding user data privacy, eliminating the need for reliance on cloud-based services and internet connectivity.

Magewell and Central
DAZN and evision joi