NVIDIA and OpenAI unveil the speediest open-source models for intellectual reasoning
In a groundbreaking development, NVIDIA and OpenAI have unveiled their latest creations, the gpt-oss-120b and gpt-oss-20b models. This release feels more like a significant turning point rather than a typical launch, marking a new era in artificial intelligence.
The collaboration between NVIDIA and OpenAI, dating back to the first DGX-1, has led to the evolution of the gpt-oss series. The new models employ the Mixture of Experts (MoE) architecture with SwigGLU activations, a design that allows for large parameter counts with efficient active parameter usage per token. They also incorporate Rotary Position Embedding (RoPE) with a very long context length (up to 128k tokens), alternating between full context and a sliding 128-token window attention mechanism.
These models run in FP4 precision, a low-precision data format natively supported by NVIDIA’s Blackwell GPU architecture. This innovation fits the 120B model on a single 80GB GPU for inference while maintaining accuracy.
Training these models required NVIDIA H100 Tensor Core GPUs, necessitating millions of GPU hours (2.1M hours for the 120B model). For inference, NVIDIA optimized these models via the TensorRT-LLM backend with improved CUDA kernels supporting MoE layers, achieving up to 1.5 million tokens per second throughput on NVIDIA GB200 NVL72 systems. Performance is further enhanced by parallelism strategies like Tensor Parallelism and Expert Parallelism, as well as model silicing across multiple GPUs for large deployments.
The core technology relies on synergy between MoE to activate only subsets of experts, reducing computation while expanding model capacity; efficient attention with RoPE and very long context windows; low-precision FP4 computation enabled by Blackwell GPUs; and highly optimized software stacks including TensorRT-LLM and frameworks like Hugging Face Transformers, Ollama, and vLLM.
This combination allows NVIDIA to deliver these advanced LLMs with high token throughput, low latency, and efficient hardware utilization from cloud GPUs down to desktop RTX GPUs.
The models are classified as "inference microservices" by NVIDIA, making them faster and simpler. They are designed to be easily deployable, especially for those already familiar with CUDA. The deployment of these technologies in the extensive NVIDIA and OpenAI ecosystem often leads to rapid adoption.
Over 4 million developers are working on OpenAI's platform, and over 6.5 million developers are using NVIDIA's software tools. The models are optimized for smooth performance across various devices, including large-scale cloud systems and standard desktop computers with NVIDIA RTX cards. If one is already using common AI tools like Hugging Face or Llama.cpp, these models will integrate immediately.
However, it's important to note that the gpt-oss models require significantly higher processing power, refinement, and operational availability compared to previous versions. The models are now open for contributions from various entities, including startups and universities.
The collaboration between NVIDIA and OpenAI in developing the gpt-oss series is ongoing for nearly a decade. The synergy between hardware, software, and services in the development of the gpt-oss series is unusual at this level. The efficiency of the gpt-oss-120b and gpt-oss-20b models is due to a combination of new hardware (NVIDIA's H100 GPUs) and smart software.
A key innovation in this development is NVIDIA's Blackwell material, specifically the NVFP4 feature. This enables models to run faster and more efficiently using less precise numbers without compromising precision.
In summary, the efficiency and speed of the gpt-oss models come from the Mixture of Experts model design, FP4 precision on NVIDIA Blackwell GPUs, attention optimizations with RoPE over 128k tokens, and software kernel improvements via TensorRT-LLM optimized for these models. These advancements enable faster AI reasoning and tool use at scale.
[1] NVIDIA Press Release
[2] OpenAI Blog Post
[4] NVIDIA Developer Blog
The groundbreaking gpt-oss series, developed by NVIDIA and OpenAI, is powered by the Mixture of Experts (MoE) architecture with SwigGLU activations and Rotary Position Embedding (RoPE) for efficient active parameter usage and very long context length. These models, employing advanced technologies like NVIDIA's Blackwell GPU's FP4 precision and NVIDIA's H100 Tensor Core GPUs, demonstrate significant improvements in AI reasoning and tool use at scale.
The collaboration between NVIDIA and OpenAI in developing the gpt-oss series is a testament to the power of artificial-intelligence and technology synergy, with innovations such as NVIDIA's NVFP4 feature in Blackwell material enabling faster and more efficient AI processing.