NVIDIA's new open source juggernaut: a revolution in efficiency from 671 billion to 253 billion parameters
In today's era of rapid development of big AI models, NVIDIA is once again making waves with its technical prowess. Recently, the Llama-Nemotron series of models released by NVIDIA has quickly risen to the top of open-source models with amazing efficiency and performance, even surpassing DeepSeek-R1, which has a much larger number of parameters, in a number of key benchmarks.

The Llama-Nemotron series contains three models:
- LN-Nano (8B): Efficient miniatures designed for edge devices and mobile applications
- LN-Super (49B): A mid-range model that balances performance and efficiency
- LN-Ultra (253B): Flagship inference model designed for complex tasks
Most amazingly, LN-Ultra outperforms DeepSeek-R1 in a number of key benchmarks such as GPQA-Diamond (76.01 vs. 71.5), IFEval (89.45 vs. 83.3), and LiveCodeBench (66.31), with only 253 billion parameters (about one-third of DeepSeek-R1's 671 billion parameters). Benchmarks, including GPQA-Diamond (76.01 vs. 71.00), IFEval (89.45 vs. 83.00), and LiveCodeBench (66.31), LN-Ultra outperforms DeepSeek-R1 across the board, and, more importantly, LN-Ultra runs efficiently on a single 8xH100 node, while DeepSeek-R1 requires 8xH200 hardware, which means that not only does it perform better, but it also delivers higher throughput in reasoning and a lower threshold to deployment.

According to the Artificial Analytics Intelligence Index, as of April 2025, Llama-Nemotron-Ultra has been recognized as the "smartest" open source model available. This series of models, all under business-friendly open source licenses, the NVIDIA Open Model License and the Llama Community License, allow enterprises to freely use and modify them, which will undoubtedly accelerate the popularization of AI technology and application innovation.
Model Training Revealed: 140,000 H100 Hours in a Five-Stage Construction Process
NVIDIA reveals the five-stage build process for the Llama-Nemotron family of models in a technical report, showing all the technical details from architecture optimization to reinforcement learning.
Phase 1: Neural Architecture Search with FFN Fusion
The team started by deeply optimizing the original Llama 3.1-based architecture using a Neural Architecture Search (NAS) framework called Puzzle. Variations were implemented by building a library of alternative Transformer modules:
- Attention mechanism selectively removed to reduce computation and KV cache memory consumption
- Variable FFN dimensions for model compression at different granularities

Particularly innovative is the FFN Fusion (FFN Fusion) technology: when continuous FFN blocks appear in the model after the NAS removes some of the attention layers, FFN Fusion replaces these structures with fewer, but wider, parallel-executable FFN layers, which significantly improves computational efficiency in a multi-GPU environment.
Phase 2: Knowledge distillation and continuous pre-training
After architectural optimization, the team performed large-scale knowledge distillation with continuous pre-training to recover and improve model performance:
- LN-Super trains 40 billion tokens using Distillation Mix dataset
- LN-Ultra first trains the same dataset for 65 billion tokens, and then continues to train 88 billion tokens on the Nemotron-H stage 4 dataset

Phase III: Synthesized data to monitor fine-tuning
The supervised fine-tuning phase employs an innovative synthetic data training methodology that carefully constructs datasets containing both inferential and non-inferential samples:
- Sample reasoning: "Detailed thinking on" added to the system command.
- Non-reasoning samples: using "detailed thinking off"
This design allows the model to dynamically switch inference behaviors based on the content of the cues, laying the foundation for the "inference switch" function.
Phase IV: Massive Intensive Learning Training
This phase is key for LN-Ultra to surpass DeepSeek-R1. The team used the same Grouped Relative Policy Optimization (GRPO) algorithm as DeepSeek-R1, and the innovative design of the training process included:
- Incentives: accuracy incentives (based on standard answer matches) and format incentives (mandatory use of specific tags)
- Data screening: simple samples with pass rates ≥75% were pre-censored
- Course training: gradual transition from easy to difficult samples using progressive batch assignment based on pass rates
The entire training process consumes about 140,000 H100 GPU hours, uses 72 nodes (8 H100 GPUs per node), and employs FP8 accuracy in the generation phase and BF16 accuracy in the training phase, a combination of techniques that enables LN-Ultra to obtain significant accuracy improvements on the GPQA-Diamond dataset.

Phase 5: Command Alignment and Human Preference Optimization
The final phase was a short reinforcement learning session focused on optimizing the model's command-following capabilities and human preference alignment. The team used RLHF technology to improve the model's general help capabilities and chat performance, while retaining its abilities in specialized areas such as math and science. The results showed that the aligned LN-Super scored 88.3 on the Arena Hard test, outperforming proprietary models such as Claude 3.5 Sonnet and GPT-4o.

Revolutionary Innovation: Inferred Switching Functionality and Hardware Awareness Optimization
One of the biggest innovations of the Llama-Nemotron series is the reasoning switch function, which allows the user to dynamically switch between the two modes by simply adding "detailed thinking on/off" to the system prompt:
- Standard Chat Mode: Respond quickly to daily queries with direct answers
- deep inference model: Perform complex multi-step reasoning and demonstrate a complete thought process
This design solves one of the major pain points of current AI models - developers do not need to maintain models with different architectures, and can flexibly adjust model behavior according to demand. In the global AI open source space, this is the first model family to realize such a feature.
At the hardware optimization level, the Nemotron series is deeply hardware-aware optimized:
- Precision Support: BF16 is used in the training phase, FP8 is used in the generation phase (bringing 1.8x speedup), and the optimizer state is kept FP32
- FP8 Precision Generation: Researchers have developed an online FP8 precision generation model supporting the vLLM framework, with a generation throughput of up to 32 token/s per prompt on a single GPU.
- Customizing the vLLM weight loader: Converting BF16 weights to FP8 format at runtime
With these optimizations, LN-Ultra achieves a staggering 4x higher performance in inference throughput than DeepSeek-R1, while maintaining superior accuracy.

Performance Comparison: Dispelling the Myth of a Linear Relationship between Number of Parameters and Performance
Through comparative testing, the Llama-Nemotron family of models demonstrates superior performance beyond its parametric scale:
mould | GPQA-Diamond | IFEval | LiveCodeBench | Arena Hard |
---|---|---|---|---|
LN-Ultra (253B) | 76.01 | 89.45 | 66.31 | 85.2 |
DeepSeek-R1 | 71.5 | 83.3 | – | 81.7 |
Llama 3.1-405B | 70.7 | 88.5 | 63.3 | 82.4 |
Even the smaller LN-Super (49B) performed well, achieving a high score of 88.3 on the Arena Hard test, outperforming proprietary models such as the Claude 3.5 Sonnet and the GPT-4o-2024-05-13, and outperforming much larger open source models.
More notably, on the out-of-distribution task JudgeBench (distinguishing between high-quality and low-quality responses), LN-Ultra becomes the best performing open-source model, significantly outperforming DeepSeek-R1, and second only to the proprietary model o3-mini(high). This is a good proof of the model's good generalization ability.
The New Open Source Landscape: The Dawn of the Efficiency-First Era
The release of the Llama-Nemotron series marks a new phase of AI development that prioritizes efficiency and impacts the industry in multiple ways:
- Breaking Parameter Barriers: To outperform larger models on a smaller scale and challenge the conventional wisdom that "bigger is better".
- Lowering the deployment threshold: Efficient architectural design enables more organizations to afford large model deployments
- Accelerating technological innovation: A fully open source strategy will accelerate the popularization and innovation of AI technology
- Promoting efficiency research: motivating more researchers to explore the efficiency boundaries of large models
As the AI race enters an era where efficiency is king, a number of innovations made public by NVIDIA's Llama-Nemotron series - from dynamic inference switches to hardware-aware optimization, and from synthetic data training to large-scale reinforcement learning - will influence the future direction of big models.
The significance of this technology disclosure lies not only in the birth of a new generation of high-efficiency models, but also in the establishment of a new technical benchmark for the entire AI industry, which promotes the continued evolution of AI technology in the direction of greater practicality and universality. With the support of new generation hardware such as the upcoming B100 GPU, this series of models is likely to be just the beginning of the efficiency revolution.
If you want to use GPT Plus, Claude Pro, Grok Super official paid exclusive account, you can contact our professional team (wx: abch891) if you don't know how to recharge yourself.