Content Details

In a world where technology and knowledge are intertwined, every reading is like a marvelous adventure that makes you feel the power of wisdom and inspires endless creativity.

NVIDIA Llama-Nemotron: The New King of Open Source Beyond DeepSeek-R1

NVIDIA's new open source juggernaut: a revolution in efficiency from 671 billion to 253 billion parameters

In today's era of rapid development of big AI models, NVIDIA is once again making waves with its technical prowess. Recently, the Llama-Nemotron series of models released by NVIDIA has quickly risen to the top of open-source models with amazing efficiency and performance, even surpassing DeepSeek-R1, which has a much larger number of parameters, in a number of key benchmarks.

The Llama-Nemotron series contains three models:

  • LN-Nano (8B): Efficient miniatures designed for edge devices and mobile applications
  • LN-Super (49B): A mid-range model that balances performance and efficiency
  • LN-Ultra (253B): Flagship inference model designed for complex tasks

Most amazingly, LN-Ultra outperforms DeepSeek-R1 in a number of key benchmarks such as GPQA-Diamond (76.01 vs. 71.5), IFEval (89.45 vs. 83.3), and LiveCodeBench (66.31), with only 253 billion parameters (about one-third of DeepSeek-R1's 671 billion parameters). Benchmarks, including GPQA-Diamond (76.01 vs. 71.00), IFEval (89.45 vs. 83.00), and LiveCodeBench (66.31), LN-Ultra outperforms DeepSeek-R1 across the board, and, more importantly, LN-Ultra runs efficiently on a single 8xH100 node, while DeepSeek-R1 requires 8xH200 hardware, which means that not only does it perform better, but it also delivers higher throughput in reasoning and a lower threshold to deployment.

According to the Artificial Analytics Intelligence Index, as of April 2025, Llama-Nemotron-Ultra has been recognized as the "smartest" open source model available. This series of models, all under business-friendly open source licenses, the NVIDIA Open Model License and the Llama Community License, allow enterprises to freely use and modify them, which will undoubtedly accelerate the popularization of AI technology and application innovation.

Model Training Revealed: 140,000 H100 Hours in a Five-Stage Construction Process

NVIDIA reveals the five-stage build process for the Llama-Nemotron family of models in a technical report, showing all the technical details from architecture optimization to reinforcement learning.

Phase 1: Neural Architecture Search with FFN Fusion

The team started by deeply optimizing the original Llama 3.1-based architecture using a Neural Architecture Search (NAS) framework called Puzzle. Variations were implemented by building a library of alternative Transformer modules:

  • Attention mechanism selectively removed to reduce computation and KV cache memory consumption
  • Variable FFN dimensions for model compression at different granularities

Particularly innovative is the FFN Fusion (FFN Fusion) technology: when continuous FFN blocks appear in the model after the NAS removes some of the attention layers, FFN Fusion replaces these structures with fewer, but wider, parallel-executable FFN layers, which significantly improves computational efficiency in a multi-GPU environment.

Phase 2: Knowledge distillation and continuous pre-training

After architectural optimization, the team performed large-scale knowledge distillation with continuous pre-training to recover and improve model performance:

  • LN-Super trains 40 billion tokens using Distillation Mix dataset
  • LN-Ultra first trains the same dataset for 65 billion tokens, and then continues to train 88 billion tokens on the Nemotron-H stage 4 dataset

Phase III: Synthesized data to monitor fine-tuning

The supervised fine-tuning phase employs an innovative synthetic data training methodology that carefully constructs datasets containing both inferential and non-inferential samples:

  • Sample reasoning: "Detailed thinking on" added to the system command.
  • Non-reasoning samples: using "detailed thinking off"

This design allows the model to dynamically switch inference behaviors based on the content of the cues, laying the foundation for the "inference switch" function.

Phase IV: Massive Intensive Learning Training

This phase is key for LN-Ultra to surpass DeepSeek-R1. The team used the same Grouped Relative Policy Optimization (GRPO) algorithm as DeepSeek-R1, and the innovative design of the training process included:

  • Incentives: accuracy incentives (based on standard answer matches) and format incentives (mandatory use of specific tags)
  • Data screening: simple samples with pass rates ≥75% were pre-censored
  • Course training: gradual transition from easy to difficult samples using progressive batch assignment based on pass rates

The entire training process consumes about 140,000 H100 GPU hours, uses 72 nodes (8 H100 GPUs per node), and employs FP8 accuracy in the generation phase and BF16 accuracy in the training phase, a combination of techniques that enables LN-Ultra to obtain significant accuracy improvements on the GPQA-Diamond dataset.

Phase 5: Command Alignment and Human Preference Optimization

The final phase was a short reinforcement learning session focused on optimizing the model's command-following capabilities and human preference alignment. The team used RLHF technology to improve the model's general help capabilities and chat performance, while retaining its abilities in specialized areas such as math and science. The results showed that the aligned LN-Super scored 88.3 on the Arena Hard test, outperforming proprietary models such as Claude 3.5 Sonnet and GPT-4o.

Revolutionary Innovation: Inferred Switching Functionality and Hardware Awareness Optimization

One of the biggest innovations of the Llama-Nemotron series is the reasoning switch function, which allows the user to dynamically switch between the two modes by simply adding "detailed thinking on/off" to the system prompt:

  • Standard Chat Mode: Respond quickly to daily queries with direct answers
  • deep inference model: Perform complex multi-step reasoning and demonstrate a complete thought process

This design solves one of the major pain points of current AI models - developers do not need to maintain models with different architectures, and can flexibly adjust model behavior according to demand. In the global AI open source space, this is the first model family to realize such a feature.

At the hardware optimization level, the Nemotron series is deeply hardware-aware optimized:

  • Precision Support: BF16 is used in the training phase, FP8 is used in the generation phase (bringing 1.8x speedup), and the optimizer state is kept FP32
  • FP8 Precision Generation: Researchers have developed an online FP8 precision generation model supporting the vLLM framework, with a generation throughput of up to 32 token/s per prompt on a single GPU.
  • Customizing the vLLM weight loader: Converting BF16 weights to FP8 format at runtime

With these optimizations, LN-Ultra achieves a staggering 4x higher performance in inference throughput than DeepSeek-R1, while maintaining superior accuracy.

Performance Comparison: Dispelling the Myth of a Linear Relationship between Number of Parameters and Performance

Through comparative testing, the Llama-Nemotron family of models demonstrates superior performance beyond its parametric scale:

mouldGPQA-DiamondIFEvalLiveCodeBenchArena Hard
LN-Ultra (253B)76.0189.4566.3185.2
DeepSeek-R171.583.381.7
Llama 3.1-405B70.788.563.382.4

Even the smaller LN-Super (49B) performed well, achieving a high score of 88.3 on the Arena Hard test, outperforming proprietary models such as the Claude 3.5 Sonnet and the GPT-4o-2024-05-13, and outperforming much larger open source models.

More notably, on the out-of-distribution task JudgeBench (distinguishing between high-quality and low-quality responses), LN-Ultra becomes the best performing open-source model, significantly outperforming DeepSeek-R1, and second only to the proprietary model o3-mini(high). This is a good proof of the model's good generalization ability.

The New Open Source Landscape: The Dawn of the Efficiency-First Era

The release of the Llama-Nemotron series marks a new phase of AI development that prioritizes efficiency and impacts the industry in multiple ways:

  1. Breaking Parameter Barriers: To outperform larger models on a smaller scale and challenge the conventional wisdom that "bigger is better".
  2. Lowering the deployment threshold: Efficient architectural design enables more organizations to afford large model deployments
  3. Accelerating technological innovation: A fully open source strategy will accelerate the popularization and innovation of AI technology
  4. Promoting efficiency research: motivating more researchers to explore the efficiency boundaries of large models

As the AI race enters an era where efficiency is king, a number of innovations made public by NVIDIA's Llama-Nemotron series - from dynamic inference switches to hardware-aware optimization, and from synthetic data training to large-scale reinforcement learning - will influence the future direction of big models.

The significance of this technology disclosure lies not only in the birth of a new generation of high-efficiency models, but also in the establishment of a new technical benchmark for the entire AI industry, which promotes the continued evolution of AI technology in the direction of greater practicality and universality. With the support of new generation hardware such as the upcoming B100 GPU, this series of models is likely to be just the beginning of the efficiency revolution.

If you want to use GPT Plus, Claude Pro, Grok Super official paid exclusive account, you can contact our professional team (wx: abch891) if you don't know how to recharge yourself.

For more products, please check out

See more at

ShirtAI - Penetrating Intelligence The AIGC Big Model: ushering in an era of dual revolution in engineering and science - Penetrating Intelligence
1:1 Restoration of Claude and GPT Official Website - AI Cloud Native Live Match App Global HD Sports Viewing Player (Recommended) - BlueShirt.com
Transit service based on official API - GPTMeta API Help, can anyone of you provide some tips on how to ask questions on GPT? - Knowing
Global Virtual Goods Digital Store - Global SmarTone (Feng Ling Ge) How powerful is Claude airtfacts feature that GPT instantly doesn't smell good? -BeepBeep

advertising position

Witness the super magic of artificial intelligence together!

Embrace your AI assistant and boost your productivity with just one click!