About Megatron NLG

NVIDIA, in collaboration with Microsoft, has introduced the Megatron-Turing Natural Language Generation model (MT-NLG), which stands as the largest and most powerful monolithic transformer language model trained so far, boasting a whopping 530 billion parameters. This model is a successor to the Turing NLG 17B and Megatron-LM. With three times the parameters of its predecessor, MT-NLG showcases unparalleled accuracy across a wide range of natural language tasks.

Features of Megatron-Turing NLG (MT-NLG)

  1. Scale and Power: MT-NLG is equipped with 530 billion parameters, making it the largest monolithic transformer language model to date.
  2. Accuracy: The model demonstrates unmatched accuracy in various natural language tasks such as completion prediction, reading comprehension, commonsense reasoning, natural language inferences, and word sense disambiguation.
  3. Advanced Transformer Layers: MT-NLG consists of a 105-layer transformer, which has significantly improved the state-of-the-art models in zero-, one-, and few-shot settings.
  4. Large-Scale Training Infrastructure: The model utilizes NVIDIA A100 Tensor Core GPUs and HDR InfiniBand networking, enabling the training of models with trillions of parameters within a feasible timeframe.
  5. 3D Parallel System: A collaboration between NVIDIA Megatron-LM and Microsoft DeepSpeed resulted in an efficient and scalable 3D parallel system that combines data, pipeline, and tensor-slicing based parallelism.
  6. Training Dataset: The training dataset for MT-NLG is primarily based on “The Pile”. It includes a selection of high-quality datasets, two recent Common Crawl snapshots, and additional datasets like RealNews and CC-Stories.
  7. Achievements: MT-NLG has set new benchmarks across several categories of NLP tasks, especially in zero-shot, one-shot, and few-shot evaluations.

Additional Features

  1. Bias Mitigation: While MT-NLG is a significant advancement in language generation, it’s acknowledged that such models can inherit biases from their training data. Both Microsoft and NVIDIA are actively researching ways to understand and eliminate these biases.
  2. Hardware System: The model training is executed with mixed precision on the NVIDIA DGX SuperPOD-based Selene supercomputer, which is powered by 560 DGX A100 servers networked with HDR InfiniBand.
  3. System Throughput: The system’s end-to-end throughput for the 530 billion parameters model with a batch size of 1920 has been tested on various configurations of DGX A100 servers on Selene, showcasing impressive iteration times and teraFLOP/s per GPU.
  4. Commitment to Responsible AI: Both Microsoft and NVIDIA emphasize the importance of fairness, reliability, safety, privacy, security, inclusiveness, transparency, and accountability in AI development and usage, aligning with the Microsoft Responsible AI Principles.