GPT-J is a 6 billion parameter, autoregressive text generation model developed by Ben Wang and Aran Komatsuzaki. It’s hosted on the GitHub repository named “mesh-transformer-jax”. The model is trained on a dataset called The Pile and is designed for scalability up to approximately 40B parameters on TPUv3s.
Here are four key features of GPT-J
- Model Parallelism: GPT-J uses the xmap/pjit operators in JAX for model parallelism of transformers, similar to the original Megatron-LM. This makes it efficient on TPUs due to the high-speed 2D mesh network.
- Scalability: The library is designed for scalability up to approximately 40B parameters on TPUv3s. For larger models, different parallelism strategies should be used.
- Pretrained Model: GPT-J is a pretrained model with 6 billion parameters, trained on The Pile dataset. It’s an autoregressive text generation model.
- Fine-tuning Capabilities: The model can be fine-tuned on a TPU VM at a rate of ~5000 tokens/second, making it suitable for small-to-medium-size datasets.