About GPT-J

GPT-J is a 6 billion parameter, autoregressive text generation model developed by Ben Wang and Aran Komatsuzaki. It’s hosted on the GitHub repository named “mesh-transformer-jax”. The model is trained on a dataset called The Pile and is designed for scalability up to approximately 40B parameters on TPUv3s.

Here are four key features of GPT-J

  1. Model Parallelism: GPT-J uses the xmap/pjit operators in JAX for model parallelism of transformers, similar to the original Megatron-LM. This makes it efficient on TPUs due to the high-speed 2D mesh network.
  2. Scalability: The library is designed for scalability up to approximately 40B parameters on TPUv3s. For larger models, different parallelism strategies should be used.
  3. Pretrained Model: GPT-J is a pretrained model with 6 billion parameters, trained on The Pile dataset. It’s an autoregressive text generation model.
  4. Fine-tuning Capabilities: The model can be fine-tuned on a TPU VM at a rate of ~5000 tokens/second, making it suitable for small-to-medium-size datasets.