About DistilBERT

DistilBERT is a compact, fast and light version of BERT, a popular transformer model used for natural language processing tasks. It was proposed to address the computational challenges associated with deploying large-scale transformer models like BERT. DistilBERT is trained by distilling the knowledge from BERT, resulting in a model that retains over 95% of BERT’s performance while being 40% smaller and 60% faster.

Here are four key features of DistilBERT

  1. Size and Speed: DistilBERT is 40% smaller and 60% faster than BERT, making it more efficient for deployment in resource-constrained environments.
  2. Performance: Despite its smaller size, DistilBERT retains over 95% of BERT’s performance on the GLUE language understanding benchmark.
  3. No token_type_ids: Unlike BERT, DistilBERT doesn’t require token_type_ids. This means you don’t need to indicate which token belongs to which segment. You can simply separate your segments with the separation token.
  4. Training Method: DistilBERT is trained using a method called distillation, where it is trained to predict the same probabilities as the larger BERT model. This involves a combination of predicting the masked tokens correctly and a cosine similarity between the hidden states of the student (DistilBERT) and the teacher (BERT) model.